Python Load Shedding — Core Concepts
What Is Load Shedding?
Load shedding is the deliberate rejection of incoming requests when a system is at or near capacity. Instead of accepting more work than it can handle (which degrades performance for everyone), the system explicitly refuses excess requests and returns a fast error.
The term comes from electrical engineering — power grids shed load (cut power to some areas) to prevent a total blackout.
Load Shedding vs. Rate Limiting
These are often confused, but they solve different problems:
| Aspect | Rate Limiting | Load Shedding |
|---|---|---|
| Purpose | Enforce per-client fairness | Protect system capacity |
| Trigger | Client exceeds quota | System reaches capacity |
| Scope | Per client/API key | System-wide |
| Rejection | 429 Too Many Requests | 503 Service Unavailable |
| When | Always active | Only under stress |
Rate limiting says “you’re sending too many requests.” Load shedding says “we’re too busy for anyone right now.” You typically need both.
Shedding Strategies
1. LIFO (Last In, First Out) Shedding
Reject the newest requests and serve the ones that have already been waiting. Users who arrived first get served first. New arrivals during overload get fast rejections.
Best for: Interactive applications where users are still waiting for their request.
2. Priority-Based Shedding
Not all requests are equal. An authenticated user’s checkout request is more important than an anonymous user’s browse request. Assign priorities and shed low-priority traffic first.
Priority example:
- Health checks and admin endpoints (never shed)
- Payment and checkout (shed last)
- Authenticated user requests
- Anonymous user requests (shed first)
- Bot/crawler traffic (shed first)
3. Deadline-Based Shedding
If a request has been in the queue longer than its useful lifetime, drop it. A user who sent a request 30 seconds ago has probably refreshed or left. Processing their stale request wastes resources.
4. Probabilistic Shedding
As load increases, randomly reject a growing percentage of requests. At 80% capacity, reject 10%. At 90%, reject 30%. At 100%, reject 50%. This creates a smooth degradation curve instead of a sharp cliff.
Detecting Overload
How does your application know it’s overloaded?
- Request queue depth — if the queue is longer than N, start shedding
- Response latency — if P99 latency exceeds a threshold, the system is struggling
- CPU/memory usage — approaching resource limits
- Active connections — concurrent connections exceeding a safe threshold
- Event loop lag (asyncio) — if the event loop is falling behind, I/O is backing up
The best systems combine multiple signals. A single metric can be misleading — high CPU might be from a batch job, not from overload.
The 503 Response
When shedding load, return HTTP 503 with useful headers:
HTTP/1.1 503 Service Unavailable
Retry-After: 5
X-Shed-Reason: capacity
Retry-After tells well-behaved clients when to try again. Without it, they’ll retry immediately and make the overload worse.
Common Misconception
“Load shedding means dropping requests silently.” Never do this. Always return an explicit rejection (503, or a queue-full error). Silent drops cause client-side timeouts — which are slow, confusing, and consume resources on both sides. A fast 503 is vastly better than a 30-second timeout.
When Load Shedding Is Critical
- Viral traffic events — your product is on the front page of a news site
- Black Friday / sales events — predictable but massive spikes
- DDoS mitigation — shedding bot traffic to protect real users
- Dependency slowdowns — when a slow database causes request queue growth
- Cascade prevention — shedding early prevents your system from becoming the slow dependency for others
The Counter-Intuitive Truth
Adding more capacity doesn’t eliminate the need for load shedding. No matter how big your system is, there’s always a load level that exceeds it. Load shedding is the safety valve that defines behavior at the boundary — the plan for what happens when demand exceeds supply.
Google, Amazon, and Netflix all use load shedding in production. It’s not a sign of under-provisioning — it’s a sign of thoughtful design.
One thing to remember: Load shedding trades a small number of fast failures for system-wide stability. Rejecting 5% of requests with a quick 503 is vastly better than making 100% of requests slow and unreliable.
See Also
- Python Aggregate Pattern Why grouping related objects under a single gatekeeper prevents data chaos in your Python application.
- Python Bounded Contexts Why the same word means different things in different parts of your code — and why that is perfectly fine.
- Python Bulkhead Pattern Why smart Python apps put walls between their parts — like a ship that stays afloat even with a hole in the hull.
- Python Circuit Breaker Pattern How a circuit breaker saves your app from crashing — explained with a home electrical fuse analogy.
- Python Clean Architecture Why your Python app should look like an onion — and how that saves you from painful rewrites.