Python Readiness & Liveness Probes — Core Concepts
Two Probes, Two Purposes
Container orchestrators like Kubernetes use probes to make automated decisions about your application’s lifecycle. Getting them wrong can cause anything from unnecessary restarts to complete outages.
| Probe | Question | Failure Action | Typical Endpoint |
|---|---|---|---|
| Liveness | ”Is the process stuck?” | Kill and restart the container | /healthz |
| Readiness | ”Can it serve traffic?” | Remove from load balancer | /readyz |
| Startup | ”Has it finished booting?” | Wait (delay other probes) | /healthz |
Liveness: The Nuclear Option
A failed liveness probe triggers a container restart. This makes it the most dangerous probe — misconfigure it and your app enters a restart loop.
What liveness should check:
- Can the event loop respond? (Is the process deadlocked?)
- Is the main thread alive?
What liveness should NOT check:
- Database connectivity
- External service availability
- Disk space
Why? If your database goes down and your liveness check includes a database ping, Kubernetes restarts your app. The database is still down, so the new instance fails liveness again. Now you have a restart loop, and when the database recovers, all your instances are in a CrashLoopBackOff state instead of ready to serve.
Readiness: The Traffic Gate
A failed readiness probe removes the pod from the Service’s endpoint list — no traffic gets routed to it. When the probe passes again, traffic resumes. No restarts, no data loss.
What readiness should check:
- Database connection pool is available
- Required caches are reachable
- Any critical downstream service is responding
Readiness failures are expected and recoverable. A deployment rollout temporarily makes old pods not-ready. A database failover makes the app not-ready for a few seconds. This is normal and correct.
Startup Probes: The Grace Period
Some Python apps take a long time to start — loading ML models, running migrations, warming caches. Without a startup probe, the liveness probe might kill the container before it finishes starting.
The startup probe runs instead of liveness and readiness until it succeeds. Once it passes, it never runs again, and the other probes take over.
The Configuration Triangle
Three settings control probe behavior:
- periodSeconds — how often the probe runs (default: 10)
- failureThreshold — how many consecutive failures before action (default: 3)
- timeoutSeconds — how long to wait for a response (default: 1)
The total tolerance is periodSeconds × failureThreshold. With defaults, your app has 30 seconds of failures before a restart (liveness) or removal from traffic (readiness).
Common Mistake: Liveness That Checks Dependencies
This is the number-one probe misconfiguration. Google’s SRE team has documented entire outages caused by liveness probes that check databases. The rule is simple:
- Liveness = process health only (am I stuck?)
- Readiness = dependency health (can I do my job?)
If your app can recover from a dependency failure on its own (retry, reconnect, use a fallback), that’s a readiness issue, not a liveness issue.
Common Misconception
“If my readiness probe fails, Kubernetes restarts my pod.” No — that’s what liveness does. Readiness only controls traffic routing. Your pod stays running, and the moment the probe passes again, traffic flows back. Confusing the two leads to either fragile restart loops (too much in liveness) or silent traffic loss (nothing in readiness).
One thing to remember: Keep liveness probes simple and dependency-free. Put all the real checks in readiness. When in doubt, use a readiness probe instead of a liveness probe.
See Also
- Python Ab Testing Framework How tech companies test two versions of something to see which one wins — explained with a lemonade stand experiment.
- Python Configuration Hierarchy How your Python app decides which settings to use — explained like layers of clothing on a cold day.
- Python Feature Flag Strategies How developers turn features on and off without redeploying — explained with a TV remote control analogy.
- Python Graceful Shutdown Why your Python app needs to say goodbye properly before it stops — explained with a restaurant closing analogy.
- Python Health Check Patterns Why your Python app needs regular check-ups — explained like a doctor's visit for software.