Python Readiness & Liveness Probes — Core Concepts

Two Probes, Two Purposes

Container orchestrators like Kubernetes use probes to make automated decisions about your application’s lifecycle. Getting them wrong can cause anything from unnecessary restarts to complete outages.

ProbeQuestionFailure ActionTypical Endpoint
Liveness”Is the process stuck?”Kill and restart the container/healthz
Readiness”Can it serve traffic?”Remove from load balancer/readyz
Startup”Has it finished booting?”Wait (delay other probes)/healthz

Liveness: The Nuclear Option

A failed liveness probe triggers a container restart. This makes it the most dangerous probe — misconfigure it and your app enters a restart loop.

What liveness should check:

  • Can the event loop respond? (Is the process deadlocked?)
  • Is the main thread alive?

What liveness should NOT check:

  • Database connectivity
  • External service availability
  • Disk space

Why? If your database goes down and your liveness check includes a database ping, Kubernetes restarts your app. The database is still down, so the new instance fails liveness again. Now you have a restart loop, and when the database recovers, all your instances are in a CrashLoopBackOff state instead of ready to serve.

Readiness: The Traffic Gate

A failed readiness probe removes the pod from the Service’s endpoint list — no traffic gets routed to it. When the probe passes again, traffic resumes. No restarts, no data loss.

What readiness should check:

  • Database connection pool is available
  • Required caches are reachable
  • Any critical downstream service is responding

Readiness failures are expected and recoverable. A deployment rollout temporarily makes old pods not-ready. A database failover makes the app not-ready for a few seconds. This is normal and correct.

Startup Probes: The Grace Period

Some Python apps take a long time to start — loading ML models, running migrations, warming caches. Without a startup probe, the liveness probe might kill the container before it finishes starting.

The startup probe runs instead of liveness and readiness until it succeeds. Once it passes, it never runs again, and the other probes take over.

The Configuration Triangle

Three settings control probe behavior:

  • periodSeconds — how often the probe runs (default: 10)
  • failureThreshold — how many consecutive failures before action (default: 3)
  • timeoutSeconds — how long to wait for a response (default: 1)

The total tolerance is periodSeconds × failureThreshold. With defaults, your app has 30 seconds of failures before a restart (liveness) or removal from traffic (readiness).

Common Mistake: Liveness That Checks Dependencies

This is the number-one probe misconfiguration. Google’s SRE team has documented entire outages caused by liveness probes that check databases. The rule is simple:

  • Liveness = process health only (am I stuck?)
  • Readiness = dependency health (can I do my job?)

If your app can recover from a dependency failure on its own (retry, reconnect, use a fallback), that’s a readiness issue, not a liveness issue.

Common Misconception

“If my readiness probe fails, Kubernetes restarts my pod.” No — that’s what liveness does. Readiness only controls traffic routing. Your pod stays running, and the moment the probe passes again, traffic flows back. Confusing the two leads to either fragile restart loops (too much in liveness) or silent traffic loss (nothing in readiness).

One thing to remember: Keep liveness probes simple and dependency-free. Put all the real checks in readiness. When in doubt, use a readiness probe instead of a liveness probe.

pythonkubernetesproduction

See Also