Python Health Check Patterns — Core Concepts

Beyond “200 OK”

The simplest health check — an endpoint that always returns 200 — is barely better than nothing. It confirms the process is running and the HTTP server is accepting connections, but tells you nothing about whether the app can actually do its job. A real health check strategy uses multiple layers.

Types of Health Checks

Shallow (Liveness) Checks

These answer: “Is the process alive and responsive?” They should be fast (under 10ms), have no external dependencies, and never fail unless the process itself is broken.

Use case: container orchestrators checking if a process needs to be restarted.

Deep (Readiness) Checks

These answer: “Can this instance serve real traffic?” They verify connections to databases, caches, message brokers, and downstream services.

Use case: load balancers deciding whether to route traffic to this instance.

Startup Checks

These answer: “Has the app finished initializing?” Some apps need to load large models, warm caches, or run migrations before they’re ready.

Use case: preventing premature traffic during cold starts.

Anatomy of a Good Health Check

A well-designed check returns structured information:

FieldPurpose
statushealthy, degraded, or unhealthy
checksIndividual component results
durationHow long the check took
versionApp version for debugging

The degraded state is important — it means “I can work, but something’s wrong.” Maybe the cache is down but the database is fine. The app can still serve requests, just slower.

What to Check (and What Not To)

Good checks:

  • Database: execute SELECT 1 with a short timeout
  • Redis/cache: PING command
  • Disk space: is usage below 90%?
  • Memory: is RSS within expected bounds?

Bad checks:

  • Calling external third-party APIs (their downtime shouldn’t mark you as unhealthy)
  • Running expensive queries that affect production traffic
  • Checks without timeouts (a stuck database connection blocks the health endpoint)

The Cascade Problem

If Service A health-checks Service B, and Service B health-checks Service C, a single failure in C marks everything unhealthy. This cascade can take down your entire system.

The rule: only check direct dependencies. If your app talks to a database and a cache, check those. Don’t check the services that they depend on.

Common Misconception

“If the health check passes, the app is healthy.” Health checks only verify what they test. If your check doesn’t test disk I/O and the disk is failing, the check still passes. Design your checks around the specific failure modes you’ve seen in production.

One thing to remember: Good health checks are layered — a fast liveness check for orchestrators, a thorough readiness check for load balancers, and each check tests only direct dependencies with strict timeouts.

pythonproductionmonitoring

See Also