Chaos Testing Applications — Core Concepts

What chaos engineering actually is

Chaos engineering is the practice of experimenting on a system to build confidence in its ability to withstand turbulent conditions. It’s not random destruction — it’s disciplined, hypothesis-driven experimentation.

The process follows a scientific method:

  1. Define steady state — What does “working normally” look like? (response time under 200ms, error rate below 0.1%)
  2. Hypothesize — “If we kill one database replica, the system should failover within 5 seconds with no user-visible errors”
  3. Inject failure — Actually kill the replica
  4. Observe — Did the system behave as expected?
  5. Learn — If it didn’t, fix the weakness and re-test

Types of failure injection

Different failures test different resilience capabilities:

Infrastructure failures

  • Kill a container or process
  • Fill up disk space
  • Exhaust available memory
  • Introduce network partitions between services

Application-level failures

  • Add latency to database queries
  • Return errors from external API calls
  • Corrupt cached data
  • Inject clock skew between services

Dependency failures

  • Make a third-party service unavailable
  • Throttle message queue throughput
  • Expire all active sessions simultaneously
  • Revoke authentication tokens

Each category reveals different weaknesses. Infrastructure failures test your deployment and orchestration. Application failures test your error handling and circuit breakers. Dependency failures test your degradation strategies.

Python tools for chaos testing

Chaos Toolkit is the most popular Python-native chaos engineering framework. It uses declarative JSON/YAML experiment definitions:

{
  "title": "Database failover test",
  "steady-state-hypothesis": {
    "title": "API responds normally",
    "probes": [
      {
        "type": "probe",
        "name": "api-health",
        "provider": {
          "type": "http",
          "url": "http://localhost:8000/health"
        },
        "tolerance": {
          "status": 200
        }
      }
    ]
  },
  "method": [
    {
      "type": "action",
      "name": "kill-database-primary",
      "provider": {
        "type": "process",
        "path": "docker",
        "arguments": "stop postgres-primary"
      }
    }
  ]
}

toxiproxy-python simulates network conditions — latency, bandwidth limits, connection drops — between your Python service and its dependencies.

fault-handler decorators inject failures directly into Python functions for testing resilience patterns without infrastructure changes.

Steady state and blast radius

The most important concept in chaos engineering is steady state — the measurable baseline that tells you the system is healthy. Without a clear definition, you can’t tell whether your experiment revealed a problem or not.

Equally important is blast radius — how much of your system you’re willing to put at risk. Start small:

  • First experiment: inject failure in a development environment
  • Second: target a single instance in staging
  • Third: affect one availability zone in production
  • Eventually: test during peak traffic (this is where Netflix operates)

Never start with a big blast radius. A chaos experiment that takes down production isn’t chaos engineering — it’s an outage you caused.

Common misconception

People often think chaos testing is only for massive distributed systems like Netflix or Google. In reality, any Python web application with a database, a cache, and an external API call has enough failure modes to benefit from chaos experiments. A single FastAPI service that can’t handle a Redis timeout is just as vulnerable as a microservices architecture.

When not to use chaos testing

Chaos testing requires observability. If you can’t measure your steady state — if you don’t have metrics, logs, and alerts in place — you won’t learn anything from injecting failures. You’ll just break things and shrug. Set up monitoring first, then start experimenting.

The one thing to remember: Chaos testing is a scientific method — define what “normal” looks like, hypothesize what will happen when something breaks, inject the failure, and learn from the result.

pythontestingreliability

See Also