Chaos Testing Applications — Deep Dive

Chaos Toolkit deep dive

Chaos Toolkit (chaostoolkit) is a Python CLI and library that runs chaos experiments defined in JSON or YAML. Install it with extensions for your infrastructure:

pip install chaostoolkit chaostoolkit-kubernetes chaostoolkit-aws chaostoolkit-prometheus

A full experiment has four phases:

# experiments/database-failover.yaml
title: "Database primary failover"
description: "Verify the application survives losing the primary database"

steady-state-hypothesis:
  title: "Application serves requests within SLA"
  probes:
    - name: "api-responds-200"
      type: probe
      provider:
        type: http
        url: "http://app:8000/api/orders"
        method: GET
        timeout: 5
      tolerance:
        status: 200
    - name: "p99-latency-under-500ms"
      type: probe
      provider:
        type: python
        module: chaoslib.probes.prometheus
        func: query
        arguments:
          query: "histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))"
          api_url: "http://prometheus:9090"
      tolerance:
        type: range
        range: [0.0, 0.5]

method:
  - name: "kill-primary-database"
    type: action
    provider:
      type: python
      module: chaosk8s.actions
      func: terminate_pods
      arguments:
        label_selector: "role=postgres-primary"
        ns: "production"
        qty: 1
    pauses:
      after: 30  # Wait 30s for failover

  - name: "verify-writes-still-work"
    type: probe
    provider:
      type: http
      url: "http://app:8000/api/orders"
      method: POST
      headers:
        Content-Type: "application/json"
      arguments:
        body: '{"item": "chaos-test", "quantity": 1}'
      timeout: 10
    tolerance:
      status: [200, 201]

rollbacks:
  - name: "restart-primary"
    type: action
    provider:
      type: python
      module: chaosk8s.actions
      func: start_pods
      arguments:
        label_selector: "role=postgres-primary"
        ns: "production"

Run it:

chaos run experiments/database-failover.yaml --journal-path=results/db-failover.json

The journal captures every probe result, action outcome, and timing — essential for post-experiment analysis.

Network chaos with toxiproxy

toxiproxy sits between your service and its dependencies, letting you inject network-level failures:

# conftest.py for chaos tests
import pytest
from toxiproxy import Toxiproxy


@pytest.fixture(scope="session")
def toxiproxy_client():
    return Toxiproxy(host="localhost", port=8474)


@pytest.fixture
def redis_proxy(toxiproxy_client):
    proxy = toxiproxy_client.get_proxy("redis")
    yield proxy
    proxy.enable()  # Always restore after test


@pytest.fixture
def postgres_proxy(toxiproxy_client):
    proxy = toxiproxy_client.get_proxy("postgres")
    yield proxy
    proxy.enable()


class TestRedisResilience:
    def test_cache_miss_falls_through(self, redis_proxy, api_client):
        """When Redis is down, requests should hit the database directly."""
        redis_proxy.disable()
        response = api_client.get("/api/products/123")
        assert response.status_code == 200
        assert response.json()["source"] == "database"

    def test_slow_redis_triggers_timeout(self, redis_proxy, api_client):
        """When Redis is slow, the circuit breaker should open."""
        redis_proxy.add_toxic(
            type="latency",
            attributes={"latency": 5000, "jitter": 1000},
        )
        response = api_client.get("/api/products/123")
        assert response.status_code == 200
        assert response.elapsed.total_seconds() < 2.0

    def test_redis_connection_reset(self, redis_proxy, api_client):
        """Random connection resets should not crash the service."""
        redis_proxy.add_toxic(
            type="reset_peer",
            attributes={"timeout": 500},
        )
        # Send 50 requests — none should 500
        errors = []
        for _ in range(50):
            resp = api_client.get("/api/products/123")
            if resp.status_code == 500:
                errors.append(resp.json())
        assert len(errors) == 0, f"Got {len(errors)} server errors"

Custom failure injection in Python

For application-level chaos, inject failures directly into your code using decorators:

# chaos/injectors.py
import random
import time
import functools
from typing import Optional


class ChaosConfig:
    enabled: bool = False
    failure_rate: float = 0.0
    latency_ms: int = 0
    exception_type: type = RuntimeError


_chaos_registry: dict[str, ChaosConfig] = {}


def chaos_point(name: str):
    """Mark a function as a chaos injection point."""
    def decorator(func):
        _chaos_registry[name] = ChaosConfig()

        @functools.wraps(func)
        def wrapper(*args, **kwargs):
            config = _chaos_registry[name]
            if config.enabled:
                if random.random() < config.failure_rate:
                    raise config.exception_type(
                        f"Chaos injection: {name}"
                    )
                if config.latency_ms > 0:
                    time.sleep(config.latency_ms / 1000.0)
            return func(*args, **kwargs)
        return wrapper
    return decorator


def enable_chaos(
    name: str,
    failure_rate: float = 0.1,
    latency_ms: int = 0,
    exception_type: type = RuntimeError,
):
    config = _chaos_registry.get(name)
    if config:
        config.enabled = True
        config.failure_rate = failure_rate
        config.latency_ms = latency_ms
        config.exception_type = exception_type


def disable_chaos(name: str):
    config = _chaos_registry.get(name)
    if config:
        config.enabled = False

Usage in application code:

from chaos.injectors import chaos_point


class PaymentService:
    @chaos_point("payment.charge")
    def charge(self, amount: float, token: str) -> dict:
        return self._gateway.charge(amount, token)

    @chaos_point("payment.refund")
    def refund(self, transaction_id: str) -> dict:
        return self._gateway.refund(transaction_id)

In chaos tests:

from chaos.injectors import enable_chaos, disable_chaos


def test_checkout_retries_on_payment_failure():
    enable_chaos("payment.charge", failure_rate=0.5)
    try:
        result = checkout_service.process(order)
        # With 50% failure rate and 3 retries, should still succeed
        assert result.status == "completed"
    finally:
        disable_chaos("payment.charge")


def test_checkout_degrades_on_slow_payment():
    enable_chaos("payment.charge", latency_ms=3000)
    try:
        result = checkout_service.process(order)
        assert result.status in ("completed", "pending_payment")
        assert result.elapsed_seconds < 10
    finally:
        disable_chaos("payment.charge")

Gameday framework

Structured chaos experiments for teams follow a “gameday” format:

# gameday/runner.py
import datetime
import json
from dataclasses import dataclass, field, asdict
from typing import Callable


@dataclass
class GamedayExperiment:
    name: str
    hypothesis: str
    inject: Callable
    verify: Callable
    rollback: Callable
    blast_radius: str
    owner: str
    results: dict = field(default_factory=dict)


@dataclass
class GamedayReport:
    date: str
    experiments: list[dict]
    participants: list[str]
    findings: list[str]
    action_items: list[str]


def run_gameday(
    experiments: list[GamedayExperiment],
    participants: list[str],
) -> GamedayReport:
    findings = []
    results = []

    for exp in experiments:
        print(f"\n{'='*60}")
        print(f"Experiment: {exp.name}")
        print(f"Hypothesis: {exp.hypothesis}")
        print(f"Blast radius: {exp.blast_radius}")
        print(f"{'='*60}")

        try:
            exp.inject()
            passed = exp.verify()
            exp.results = {"passed": passed, "error": None}
            if not passed:
                findings.append(
                    f"FAILED: {exp.name} — hypothesis disproved"
                )
        except Exception as e:
            exp.results = {"passed": False, "error": str(e)}
            findings.append(f"ERROR: {exp.name}{e}")
        finally:
            exp.rollback()

        results.append(asdict(exp))

    return GamedayReport(
        date=datetime.date.today().isoformat(),
        experiments=results,
        participants=participants,
        findings=findings,
        action_items=[],  # Filled in during post-mortem
    )

CI integration for automated chaos

Run chaos experiments as part of your deployment pipeline:

# .github/workflows/chaos.yml
name: Chaos Tests
on:
  workflow_dispatch:
  schedule:
    - cron: '0 10 * * 3'  # Wednesday 10 AM

jobs:
  chaos-staging:
    runs-on: ubuntu-latest
    environment: staging
    steps:
      - uses: actions/checkout@v4
      - run: pip install chaostoolkit chaostoolkit-kubernetes

      - name: Run database failover experiment
        run: chaos run experiments/database-failover.yaml
        env:
          KUBECONFIG: ${{ secrets.STAGING_KUBECONFIG }}

      - name: Run network partition experiment
        run: chaos run experiments/network-partition.yaml
        env:
          KUBECONFIG: ${{ secrets.STAGING_KUBECONFIG }}

      - name: Upload journal
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: chaos-journals
          path: journal.json

      - name: Notify on failure
        if: failure()
        run: |
          curl -X POST "${{ secrets.SLACK_WEBHOOK }}" \
            -d '{"text":"Chaos experiment failed in staging"}'

Observability during experiments

Without metrics, chaos testing is just breaking things. Essential instrumentation:

# middleware/chaos_metrics.py
import time
from prometheus_client import Counter, Histogram, Gauge

REQUESTS_TOTAL = Counter(
    "http_requests_total",
    "Total HTTP requests",
    ["method", "endpoint", "status"],
)
REQUEST_DURATION = Histogram(
    "http_request_duration_seconds",
    "Request duration",
    ["method", "endpoint"],
)
CIRCUIT_BREAKER_STATE = Gauge(
    "circuit_breaker_state",
    "Circuit breaker state (0=closed, 1=open, 2=half-open)",
    ["service"],
)
ERROR_RATE = Gauge(
    "error_rate_percent",
    "Rolling error rate",
    ["service"],
)

During a chaos experiment, dashboard these metrics in real-time. The steady-state hypothesis should reference specific metric thresholds, and the experiment should automatically fail if metrics breach those thresholds.

The one thing to remember: Effective chaos engineering combines declarative experiment definitions, network-level failure injection, application-level chaos points, structured gamedays, and always — always — observability to measure the impact.

pythontestingreliability

See Also