Chaos Testing Applications — Deep Dive
Chaos Toolkit deep dive
Chaos Toolkit (chaostoolkit) is a Python CLI and library that runs chaos experiments defined in JSON or YAML. Install it with extensions for your infrastructure:
pip install chaostoolkit chaostoolkit-kubernetes chaostoolkit-aws chaostoolkit-prometheus
A full experiment has four phases:
# experiments/database-failover.yaml
title: "Database primary failover"
description: "Verify the application survives losing the primary database"
steady-state-hypothesis:
title: "Application serves requests within SLA"
probes:
- name: "api-responds-200"
type: probe
provider:
type: http
url: "http://app:8000/api/orders"
method: GET
timeout: 5
tolerance:
status: 200
- name: "p99-latency-under-500ms"
type: probe
provider:
type: python
module: chaoslib.probes.prometheus
func: query
arguments:
query: "histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))"
api_url: "http://prometheus:9090"
tolerance:
type: range
range: [0.0, 0.5]
method:
- name: "kill-primary-database"
type: action
provider:
type: python
module: chaosk8s.actions
func: terminate_pods
arguments:
label_selector: "role=postgres-primary"
ns: "production"
qty: 1
pauses:
after: 30 # Wait 30s for failover
- name: "verify-writes-still-work"
type: probe
provider:
type: http
url: "http://app:8000/api/orders"
method: POST
headers:
Content-Type: "application/json"
arguments:
body: '{"item": "chaos-test", "quantity": 1}'
timeout: 10
tolerance:
status: [200, 201]
rollbacks:
- name: "restart-primary"
type: action
provider:
type: python
module: chaosk8s.actions
func: start_pods
arguments:
label_selector: "role=postgres-primary"
ns: "production"
Run it:
chaos run experiments/database-failover.yaml --journal-path=results/db-failover.json
The journal captures every probe result, action outcome, and timing — essential for post-experiment analysis.
Network chaos with toxiproxy
toxiproxy sits between your service and its dependencies, letting you inject network-level failures:
# conftest.py for chaos tests
import pytest
from toxiproxy import Toxiproxy
@pytest.fixture(scope="session")
def toxiproxy_client():
return Toxiproxy(host="localhost", port=8474)
@pytest.fixture
def redis_proxy(toxiproxy_client):
proxy = toxiproxy_client.get_proxy("redis")
yield proxy
proxy.enable() # Always restore after test
@pytest.fixture
def postgres_proxy(toxiproxy_client):
proxy = toxiproxy_client.get_proxy("postgres")
yield proxy
proxy.enable()
class TestRedisResilience:
def test_cache_miss_falls_through(self, redis_proxy, api_client):
"""When Redis is down, requests should hit the database directly."""
redis_proxy.disable()
response = api_client.get("/api/products/123")
assert response.status_code == 200
assert response.json()["source"] == "database"
def test_slow_redis_triggers_timeout(self, redis_proxy, api_client):
"""When Redis is slow, the circuit breaker should open."""
redis_proxy.add_toxic(
type="latency",
attributes={"latency": 5000, "jitter": 1000},
)
response = api_client.get("/api/products/123")
assert response.status_code == 200
assert response.elapsed.total_seconds() < 2.0
def test_redis_connection_reset(self, redis_proxy, api_client):
"""Random connection resets should not crash the service."""
redis_proxy.add_toxic(
type="reset_peer",
attributes={"timeout": 500},
)
# Send 50 requests — none should 500
errors = []
for _ in range(50):
resp = api_client.get("/api/products/123")
if resp.status_code == 500:
errors.append(resp.json())
assert len(errors) == 0, f"Got {len(errors)} server errors"
Custom failure injection in Python
For application-level chaos, inject failures directly into your code using decorators:
# chaos/injectors.py
import random
import time
import functools
from typing import Optional
class ChaosConfig:
enabled: bool = False
failure_rate: float = 0.0
latency_ms: int = 0
exception_type: type = RuntimeError
_chaos_registry: dict[str, ChaosConfig] = {}
def chaos_point(name: str):
"""Mark a function as a chaos injection point."""
def decorator(func):
_chaos_registry[name] = ChaosConfig()
@functools.wraps(func)
def wrapper(*args, **kwargs):
config = _chaos_registry[name]
if config.enabled:
if random.random() < config.failure_rate:
raise config.exception_type(
f"Chaos injection: {name}"
)
if config.latency_ms > 0:
time.sleep(config.latency_ms / 1000.0)
return func(*args, **kwargs)
return wrapper
return decorator
def enable_chaos(
name: str,
failure_rate: float = 0.1,
latency_ms: int = 0,
exception_type: type = RuntimeError,
):
config = _chaos_registry.get(name)
if config:
config.enabled = True
config.failure_rate = failure_rate
config.latency_ms = latency_ms
config.exception_type = exception_type
def disable_chaos(name: str):
config = _chaos_registry.get(name)
if config:
config.enabled = False
Usage in application code:
from chaos.injectors import chaos_point
class PaymentService:
@chaos_point("payment.charge")
def charge(self, amount: float, token: str) -> dict:
return self._gateway.charge(amount, token)
@chaos_point("payment.refund")
def refund(self, transaction_id: str) -> dict:
return self._gateway.refund(transaction_id)
In chaos tests:
from chaos.injectors import enable_chaos, disable_chaos
def test_checkout_retries_on_payment_failure():
enable_chaos("payment.charge", failure_rate=0.5)
try:
result = checkout_service.process(order)
# With 50% failure rate and 3 retries, should still succeed
assert result.status == "completed"
finally:
disable_chaos("payment.charge")
def test_checkout_degrades_on_slow_payment():
enable_chaos("payment.charge", latency_ms=3000)
try:
result = checkout_service.process(order)
assert result.status in ("completed", "pending_payment")
assert result.elapsed_seconds < 10
finally:
disable_chaos("payment.charge")
Gameday framework
Structured chaos experiments for teams follow a “gameday” format:
# gameday/runner.py
import datetime
import json
from dataclasses import dataclass, field, asdict
from typing import Callable
@dataclass
class GamedayExperiment:
name: str
hypothesis: str
inject: Callable
verify: Callable
rollback: Callable
blast_radius: str
owner: str
results: dict = field(default_factory=dict)
@dataclass
class GamedayReport:
date: str
experiments: list[dict]
participants: list[str]
findings: list[str]
action_items: list[str]
def run_gameday(
experiments: list[GamedayExperiment],
participants: list[str],
) -> GamedayReport:
findings = []
results = []
for exp in experiments:
print(f"\n{'='*60}")
print(f"Experiment: {exp.name}")
print(f"Hypothesis: {exp.hypothesis}")
print(f"Blast radius: {exp.blast_radius}")
print(f"{'='*60}")
try:
exp.inject()
passed = exp.verify()
exp.results = {"passed": passed, "error": None}
if not passed:
findings.append(
f"FAILED: {exp.name} — hypothesis disproved"
)
except Exception as e:
exp.results = {"passed": False, "error": str(e)}
findings.append(f"ERROR: {exp.name} — {e}")
finally:
exp.rollback()
results.append(asdict(exp))
return GamedayReport(
date=datetime.date.today().isoformat(),
experiments=results,
participants=participants,
findings=findings,
action_items=[], # Filled in during post-mortem
)
CI integration for automated chaos
Run chaos experiments as part of your deployment pipeline:
# .github/workflows/chaos.yml
name: Chaos Tests
on:
workflow_dispatch:
schedule:
- cron: '0 10 * * 3' # Wednesday 10 AM
jobs:
chaos-staging:
runs-on: ubuntu-latest
environment: staging
steps:
- uses: actions/checkout@v4
- run: pip install chaostoolkit chaostoolkit-kubernetes
- name: Run database failover experiment
run: chaos run experiments/database-failover.yaml
env:
KUBECONFIG: ${{ secrets.STAGING_KUBECONFIG }}
- name: Run network partition experiment
run: chaos run experiments/network-partition.yaml
env:
KUBECONFIG: ${{ secrets.STAGING_KUBECONFIG }}
- name: Upload journal
if: always()
uses: actions/upload-artifact@v4
with:
name: chaos-journals
path: journal.json
- name: Notify on failure
if: failure()
run: |
curl -X POST "${{ secrets.SLACK_WEBHOOK }}" \
-d '{"text":"Chaos experiment failed in staging"}'
Observability during experiments
Without metrics, chaos testing is just breaking things. Essential instrumentation:
# middleware/chaos_metrics.py
import time
from prometheus_client import Counter, Histogram, Gauge
REQUESTS_TOTAL = Counter(
"http_requests_total",
"Total HTTP requests",
["method", "endpoint", "status"],
)
REQUEST_DURATION = Histogram(
"http_request_duration_seconds",
"Request duration",
["method", "endpoint"],
)
CIRCUIT_BREAKER_STATE = Gauge(
"circuit_breaker_state",
"Circuit breaker state (0=closed, 1=open, 2=half-open)",
["service"],
)
ERROR_RATE = Gauge(
"error_rate_percent",
"Rolling error rate",
["service"],
)
During a chaos experiment, dashboard these metrics in real-time. The steady-state hypothesis should reference specific metric thresholds, and the experiment should automatically fail if metrics breach those thresholds.
The one thing to remember: Effective chaos engineering combines declarative experiment definitions, network-level failure injection, application-level chaos points, structured gamedays, and always — always — observability to measure the impact.
See Also
- Python Acceptance Testing Patterns How Python teams verify software does what real users actually asked for.
- Python Approval Testing How approval testing lets you verify complex Python output by comparing it to a saved 'golden' copy you already checked.
- Python Behavior Driven Development Get an intuitive feel for Behavior Driven Development so Python behavior stops feeling unpredictable.
- Python Browser Automation Testing How Python can control a web browser like a robot to test websites automatically.
- Python Contract Testing Why contract testing is like having a written agreement between two teams so neither one accidentally breaks the other's work.