Python Load Testing with Locust — Deep Dive

Build production-grade load test suites with Locust — custom user classes, distributed execution, CI integration, and performance analysis patterns.

Building a realistic load test

A production-quality Locust test models actual user behavior, not just endpoint hammering:

# locustfile.py
from locust import HttpUser, task, between, tag

class WebsiteUser(HttpUser):
    """Simulates a typical e-commerce browsing session."""
    wait_time = between(1, 5)  # 1-5 seconds between actions
    
    def on_start(self):
        """Login when the user starts."""
        response = self.client.post("/api/auth/login", json={
            "email": "loadtest@example.com",
            "password": "test-password-123",
        })
        self.token = response.json().get("token", "")
        self.client.headers.update({"Authorization": f"Bearer {self.token}"})
    
    @tag("browse")
    @task(5)  # 5x more likely than other tasks
    def browse_products(self):
        self.client.get("/api/products?page=1&limit=20")
    
    @tag("browse")
    @task(3)
    def view_product_detail(self):
        product_id = self._random_product_id()
        self.client.get(f"/api/products/{product_id}")
    
    @tag("purchase")
    @task(1)
    def add_to_cart(self):
        product_id = self._random_product_id()
        self.client.post("/api/cart/items", json={
            "product_id": product_id,
            "quantity": 1,
        })
    
    @tag("purchase")
    @task(1)
    def checkout(self):
        with self.client.post(
            "/api/orders",
            json={"payment_method": "test"},
            catch_response=True
        ) as response:
            if response.status_code == 201:
                response.success()
            elif response.status_code == 409:
                response.failure("Cart was empty")
            else:
                response.failure(f"Unexpected: {response.status_code}")
    
    def _random_product_id(self) -> int:
        import random
        return random.randint(1, 1000)

Task weights model real behavior: users browse much more than they buy. The @tag decorator lets you run subsets of tasks (locust --tags browse for browse-only tests).

Custom user types for mixed workloads

Real applications serve different types of users with different behaviors:

from locust import HttpUser, task, between, constant

class APIConsumer(HttpUser):
    """Simulates backend-to-backend API calls."""
    wait_time = constant(0.1)  # API clients are fast
    weight = 3  # 3x more API consumers than admin users
    
    @task
    def fetch_data(self):
        self.client.get("/api/v1/data/export", 
                       headers={"X-API-Key": "test-key"})

class AdminUser(HttpUser):
    """Simulates admin dashboard usage."""
    wait_time = between(3, 10)  # Admins read dashboards slowly
    weight = 1
    
    @task(3)
    def view_dashboard(self):
        self.client.get("/admin/dashboard")
    
    @task(1)
    def generate_report(self):
        self.client.post("/admin/reports/generate", json={
            "type": "monthly",
            "format": "csv",
        })

class MobileUser(HttpUser):
    """Simulates mobile app API calls."""
    wait_time = between(2, 8)
    weight = 6  # Most traffic comes from mobile
    
    def on_start(self):
        self.client.headers.update({
            "User-Agent": "MyApp/2.1 (iOS 17.4)",
            "Accept": "application/json",
        })
    
    @task
    def sync_feed(self):
        self.client.get("/api/mobile/feed?since=2026-03-01")

The weight parameter controls the ratio of user types. With weights 3:1:6, a 1000-user test runs 300 API consumers, 100 admin users, and 600 mobile users.

Distributed execution architecture

For large-scale tests, Locust runs in master-worker mode:

# Master node (coordinates workers, serves web UI)
locust --master --expect-workers=4

# Worker nodes (generate actual load)
locust --worker --master-host=master-ip
locust --worker --master-host=master-ip
locust --worker --master-host=master-ip
locust --worker --master-host=master-ip

For cloud deployment, containerize the workers:

FROM python:3.12-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY locustfile.py .
CMD ["locust", "--worker", "--master-host", "locust-master"]

# docker-compose.yml for local distributed testing
services:
  master:
    build: .
    command: locust --master --expect-workers=4
    ports:
      - "8089:8089"
  worker:
    build: .
    command: locust --worker --master-host=master
    deploy:
      replicas: 4

Each worker can handle approximately 5,000-10,000 simulated users depending on test complexity and hardware. Four workers on modest machines can simulate 20,000-40,000 concurrent users.

Headless execution and CI integration

For CI pipelines, run Locust without the web UI:

locust --headless \
  --users 500 \
  --spawn-rate 50 \
  --run-time 5m \
  --host https://staging.example.com \
  --csv results/loadtest \
  --html results/report.html

Parse the CSV output for automated pass/fail decisions:

# scripts/check_load_results.py
import csv
import sys

def check_results(csv_path: str) -> bool:
    """Fail CI if performance thresholds are exceeded."""
    thresholds = {
        "p95_response_time": 2000,  # ms
        "failure_rate": 0.01,       # 1%
        "avg_response_time": 500,   # ms
    }
    
    with open(f"{csv_path}_stats.csv") as f:
        reader = csv.DictReader(f)
        for row in reader:
            if row["Name"] == "Aggregated":
                p95 = float(row["95%"])
                avg = float(row["Average Response Time"])
                failures = int(row["Failure Count"])
                total = int(row["Request Count"])
                failure_rate = failures / total if total > 0 else 0
                
                if p95 > thresholds["p95_response_time"]:
                    print(f"FAIL: p95 response time {p95}ms > {thresholds['p95_response_time']}ms")
                    return False
                if failure_rate > thresholds["failure_rate"]:
                    print(f"FAIL: failure rate {failure_rate:.2%} > {thresholds['failure_rate']:.0%}")
                    return False
                if avg > thresholds["avg_response_time"]:
                    print(f"FAIL: avg response time {avg}ms > {thresholds['avg_response_time']}ms")
                    return False
    
    print("PASS: All performance thresholds met")
    return True

if __name__ == "__main__":
    sys.exit(0 if check_results(sys.argv[1]) else 1)

Custom event hooks for advanced monitoring

Locust provides event hooks for custom metrics and integrations:

from locust import events
import time

@events.request.add_listener
def on_request(request_type, name, response_time, response_length, 
               response, exception, context, **kwargs):
    """Send metrics to external monitoring."""
    if response_time > 5000:
        print(f"SLOW REQUEST: {name} took {response_time}ms")
    
    if exception:
        print(f"FAILED: {name} - {exception}")

@events.test_start.add_listener
def on_test_start(environment, **kwargs):
    print(f"Load test starting against {environment.host}")
    print(f"Target: {environment.runner.target_user_count} users")

@events.test_stop.add_listener  
def on_test_stop(environment, **kwargs):
    stats = environment.runner.stats
    total = stats.total
    print(f"Test complete: {total.num_requests} requests, "
          f"{total.num_failures} failures, "
          f"avg {total.avg_response_time:.0f}ms")

Performance analysis patterns

After running a load test, look for these patterns:

Linear degradation: Response times increase proportionally with users. This usually indicates CPU-bound processing — each request takes fixed time and requests queue up.

Cliff effect: Performance is fine up to N users, then suddenly collapses. This typically means a resource limit was hit — database connection pool exhausted, memory filled, or thread pool saturated.

Sawtooth pattern: Response times spike periodically then recover. Often caused by garbage collection pauses, cache expiration/rebuilds, or background job interference.

Flat then spike: Performance is constant regardless of load until a specific endpoint is hit. That endpoint is the bottleneck — often a database query missing an index, an N+1 query, or an unbounded data fetch.

Each pattern points to a different class of fix. Linear degradation benefits from caching or query optimization. Cliff effects need resource pool tuning. Sawtooth needs GC tuning or cache warming. Flat-then-spike needs endpoint-specific optimization.

One thing to remember: The most valuable load test result isn’t a number — it’s the story of how your system degrades. Understanding whether it degrades gracefully (slows down) or catastrophically (crashes) determines your production reliability.

pythontestingperformance