FastAPI Deployment to Production — Deep Dive

Optimized Dockerfile

A production Dockerfile balances image size, build speed, and security:

# Stage 1: Build dependencies
FROM python:3.12-slim AS builder

WORKDIR /app
RUN pip install --no-cache-dir poetry

COPY pyproject.toml poetry.lock ./
RUN poetry config virtualenvs.in-project true && \
    poetry install --only main --no-interaction --no-ansi

# Stage 2: Runtime
FROM python:3.12-slim AS runtime

RUN groupadd -r appuser && useradd -r -g appuser appuser

WORKDIR /app
COPY --from=builder /app/.venv /app/.venv
COPY src/ ./src/

ENV PATH="/app/.venv/bin:$PATH"
ENV PYTHONDONTWRITEBYTECODE=1
ENV PYTHONUNBUFFERED=1

USER appuser
EXPOSE 8000

HEALTHCHECK --interval=30s --timeout=5s --retries=3 \
    CMD python -c "import urllib.request; urllib.request.urlopen('http://localhost:8000/health')"

CMD ["uvicorn", "src.main:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "4"]

Key optimizations:

  • Multi-stage build: Builder stage has poetry and dev tools; runtime stage has only the virtual environment and source code. Image size drops from ~800MB to ~150MB.
  • Non-root user: Prevents container escape attacks from gaining host root access.
  • PYTHONDONTWRITEBYTECODE: Avoids .pyc files cluttering the container.
  • PYTHONUNBUFFERED: Ensures logs appear immediately in docker logs.
  • Layer ordering: Dependencies change less often than code. Putting COPY pyproject.toml before COPY src/ means dependency layers are cached across builds.

Gunicorn configuration

A production Gunicorn config file (gunicorn.conf.py):

import multiprocessing
import os

# Worker configuration
workers = int(os.environ.get("WEB_CONCURRENCY", multiprocessing.cpu_count() * 2 + 1))
worker_class = "uvicorn.workers.UvicornWorker"
worker_tmp_dir = "/dev/shm"  # Use shared memory for heartbeat files

# Binding
bind = f"0.0.0.0:{os.environ.get('PORT', '8000')}"

# Timeouts
timeout = 120           # Kill workers that hang for >120s
graceful_timeout = 30   # Wait 30s for in-flight requests during shutdown
keepalive = 5           # Keep persistent connections alive for 5s

# Logging
accesslog = "-"         # stdout
errorlog = "-"          # stderr
loglevel = os.environ.get("LOG_LEVEL", "info")
access_log_format = '%(h)s %(l)s %(u)s %(t)s "%(r)s" %(s)s %(b)s "%(f)s" "%(a)s" %(D)s'

# Process naming
proc_name = "fastapi-app"

# Lifecycle hooks
def on_starting(server):
    """Run before workers are forked."""
    pass

def pre_fork(server, worker):
    """Run in master before each worker is forked."""
    pass

def post_fork(server, worker):
    """Run in each worker after fork."""
    pass

def worker_abort(worker):
    """Called when a worker is killed due to timeout."""
    import traceback
    traceback.print_stack()

worker_tmp_dir = "/dev/shm" is critical in Docker. Gunicorn uses temporary files for worker heartbeats. Docker’s default tmpfs is slow; /dev/shm (shared memory) is fast and prevents spurious worker kills.

Nginx reverse proxy configuration

upstream fastapi {
    least_conn;
    server 127.0.0.1:8000;
    # Add more servers for horizontal scaling
    # server 192.168.1.11:8000;
    # server 192.168.1.12:8000;
}

server {
    listen 443 ssl http2;
    server_name api.example.com;

    ssl_certificate /etc/letsencrypt/live/api.example.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/api.example.com/privkey.pem;
    ssl_protocols TLSv1.2 TLSv1.3;
    ssl_ciphers ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256;

    # Security headers
    add_header X-Frame-Options DENY;
    add_header X-Content-Type-Options nosniff;
    add_header Strict-Transport-Security "max-age=63072000" always;

    # Rate limiting
    limit_req_zone $binary_remote_addr zone=api:10m rate=10r/s;

    location / {
        limit_req zone=api burst=20 nodelay;

        proxy_pass http://fastapi;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;

        # Timeouts
        proxy_connect_timeout 10s;
        proxy_send_timeout 60s;
        proxy_read_timeout 60s;

        # Buffering (good for most APIs)
        proxy_buffering on;
        proxy_buffer_size 4k;
        proxy_buffers 8 8k;
    }

    # WebSocket support
    location /ws {
        proxy_pass http://fastapi;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
        proxy_read_timeout 86400;
    }

    # Health check (bypass rate limiting)
    location /health {
        limit_req off;
        proxy_pass http://fastapi;
    }
}

# Redirect HTTP to HTTPS
server {
    listen 80;
    server_name api.example.com;
    return 301 https://$host$request_uri;
}

least_conn load balancing sends requests to the server with fewest active connections — better than round-robin for uneven request durations.

Kubernetes deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: fastapi-app
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
      maxSurge: 1
  selector:
    matchLabels:
      app: fastapi-app
  template:
    metadata:
      labels:
        app: fastapi-app
    spec:
      containers:
      - name: app
        image: registry.example.com/fastapi-app:v1.2.3
        ports:
        - containerPort: 8000
        env:
        - name: DATABASE_URL
          valueFrom:
            secretKeyRef:
              name: app-secrets
              key: database-url
        resources:
          requests:
            cpu: 250m
            memory: 256Mi
          limits:
            cpu: "1"
            memory: 512Mi
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 5
          periodSeconds: 10
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 15
          periodSeconds: 20
          failureThreshold: 3
        lifecycle:
          preStop:
            exec:
              command: ["/bin/sh", "-c", "sleep 10"]

Key details:

  • readinessProbe: Kubernetes only sends traffic to pods that pass this check. Prevents routing to pods that are still starting.
  • livenessProbe: Kubernetes restarts pods that fail this check. Catches deadlocked or zombie processes.
  • preStop sleep: Gives the load balancer time to remove the pod from rotation before the process shuts down. Without this, in-flight requests get dropped during deploys.
  • Resource limits: Prevents a single pod from consuming all node resources. Set based on profiling, not guesswork.

Zero-downtime deployment

The deployment pipeline for zero-downtime updates:

  1. Build and push the new Docker image with a version tag
  2. Run database migrations as a separate Kubernetes Job (before the app update)
  3. Update the Deployment image tag (triggers rolling update)
  4. Kubernetes rolls new pods (respecting maxUnavailable)
  5. New pods pass readiness checks before receiving traffic
  6. Old pods receive SIGTERM, finish in-flight requests during the grace period, then shut down

For Gunicorn, handle SIGTERM gracefully:

import signal

def handle_sigterm(*args):
    """Gunicorn sends SIGTERM to workers on shutdown."""
    raise SystemExit(0)

signal.signal(signal.SIGTERM, handle_sigterm)

Structured logging

Production logs must be machine-parseable:

import structlog
import logging

structlog.configure(
    processors=[
        structlog.contextvars.merge_contextvars,
        structlog.processors.add_log_level,
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.JSONRenderer(),
    ],
    logger_factory=structlog.PrintLoggerFactory(),
)

logger = structlog.get_logger()

# Middleware adds request context
@app.middleware("http")
async def log_requests(request: Request, call_next):
    structlog.contextvars.clear_contextvars()
    structlog.contextvars.bind_contextvars(
        request_id=request.headers.get("X-Request-ID", str(uuid.uuid4())),
        method=request.method,
        path=request.url.path,
    )
    response = await call_next(request)
    logger.info("request_completed", status=response.status_code)
    return response

JSON logs integrate with ELK (Elasticsearch, Logstash, Kibana), Datadog, Grafana Loki, and CloudWatch. Every log line includes the request ID, making it trivial to trace a single request through the system.

Prometheus metrics

from prometheus_client import Counter, Histogram, generate_latest
from starlette.responses import Response

REQUEST_COUNT = Counter(
    "http_requests_total",
    "Total HTTP requests",
    ["method", "endpoint", "status"],
)

REQUEST_LATENCY = Histogram(
    "http_request_duration_seconds",
    "Request latency in seconds",
    ["method", "endpoint"],
    buckets=[0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0],
)

@app.middleware("http")
async def metrics_middleware(request: Request, call_next):
    start = time.perf_counter()
    response = await call_next(request)
    duration = time.perf_counter() - start

    REQUEST_COUNT.labels(
        method=request.method,
        endpoint=request.url.path,
        status=response.status_code,
    ).inc()

    REQUEST_LATENCY.labels(
        method=request.method,
        endpoint=request.url.path,
    ).observe(duration)

    return response

@app.get("/metrics")
async def metrics():
    return Response(generate_latest(), media_type="text/plain")

Feed these metrics into Grafana dashboards for request rates, error rates, and latency percentiles. Alert on p99 latency breaches and error rate spikes.

Startup and shutdown lifecycle

from contextlib import asynccontextmanager

@asynccontextmanager
async def lifespan(app: FastAPI):
    # Startup
    await init_database()
    await warm_cache()
    logger.info("application_started")
    yield
    # Shutdown
    await close_database_pool()
    await flush_metrics()
    logger.info("application_stopped")

app = FastAPI(lifespan=lifespan)

The lifespan context manager replaced the deprecated @app.on_event("startup") pattern. It ensures cleanup runs even if startup partially fails.

The one thing to remember: Production deployment is a system, not a command — multi-stage Docker builds for small secure images, Gunicorn with proper worker config and /dev/shm heartbeats, Nginx for TLS and rate limiting, structured JSON logging, Prometheus metrics, health-checked rolling deploys, and graceful shutdown handling all work together to keep your API reliable under real-world conditions.

pythonwebapisdeployment

See Also

  • Python Aiohttp Client Understand Aiohttp Client through a practical analogy so your Python decisions become faster and clearer.
  • Python Api Client Design Why building your own API client in Python is like creating a TV remote that only has the buttons you actually need.
  • Python Api Documentation Swagger Swagger turns your Python API into an interactive playground where anyone can click buttons to try it out — no coding required.
  • Python Api Mocking Responses Why testing with fake API responses is like rehearsing a play with stand-ins before the real actors show up.
  • Python Api Pagination Clients Why APIs send data in pages, and how Python handles it — like reading a book one chapter at a time instead of swallowing the whole thing.