FastAPI Deployment to Production — Deep Dive
Optimized Dockerfile
A production Dockerfile balances image size, build speed, and security:
# Stage 1: Build dependencies
FROM python:3.12-slim AS builder
WORKDIR /app
RUN pip install --no-cache-dir poetry
COPY pyproject.toml poetry.lock ./
RUN poetry config virtualenvs.in-project true && \
poetry install --only main --no-interaction --no-ansi
# Stage 2: Runtime
FROM python:3.12-slim AS runtime
RUN groupadd -r appuser && useradd -r -g appuser appuser
WORKDIR /app
COPY --from=builder /app/.venv /app/.venv
COPY src/ ./src/
ENV PATH="/app/.venv/bin:$PATH"
ENV PYTHONDONTWRITEBYTECODE=1
ENV PYTHONUNBUFFERED=1
USER appuser
EXPOSE 8000
HEALTHCHECK --interval=30s --timeout=5s --retries=3 \
CMD python -c "import urllib.request; urllib.request.urlopen('http://localhost:8000/health')"
CMD ["uvicorn", "src.main:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "4"]
Key optimizations:
- Multi-stage build: Builder stage has poetry and dev tools; runtime stage has only the virtual environment and source code. Image size drops from ~800MB to ~150MB.
- Non-root user: Prevents container escape attacks from gaining host root access.
PYTHONDONTWRITEBYTECODE: Avoids.pycfiles cluttering the container.PYTHONUNBUFFERED: Ensures logs appear immediately indocker logs.- Layer ordering: Dependencies change less often than code. Putting
COPY pyproject.tomlbeforeCOPY src/means dependency layers are cached across builds.
Gunicorn configuration
A production Gunicorn config file (gunicorn.conf.py):
import multiprocessing
import os
# Worker configuration
workers = int(os.environ.get("WEB_CONCURRENCY", multiprocessing.cpu_count() * 2 + 1))
worker_class = "uvicorn.workers.UvicornWorker"
worker_tmp_dir = "/dev/shm" # Use shared memory for heartbeat files
# Binding
bind = f"0.0.0.0:{os.environ.get('PORT', '8000')}"
# Timeouts
timeout = 120 # Kill workers that hang for >120s
graceful_timeout = 30 # Wait 30s for in-flight requests during shutdown
keepalive = 5 # Keep persistent connections alive for 5s
# Logging
accesslog = "-" # stdout
errorlog = "-" # stderr
loglevel = os.environ.get("LOG_LEVEL", "info")
access_log_format = '%(h)s %(l)s %(u)s %(t)s "%(r)s" %(s)s %(b)s "%(f)s" "%(a)s" %(D)s'
# Process naming
proc_name = "fastapi-app"
# Lifecycle hooks
def on_starting(server):
"""Run before workers are forked."""
pass
def pre_fork(server, worker):
"""Run in master before each worker is forked."""
pass
def post_fork(server, worker):
"""Run in each worker after fork."""
pass
def worker_abort(worker):
"""Called when a worker is killed due to timeout."""
import traceback
traceback.print_stack()
worker_tmp_dir = "/dev/shm" is critical in Docker. Gunicorn uses temporary files for worker heartbeats. Docker’s default tmpfs is slow; /dev/shm (shared memory) is fast and prevents spurious worker kills.
Nginx reverse proxy configuration
upstream fastapi {
least_conn;
server 127.0.0.1:8000;
# Add more servers for horizontal scaling
# server 192.168.1.11:8000;
# server 192.168.1.12:8000;
}
server {
listen 443 ssl http2;
server_name api.example.com;
ssl_certificate /etc/letsencrypt/live/api.example.com/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/api.example.com/privkey.pem;
ssl_protocols TLSv1.2 TLSv1.3;
ssl_ciphers ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256;
# Security headers
add_header X-Frame-Options DENY;
add_header X-Content-Type-Options nosniff;
add_header Strict-Transport-Security "max-age=63072000" always;
# Rate limiting
limit_req_zone $binary_remote_addr zone=api:10m rate=10r/s;
location / {
limit_req zone=api burst=20 nodelay;
proxy_pass http://fastapi;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
# Timeouts
proxy_connect_timeout 10s;
proxy_send_timeout 60s;
proxy_read_timeout 60s;
# Buffering (good for most APIs)
proxy_buffering on;
proxy_buffer_size 4k;
proxy_buffers 8 8k;
}
# WebSocket support
location /ws {
proxy_pass http://fastapi;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_read_timeout 86400;
}
# Health check (bypass rate limiting)
location /health {
limit_req off;
proxy_pass http://fastapi;
}
}
# Redirect HTTP to HTTPS
server {
listen 80;
server_name api.example.com;
return 301 https://$host$request_uri;
}
least_conn load balancing sends requests to the server with fewest active connections — better than round-robin for uneven request durations.
Kubernetes deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: fastapi-app
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1
maxSurge: 1
selector:
matchLabels:
app: fastapi-app
template:
metadata:
labels:
app: fastapi-app
spec:
containers:
- name: app
image: registry.example.com/fastapi-app:v1.2.3
ports:
- containerPort: 8000
env:
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: app-secrets
key: database-url
resources:
requests:
cpu: 250m
memory: 256Mi
limits:
cpu: "1"
memory: 512Mi
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 5
periodSeconds: 10
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 15
periodSeconds: 20
failureThreshold: 3
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 10"]
Key details:
readinessProbe: Kubernetes only sends traffic to pods that pass this check. Prevents routing to pods that are still starting.livenessProbe: Kubernetes restarts pods that fail this check. Catches deadlocked or zombie processes.preStopsleep: Gives the load balancer time to remove the pod from rotation before the process shuts down. Without this, in-flight requests get dropped during deploys.- Resource limits: Prevents a single pod from consuming all node resources. Set based on profiling, not guesswork.
Zero-downtime deployment
The deployment pipeline for zero-downtime updates:
- Build and push the new Docker image with a version tag
- Run database migrations as a separate Kubernetes Job (before the app update)
- Update the Deployment image tag (triggers rolling update)
- Kubernetes rolls new pods (respecting
maxUnavailable) - New pods pass readiness checks before receiving traffic
- Old pods receive SIGTERM, finish in-flight requests during the grace period, then shut down
For Gunicorn, handle SIGTERM gracefully:
import signal
def handle_sigterm(*args):
"""Gunicorn sends SIGTERM to workers on shutdown."""
raise SystemExit(0)
signal.signal(signal.SIGTERM, handle_sigterm)
Structured logging
Production logs must be machine-parseable:
import structlog
import logging
structlog.configure(
processors=[
structlog.contextvars.merge_contextvars,
structlog.processors.add_log_level,
structlog.processors.TimeStamper(fmt="iso"),
structlog.processors.JSONRenderer(),
],
logger_factory=structlog.PrintLoggerFactory(),
)
logger = structlog.get_logger()
# Middleware adds request context
@app.middleware("http")
async def log_requests(request: Request, call_next):
structlog.contextvars.clear_contextvars()
structlog.contextvars.bind_contextvars(
request_id=request.headers.get("X-Request-ID", str(uuid.uuid4())),
method=request.method,
path=request.url.path,
)
response = await call_next(request)
logger.info("request_completed", status=response.status_code)
return response
JSON logs integrate with ELK (Elasticsearch, Logstash, Kibana), Datadog, Grafana Loki, and CloudWatch. Every log line includes the request ID, making it trivial to trace a single request through the system.
Prometheus metrics
from prometheus_client import Counter, Histogram, generate_latest
from starlette.responses import Response
REQUEST_COUNT = Counter(
"http_requests_total",
"Total HTTP requests",
["method", "endpoint", "status"],
)
REQUEST_LATENCY = Histogram(
"http_request_duration_seconds",
"Request latency in seconds",
["method", "endpoint"],
buckets=[0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0],
)
@app.middleware("http")
async def metrics_middleware(request: Request, call_next):
start = time.perf_counter()
response = await call_next(request)
duration = time.perf_counter() - start
REQUEST_COUNT.labels(
method=request.method,
endpoint=request.url.path,
status=response.status_code,
).inc()
REQUEST_LATENCY.labels(
method=request.method,
endpoint=request.url.path,
).observe(duration)
return response
@app.get("/metrics")
async def metrics():
return Response(generate_latest(), media_type="text/plain")
Feed these metrics into Grafana dashboards for request rates, error rates, and latency percentiles. Alert on p99 latency breaches and error rate spikes.
Startup and shutdown lifecycle
from contextlib import asynccontextmanager
@asynccontextmanager
async def lifespan(app: FastAPI):
# Startup
await init_database()
await warm_cache()
logger.info("application_started")
yield
# Shutdown
await close_database_pool()
await flush_metrics()
logger.info("application_stopped")
app = FastAPI(lifespan=lifespan)
The lifespan context manager replaced the deprecated @app.on_event("startup") pattern. It ensures cleanup runs even if startup partially fails.
The one thing to remember: Production deployment is a system, not a command — multi-stage Docker builds for small secure images, Gunicorn with proper worker config and /dev/shm heartbeats, Nginx for TLS and rate limiting, structured JSON logging, Prometheus metrics, health-checked rolling deploys, and graceful shutdown handling all work together to keep your API reliable under real-world conditions.
See Also
- Python Aiohttp Client Understand Aiohttp Client through a practical analogy so your Python decisions become faster and clearer.
- Python Api Client Design Why building your own API client in Python is like creating a TV remote that only has the buttons you actually need.
- Python Api Documentation Swagger Swagger turns your Python API into an interactive playground where anyone can click buttons to try it out — no coding required.
- Python Api Mocking Responses Why testing with fake API responses is like rehearsing a play with stand-ins before the real actors show up.
- Python Api Pagination Clients Why APIs send data in pages, and how Python handles it — like reading a book one chapter at a time instead of swallowing the whole thing.