Containerization — Deep Dive

The Linux kernel primitives behind containers, the OCI spec, how overlay filesystems work, container security attack surface, and why your Dockerfile is probably building images 3x bigger than they need to be.

Under the Hood: What the Docker CLI Actually Does

When you run docker run nginx, you probably imagine Docker doing something clever. What actually happens is a sequence of Linux syscalls you could reproduce manually if you wanted to spend a weekend.

Docker (or its lower-level sibling containerd) calls the kernel with:

clone() with namespace flags — creates a new process in isolated namespaces
unshare() — detaches namespace associations
pivot_root() — changes the container’s filesystem root
mount() with overlayfs — layers image filesystems together
Writes cgroup files in /sys/fs/cgroup/ to set resource limits

None of this requires Docker. You can build a minimal container in bash with just unshare, chroot, and a rootfs tarball. The point: containers are a userspace abstraction over kernel primitives that ship in Linux 3.8+ (2013). Docker just made them usable.

Linux Namespaces: The Isolation Primitives

Linux has 8 namespace types as of kernel 5.6. Containers typically use 6:

Namespace	Flag	Isolates
Mount	`CLONE_NEWNS`	Filesystem mount points
UTS	`CLONE_NEWUTS`	Hostname and domain name
IPC	`CLONE_NEWIPC`	System V IPC, POSIX message queues
PID	`CLONE_NEWPID`	Process IDs (container sees PID 1)
Network	`CLONE_NEWNET`	Network interfaces, routing tables, ports
User	`CLONE_NEWUSER`	UID/GID mappings

The PID namespace is interesting: inside a container, PID 1 is whatever your container entrypoint is. The host sees it as, say, PID 47291. This is why proper container entrypoints must handle SIGTERM cleanly — they’re PID 1, which means they get zombie process cleanup responsibility. Many images get this wrong and use CMD python app.py when they should use exec python app.py (or a proper init like tini) to correctly forward signals.

The user namespace is the most complex and was enabled by default in Docker much later than the others. It lets a container think it’s running as root (UID 0) while the host sees it as an unprivileged user (UID 100000+). This dramatically reduces the blast radius of a container escape.

Control Groups v2 and Resource Accounting

cgroups v1 had a fragmented, inconsistent interface — each subsystem (cpu, memory, blkio) lived in its own hierarchy. cgroups v2, stabilized in Linux 4.5 (2016) and default in most distros by 2022, unified everything under a single hierarchy at /sys/fs/cgroup/.

When you run docker run --memory=512m --cpus=1.5 nginx, Docker writes to files like:

/sys/fs/cgroup/<container-id>/memory.max       # 536870912 (bytes)
/sys/fs/cgroup/<container-id>/cpu.max          # 150000 100000

The cpu.max format is quota period in microseconds — 150000/100000 means 1.5 cores worth of CPU time per 100ms period.

What most people don’t realize: memory limits without swap limits are incomplete. A container with --memory=512m can still use swap unless you also set --memory-swap=512m (setting it equal disables swap for the container). Many production containers run out of swap headroom in unexpected ways.

OOM behavior is also worth understanding. When a container’s process exceeds its memory limit, the kernel’s OOM killer runs inside the cgroup — it kills processes in the container, not on the host. This is generally desirable, but it means your container app needs to handle unexpected death gracefully.

Overlay Filesystems: How Image Layers Are Mounted

This is the part most Docker tutorials skip entirely, and it explains a lot of mysterious behavior.

Docker images are stored as layers. The overlay2 storage driver (default since Docker 1.13) uses Linux’s overlayfs to merge these layers at runtime.

The structure for a container looks like:

lowerdir=layer1:layer2:layer3    # read-only image layers (colon-separated, bottom to top)
upperdir=/containers/<id>/diff   # read-write container layer
workdir=/containers/<id>/work    # overlayfs internal (must be empty)
merged=/containers/<id>/merged   # the container's view of the filesystem

When a container reads a file: overlayfs checks upperdir first, then walks down through the lowerdirs. When a container writes a file: it’s copy-on-write. The original is copied from lowerdir to upperdir, then modified. The lowerdir file is untouched.

This has a real performance implication: write-intensive workloads should use volumes, not the container filesystem. If your app writes lots of data to its own filesystem, every write goes through the copy-on-write mechanism and through the overlay layer, which adds overhead. Databases especially should always use Docker volumes, which are plain directories mounted into the container via bind mounts — no overlay.

Image Size: Why Your Images Are Probably Too Big

The average Docker image on Docker Hub is around 300-400MB. Most of that is unnecessary.

Common anti-patterns:

Not using multi-stage builds. Build toolchains are huge. Your Go binary is 8MB, but the image that compiled it might be 800MB. Multi-stage builds let you compile in one stage and copy only the artifact into a minimal final image:

# Stage 1: build
FROM golang:1.22 AS builder
WORKDIR /app
COPY . .
RUN go build -o main ./cmd/server

# Stage 2: run (scratch = empty base image)
FROM scratch
COPY --from=builder /app/main /main
ENTRYPOINT ["/main"]

The resulting image is the size of your binary. Nothing else.

Leaving package manager caches. apt-get install downloads package lists and caches files. RUN apt-get install curl creates a layer with those caches. Doing it properly:

RUN apt-get update && apt-get install -y curl \
    && rm -rf /var/lib/apt/lists/*

Must be in a single RUN instruction — each RUN creates a layer, and caches deleted in a later layer still exist in the earlier layer’s snapshot.

Using :latest as base. FROM python:latest is 1.0GB. FROM python:3.11-slim is 130MB. FROM python:3.11-alpine is 50MB. Alpine’s musl libc has some compatibility quirks, but slim variants of official images are almost always the right choice.

Container Security: The Real Attack Surface

Containers are not a security boundary in the way VMs are. Understanding where the walls actually are:

The shared kernel problem. All containers on a host share the host kernel. A kernel exploit (like Dirty COW in 2016, or Dirty Pipe in 2022) can affect all containers simultaneously. VM hypervisors have a much thinner attack surface.

Privileged containers are effectively root on the host. docker run --privileged gives the container all Linux capabilities and access to host devices. This is sometimes needed (e.g., running Docker-in-Docker), but a process that escapes a privileged container essentially has root on the host. Treat privileged containers as a serious risk.

Capability dropping. Linux capabilities are a fine-grained system for what root can do. By default, Docker drops most capabilities and keeps a limited set. You can go further with --cap-drop=ALL --cap-add=NET_BIND_SERVICE (for example). The principle of least privilege applies: your web app probably doesn’t need CAP_SYS_ADMIN.

seccomp profiles. Docker ships a default seccomp profile that blocks about 44 syscalls considered dangerous. You can add custom profiles to restrict further. Spotify’s Backstage team uses custom seccomp profiles in production that only allow the specific syscalls their services need.

Image supply chain. In 2021, the ua-parser-js npm package was compromised by a malicious maintainer takeover. Any Docker image built from node:* and containing that package would be backdoored. This is why image scanning tools (Trivy, Snyk, Grype) matter — they check layers for CVEs, and some can detect secrets accidentally baked into images.

The OCI Spec and Runtime Diversity

The Open Container Initiative defines three specs:

OCI Image Spec: What a container image looks like (manifest, config, layers)
OCI Runtime Spec: What a container runtime must do (lifecycle: create, start, kill, delete)
OCI Distribution Spec: How registries serve images (HTTP API)

runc is the reference OCI runtime — it’s the low-level piece that actually calls the Linux kernel. Docker uses runc under containerd. Kubernetes can use containerd directly (since Docker was removed as default runtime in Kubernetes 1.24, January 2022 — a change that caused widespread confusion but changed nothing for most users, since containerd was already the actual runtime).

Alternative runtimes exist for specific use cases:

gVisor (Google): intercepts syscalls with a userspace kernel sandbox — much stronger isolation, some performance cost
Kata Containers: runs containers in lightweight VMs — near-full isolation, seconds to start instead of milliseconds
Firecracker (AWS): powers Lambda and Fargate — microVMs in 125ms, used for millions of function executions per day

The OCI runtime spec means these are interchangeable. You can configure Kubernetes to use gVisor for untrusted code and runc for trusted services, with the same image format.

Layer Caching Strategy for CI/CD

In a CI environment, cache misses are expensive. A Python image that reinstalls 50 dependencies on every commit is wasting 2-3 minutes per build.

The golden rule: order Dockerfile instructions from least-frequently-changed to most-frequently-changed.

# Bad: cache busted on every code change because COPY invalidates pip install
FROM python:3.11-slim
COPY . /app
RUN pip install -r /app/requirements.txt

# Good: requirements only reinstall when requirements.txt changes
FROM python:3.11-slim
COPY requirements.txt /app/
RUN pip install -r /app/requirements.txt
COPY . /app

In CI systems like GitHub Actions, you can persist the Docker layer cache between runs using the cache-from and cache-to flags with the BuildKit backend. GitHub’s docker/build-push-action supports this directly with cache-from: type=gha.

One Thing to Remember

The container filesystem is a stack of read-only layers with a thin read-write layer on top — anything your running container writes lives only in that top layer and disappears when the container dies. Design your apps accordingly: state goes in volumes, secrets go in environment variables or secret managers, and logs go to stdout so the runtime can collect them.

techdevopsdockercontainerskuberneteslinuxsecurityoci