Kubernetes — Deep Dive

etcd, the scheduler algorithm, custom controllers, and why most Kubernetes outages aren't Kubernetes's fault — a technical teardown of the control plane and the patterns that keep production clusters alive.

What Kubernetes Actually Is (Architecture First)

Most people learn Kubernetes by touching kubectl. That’s fine for day-to-day use, but to understand why things fail (and they will fail), you need the architecture.

Kubernetes is a distributed system built around a single source of truth: etcd, a distributed key-value store. Everything — every pod spec, every deployment, every node registration, every secret — lives in etcd. The control plane is a set of reconciliation loops that watch etcd for changes and act on them.

Control Plane Components

┌─────────────────────────────────────────────┐
│                Control Plane                │
│                                             │
│  ┌─────────────┐   ┌──────────────────────┐ │
│  │  kube-api   │   │        etcd          │ │
│  │   server    │◄──│  (source of truth)   │ │
│  └──────┬──────┘   └──────────────────────┘ │
│         │                                   │
│  ┌──────▼──────┐   ┌──────────────────────┐ │
│  │  scheduler  │   │  controller-manager  │ │
│  └─────────────┘   └──────────────────────┘ │
└─────────────────────────────────────────────┘

kube-apiserver — The only component anything talks to directly. All state changes go through the API server. kubectl, the scheduler, controller-manager, kubelets on nodes — everyone hits the API server. It validates requests, handles auth/authz, and writes to etcd.

etcd — Raft-based distributed KV store. For production, you run 3 or 5 etcd nodes. The Raft consensus algorithm requires (n/2)+1 nodes to agree before writing, so a 3-node cluster tolerates 1 failure. Never run 2 or 4 etcd nodes — even numbers create split-brain scenarios.

kube-scheduler — Watches for pods with no assigned node (.spec.nodeName is empty), scores candidate nodes, assigns a node to the pod spec. It doesn’t start the pod — it just updates the spec. The kubelet on the chosen node notices and takes over.

controller-manager — Runs ~20 built-in control loops. The ReplicaSet controller watches for pods belonging to a ReplicaSet and creates/deletes them to match replicas. The Node controller watches for nodes that stop reporting heartbeats and marks them NotReady. Each controller is a simple loop: observe current state → compare to desired state → act.

Worker Node Components

kubelet — Runs on every node. Watches the API server for pods scheduled to its node, instructs the container runtime (containerd, in modern clusters) to start/stop containers, reports pod status back to the API server. If the kubelet crashes, pods already running on that node keep running — but Kubernetes has no visibility into them until kubelet recovers.

kube-proxy — Implements Service networking. Maintains iptables (or ipvs) rules that route traffic for Service ClusterIPs to the backend pod IPs. When pods come and go, kube-proxy updates the rules. This is why Services work as stable endpoints even as pod IPs change constantly.

Container runtime — containerd (via CNI) actually pulls images and manages container lifecycle. Docker was removed from Kubernetes in 1.24 because Docker added unnecessary overhead; containerd handles everything Kubernetes needs.

The Scheduling Algorithm

The scheduler runs in two phases: filtering and scoring.

Filtering removes any node that cannot run the pod:

Insufficient CPU/memory resources
Taints that the pod doesn’t tolerate
NodeSelector labels don’t match
Affinity/anti-affinity rules violated
Pod can’t be scheduled on control-plane nodes (unless configured)

Scoring ranks the remaining nodes. Default scoring functions include:

LeastAllocated — prefer nodes with more free resources (spreads load)
ImageLocality — prefer nodes that already have the container image (avoids pull latency)
InterPodAffinity — prefer nodes that satisfy soft affinity rules
NodeAffinity — weight nodes matching preference selectors

The node with the highest aggregate score wins. This is pluggable — you can write custom scheduler plugins if the defaults don’t fit your hardware topology.

Custom Controllers: The Operator Pattern

This is the part most engineers don’t reach until they’ve been running Kubernetes for a year, and it’s arguably the most powerful feature.

The built-in controllers handle generic workloads. But what if you’re running a stateful database cluster that has specific requirements for startup order, backup procedures, and failover? You can encode that operational knowledge in a Custom Controller (aka Operator).

An Operator consists of:

A Custom Resource Definition (CRD) — a new API object type, e.g., kind: PostgresCluster
A controller — a program that watches for PostgresCluster objects and reconciles them

// Simplified controller reconcile loop
func (r *PostgresClusterReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    var cluster v1alpha1.PostgresCluster
    if err := r.Get(ctx, req.NamespacedName, &cluster); err != nil {
        return ctrl.Result{}, client.IgnoreNotFound(err)
    }

    // Ensure StatefulSet exists with correct replica count
    if err := r.ensureStatefulSet(ctx, &cluster); err != nil {
        return ctrl.Result{RequeueAfter: 30 * time.Second}, err
    }

    // Ensure primary/replica roles are correctly assigned
    if err := r.reconcileReplication(ctx, &cluster); err != nil {
        return ctrl.Result{RequeueAfter: 10 * time.Second}, err
    }

    return ctrl.Result{}, nil
}

This pattern is now how almost every complex software system gets deployed on Kubernetes. Prometheus Operator, Strimzi (Kafka), CockroachDB Operator, Cert-manager, Istio — all operators. The OperatorHub lists over 300 as of 2025.

The mental model: if a human operator could write a runbook for managing a system, you can encode that runbook in a controller.

Pod Disruption Budgets and Why They Save You

Here’s a scenario: you have 3 replicas of your web app. Node maintenance kicks off cluster-wide. Kubernetes starts draining nodes (evicting pods). Without a PodDisruptionBudget, it can evict all 3 pods simultaneously. Your service is down.

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: web-app-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: web-app

Now Kubernetes will never voluntarily bring your available pod count below 2. Node drain will evict one pod, wait for a replacement to become ready, then evict the next. Slower, but you stay up.

Most Kubernetes outages I’ve seen in postmortems aren’t failures — they’re voluntary disruptions (cluster upgrades, node rotations, rollouts) that happened to a deployment without PDBs. This is an easy win that most teams don’t implement until after their first self-inflicted outage.

Resource Requests vs Limits — The Most Misunderstood Setting

Every pod spec can set requests and limits for CPU and memory.

resources:
  requests:
    memory: "256Mi"
    cpu: "250m"
  limits:
    memory: "512Mi"
    cpu: "1000m"

Requests — what the scheduler uses. The node must have at least 256Mi free memory to schedule this pod. Requests affect bin-packing.

Limits — what the runtime enforces. If the container tries to use more than 512Mi memory, the kernel OOM-kills it. If it tries to use more than 1 CPU core, it gets throttled (not killed).

The subtlety most people miss: CPU throttling is silent and devastating. A CPU-throttled pod doesn’t crash — it just gets slower. It’s nearly impossible to detect without metrics. Latency spikes, timeouts accumulate, and nothing in your logs explains why. Setting CPU limits too tight is the hidden cause of many mysterious latency issues.

Memory behavior is asymmetric: memory limits are hard (OOM kill), CPU limits are soft (throttle). Many teams now advocate setting memory limits but leaving CPU limits unset, relying on node-level pressure eviction instead.

Networking Model

Kubernetes makes a guarantee: every pod can communicate with every other pod directly, without NAT. No port mapping, no proxy — flat network across the entire cluster.

This is implemented by CNI plugins (Container Network Interface). Each plugin implements the guarantee differently:

Flannel — overlay network using VXLAN tunneling. Simple, high overhead.
Calico — BGP routing where possible, VXLAN for cross-subnet. Good performance, complex config.
Cilium — eBPF-based. Bypasses iptables entirely, significantly lower latency. Now the default on most managed Kubernetes offerings (GKE, AKS).

Cilium’s rise is one of the bigger infrastructure shifts of the 2020s. eBPF lets Cilium hook into the Linux kernel’s packet processing at a level that kube-proxy’s iptables chains can’t reach. At 10,000+ services, iptables rule count becomes a serious performance problem; Cilium sidesteps it entirely.

What Actually Goes Wrong in Production

From postmortem analysis and my own experience, the recurring failure modes:

etcd storage pressure — etcd has a default storage limit of 2GB (configurable to 8GB). Old revisions accumulate. If you don’t run compaction jobs, etcd hits the limit and the cluster freezes — no new objects can be created, no updates processed. The command etcdctl compact $(etcdctl endpoint status --write-out="json" | jq -r '.[0].Status.header.revision') should be automated.

ImagePullBackoff cascades — A brief registry outage causes new pods to fail to pull images. Kubernetes retries with exponential backoff. But if you’re doing a rollout and many new pods are trying to start simultaneously, your healthy pod count can dip below your PDB threshold, blocking the rollout indefinitely. Use imagePullPolicy: IfNotPresent for production images.

Noisy neighbor eviction storms — A memory-hungry pod on a node triggers node pressure. Kubernetes starts evicting pods. Those pods get rescheduled to other nodes. Those nodes now have more load. More evictions. Cascades happen fast. Proper resource requests prevent this by ensuring the scheduler doesn’t overcommit nodes.

API server request storms — Many controllers, monitoring agents, and operators all poll the API server. At scale, this becomes significant load. The watch mechanism (long-lived HTTP connections that stream changes) is much more efficient than polling and should always be preferred in custom controllers.

When Not to Use Kubernetes

This doesn’t get said enough: Kubernetes is operationally expensive.

A managed cluster (EKS, GKE, AKS) costs ~$100-300/month before any workloads. You need engineers who understand it. You need monitoring, logging, security policies, regular version upgrades (a new minor version every 4 months, with 14-month support windows). You’re signing up for a platform, not just a deployment tool.

For teams under ~10 engineers with less than a dozen services, the tradeoff is almost always wrong. Fly.io, Railway, Render, or even ECS are dramatically simpler and will handle most workloads just fine.

Kubernetes shines when you have: many independent services, teams that deploy independently, need for fine-grained resource isolation, or compliance requirements that benefit from namespace-level separation.

One Thing to Remember

Everything in Kubernetes is a reconciliation loop comparing desired state to actual state. Understanding this — from the scheduler to your custom operators — explains both why it’s reliable and why it fails in the specific, predictable ways it does. Most Kubernetes problems are resource misconfiguration, not Kubernetes itself.

clouddevopskubernetescontainersinfrastructure