Kubernetes — Core Concepts
The Problem It Solved
In 2011, Netflix had a near-catastrophic infrastructure failure. Engineers were manually SSHing into servers, tracking which ones were healthy in spreadsheets, restarting crashed processes by hand. The company was growing 100% year-over-year. That approach was going to kill them.
Google had been solving this problem internally for a decade with a system called Borg. In 2014, they open-sourced a redesigned version: Kubernetes (Greek for “helmsman” or “pilot”). By 2016, it had become the default way to run applications at scale. By 2023, it processed more compute jobs per day than anything else on Earth.
This is the story of how it works.
The Core Idea: Declare What You Want, Not How to Get It
Most software instructions are imperative: do this, then do that. “Start a server. Copy this file. Restart the process.”
Kubernetes is declarative. You tell it the desired state of the world:
“I want 3 copies of my web app running. Each needs 512MB of memory and 1 CPU core. If any copy fails, replace it. Forward internet traffic to whichever ones are healthy.”
Kubernetes figures out how to make that happen — and more importantly, it keeps checking to make sure that state remains true. If the world drifts from what you declared, Kubernetes corrects it. This is called the reconciliation loop, and it runs constantly.
The Building Blocks
Pods — The Smallest Unit
A pod is one or more containers packaged together. Usually it’s one container per pod, but tightly-coupled processes (like an app and its log collector) can share a pod.
Pods are ephemeral. They’re meant to die. Kubernetes will kill and replace them constantly — during updates, after crashes, when rescheduling to a different machine. Designing for this is a mental shift most engineers struggle with initially.
Nodes — The Machines
A node is a physical or virtual machine in your cluster. Pods run on nodes. A cluster typically has dozens to thousands of nodes.
There are two kinds:
- Control plane nodes run the Kubernetes brain (the API server, scheduler, and state database)
- Worker nodes run your actual application pods
Deployments — How You Describe Apps
You rarely create a pod directly. You create a Deployment: a description of what you want running, how many copies, what image to use, and what to do during updates.
replicas: 5
image: my-web-app:v2.4
Kubernetes creates 5 pods from that spec. Kill one manually — it creates another within seconds. The scheduler decides which node each pod lands on, based on available resources.
Services — Stable Network Addresses
Pods are temporary, but your users need a stable place to send requests. A Service is a permanent address that routes traffic to whatever pods are currently healthy. If pod A dies and pod B replaces it on a different machine with a different IP, the Service doesn’t care — it finds the healthy pods automatically.
ConfigMaps and Secrets
App configuration (database URLs, feature flags, API endpoints) goes in a ConfigMap. Sensitive values (passwords, tokens) go in a Secret. Both get injected into pods at runtime, so you don’t bake credentials into your container images.
Common Misconception: Kubernetes Manages Your Data
It doesn’t. Kubernetes is stateless by nature — it manages compute, not storage. Databases are notoriously painful to run in Kubernetes because if a pod dies, its data dies with it.
Most teams run stateful things (Postgres, MySQL, Cassandra) outside Kubernetes on managed cloud services (AWS RDS, Google Cloud SQL), and only use Kubernetes for stateless application services. There are ways to handle state in Kubernetes (PersistentVolumes, StatefulSets), but it’s significantly more complex and often not worth it.
How Updates Actually Work
This is where Kubernetes earns its reputation. Say you want to update your web app from version 1 to version 2.
A rolling update replaces pods gradually: kill one v1, start one v2, wait for it to be healthy, kill the next v1… until all 5 pods run v2. If any new pod fails to start, Kubernetes stops the rollout and you still have a partially-running v1 serving traffic. Nothing goes down.
If you notice a bug in v2 after the rollout, one command rolls back to v1. Kubernetes reruns the rolling update in reverse.
The Ecosystem
Kubernetes alone is powerful. But most teams use it with a cluster of surrounding tools:
| Tool | What It Does |
|---|---|
| Helm | Package manager for Kubernetes apps (like apt, but for clusters) |
| Istio | Service mesh — manages traffic between services, adds encryption |
| Prometheus | Metrics collection for monitoring |
| ArgoCD | GitOps — syncs your cluster state from a Git repo |
| cert-manager | Automatic TLS certificates |
This ecosystem is also why Kubernetes has a steep learning curve. You don’t just learn one thing — you learn an entire platform.
Who Actually Uses This?
Almost every company with a serious engineering team. Airbnb migrated 1,000+ services to Kubernetes. Spotify runs over 10 million pods per day. The New York Times moved their entire digital infrastructure in 2019.
But it’s also genuinely overkill for small teams. A startup with 3 engineers probably doesn’t need Kubernetes — a simpler platform like Railway, Fly.io, or even Heroku will do fine and won’t require a dedicated DevOps engineer to maintain.
The honest answer: Kubernetes is the right tool when you have enough services and traffic that the complexity pays for itself.
One Thing to Remember
Kubernetes doesn’t run your app — it keeps your desired state of the app alive. You declare “3 copies, always healthy,” and it continuously works to make that true, healing failures and rerouting traffic automatically. That shift from imperative to declarative is the whole idea.
See Also
- Edge Ai Why AI is moving from cloud data centers to your devices — and what becomes possible when AI runs right where you are instead of sending your data far away.
- Gpu Computing Why the graphics cards gamers use became the engine of the AI revolution — and how thousands of tiny processors working together changed what's computationally possible.
- Mlops Why getting an AI model to actually work in production is 10x harder than training it — and the engineering practices that make it reliable.