TensorFlow Serving — Core Concepts
What TensorFlow Serving Does
TensorFlow Serving is a production-grade system for deploying ML models as network services. It wraps SavedModel files in a high-performance C++ server that exposes REST and gRPC APIs for inference requests.
Unlike running model.predict() in a Python script, TF Serving handles:
- Concurrent requests from many clients simultaneously
- Model versioning with zero-downtime updates
- Request batching to maximize GPU utilization
- Hardware acceleration on GPUs and TPUs with optimized inference paths
Architecture Overview
TF Serving has three core concepts:
Servables
A servable is anything the system can serve — typically a TensorFlow SavedModel. Each servable has a name and a version number. Multiple versions can be loaded simultaneously.
Loaders
Loaders know how to load and unload servables. The default loader reads SavedModel directories. When a new version appears on disk (or in cloud storage), the loader picks it up automatically.
Managers
Managers oversee the lifecycle of servables — deciding which versions to load, which to unload, and how to handle transitions. The default policy keeps the latest version active and gracefully drains the previous one.
How Model Updates Work
The update flow is what makes TF Serving production-ready:
- You train a new model and save it to
models/my_model/2/(version 2) - TF Serving detects the new version directory
- It loads version 2 into memory
- Once loaded, it routes new requests to version 2
- In-flight requests to version 1 complete normally
- Version 1 is unloaded
No restart, no dropped requests. This is critical for systems that serve millions of users — you cannot take the service down every time you retrain.
Two API Protocols
REST API
Simple HTTP/JSON endpoints. Easy to test with curl or any HTTP client:
POST http://localhost:8501/v1/models/my_model:predict
{
"instances": [
{"input": [1.0, 2.0, 3.0, 4.0]}
]
}
Response:
{
"predictions": [[0.1, 0.8, 0.1]]
}
gRPC API
Binary protocol on port 8500. Significantly faster than REST for large payloads (images, embeddings) because it avoids JSON serialization overhead. Throughput can be 2-5x higher for batch inference workloads.
Request Batching
Without batching, each request runs independently through the model. This underutilizes GPUs, which excel at parallel computation.
TF Serving’s batching scheduler collects requests that arrive within a time window and processes them as a single batch:
| Config | Effect |
|---|---|
max_batch_size | Maximum requests per batch |
batch_timeout_micros | How long to wait for a full batch |
num_batch_threads | Parallel batch processing threads |
max_enqueued_batches | Queue depth before rejecting requests |
A well-tuned batcher can increase throughput by 3-10x on GPU while adding only milliseconds of latency.
Model Configuration
TF Serving supports multiple models simultaneously through a configuration file:
model_config_list {
config {
name: "classifier"
base_path: "/models/classifier"
model_platform: "tensorflow"
}
config {
name: "recommender"
base_path: "/models/recommender"
model_platform: "tensorflow"
}
}
Each model gets its own endpoint. You can serve a classifier, a recommender, and a text encoder from a single TF Serving instance.
Common Misconception
“TF Serving only works with TensorFlow models.” While optimized for TensorFlow SavedModels, TF Serving can serve any model format that has a compatible servable implementation. The ecosystem includes adapters for ONNX models and custom inference code. That said, for non-TensorFlow models, alternatives like Triton Inference Server or BentoML are often simpler.
When to Use TF Serving
Good fit: High-throughput prediction services, models that update frequently, GPU-accelerated inference, teams already using TensorFlow.
Not ideal: Simple batch prediction jobs (just use model.predict()), models that need complex pre/post-processing (consider wrapping TF Serving behind a FastAPI gateway), or multi-framework environments (consider Triton).
The one thing to remember: TF Serving turns SavedModels into production endpoints with zero-downtime updates and automatic batching — it bridges the gap between “model works” and “model serves users.”
See Also
- Activation Functions Why neural networks need these tiny mathematical functions — and how ReLU's simplicity accidentally made deep learning possible.
- Ai Agents Architecture How AI systems go from answering questions to actually doing things — the design patterns that turn language models into autonomous agents that browse, code, and plan.
- Ai Agents ChatGPT answers questions. AI agents actually do things — browse the web, write code, send emails, and keep going until the job is done. Here's the difference.
- Ai Ethics Why building AI fairly is harder than it sounds — bias, accountability, privacy, and who gets to decide what AI is allowed to do.
- Ai Hallucinations ChatGPT sometimes makes up facts with total confidence. Here's the weird reason why — and why it's not as simple as 'the AI lied.'