TensorFlow Serving — Core Concepts

What TensorFlow Serving Does

TensorFlow Serving is a production-grade system for deploying ML models as network services. It wraps SavedModel files in a high-performance C++ server that exposes REST and gRPC APIs for inference requests.

Unlike running model.predict() in a Python script, TF Serving handles:

  • Concurrent requests from many clients simultaneously
  • Model versioning with zero-downtime updates
  • Request batching to maximize GPU utilization
  • Hardware acceleration on GPUs and TPUs with optimized inference paths

Architecture Overview

TF Serving has three core concepts:

Servables

A servable is anything the system can serve — typically a TensorFlow SavedModel. Each servable has a name and a version number. Multiple versions can be loaded simultaneously.

Loaders

Loaders know how to load and unload servables. The default loader reads SavedModel directories. When a new version appears on disk (or in cloud storage), the loader picks it up automatically.

Managers

Managers oversee the lifecycle of servables — deciding which versions to load, which to unload, and how to handle transitions. The default policy keeps the latest version active and gracefully drains the previous one.

How Model Updates Work

The update flow is what makes TF Serving production-ready:

  1. You train a new model and save it to models/my_model/2/ (version 2)
  2. TF Serving detects the new version directory
  3. It loads version 2 into memory
  4. Once loaded, it routes new requests to version 2
  5. In-flight requests to version 1 complete normally
  6. Version 1 is unloaded

No restart, no dropped requests. This is critical for systems that serve millions of users — you cannot take the service down every time you retrain.

Two API Protocols

REST API

Simple HTTP/JSON endpoints. Easy to test with curl or any HTTP client:

POST http://localhost:8501/v1/models/my_model:predict

{
  "instances": [
    {"input": [1.0, 2.0, 3.0, 4.0]}
  ]
}

Response:

{
  "predictions": [[0.1, 0.8, 0.1]]
}

gRPC API

Binary protocol on port 8500. Significantly faster than REST for large payloads (images, embeddings) because it avoids JSON serialization overhead. Throughput can be 2-5x higher for batch inference workloads.

Request Batching

Without batching, each request runs independently through the model. This underutilizes GPUs, which excel at parallel computation.

TF Serving’s batching scheduler collects requests that arrive within a time window and processes them as a single batch:

ConfigEffect
max_batch_sizeMaximum requests per batch
batch_timeout_microsHow long to wait for a full batch
num_batch_threadsParallel batch processing threads
max_enqueued_batchesQueue depth before rejecting requests

A well-tuned batcher can increase throughput by 3-10x on GPU while adding only milliseconds of latency.

Model Configuration

TF Serving supports multiple models simultaneously through a configuration file:

model_config_list {
  config {
    name: "classifier"
    base_path: "/models/classifier"
    model_platform: "tensorflow"
  }
  config {
    name: "recommender"
    base_path: "/models/recommender"
    model_platform: "tensorflow"
  }
}

Each model gets its own endpoint. You can serve a classifier, a recommender, and a text encoder from a single TF Serving instance.

Common Misconception

“TF Serving only works with TensorFlow models.” While optimized for TensorFlow SavedModels, TF Serving can serve any model format that has a compatible servable implementation. The ecosystem includes adapters for ONNX models and custom inference code. That said, for non-TensorFlow models, alternatives like Triton Inference Server or BentoML are often simpler.

When to Use TF Serving

Good fit: High-throughput prediction services, models that update frequently, GPU-accelerated inference, teams already using TensorFlow.

Not ideal: Simple batch prediction jobs (just use model.predict()), models that need complex pre/post-processing (consider wrapping TF Serving behind a FastAPI gateway), or multi-framework environments (consider Triton).

The one thing to remember: TF Serving turns SavedModels into production endpoints with zero-downtime updates and automatic batching — it bridges the gap between “model works” and “model serves users.”

pythonmachine-learningtensorflowdeployment

See Also

  • Activation Functions Why neural networks need these tiny mathematical functions — and how ReLU's simplicity accidentally made deep learning possible.
  • Ai Agents Architecture How AI systems go from answering questions to actually doing things — the design patterns that turn language models into autonomous agents that browse, code, and plan.
  • Ai Agents ChatGPT answers questions. AI agents actually do things — browse the web, write code, send emails, and keep going until the job is done. Here's the difference.
  • Ai Ethics Why building AI fairly is harder than it sounds — bias, accountability, privacy, and who gets to decide what AI is allowed to do.
  • Ai Hallucinations ChatGPT sometimes makes up facts with total confidence. Here's the weird reason why — and why it's not as simple as 'the AI lied.'