TensorFlow Serving — Deep Dive
Production Deployment with Docker
The standard deployment path uses the official TF Serving Docker image:
# Pull the GPU-enabled image
docker pull tensorflow/serving:latest-gpu
# Serve a model
docker run -d --name tf-serving \
--gpus all \
-p 8500:8500 \
-p 8501:8501 \
-v /path/to/models:/models \
-e MODEL_NAME=my_classifier \
tensorflow/serving:latest-gpu
For multiple models, mount a config file:
docker run -d --name tf-serving \
--gpus all \
-p 8500:8500 -p 8501:8501 \
-v /path/to/models:/models \
-v /path/to/config:/config \
tensorflow/serving:latest-gpu \
--model_config_file=/config/models.config \
--model_config_file_poll_wait_seconds=60
The poll_wait_seconds flag tells TF Serving to check for config changes every 60 seconds — enabling dynamic model registration without restarts.
SavedModel Export for Serving
Models must include a serving signature that defines input/output tensor names:
import tensorflow as tf
model = tf.keras.models.load_model("trained_model")
# Export with explicit signatures
@tf.function(input_signature=[
tf.TensorSpec(shape=[None, 224, 224, 3], dtype=tf.float32, name="image")
])
def serve(image):
predictions = model(image, training=False)
return {"class_probabilities": predictions}
tf.saved_model.save(
model,
"models/classifier/1",
signatures={"serving_default": serve}
)
Inspect the exported signature:
saved_model_cli show --dir models/classifier/1 --tag_set serve \
--signature_def serving_default
This shows the exact tensor names and shapes clients need to construct requests.
gRPC Client Implementation
import grpc
import numpy as np
import tensorflow as tf
from tensorflow_serving.apis import predict_pb2
from tensorflow_serving.apis import prediction_service_pb2_grpc
def predict(image_array, model_name="classifier", host="localhost:8500"):
channel = grpc.insecure_channel(host)
stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)
request = predict_pb2.PredictRequest()
request.model_spec.name = model_name
request.model_spec.signature_name = "serving_default"
request.inputs["image"].CopyFrom(
tf.make_tensor_proto(image_array, dtype=tf.float32)
)
response = stub.Predict(request, timeout=5.0)
probs = tf.make_ndarray(response.outputs["class_probabilities"])
return probs
# Usage
image = np.random.rand(1, 224, 224, 3).astype(np.float32)
result = predict(image)
print(f"Top class: {np.argmax(result)}")
Connection Pooling for High Throughput
import grpc
# Reuse channels across requests
channel = grpc.insecure_channel(
"localhost:8500",
options=[
("grpc.max_send_message_length", 50 * 1024 * 1024),
("grpc.max_receive_message_length", 50 * 1024 * 1024),
("grpc.keepalive_time_ms", 30000),
("grpc.keepalive_timeout_ms", 5000),
]
)
stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)
Batching Configuration
Create a batching_parameters.txt file:
max_batch_size { value: 128 }
batch_timeout_micros { value: 5000 }
num_batch_threads { value: 8 }
max_enqueued_batches { value: 1000000 }
pad_variable_length_inputs: true
Launch with batching enabled:
docker run -d \
-p 8500:8500 -p 8501:8501 \
-v /models:/models \
-v /config:/config \
tensorflow/serving:latest-gpu \
--model_config_file=/config/models.config \
--enable_batching=true \
--batching_parameters_file=/config/batching_parameters.txt
Tuning Batch Parameters
| Parameter | Too Low | Too High |
|---|---|---|
max_batch_size | Underutilizes GPU | OOM on large inputs |
batch_timeout_micros | Small batches, poor throughput | High latency for first request in batch |
num_batch_threads | Serialized batch processing | Excessive context switching |
Start with max_batch_size=32, batch_timeout_micros=5000 (5ms), and adjust based on profiling. Monitor the batching_session:batch_size metric to see actual batch sizes.
Model Versioning and Rollouts
Version Policy
Control which versions stay loaded:
model_config_list {
config {
name: "classifier"
base_path: "/models/classifier"
model_version_policy {
specific { versions: 3, versions: 4 }
}
version_labels {
key: "stable"
value: 3
}
version_labels {
key: "canary"
value: 4
}
}
}
Clients can request by label:
request.model_spec.version_label = "canary"
A/B Testing Architecture
Route traffic between versions using a gateway (Envoy, Istio, or a custom FastAPI proxy):
from fastapi import FastAPI
import random
app = FastAPI()
@app.post("/predict")
async def predict(request: PredictRequest):
if random.random() < 0.1: # 10% canary traffic
version = "canary"
else:
version = "stable"
result = await call_tf_serving(request, version_label=version)
# Log version for analysis
log_prediction(version=version, result=result)
return result
Canary Rollout Process
- Deploy new version alongside current
- Route 5% traffic → monitor error rate and latency
- Increase to 25% → check accuracy metrics
- Full rollout at 100% → unload old version
- If metrics degrade at any step → rollback by changing the version label
GPU Sharing and Resource Management
Multiple Models on One GPU
TF Serving allocates GPU memory per model. Control allocation:
# Limit TF Serving to 50% of GPU memory
docker run -d \
--gpus all \
-e TF_FORCE_GPU_ALLOW_GROWTH=true \
tensorflow/serving:latest-gpu \
--per_process_gpu_memory_fraction=0.5
TF_FORCE_GPU_ALLOW_GROWTH=true allocates memory incrementally rather than reserving the full fraction upfront.
Multi-GPU Assignment
model_config_list {
config {
name: "vision_model"
base_path: "/models/vision"
# Assign to GPU 0 via TF_CONFIG
}
config {
name: "nlp_model"
base_path: "/models/nlp"
# Assign to GPU 1
}
}
Use CUDA_VISIBLE_DEVICES or TensorFlow’s device placement for fine-grained control.
Monitoring and Observability
TF Serving exposes Prometheus-compatible metrics at http://localhost:8501/monitoring/prometheus/metrics:
Key metrics to monitor:
| Metric | What It Tells You |
|---|---|
:tensorflow:serving:request_count | Total requests by model and method |
:tensorflow:serving:request_latency | Prediction latency histogram |
:tensorflow:core:graph_runs | Model execution count |
:tensorflow:serving:batching_session:batch_size | Actual batch sizes |
Alerting Thresholds
# Prometheus alerting rules
groups:
- name: tf-serving
rules:
- alert: HighLatency
expr: histogram_quantile(0.99, rate(tf_serving_request_latency_bucket[5m])) > 100
for: 5m
annotations:
summary: "P99 latency > 100ms for {{ $labels.model_name }}"
- alert: HighErrorRate
expr: rate(tf_serving_request_count{status="error"}[5m]) / rate(tf_serving_request_count[5m]) > 0.01
for: 2m
annotations:
summary: "Error rate > 1% for {{ $labels.model_name }}"
Performance Optimization
Warm-Up Requests
Cold models cause latency spikes on the first request (TensorFlow graph compilation, GPU memory allocation). Add warm-up data:
# Create tf_serving_warmup_requests in the model directory
models/classifier/1/assets.extra/tf_serving_warmup_requests
import tensorflow as tf
from tensorflow_serving.apis import predict_pb2, prediction_log_pb2
with tf.io.TFRecordWriter("assets.extra/tf_serving_warmup_requests") as writer:
request = predict_pb2.PredictRequest()
request.model_spec.name = "classifier"
request.inputs["image"].CopyFrom(
tf.make_tensor_proto(np.zeros((1, 224, 224, 3), dtype=np.float32))
)
log = prediction_log_pb2.PredictionLog(
predict_log=prediction_log_pb2.PredictLog(request=request)
)
writer.write(log.SerializeToString())
TensorRT Integration
For NVIDIA GPUs, convert models to TensorRT for optimized inference:
from tensorflow.python.compiler.tensorrt import trt_convert as trt
converter = trt.TrtGraphConverterV2(
input_saved_model_dir="models/classifier/1",
conversion_params=trt.TrtConversionParams(
precision_mode=trt.TrtPrecisionMode.FP16,
max_workspace_size_bytes=1 << 30
)
)
converter.convert()
converter.save("models/classifier_trt/1")
TensorRT models can run 2-6x faster than standard TensorFlow on the same GPU by fusing operations and optimizing memory access patterns.
Kubernetes Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: tf-serving-classifier
spec:
replicas: 3
selector:
matchLabels:
app: tf-serving-classifier
template:
spec:
containers:
- name: tf-serving
image: tensorflow/serving:latest-gpu
ports:
- containerPort: 8500
- containerPort: 8501
args:
- --model_config_file=/config/models.config
- --enable_batching=true
resources:
limits:
nvidia.com/gpu: 1
memory: "8Gi"
requests:
memory: "4Gi"
readinessProbe:
httpGet:
path: /v1/models/classifier
port: 8501
initialDelaySeconds: 30
livenessProbe:
httpGet:
path: /v1/models/classifier
port: 8501
initialDelaySeconds: 60
Use Horizontal Pod Autoscaler (HPA) based on request latency or GPU utilization to scale replicas dynamically.
The one thing to remember: Production TF Serving requires warm-up requests, tuned batching, version labels for safe rollouts, and Prometheus monitoring — the serving binary is just the start; the operational infrastructure around it is what makes it reliable.
See Also
- Activation Functions Why neural networks need these tiny mathematical functions — and how ReLU's simplicity accidentally made deep learning possible.
- Ai Agents Architecture How AI systems go from answering questions to actually doing things — the design patterns that turn language models into autonomous agents that browse, code, and plan.
- Ai Agents ChatGPT answers questions. AI agents actually do things — browse the web, write code, send emails, and keep going until the job is done. Here's the difference.
- Ai Ethics Why building AI fairly is harder than it sounds — bias, accountability, privacy, and who gets to decide what AI is allowed to do.
- Ai Hallucinations ChatGPT sometimes makes up facts with total confidence. Here's the weird reason why — and why it's not as simple as 'the AI lied.'