TensorFlow Serving — Deep Dive

Production Deployment with Docker

The standard deployment path uses the official TF Serving Docker image:

# Pull the GPU-enabled image
docker pull tensorflow/serving:latest-gpu

# Serve a model
docker run -d --name tf-serving \
    --gpus all \
    -p 8500:8500 \
    -p 8501:8501 \
    -v /path/to/models:/models \
    -e MODEL_NAME=my_classifier \
    tensorflow/serving:latest-gpu

For multiple models, mount a config file:

docker run -d --name tf-serving \
    --gpus all \
    -p 8500:8500 -p 8501:8501 \
    -v /path/to/models:/models \
    -v /path/to/config:/config \
    tensorflow/serving:latest-gpu \
    --model_config_file=/config/models.config \
    --model_config_file_poll_wait_seconds=60

The poll_wait_seconds flag tells TF Serving to check for config changes every 60 seconds — enabling dynamic model registration without restarts.

SavedModel Export for Serving

Models must include a serving signature that defines input/output tensor names:

import tensorflow as tf

model = tf.keras.models.load_model("trained_model")

# Export with explicit signatures
@tf.function(input_signature=[
    tf.TensorSpec(shape=[None, 224, 224, 3], dtype=tf.float32, name="image")
])
def serve(image):
    predictions = model(image, training=False)
    return {"class_probabilities": predictions}

tf.saved_model.save(
    model,
    "models/classifier/1",
    signatures={"serving_default": serve}
)

Inspect the exported signature:

saved_model_cli show --dir models/classifier/1 --tag_set serve \
    --signature_def serving_default

This shows the exact tensor names and shapes clients need to construct requests.

gRPC Client Implementation

import grpc
import numpy as np
import tensorflow as tf
from tensorflow_serving.apis import predict_pb2
from tensorflow_serving.apis import prediction_service_pb2_grpc

def predict(image_array, model_name="classifier", host="localhost:8500"):
    channel = grpc.insecure_channel(host)
    stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)

    request = predict_pb2.PredictRequest()
    request.model_spec.name = model_name
    request.model_spec.signature_name = "serving_default"
    request.inputs["image"].CopyFrom(
        tf.make_tensor_proto(image_array, dtype=tf.float32)
    )

    response = stub.Predict(request, timeout=5.0)
    probs = tf.make_ndarray(response.outputs["class_probabilities"])
    return probs

# Usage
image = np.random.rand(1, 224, 224, 3).astype(np.float32)
result = predict(image)
print(f"Top class: {np.argmax(result)}")

Connection Pooling for High Throughput

import grpc

# Reuse channels across requests
channel = grpc.insecure_channel(
    "localhost:8500",
    options=[
        ("grpc.max_send_message_length", 50 * 1024 * 1024),
        ("grpc.max_receive_message_length", 50 * 1024 * 1024),
        ("grpc.keepalive_time_ms", 30000),
        ("grpc.keepalive_timeout_ms", 5000),
    ]
)
stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)

Batching Configuration

Create a batching_parameters.txt file:

max_batch_size { value: 128 }
batch_timeout_micros { value: 5000 }
num_batch_threads { value: 8 }
max_enqueued_batches { value: 1000000 }
pad_variable_length_inputs: true

Launch with batching enabled:

docker run -d \
    -p 8500:8500 -p 8501:8501 \
    -v /models:/models \
    -v /config:/config \
    tensorflow/serving:latest-gpu \
    --model_config_file=/config/models.config \
    --enable_batching=true \
    --batching_parameters_file=/config/batching_parameters.txt

Tuning Batch Parameters

ParameterToo LowToo High
max_batch_sizeUnderutilizes GPUOOM on large inputs
batch_timeout_microsSmall batches, poor throughputHigh latency for first request in batch
num_batch_threadsSerialized batch processingExcessive context switching

Start with max_batch_size=32, batch_timeout_micros=5000 (5ms), and adjust based on profiling. Monitor the batching_session:batch_size metric to see actual batch sizes.

Model Versioning and Rollouts

Version Policy

Control which versions stay loaded:

model_config_list {
  config {
    name: "classifier"
    base_path: "/models/classifier"
    model_version_policy {
      specific { versions: 3, versions: 4 }
    }
    version_labels {
      key: "stable"
      value: 3
    }
    version_labels {
      key: "canary"
      value: 4
    }
  }
}

Clients can request by label:

request.model_spec.version_label = "canary"

A/B Testing Architecture

Route traffic between versions using a gateway (Envoy, Istio, or a custom FastAPI proxy):

from fastapi import FastAPI
import random

app = FastAPI()

@app.post("/predict")
async def predict(request: PredictRequest):
    if random.random() < 0.1:  # 10% canary traffic
        version = "canary"
    else:
        version = "stable"

    result = await call_tf_serving(request, version_label=version)

    # Log version for analysis
    log_prediction(version=version, result=result)
    return result

Canary Rollout Process

  1. Deploy new version alongside current
  2. Route 5% traffic → monitor error rate and latency
  3. Increase to 25% → check accuracy metrics
  4. Full rollout at 100% → unload old version
  5. If metrics degrade at any step → rollback by changing the version label

GPU Sharing and Resource Management

Multiple Models on One GPU

TF Serving allocates GPU memory per model. Control allocation:

# Limit TF Serving to 50% of GPU memory
docker run -d \
    --gpus all \
    -e TF_FORCE_GPU_ALLOW_GROWTH=true \
    tensorflow/serving:latest-gpu \
    --per_process_gpu_memory_fraction=0.5

TF_FORCE_GPU_ALLOW_GROWTH=true allocates memory incrementally rather than reserving the full fraction upfront.

Multi-GPU Assignment

model_config_list {
  config {
    name: "vision_model"
    base_path: "/models/vision"
    # Assign to GPU 0 via TF_CONFIG
  }
  config {
    name: "nlp_model"
    base_path: "/models/nlp"
    # Assign to GPU 1
  }
}

Use CUDA_VISIBLE_DEVICES or TensorFlow’s device placement for fine-grained control.

Monitoring and Observability

TF Serving exposes Prometheus-compatible metrics at http://localhost:8501/monitoring/prometheus/metrics:

Key metrics to monitor:

MetricWhat It Tells You
:tensorflow:serving:request_countTotal requests by model and method
:tensorflow:serving:request_latencyPrediction latency histogram
:tensorflow:core:graph_runsModel execution count
:tensorflow:serving:batching_session:batch_sizeActual batch sizes

Alerting Thresholds

# Prometheus alerting rules
groups:
  - name: tf-serving
    rules:
      - alert: HighLatency
        expr: histogram_quantile(0.99, rate(tf_serving_request_latency_bucket[5m])) > 100
        for: 5m
        annotations:
          summary: "P99 latency > 100ms for {{ $labels.model_name }}"

      - alert: HighErrorRate
        expr: rate(tf_serving_request_count{status="error"}[5m]) / rate(tf_serving_request_count[5m]) > 0.01
        for: 2m
        annotations:
          summary: "Error rate > 1% for {{ $labels.model_name }}"

Performance Optimization

Warm-Up Requests

Cold models cause latency spikes on the first request (TensorFlow graph compilation, GPU memory allocation). Add warm-up data:

# Create tf_serving_warmup_requests in the model directory
models/classifier/1/assets.extra/tf_serving_warmup_requests
import tensorflow as tf
from tensorflow_serving.apis import predict_pb2, prediction_log_pb2

with tf.io.TFRecordWriter("assets.extra/tf_serving_warmup_requests") as writer:
    request = predict_pb2.PredictRequest()
    request.model_spec.name = "classifier"
    request.inputs["image"].CopyFrom(
        tf.make_tensor_proto(np.zeros((1, 224, 224, 3), dtype=np.float32))
    )
    log = prediction_log_pb2.PredictionLog(
        predict_log=prediction_log_pb2.PredictLog(request=request)
    )
    writer.write(log.SerializeToString())

TensorRT Integration

For NVIDIA GPUs, convert models to TensorRT for optimized inference:

from tensorflow.python.compiler.tensorrt import trt_convert as trt

converter = trt.TrtGraphConverterV2(
    input_saved_model_dir="models/classifier/1",
    conversion_params=trt.TrtConversionParams(
        precision_mode=trt.TrtPrecisionMode.FP16,
        max_workspace_size_bytes=1 << 30
    )
)
converter.convert()
converter.save("models/classifier_trt/1")

TensorRT models can run 2-6x faster than standard TensorFlow on the same GPU by fusing operations and optimizing memory access patterns.

Kubernetes Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: tf-serving-classifier
spec:
  replicas: 3
  selector:
    matchLabels:
      app: tf-serving-classifier
  template:
    spec:
      containers:
        - name: tf-serving
          image: tensorflow/serving:latest-gpu
          ports:
            - containerPort: 8500
            - containerPort: 8501
          args:
            - --model_config_file=/config/models.config
            - --enable_batching=true
          resources:
            limits:
              nvidia.com/gpu: 1
              memory: "8Gi"
            requests:
              memory: "4Gi"
          readinessProbe:
            httpGet:
              path: /v1/models/classifier
              port: 8501
            initialDelaySeconds: 30
          livenessProbe:
            httpGet:
              path: /v1/models/classifier
              port: 8501
            initialDelaySeconds: 60

Use Horizontal Pod Autoscaler (HPA) based on request latency or GPU utilization to scale replicas dynamically.

The one thing to remember: Production TF Serving requires warm-up requests, tuned batching, version labels for safe rollouts, and Prometheus monitoring — the serving binary is just the start; the operational infrastructure around it is what makes it reliable.

pythonmachine-learningtensorflowdeployment

See Also

  • Activation Functions Why neural networks need these tiny mathematical functions — and how ReLU's simplicity accidentally made deep learning possible.
  • Ai Agents Architecture How AI systems go from answering questions to actually doing things — the design patterns that turn language models into autonomous agents that browse, code, and plan.
  • Ai Agents ChatGPT answers questions. AI agents actually do things — browse the web, write code, send emails, and keep going until the job is done. Here's the difference.
  • Ai Ethics Why building AI fairly is harder than it sounds — bias, accountability, privacy, and who gets to decide what AI is allowed to do.
  • Ai Hallucinations ChatGPT sometimes makes up facts with total confidence. Here's the weird reason why — and why it's not as simple as 'the AI lied.'