OCR in Python — Deep Dive

Build robust OCR pipelines in Python with Tesseract tuning, deep learning models, and document understanding for production workloads.

Production OCR is not about picking the right library — it is about engineering the pipeline around the library so that messy real-world documents produce reliable, structured output. This guide covers preprocessing engineering, engine tuning, layout analysis, and building systems that handle thousands of documents per hour.

Preprocessing pipeline

Adaptive binarization

Simple global thresholding fails on documents with shadows, stains, or uneven lighting. Adaptive methods compute a threshold for each pixel based on its neighborhood:

import cv2

gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

# Gaussian adaptive threshold — handles gradual lighting changes
binary = cv2.adaptiveThreshold(
    gray, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
    cv2.THRESH_BINARY, blockSize=31, C=10
)

For severely degraded documents (old newspapers, faxes), Sauvola binarization from skimage.filters produces cleaner results by incorporating local standard deviation.

Deskew with projection profiles

Summing pixel values across horizontal lines creates a projection profile. Correctly aligned text produces sharp peaks (text lines) and deep valleys (white space between lines). Rotate the image in small increments and pick the angle that maximizes the variance of the horizontal projection profile.

import numpy as np
from scipy.ndimage import rotate

def find_skew(binary_image, angle_range=(-5, 5), steps=100):
    best_angle = 0
    best_variance = 0
    for angle in np.linspace(*angle_range, steps):
        rotated = rotate(binary_image, angle, reshape=False, order=0)
        profile = rotated.sum(axis=1)
        variance = np.var(profile)
        if variance > best_variance:
            best_variance = variance
            best_angle = angle
    return best_angle

Noise removal

Morphological opening (erosion followed by dilation) removes small specks without destroying text. For salt-and-pepper noise, a median filter with kernel size 3 works well.

kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (2, 2))
cleaned = cv2.morphologyEx(binary, cv2.MORPH_OPEN, kernel)

Tesseract tuning

Page segmentation modes (PSM)

Tesseract’s PSM controls how it segments the image before recognition. The default (PSM 3 — fully automatic) fails on single-line text, tables, or sparse text. Choose the right mode:

PSM	Description	Use case
3	Fully automatic	Multi-paragraph documents
4	Assume single column	Simple letters
6	Assume uniform text block	Cropped text regions
7	Treat as single line	License plates, serial numbers
11	Sparse text, no order	Receipts, signage
13	Raw line, no segmentation	Pre-segmented lines

config = "--psm 6 --oem 1"  # OEM 1 = LSTM engine
text = pytesseract.image_to_string(image, config=config)

Whitelisting characters

When you know the expected character set (e.g., digits only for invoice amounts):

config = "--psm 7 -c tessedit_char_whitelist=0123456789."
text = pytesseract.image_to_string(amount_crop, config=config)

Custom training

For specialized fonts or domains (e.g., typewriter text, engineering drawings), fine-tune Tesseract’s LSTM model:

Generate training images with text2image or collect real samples.
Create ground truth .gt.txt files.
Run tesstrain.sh to fine-tune from the base model.
Typically 500–2,000 training samples achieve significant improvement.

Deep learning OCR architectures

CRNN (CNN + RNN + CTC)

The classic deep OCR architecture:

CNN backbone extracts feature columns from the input image.
Bidirectional LSTM models dependencies between columns.
CTC (Connectionist Temporal Classification) loss handles alignment between variable-length predictions and labels without requiring character-level bounding boxes.

TrOCR (Transformer-based)

Microsoft’s TrOCR uses a Vision Transformer (ViT) encoder and a GPT-2-style decoder. It outperforms CRNN on handwriting recognition and achieves state-of-the-art on the IAM Handwriting Database.

from transformers import TrOCRProcessor, VisionEncoderDecoderModel

processor = TrOCRProcessor.from_pretrained("microsoft/trocr-large-handwritten")
model = VisionEncoderDecoderModel.from_pretrained("microsoft/trocr-large-handwritten")

pixel_values = processor(images=cropped_line, return_tensors="pt").pixel_values
generated_ids = model.generate(pixel_values)
text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

PaddleOCR’s PP-OCRv4

PP-OCRv4 combines a lightweight detection model (DB++), a direction classifier, and a recognition model (SVTR) into a pipeline that runs at 20+ FPS on CPU while achieving >95% accuracy on diverse benchmarks.

Document layout analysis

For complex documents (multi-column papers, invoices with tables, forms), OCR alone is insufficient. Layout analysis classifies regions as text, table, figure, header, or footer before sending each to the appropriate processing pipeline.

LayoutParser + Detectron2

import layoutparser as lp

model = lp.Detectron2LayoutModel(
    "lp://PubLayNet/mask_rcnn_X_101_32x8d_FPN_3x/config",
    extra_config=["MODEL.ROI_HEADS.SCORE_THRESH_TEST", 0.5],
    label_map={0: "Text", 1: "Title", 2: "List", 3: "Table", 4: "Figure"}
)

layout = model.detect(image)
text_blocks = lp.Layout([b for b in layout if b.type == "Text"])

Table extraction

Tables are the hardest layout element. Dedicated tools like img2table, Camelot (for PDFs), or PaddleOCR’s table recognition module detect cell boundaries and map text to a grid structure.

Building a document processing pipeline

from pathlib import Path
import json

def process_document(image_path: str) -> dict:
    image = cv2.imread(image_path)

    # 1. Preprocess
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    binary = cv2.adaptiveThreshold(gray, 255,
        cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 31, 10)

    # 2. Deskew
    angle = find_skew(binary)
    rotated = rotate(binary, angle, reshape=False, order=0)

    # 3. Layout analysis
    layout = layout_model.detect(rotated)

    # 4. OCR each text region
    results = []
    for block in sorted(layout, key=lambda b: (b.block.y_1, b.block.x_1)):
        crop = rotated[int(b.block.y_1):int(b.block.y_2),
                       int(b.block.x_1):int(b.block.x_2)]
        text = pytesseract.image_to_string(crop, config="--psm 6")
        results.append({
            "type": block.type,
            "bbox": [b.block.x_1, b.block.y_1, b.block.x_2, b.block.y_2],
            "text": text.strip(),
        })

    return {"file": str(image_path), "blocks": results}

Performance optimization

Parallel processing

OCR is CPU-bound (for Tesseract) or GPU-bound (for deep models). Use concurrent.futures.ProcessPoolExecutor to process multiple pages in parallel:

from concurrent.futures import ProcessPoolExecutor

with ProcessPoolExecutor(max_workers=4) as pool:
    results = list(pool.map(process_document, document_paths))

GPU batching for deep models

EasyOCR and PaddleOCR support batch inference. Group detected text regions into batches of 32–64 and process them together for 3–5× throughput improvement over one-by-one inference.

Caching

Hash the input image and cache OCR results. Re-processing identical documents wastes compute. Use content-addressable storage (SHA-256 hash → result JSON) for deduplication.

Evaluation and monitoring

Ground truth creation

For new document types, manually transcribe 50–100 samples. Use these for:

Benchmark accuracy before and after pipeline changes.
Regression testing in CI — fail the build if CER exceeds threshold.
A/B testing preprocessing changes.

Production monitoring

Log per-document metrics:

Mean character confidence
Number of low-confidence words (below 0.6)
Processing time

Alert when the percentage of low-confidence documents exceeds historical norms — this signals new document formats, degraded scan quality, or upstream changes.

The one thing to remember: Production OCR accuracy is won in preprocessing and pipeline engineering, not model selection — the best recognition engine cannot fix a blurry, skewed, noisy input image.

pythonocrtext-recognitioncomputer-vision