OCR in Python — Deep Dive
Production OCR is not about picking the right library — it is about engineering the pipeline around the library so that messy real-world documents produce reliable, structured output. This guide covers preprocessing engineering, engine tuning, layout analysis, and building systems that handle thousands of documents per hour.
Preprocessing pipeline
Adaptive binarization
Simple global thresholding fails on documents with shadows, stains, or uneven lighting. Adaptive methods compute a threshold for each pixel based on its neighborhood:
import cv2
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
# Gaussian adaptive threshold — handles gradual lighting changes
binary = cv2.adaptiveThreshold(
gray, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
cv2.THRESH_BINARY, blockSize=31, C=10
)
For severely degraded documents (old newspapers, faxes), Sauvola binarization from skimage.filters produces cleaner results by incorporating local standard deviation.
Deskew with projection profiles
Summing pixel values across horizontal lines creates a projection profile. Correctly aligned text produces sharp peaks (text lines) and deep valleys (white space between lines). Rotate the image in small increments and pick the angle that maximizes the variance of the horizontal projection profile.
import numpy as np
from scipy.ndimage import rotate
def find_skew(binary_image, angle_range=(-5, 5), steps=100):
best_angle = 0
best_variance = 0
for angle in np.linspace(*angle_range, steps):
rotated = rotate(binary_image, angle, reshape=False, order=0)
profile = rotated.sum(axis=1)
variance = np.var(profile)
if variance > best_variance:
best_variance = variance
best_angle = angle
return best_angle
Noise removal
Morphological opening (erosion followed by dilation) removes small specks without destroying text. For salt-and-pepper noise, a median filter with kernel size 3 works well.
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (2, 2))
cleaned = cv2.morphologyEx(binary, cv2.MORPH_OPEN, kernel)
Tesseract tuning
Page segmentation modes (PSM)
Tesseract’s PSM controls how it segments the image before recognition. The default (PSM 3 — fully automatic) fails on single-line text, tables, or sparse text. Choose the right mode:
| PSM | Description | Use case |
|---|---|---|
| 3 | Fully automatic | Multi-paragraph documents |
| 4 | Assume single column | Simple letters |
| 6 | Assume uniform text block | Cropped text regions |
| 7 | Treat as single line | License plates, serial numbers |
| 11 | Sparse text, no order | Receipts, signage |
| 13 | Raw line, no segmentation | Pre-segmented lines |
config = "--psm 6 --oem 1" # OEM 1 = LSTM engine
text = pytesseract.image_to_string(image, config=config)
Whitelisting characters
When you know the expected character set (e.g., digits only for invoice amounts):
config = "--psm 7 -c tessedit_char_whitelist=0123456789."
text = pytesseract.image_to_string(amount_crop, config=config)
Custom training
For specialized fonts or domains (e.g., typewriter text, engineering drawings), fine-tune Tesseract’s LSTM model:
- Generate training images with
text2imageor collect real samples. - Create ground truth
.gt.txtfiles. - Run
tesstrain.shto fine-tune from the base model. - Typically 500–2,000 training samples achieve significant improvement.
Deep learning OCR architectures
CRNN (CNN + RNN + CTC)
The classic deep OCR architecture:
- CNN backbone extracts feature columns from the input image.
- Bidirectional LSTM models dependencies between columns.
- CTC (Connectionist Temporal Classification) loss handles alignment between variable-length predictions and labels without requiring character-level bounding boxes.
TrOCR (Transformer-based)
Microsoft’s TrOCR uses a Vision Transformer (ViT) encoder and a GPT-2-style decoder. It outperforms CRNN on handwriting recognition and achieves state-of-the-art on the IAM Handwriting Database.
from transformers import TrOCRProcessor, VisionEncoderDecoderModel
processor = TrOCRProcessor.from_pretrained("microsoft/trocr-large-handwritten")
model = VisionEncoderDecoderModel.from_pretrained("microsoft/trocr-large-handwritten")
pixel_values = processor(images=cropped_line, return_tensors="pt").pixel_values
generated_ids = model.generate(pixel_values)
text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
PaddleOCR’s PP-OCRv4
PP-OCRv4 combines a lightweight detection model (DB++), a direction classifier, and a recognition model (SVTR) into a pipeline that runs at 20+ FPS on CPU while achieving >95% accuracy on diverse benchmarks.
Document layout analysis
For complex documents (multi-column papers, invoices with tables, forms), OCR alone is insufficient. Layout analysis classifies regions as text, table, figure, header, or footer before sending each to the appropriate processing pipeline.
LayoutParser + Detectron2
import layoutparser as lp
model = lp.Detectron2LayoutModel(
"lp://PubLayNet/mask_rcnn_X_101_32x8d_FPN_3x/config",
extra_config=["MODEL.ROI_HEADS.SCORE_THRESH_TEST", 0.5],
label_map={0: "Text", 1: "Title", 2: "List", 3: "Table", 4: "Figure"}
)
layout = model.detect(image)
text_blocks = lp.Layout([b for b in layout if b.type == "Text"])
Table extraction
Tables are the hardest layout element. Dedicated tools like img2table, Camelot (for PDFs), or PaddleOCR’s table recognition module detect cell boundaries and map text to a grid structure.
Building a document processing pipeline
from pathlib import Path
import json
def process_document(image_path: str) -> dict:
image = cv2.imread(image_path)
# 1. Preprocess
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
binary = cv2.adaptiveThreshold(gray, 255,
cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 31, 10)
# 2. Deskew
angle = find_skew(binary)
rotated = rotate(binary, angle, reshape=False, order=0)
# 3. Layout analysis
layout = layout_model.detect(rotated)
# 4. OCR each text region
results = []
for block in sorted(layout, key=lambda b: (b.block.y_1, b.block.x_1)):
crop = rotated[int(b.block.y_1):int(b.block.y_2),
int(b.block.x_1):int(b.block.x_2)]
text = pytesseract.image_to_string(crop, config="--psm 6")
results.append({
"type": block.type,
"bbox": [b.block.x_1, b.block.y_1, b.block.x_2, b.block.y_2],
"text": text.strip(),
})
return {"file": str(image_path), "blocks": results}
Performance optimization
Parallel processing
OCR is CPU-bound (for Tesseract) or GPU-bound (for deep models). Use concurrent.futures.ProcessPoolExecutor to process multiple pages in parallel:
from concurrent.futures import ProcessPoolExecutor
with ProcessPoolExecutor(max_workers=4) as pool:
results = list(pool.map(process_document, document_paths))
GPU batching for deep models
EasyOCR and PaddleOCR support batch inference. Group detected text regions into batches of 32–64 and process them together for 3–5× throughput improvement over one-by-one inference.
Caching
Hash the input image and cache OCR results. Re-processing identical documents wastes compute. Use content-addressable storage (SHA-256 hash → result JSON) for deduplication.
Evaluation and monitoring
Ground truth creation
For new document types, manually transcribe 50–100 samples. Use these for:
- Benchmark accuracy before and after pipeline changes.
- Regression testing in CI — fail the build if CER exceeds threshold.
- A/B testing preprocessing changes.
Production monitoring
Log per-document metrics:
- Mean character confidence
- Number of low-confidence words (below 0.6)
- Processing time
Alert when the percentage of low-confidence documents exceeds historical norms — this signals new document formats, degraded scan quality, or upstream changes.
The one thing to remember: Production OCR accuracy is won in preprocessing and pipeline engineering, not model selection — the best recognition engine cannot fix a blurry, skewed, noisy input image.
See Also
- Python Adaptive Learning Systems How Python builds learning apps that adjust to each student like a personal tutor who knows exactly what you need next.
- Python Airflow Learn Airflow as a timetable manager that makes sure data tasks run in the right order every day.
- Python Altair Learn Altair through the idea of drawing charts by describing rules, not by hand-placing every visual element.
- Python Automated Grading How Python grades homework and exams automatically, from simple answer keys to understanding written essays.
- Python Batch Vs Stream Processing Batch processing is like doing laundry once a week; stream processing is like a self-cleaning shirt that cleans itself constantly.