OCR in Python — Core Concepts

Understand how Tesseract, EasyOCR, and PaddleOCR extract text from images — and when each tool fits best.

Optical Character Recognition (OCR) converts images of text into machine-readable strings. The pipeline has three stages: preprocessing the image, detecting where text lives, and recognizing what the text says.

The OCR pipeline

1. Preprocessing

Raw images rarely arrive clean. Preprocessing improves accuracy dramatically:

Binarization: Convert to black text on white background. Adaptive thresholding handles uneven lighting.
Deskewing: Straighten rotated documents. A skew of even 2 degrees hurts line segmentation.
Denoising: Remove specks, scanner artifacts, and compression artifacts.
Resolution upscaling: OCR engines perform best at 300 DPI. If the source is lower, upscale with bicubic interpolation before processing.

2. Text detection

Find rectangular regions containing text. Modern detectors like CRAFT and EAST output bounding boxes around words or text lines, handling curved text, multi-column layouts, and rotated signs.

3. Text recognition

Each detected region is fed to a recognition model that outputs the character sequence. Traditional approaches segment individual characters; modern deep learning models (CRNN, attention-based) read the entire text region end-to-end.

Python OCR libraries

Tesseract (via pytesseract)

Google’s open-source engine, originally from HP Labs in the 1980s. Version 5 added an LSTM-based recognition engine that handles more fonts and languages.

import pytesseract
from PIL import Image

text = pytesseract.image_to_string(Image.open("document.png"), lang="eng")

Strengths: 100+ languages, extensive documentation, no GPU needed. Weaknesses: Struggles with complex layouts, handwriting, and curved text.

EasyOCR

Deep learning-based, supports 80+ languages including CJK scripts.

import easyocr

reader = easyocr.Reader(["en", "fr"])
results = reader.readtext("sign.jpg")
# Returns: [(bbox, text, confidence), ...]

Strengths: Multi-language in one call, handles scene text (signs, labels). Weaknesses: Slower than Tesseract on large documents, higher memory usage.

PaddleOCR

From Baidu, excels at CJK text and structured documents like tables and forms.

from paddleocr import PaddleOCR

ocr = PaddleOCR(use_angle_cls=True, lang="en")
results = ocr.ocr("receipt.jpg")

Strengths: Best-in-class for Chinese/Japanese/Korean, built-in layout analysis. Weaknesses: Larger installation footprint, documentation primarily in Chinese.

Choosing the right tool

Scenario	Best choice	Why
Clean scanned documents	Tesseract	Fast, free, accurate on printed text
Street signs and photos	EasyOCR	Scene text detection built in
Receipts and invoices	PaddleOCR	Layout analysis + table extraction
Handwritten notes	Google Cloud Vision or Azure AI	Cloud models trained on massive handwriting datasets

Common misconception

People assume OCR produces perfect text from any image. In reality, accuracy depends heavily on image quality. A blurry phone photo of a crumpled receipt might yield 70% accuracy, while a 300-DPI scan of a printed page exceeds 99%. Preprocessing is often more important than model choice.

Post-processing matters

Raw OCR output contains errors. Practical systems add:

Spell checking: Flag and correct obvious misreads like “rnatch” → “match.”
Regular expressions: Extract structured data (dates, amounts, emails) from noisy text.
Confidence filtering: OCR engines return per-word confidence scores. Discard or flag words below a threshold (e.g., 0.7).
Layout reconstruction: Reassemble detected text blocks into the correct reading order, especially for multi-column documents.

Accuracy metrics

Character Error Rate (CER) measures the edit distance between predicted and ground truth text, divided by the length of the ground truth. A CER of 0.02 means 2% of characters are wrong.

Word Error Rate (WER) does the same at the word level. It is always higher than CER because one wrong character makes the entire word incorrect.

The one thing to remember: OCR accuracy is a function of image quality and preprocessing as much as the recognition engine itself — clean input beats a better model every time.

pythonocrtext-recognitioncomputer-vision