OCR in Python — Core Concepts

Optical Character Recognition (OCR) converts images of text into machine-readable strings. The pipeline has three stages: preprocessing the image, detecting where text lives, and recognizing what the text says.

The OCR pipeline

1. Preprocessing

Raw images rarely arrive clean. Preprocessing improves accuracy dramatically:

  • Binarization: Convert to black text on white background. Adaptive thresholding handles uneven lighting.
  • Deskewing: Straighten rotated documents. A skew of even 2 degrees hurts line segmentation.
  • Denoising: Remove specks, scanner artifacts, and compression artifacts.
  • Resolution upscaling: OCR engines perform best at 300 DPI. If the source is lower, upscale with bicubic interpolation before processing.

2. Text detection

Find rectangular regions containing text. Modern detectors like CRAFT and EAST output bounding boxes around words or text lines, handling curved text, multi-column layouts, and rotated signs.

3. Text recognition

Each detected region is fed to a recognition model that outputs the character sequence. Traditional approaches segment individual characters; modern deep learning models (CRNN, attention-based) read the entire text region end-to-end.

Python OCR libraries

Tesseract (via pytesseract)

Google’s open-source engine, originally from HP Labs in the 1980s. Version 5 added an LSTM-based recognition engine that handles more fonts and languages.

import pytesseract
from PIL import Image

text = pytesseract.image_to_string(Image.open("document.png"), lang="eng")

Strengths: 100+ languages, extensive documentation, no GPU needed. Weaknesses: Struggles with complex layouts, handwriting, and curved text.

EasyOCR

Deep learning-based, supports 80+ languages including CJK scripts.

import easyocr

reader = easyocr.Reader(["en", "fr"])
results = reader.readtext("sign.jpg")
# Returns: [(bbox, text, confidence), ...]

Strengths: Multi-language in one call, handles scene text (signs, labels). Weaknesses: Slower than Tesseract on large documents, higher memory usage.

PaddleOCR

From Baidu, excels at CJK text and structured documents like tables and forms.

from paddleocr import PaddleOCR

ocr = PaddleOCR(use_angle_cls=True, lang="en")
results = ocr.ocr("receipt.jpg")

Strengths: Best-in-class for Chinese/Japanese/Korean, built-in layout analysis. Weaknesses: Larger installation footprint, documentation primarily in Chinese.

Choosing the right tool

ScenarioBest choiceWhy
Clean scanned documentsTesseractFast, free, accurate on printed text
Street signs and photosEasyOCRScene text detection built in
Receipts and invoicesPaddleOCRLayout analysis + table extraction
Handwritten notesGoogle Cloud Vision or Azure AICloud models trained on massive handwriting datasets

Common misconception

People assume OCR produces perfect text from any image. In reality, accuracy depends heavily on image quality. A blurry phone photo of a crumpled receipt might yield 70% accuracy, while a 300-DPI scan of a printed page exceeds 99%. Preprocessing is often more important than model choice.

Post-processing matters

Raw OCR output contains errors. Practical systems add:

  • Spell checking: Flag and correct obvious misreads like “rnatch” → “match.”
  • Regular expressions: Extract structured data (dates, amounts, emails) from noisy text.
  • Confidence filtering: OCR engines return per-word confidence scores. Discard or flag words below a threshold (e.g., 0.7).
  • Layout reconstruction: Reassemble detected text blocks into the correct reading order, especially for multi-column documents.

Accuracy metrics

Character Error Rate (CER) measures the edit distance between predicted and ground truth text, divided by the length of the ground truth. A CER of 0.02 means 2% of characters are wrong.

Word Error Rate (WER) does the same at the word level. It is always higher than CER because one wrong character makes the entire word incorrect.

The one thing to remember: OCR accuracy is a function of image quality and preprocessing as much as the recognition engine itself — clean input beats a better model every time.

pythonocrtext-recognitioncomputer-vision

See Also

  • Python Adaptive Learning Systems How Python builds learning apps that adjust to each student like a personal tutor who knows exactly what you need next.
  • Python Airflow Learn Airflow as a timetable manager that makes sure data tasks run in the right order every day.
  • Python Altair Learn Altair through the idea of drawing charts by describing rules, not by hand-placing every visual element.
  • Python Automated Grading How Python grades homework and exams automatically, from simple answer keys to understanding written essays.
  • Python Batch Vs Stream Processing Batch processing is like doing laundry once a week; stream processing is like a self-cleaning shirt that cleans itself constantly.