How to extract text from an image using Python
Tesseract via pytesseract, EasyOCR, PaddleOCR, and the API-based path — what each one is best at, what they break on, and the few lines of code to get started.
OCR in Python is mature but uneven. The right library depends on the language, the image quality, and whether you care about layout preservation or just raw text.
Tesseract via pytesseract
The default. Free, well-supported, handles 100+ languages. Install Tesseract at the OS level (brew install tesseract or apt install tesseract-ocr), then:
import pytesseract from PIL import Image text = pytesseract.image_to_string(Image.open("page.png"), lang="eng") print(text)
For better results on scans, preprocess: convert to grayscale, threshold, deskew. The OpenCV docs have one-liners for each. Skipping preprocessing on noisy scans is the most common reason Tesseract output looks broken.
Strengths: clean printed English, large fonts, structured documents. Weaknesses: handwritten text, low-res screenshots, anything with mixed fonts in the same line.
EasyOCR
Deep-learning-based, pip-installable, no system dependencies. Better than Tesseract on natural-scene text (signs, receipts, labels) and on non-Latin scripts.
import easyocr reader = easyocr.Reader(["en"]) results = reader.readtext("receipt.jpg") for bbox, text, conf in results: print(text, conf)
Returns bounding boxes and confidence scores per detection — useful when you need to filter low-confidence reads. Slower than Tesseract on first run because it downloads model weights.
PaddleOCR
From Baidu. Strongest on Chinese, Japanese, Korean. Comparable to EasyOCR on English. Heavier install but worth it if you're processing CJK documents at scale.
Handwriting
None of the open-source OCR engines handle handwriting well. For handwritten text, the realistic options are Microsoft's Read API (Computer Vision), Google Cloud Vision, or a multimodal model like the one ExtractFox uses for the handwriting extractor.
API-based extraction
When you need not just text but structure (fields, tables, key-value pairs), the multimodal route skips the OCR-then-parse pipeline entirely. You send the image, declare what you want as a schema, and get structured output back:
import requests with open("invoice.jpg", "rb") as f: r = requests.post( "https://extractfox.com/api/extract", files={"file": f}, data={"vertical": "invoice"}, headers={"Authorization": f"Bearer {API_KEY}"}, )
Worth it when extraction quality matters more than running locally for free.
Choosing
- Clean printed English, no internet → Tesseract.
- Receipts, signs, multilingual → EasyOCR.
- CJK at scale → PaddleOCR.
- Handwriting → Cloud Vision or multimodal.
- Need structured fields, not just text → API-based.