How to extract data from an ID document
Pull identity fields from driver's licenses, national ID cards, and residence permits — with Python OCR, KYC APIs, and AI extraction for any country or format.
Extracting data from an ID document — driver's license, national ID card, residence permit, voter card — means reading printed fields that vary by country, card type, and issue year. The same field (date of birth) appears in different positions, different formats, and different languages across the hundreds of ID formats in use worldwide.
What you can extract from an ID document
- Names — surname, given names, full name as printed
- Date of birth and sex
- Document number, issue date, expiry date
- Issuing country, issuing authority or state
- Address (where printed on the card — many IDs don't include it)
- Machine-readable zone (MRZ) — the two or three lines of machine-readable text on the back of many cards
- Nationality and place of birth
Method 1: Tesseract OCR (open-source, format-agnostic)
Tesseract reads text from ID photos but returns raw text — you still need to parse the output into named fields.
- pip install pytesseract pillow && brew install tesseract
- from PIL import Image; import pytesseract
- text = pytesseract.image_to_string(Image.open('id_front.jpg'))
- # text contains all the raw characters — you need to find DOB, name, number yourself
The practical challenge: ID cards have dense, small text in non-standard fonts. Tesseract accuracy drops significantly on card fonts vs document text. And unlike passports (which have standardized MRZ fields), national IDs vary by country — there's no universal field schema to parse against.
Method 2: MRZ parsing (for cards with a machine-readable zone)
Many national ID cards and residence permits include an MRZ — two or three lines at the bottom with fixed-position fields that follow ICAO 9303 standards. Libraries like mrz (Python) parse these reliably.
- pip install mrz
- from mrz.checker.td1 import TD1CodeChecker # TD1 = 3-line ID card MRZ
- result = TD1CodeChecker('IDGBRXXXXXX<<<<<<<<<<<<<<<', '8001012M2501016GBR<<<<<<<<6', 'SMITH<<JOHN<WILLIAM<<<<<<<')
- print(result.surname, result.given_names, result.birth_date)
MRZ parsing is very accurate when the MRZ is visible and readable — the format is fixed by ICAO spec. The limitation: not all ID cards have an MRZ, and the OCR step to extract the MRZ lines from a photo still introduces errors on low-quality images.
Method 3: AI extraction (any country, any format, front and back)
AI extraction uses a vision model that recognizes ID document structure — it finds the name, date of birth, and document number based on their visual context and position, not on fixed coordinates. This means it works across hundreds of national ID formats without per-country templates.
ID documents vs passports
Passports follow a tighter international standard (ICAO 9303 TD3) — two MRZ lines, consistent field positions across all countries. National ID cards are more varied: different sizes (TD1, TD2), different field layouts per country, and some countries don't include an MRZ at all. If you're processing passports specifically, the passport extractor handles the MRZ parsing and additional passport-specific fields.
KYC and identity verification workflows
For KYC onboarding — verifying identity at signup — the typical flow is: user uploads a front (and sometimes back) photo of their ID, extraction returns the structured fields, and downstream logic checks the expiry date, validates the document number format, and compares the name against the account holder. AI extraction handles the 'read the document' step; your business logic handles the verification rules.
Frequently asked questions
How do I extract data from a driver's license photo?+
Upload the photo to an AI extractor. Tesseract can read text from a license image but returns raw text — you'd need to write parsing logic to extract the name, DOB, and license number from the OCR output. AI extraction returns named fields directly.
Does ID document extraction work on cards from any country?+
AI extraction works across all major national ID formats because it reads the document visually rather than applying per-country templates. Tesseract and MRZ parsers are more limited — MRZ parsing only works on cards that include a machine-readable zone.
Can I extract the MRZ from an ID card?+
Yes. If the card has an MRZ (two or three rows of alphanumeric characters at the bottom), AI extraction returns both the raw MRZ lines and the parsed fields (surname, given names, DOB, document number, expiry, nationality, check digits).
Is AI-extracted ID data accurate enough for KYC?+
For the structured fields (name, DOB, expiry date), accuracy on clear photos is very high. The recommended approach for regulated KYC is to use extraction for structured data capture and run a separate liveness/authenticity check to verify the document is genuine.