How to extract data from a resume or CV
Parse resume PDFs into structured candidate data — contact info, work history, education, skills — using Python, resume parsers, or AI. Works on any CV layout.
Resume parsing is the problem of turning a PDF into a structured candidate record: name, email, phone, job history with dates and companies, education, skills. The hard part is that every candidate designs their resume differently — two columns, infographic layouts, different section headings, tables vs bullets.
Method 1: pdfminer or pdfplumber (raw text extraction)
Extracting raw text from a resume PDF is straightforward. Turning that text into structured fields — that's where it gets hard.
- pip install pdfplumber
- import pdfplumber
- with pdfplumber.open('resume.pdf') as pdf:
- text = '\n'.join(p.extract_text() or '' for p in pdf.pages)
- # text now contains all the resume text — you still need to parse it into fields
From there, extracting the email is easy (regex). Phone numbers are trickier (many formats). The job history is the hard part — section headings vary ('Experience', 'Work History', 'Professional Background', 'Career'), and date formats differ across candidates. This is 80% of the engineering effort in a homegrown parser.
Method 2: Open-source resume parsers
Several open-source libraries are purpose-built for resumes:
- pyresparser — Python, uses spaCy NLP, extracts name, email, skills, education, and experience
- resume-parser (npm) — JavaScript, works on plain text extracted from PDF
- OpenResume parser — designed for the OpenResume format but works on many PDFs
These work well on standard Western resume formats. They struggle on two-column layouts (where the text extraction order is wrong), infographic CVs, non-English resumes, and academic CVs with publication lists. Skill extraction in particular is pattern-matched and misses domain-specific skills it hasn't been trained on.
Method 3: AI extraction (any layout, any language, any format)
Multimodal AI reads the resume as a document — it handles two-column layouts correctly because it sees the visual structure, not just the text flow. It understands that 'Managed a team of 12 engineers' goes under the role that follows it, not the one above it.
What you can extract from a resume
- Contact info — full name, email, phone, location, LinkedIn URL, GitHub, portfolio
- Headline and summary
- Work experience — company, title, start and end dates, description bullets
- Education — institution, degree, field, dates, GPA (if shown)
- Skills — technical skills, languages, tools, certifications
- Publications, patents, awards (for academic CVs)
Parsing resumes at scale for recruiting
For volume recruiting — processing hundreds of applications — the bottleneck is consistent schema. A homegrown parser returns different field names per resume or misses skills on infographic layouts. An AI parser returns the same schema for every candidate, so filtering by skill, location, or years of experience is a simple spreadsheet operation.
Common flow: applicants submit PDFs, automation routes each to the API, structured JSON drops into a candidate tracking sheet or ATS. Each candidate becomes a row with comparable fields.
Frequently asked questions
How do I parse a resume PDF into structured data?+
Use pyresparser for a free Python option (works on standard formats), or an AI extractor for any CV layout. AI extraction handles two-column layouts, infographic CVs, and non-English resumes that rule-based parsers miss.
What fields can I extract from a resume automatically?+
Name, email, phone, location, every job with company, title, and dates, education history, and a deduplicated skills list. With AI extraction you can also ask for specific fields — 'years of Python experience', 'highest degree earned' — as part of the same request.
How do I extract resume data from many PDFs at once?+
Use the API: POST each PDF and collect the JSON. The output schema is consistent across all resumes regardless of format — same field names, same structure — so you can directly build a candidate comparison spreadsheet.
Why do open-source resume parsers miss skills?+
Most parsers use a skill dictionary — a fixed list of known skills to look for. Domain-specific, niche, or newly coined skills that aren't in the list are missed. AI extraction reasons about the document instead of pattern-matching, so it picks up skills the dictionary doesn't know about.