How to extract data from an insurance policy PDF
Pull coverage limits, premiums, exclusions, and policy dates from any insurance policy PDF — declarations pages, endorsements, and full policies from any carrier.
Insurance policies are among the most complex documents to extract data from: hundreds of pages of legal language, carrier-specific field names for the same concept, endorsements that override parts of the base policy, and exclusions buried in sub-sections. The declarations page is the structured summary — the rest of the policy is prose.
Declarations page vs the full policy
The declarations page (dec page) is the structured summary at the front of the policy. It contains the policy number, named insured, coverage types, limits, premiums, effective dates, and sometimes the deductible. It's the page brokers and claims teams look at first.
The rest of the policy — conditions, definitions, exclusions, endorsements — is prose. Extracting specific clause language (e.g., 'what exactly is excluded under the pollution exclusion?') requires reading the full document, not just the dec page.
What you can extract from an insurance policy
- Policy header — policy number, carrier, policy type, named insured
- Effective date and expiration date
- Coverage types and limits — per occurrence, aggregate, deductible
- Premium — annual or installment
- Additional insureds and loss payees
- Endorsements list — policy modifications attached
- Exclusions — what the policy does not cover
- Retroactive date (for claims-made policies like E&O and D&O)
Method 1: Python pdfplumber for dec page text extraction
For digital policy PDFs, pdfplumber extracts the dec page text in a few lines:
- pip install pdfplumber
- import pdfplumber, re
- with pdfplumber.open('policy.pdf') as pdf:
- dec_page = pdf.pages[0].extract_text() # dec page is usually page 1
- policy_number = re.search(r'Policy No[.:]?\s+([A-Z0-9-]+)', dec_page, re.I)
- effective = re.search(r'Effective Date[:\s]+([\d/]+)', dec_page, re.I)
This approach works on a single carrier's policies where you know the field labels. It breaks when a new carrier labels 'Effective Date' as 'Inception Date', 'Policy Period From', or 'Coverage Start'. For a brokerage handling policies from dozens of carriers, maintaining carrier-specific regex patterns is a constant maintenance burden.
Method 2: AI extraction (semantic, works across any carrier)
AI extraction reads the policy document semantically — it knows that 'Policy Period: 01/01/2025 to 01/01/2026' and 'Inception: January 1, 2025' and 'Effective: 1/1/25' all mean the same field. It also handles endorsements, reading them in context of the base policy to return the effective (post-endorsement) values.
Extracting exclusions
Exclusions are the most important and hardest part to extract. They're buried in the body of the policy, often labeled by section reference ('Exclusion A-4') rather than descriptive headings. Common exclusions that brokers check: pollution, cyber, PFAS/PFOA, mold, asbestos, professional services, employment practices.
To extract a specific exclusion: describe what you're looking for in plain language. 'List every exclusion as a row with the exclusion name and a one-sentence summary' returns a structured list rather than a wall of policy text.
Policy comparison for renewals
At renewal time, brokers compare expiring and renewing policies to identify coverage changes. Extract the same fields from both PDFs and diff the outputs: a limit that dropped from $2M to $1M, an exclusion added, a deductible that changed. Without extraction, this comparison is done by eye — a slow, error-prone process on 50-page documents.
Building a searchable policy library
For a brokerage with thousands of policies, the highest-value use of extraction is building a queryable database: every policy as a JSON record, indexed by carrier, named insured, line of business, effective dates, limits, and expiration. Renewal outreach, compliance audits, and client reporting all become database queries instead of manual document reviews.
Frequently asked questions
How do I extract coverage limits from an insurance policy PDF?+
Coverage limits are on the declarations page. Use pdfplumber to extract the dec page text, then apply regex patterns for 'Each Occurrence', 'General Aggregate', 'Per Claim' etc. For multi-carrier portfolios, AI extraction handles the varying field labels automatically.
Can I extract data from all types of insurance policies?+
Yes — auto, home, commercial general liability, professional liability, E&O, D&O, health, and life. The fields differ by policy type (a CGL has per-occurrence and aggregate limits; a life policy has a death benefit and beneficiaries), but AI extraction handles each type with a type-appropriate schema.
How do endorsements affect data extraction?+
Endorsements modify the base policy — they can change limits, add exclusions, or add additional insureds. When extracting, specify whether you want the base policy fields or the effective (post-endorsement) values. AI extraction can read the endorsements in context and return the final effective values.
How do I extract exclusions from an insurance policy?+
Describe what you want: 'list every exclusion as a row with name and one-sentence summary' or 'extract the pollution exclusion verbatim'. AI extraction finds the relevant sections regardless of where they appear in the document.