All posts
WorkflowApril 4, 20265 min read

How to extract key information from emails

Sender, dates, attachments, action items, deal numbers — emails are unstructured text wrapping structured data. Here's how to pull the structure out reliably for CRM, support, or finance workflows.

By Dawid Sibinski

Two layers of information live in every email. The headers and metadata (from, to, subject, dates, attachments, message-id) are trivially structured. The body is the hard part — it's prose that may contain order numbers, dates, dollar amounts, action items, signatures, and quoted history from a long thread.

Headers and structure: easy

Python's stdlib email.parser handles the headers cleanly:

from email import message_from_bytes from email.policy import default with open("message.eml", "rb") as f: msg = message_from_bytes(f.read(), policy=default) print(msg["From"], msg["Subject"], msg["Date"]) for part in msg.iter_attachments(): print(part.get_filename())

For Gmail or Outlook: hit the API directly (Gmail API messages.get with format=full, Microsoft Graph /me/messages). Both return parsed headers and the body in HTML and plain text.

Body extraction: harder than it looks

Three problems compound:

  • Replies and forwards quote prior messages, often with broken delimiters.
  • Signatures vary per sender and per device.
  • HTML emails have inline styles, tracking pixels, and footer disclaimers that aren't part of the actual content.

The talon library (open-sourced from Mailgun) handles signature and quote extraction in Python — not perfect but the best free option. talon.signature.extract_signature and talon.quotations.extract_from work well enough for most flows.

Pulling structured fields out of the body

Once you have the clean body, the typical needs are:

  • Order/invoice/ticket numbers — regex on known prefixes works for in-house systems.
  • Dates and times — dateparser (Python) handles natural-language dates in 200+ locales.
  • Dollar amounts — regex catches them; libraries like price-parser handle currency edge cases.
  • Action items — best done with an LLM, not regex.
  • Signatures and contact details — talon plus a contact-extractor like phonenumbers and email-validator.

When the email body is the document

Common patterns where the body itself contains the unstructured data you want:

  • Customer support: extract intent, sentiment, urgency for routing.
  • Sales: extract company name, contact role, deal stage signals from prospect replies.
  • Operations: extract shipping notifications, status updates, and exception alerts from system emails.

An LLM with a target schema beats handwritten regex for all of these once volume is more than a few dozen messages a day.

When the value is in attachments

If the email is a transport for an invoice PDF or a signed contract, the email body matters less than the attachment. ExtractFox's API takes file uploads or URLs — wire your inbox processor (Make, Zapier, custom) to fetch attachments matching a rule, hand them to the right extractor, drop the results in your accounting or CRM system. The email becomes the trigger; the document becomes the data.

More on workflow

Stop reading, start extracting

Drop a PDF or image into ExtractFox and get structured data back in seconds.

Try a free extraction →