EngineeringMarch 19, 20265 min read

How to extract facts from text

Named entities, claims, relations — what "fact extraction" actually means in NLP, the libraries that handle each piece, and how the LLM era changed which ones are worth using.

By Dawid Sibinski

"Extract facts from text" sounds like one task. In NLP it's three: named entity recognition (find the names), relation extraction (link them), and claim extraction (find statements that assert something). Pick the wrong subtask and you'll spend a week building the wrong thing.

Named entities: a solved problem

spaCy and Stanza both ship with pretrained NER models for ~20 languages. They identify people, organizations, locations, dates, money, and more:

import spacy nlp = spacy.load("en_core_web_trf") doc = nlp("Acme Corp acquired Foo Inc in 2023 for $400M.") for ent in doc.ents: print(ent.text, ent.label_)

Output: Acme Corp ORG, Foo Inc ORG, 2023 DATE, $400M MONEY. Fast (CPU-friendly), free, and accurate enough for most catalog and search use cases.

Relation extraction: harder

Knowing that "Acme" and "Foo" appear in the same sentence isn't enough — you want the relation (acquired) and the direction (Acme → Foo). Pretrained relation extraction is patchy. Options:

OpenIE (from Stanford CoreNLP) — extracts subject-verb-object triples without supervision. Works, noisy.
REBEL — newer transformer model that extracts (subject, relation, object) from text directly.
An LLM with a typed schema — by far the most reliable in 2024+, especially for domain-specific relations.

Claim extraction: hardest

"Pull every assertion this article makes about climate policy" is what most people actually mean by "extract facts." Pre-LLM, this was an open research problem. With an LLM and a typed schema (claim, claimant, evidence_quote, page_number), it's tractable for most domains.

Practical setup: hybrid

Use spaCy or Stanza for fast NER as a first pass — cheap, repeatable, deterministic.
Use an LLM with a JSON schema for relations and claims — more expensive, more capable, ground every extraction in a quote span.
Validate everything against a source quote — for fact extraction this is non-negotiable, since hallucinated facts are worse than missing ones.

Domain-specific facts

Legal: BlackstoneNLP for case law NER and clause detection.
Medical: scispaCy and BioBERT for biomedical entities and relations.
Finance: a custom LLM schema usually outperforms domain libraries here.

When facts live in documents, not raw text

Most fact extraction in production starts from PDFs and images, not clean text strings. ExtractFox handles the OCR-and-layout step; pipe the structured output into your fact extractor downstream. Skipping the document-to-text step cleanly is the fastest way to make the rest of the pipeline reliable.