IndustryMarch 5, 20266 min read

What "unstructured data" actually means — and how to extract from it

Unstructured data isn't disorganized data — it's data without a schema. Here's the practical taxonomy, why it became extractable in the last few years, and the patterns that work.

By Dawid Sibinski

Estimates put unstructured data at somewhere between 80% and 90% of all data inside an enterprise. The number is shaky but the shape is right: most of what an organization knows is in documents, emails, transcripts, images, and PDFs — not in databases. Until recently, that data was effectively read-only for software.

The actual definition

"Unstructured" doesn't mean disorganized. A scanned invoice has plenty of structure — a person can find the total in two seconds. What it lacks is a machine-readable schema. The data is there; the contract for getting at it isn't.

The useful taxonomy is three buckets:

Structured: rows and columns, fixed schema. Databases, CSVs, well-formed JSON.
Semi-structured: schema present but flexible. Loose JSON, XML with optional fields, log files with consistent keys.
Unstructured: no machine schema at all. PDFs, scanned documents, images, audio, video, free-text email bodies.

Why it suddenly became extractable

The two unlocks:

Multimodal models that read documents the way a person does, instead of token-by-token from OCR. They see the page — header, body, totals — and can map it to whatever schema you ask for.
Structured output as a first-class API feature. Modern models will reliably return JSON conforming to a schema you define, not free-form prose you have to parse.

Together, these turn extraction from a parser-engineering project into a schema-design project. You describe what you want; the model maps the document to it.

Patterns that work in production

1. Pre-built schema for known document types

Invoices, receipts, contracts, bank statements all have predictable shapes. Define the schema once, run every incoming document through it. This is what most ExtractFox tools do under the hood.

2. Schema generation for one-off requests

When you don't know what fields a document has, ask the model. "Extract everything that looks like a key-value pair from this PDF" returns a useful first pass. Refine with a second extraction call against the schema you got.

3. Confidence-aware human review

Models will give you confidence signals if you ask. Route high-confidence extractions straight through; queue low-confidence ones for a person to verify. This is the difference between a demo and a production pipeline.

What's still hard

Extremely long documents (200+ pages) — context limits matter; chunk strategically.
Documents that mix languages mid-page.
Tables that span pages with footers in between.
Hand-drawn diagrams and complex chemical or mathematical notation.

These aren't unsolvable, but they need careful prompting and often a second pass. Most real-world extraction needs aren't in this category.

Where to start

Pick the most painful unstructured-data workflow in your organization — the one where someone re-keys data from PDFs every week. Run a sample through a multimodal extractor, measure accuracy, decide whether to wire it into a pipeline or keep it as a manual-trigger tool. Almost no one regrets removing the typing step.