TutorialApril 5, 20264 min read

How to extract formulas from an Excel file or PDF

Showing all formulas in a sheet, exporting them programmatically with openpyxl, and pulling math from a PDF where the formulas are rendered images, not LaTeX.

By Dawid Sibinski

Two very different tasks share the phrase "extract formulas." One is auditing an Excel sheet to see what's calculated where. The other is reading mathematical expressions out of a PDF (research paper, textbook). The tools don't overlap.

Excel: see every formula at once

Show Formulas mode: Formulas tab → Show Formulas (Ctrl+`). Every cell flips from value to formula. Print, screenshot, or save-as PDF in this mode for an audit-ready view of the whole sheet.

Excel: dump formulas to a flat list

openpyxl is the canonical Python library:

from openpyxl import load_workbook wb = load_workbook("model.xlsx", data_only=False) for sheet in wb.sheetnames: ws = wb[sheet] for row in ws.iter_rows(): for cell in row: if isinstance(cell.value, str) and cell.value.startswith("="): print(f"{sheet}!{cell.coordinate}: {cell.value}")

Crucial flag: data_only=False (the default reads formulas; data_only=True reads cached values). Output is one row per formula, ready to dump to CSV for audit or migration.

Excel: visual map of formula cells

Home → Find & Select → Formulas highlights every formula cell. Useful before a refactor — you see the calculation skeleton at a glance and can spot orphan formulas or cells that should be calculated but aren't.

PDF: math formulas as text or LaTeX

Most academic PDFs render math as graphics, not selectable text. Three options:

Mathpix Snip — paid, the gold standard. Snip a region, get LaTeX out. Mature and accurate.
PaddleOCR with the formula model — open-source, decent but inconsistent on complex notation.
Multimodal LLM — paste the screenshot, ask for LaTeX. Modern models (Claude, GPT-4o, Gemini) handle most printed math accurately.

PDF: when the formulas are typed text (not graphics)

Some born-digital PDFs encode math as Unicode (∫, ∑, π) plus regular text. Standard PDF text extraction (pdfplumber, pdfminer.six) gets these. The output isn't LaTeX, but for indexing or search it's enough.

Edge case: reverse-engineering a sheet's logic

If the goal is understanding the model, not just listing cells: combine the openpyxl dump with a graph library to render a dependency tree. cellx (PyPI) does this — feed it the workbook, get a Graphviz diagram of which cells feed which.