How to extract data from a chart or graph (image or PDF)
Reverse-engineer bar, line, pie, and scatter charts back into numbers using WebPlotDigitizer, Python, and AI extraction. Works on screenshots, report PDFs, and dashboard photos.
Charts bury data. A line chart showing five years of revenue has dozens of specific numbers locked inside an image — and the only way to get them back is to reconstruct them. This is called chart digitization or chart data extraction, and it matters any time you need to analyze, compare, or re-plot data that was published only as a graphic.
Common scenarios: competitor annual reports that publish charts but not the underlying tables, scientific papers where the data is in a figure rather than the supplementary materials, dashboard screenshots from a tool you no longer have access to, and PDF reports where the data table wasn't included.
Method 1: WebPlotDigitizer (free, browser-based)
WebPlotDigitizer (automeris.io/WebPlotDigitizer) is the standard tool for manual chart digitization. You upload the chart image, mark the axis endpoints and calibrate the scale, then click on each data point. The tool returns a CSV of the coordinates mapped to the actual values.
- Upload the chart PNG or JPEG.
- Select the chart type (XY, bar, pie, polar, ternary, map).
- Click 4 calibration points on the axes and enter their actual values.
- Use the point picker to click each data point — or enable automatic detection for clean charts.
- Export as CSV.
WebPlotDigitizer is accurate for clean charts with clear axis labels. It's entirely manual for multi-series charts — you pick each point in each series separately. For a bar chart with 20 bars across 4 series, that's 80 clicks. It doesn't work on pie charts where the calibration model doesn't apply.
Method 2: Python for programmatic chart digitization
For batch processing — extracting the same chart from 50 quarterly reports — Python lets you automate what WebPlotDigitizer does manually. The approach depends on chart type.
For bar charts where each bar is a distinct color:
- pip install pillow numpy
- from PIL import Image; import numpy as np
- img = np.array(Image.open('chart.png').convert('RGB'))
- # Find the chart area by locating the axes (darker pixel rows/columns)
- # For each series color, create a mask and measure column heights
- # Map pixel height to value using the y-axis calibration
This works well for simple bar charts with clean colors. It breaks on anti-aliased edges, gradients, overlapping series, and any chart where the colors aren't distinct. You'd spend more time on the pixel logic than the data is worth unless you're processing dozens of identical charts.
A more practical Python approach is to extract the chart from a PDF (using pdfplumber to identify the bounding box of the chart page region), then hand the cropped image to an AI model for the actual data recovery.
Method 3: AI extraction (any chart type, reads axis labels and legends)
AI extraction with a vision model reads the chart the way a person does — it identifies the chart type, reads the axis labels and scales, interprets the legend, and returns the data series as a table. The key advantage is that it understands context: it knows that a y-axis labeled 'Revenue ($M)' means the values should be multiplied by a million, and that a bar labeled '2023' maps to that year.
Chart type difficulty ranking
Not all charts are equally hard to digitize. From easiest to hardest:
- Bar charts with clear labels on each bar — the values are often printed on the bars; no reconstruction needed.
- Line charts with few series and clear gridlines — AI can read the values at each data point from the gridline intersections.
- Pie/donut charts — sector angles must be measured; AI does this reasonably well if the labels and percentages are visible.
- Stacked bar charts — each segment's value requires subtracting adjacent stack layers; prone to accumulation error.
- Scatter plots — when points overlap, some are lost. AI recovers most points but dense regions are unreliable.
- Dual-axis charts — one axis per side with different scales; require AI to map series to the correct axis.
When the chart has printed value labels
The easiest case: many charts print the value directly on or next to each bar or point. In this case, you don't need to reconstruct values from pixel positions at all — you just extract the text. This is simpler and more accurate than any pixel-position approach. If your chart has labels, use an image-to-text or AI extractor and ask it to 'list each bar label and its value as a table'.
Recovering chart data from a locked PDF
When a report is a password-locked or print-only PDF, you can still screenshot individual pages and run the chart extractor on the screenshots. The quality depends on screenshot resolution — higher DPI means better axis-label legibility and more accurate value recovery. On macOS, use screenshot at 2x resolution for Retina quality.
Frequently asked questions
How do I extract data from a chart image?+
For manual extraction of a clean chart, WebPlotDigitizer (free, browser-based) lets you calibrate the axes and click each point. For automated extraction of any chart type, AI extraction reads the axis labels and legend and returns the data series as a table — without calibration steps.
Can I extract data from a pie chart?+
Yes. Pie charts are harder than bar or line charts because values must be inferred from sector angles, but AI extraction handles them well when the labels and percentages are visible. If percentages are printed in the chart, the extraction is exact.
How do I extract data from a chart in a PDF?+
Screenshot the relevant PDF page (higher resolution is better) and run it through a chart data extractor. Alternatively, if the PDF has an embedded chart from Excel or PowerPoint, sometimes the source data is still embedded — check with a PDF editor before going the screenshot route.
How accurate is AI chart data extraction?+
For charts with printed value labels: exact. For bar charts with clear gridlines: typically within 1-2% of the true value. For dense scatter plots or unlabeled line charts with fine granularity: less reliable. Always sanity-check the output totals against any aggregate figures shown in the source.