Extracting Tabular Data from PDFs with Layout-Aware Parsing
Why PDF Table Extraction is Hard
PDFs store content as positioned text fragments, not as structured data. A table in a PDF is rendered as individual text blocks positioned on a grid — there's no native "table" concept in the PDF format.
Layout-aware parsing reconstructs the table structure by analysing the spatial relationships between text fragments: row alignment, column spacing, and cell boundaries.
API Usage
curl -X POST https://api.toolkitapi.io/v1/pdf/extract-tables \
-H "X-API-Key: $API_KEY" \
-F "[email protected]" \
-F "pages=1,3,5"
{
"tables": [
{
"page": 1,
"rows": [
["Product", "Q1", "Q2", "Q3"],
["Widget A", "1200", "1450", "1600"],
["Widget B", "800", "920", "1100"]
]
}
]
}
Output Formats
Pass format=csv to receive a ZIP file containing one CSV per table,
or format=excel for an .xlsx file with tables on separate sheets.
Scanned PDFs
Scanned documents are images, not text. The API automatically detects whether OCR is needed and applies it when the page contains no extractable text. Accuracy depends on scan quality — 300 DPI or higher gives the best results.
Page Selection
Use pages to extract tables from specific pages only. This is faster and
cheaper for large documents where tables appear on known pages:
{ "pages": "1-5,10,15-20" }
Post-Processing
The extracted rows are strings. If the original table contained numbers, cast them after extraction:
rows = response["tables"][0]["rows"]
headers = rows[0]
data = [dict(zip(headers, row)) for row in rows[1:]]
df = pd.DataFrame(data)
df["Q1"] = pd.to_numeric(df["Q1"], errors="coerce")