Extracting Tabular Data from PDFs with Layout-Aware Parsing

Extraction

Why PDF Table Extraction is Hard

PDFs store content as positioned text fragments, not as structured data. A table in a PDF is rendered as individual text blocks positioned on a grid — there's no native "table" concept in the PDF format.

Layout-aware parsing reconstructs the table structure by analysing the spatial relationships between text fragments: row alignment, column spacing, and cell boundaries.

API Usage

curl -X POST https://api.toolkitapi.io/v1/pdf/extract-tables \
  -H "X-API-Key: $API_KEY" \
  -F "[email protected]" \
  -F "pages=1,3,5"
{
  "tables": [
    {
      "page": 1,
      "rows": [
        ["Product", "Q1", "Q2", "Q3"],
        ["Widget A", "1200", "1450", "1600"],
        ["Widget B", "800",  "920",  "1100"]
      ]
    }
  ]
}

Output Formats

Pass format=csv to receive a ZIP file containing one CSV per table, or format=excel for an .xlsx file with tables on separate sheets.

Scanned PDFs

Scanned documents are images, not text. The API automatically detects whether OCR is needed and applies it when the page contains no extractable text. Accuracy depends on scan quality — 300 DPI or higher gives the best results.

Page Selection

Use pages to extract tables from specific pages only. This is faster and cheaper for large documents where tables appear on known pages:

{ "pages": "1-5,10,15-20" }

Post-Processing

The extracted rows are strings. If the original table contained numbers, cast them after extraction:

rows = response["tables"][0]["rows"]
headers = rows[0]
data = [dict(zip(headers, row)) for row in rows[1:]]
df = pd.DataFrame(data)
df["Q1"] = pd.to_numeric(df["Q1"], errors="coerce")

Try it out

Browse Tools →

More from the Blog