📄

PDF OCR

Extract text from scanned or image-based PDFs using optical character recognition

POST 1 credit /v1/pdf/ocr

curl -X POST "https://pdf.toolkitapi.io/v1/pdf/ocr" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://toolkitapi.io/scanned-document.pdf",
    "pages": "1-3",
    "language": "eng",
    "dpi": 300
  }'

import httpx

resp = httpx.post(
    "https://pdf.toolkitapi.io/v1/pdf/ocr",
    json={
    "url": "https://toolkitapi.io/scanned-document.pdf",
    "pages": "1-3",
    "language": "eng",
    "dpi": 300
  },
)
print(resp.json())

const resp = await fetch("https://pdf.toolkitapi.io/v1/pdf/ocr", {
  method: "POST",
  headers: {
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    "url": "https://toolkitapi.io/scanned-document.pdf",
    "pages": "1-3",
    "language": "eng",
    "dpi": 300
  }),
});
const data = await resp.json();
console.log(data);

# See curl example

Response 200 OK

{
  "pages": [
    {"page": 1, "text": "INVOICE\n\nInvoice Number: INV-2024-0042\nDate: December 15, 2024\n\nBill To:\nAcme Corp...", "word_count": 87, "confidence": null},
    {"page": 2, "text": "Item Description          Qty    Price\nWidget A                  10    $25.00\nWidget B                   5    $42.00...", "word_count": 124, "confidence": null}
  ],
  "total_pages": 5,
  "extracted_pages": 2,
  "total_word_count": 211,
  "language": "eng"
}

Try It Live

Live Demo

Response

Description

Extract text from scanned or image-based PDFs using optical character recognition

How to Use

1

1. Provide the PDF via `pdf` (base64) or `url` (public URL).

2

2. Optionally set `pages` to OCR specific pages. Omit to process all pages.

3

3. Set `language` to the Tesseract language code matching your document (default: `eng`). Common codes: `fra` (French), `deu` (German), `spa` (Spanish).

4

4. Adjust `dpi` if needed — higher values (300–600) improve accuracy for poor-quality scans, lower values (150) are faster.

About This Tool

PDF OCR extracts text from scanned documents and image-only PDFs where standard text extraction returns empty results. It renders each page to an image at your specified DPI, then runs Tesseract OCR to recognize and extract the text content.

This is the companion to the text extraction endpoint. If `/pdf/text` returns blank pages, the PDF likely contains scanned images rather than embedded text — and this is where OCR steps in. It supports multiple languages and lets you tune the rendering resolution for accuracy.

Higher DPI settings produce sharper page images, which improves OCR accuracy at the cost of processing time. The default of 300 DPI works well for most scanned documents.

Why Use This Tool

Scanned document digitization — Extract text from paper documents that were scanned to PDF
Legacy document processing — Make old image-based PDFs searchable and indexable
Invoice and receipt parsing — Pull text from photographed or scanned financial documents
Multilingual extraction — OCR documents in various languages using Tesseract language packs
Accessibility conversion — Create text versions of image-only PDFs for screen readers

Frequently Asked Questions

What languages are supported?

Tesseract supports 100+ languages. Common codes: `eng` (English), `fra` (French), `deu` (German), `spa` (Spanish), `ita` (Italian), `por` (Portuguese), `jpn` (Japanese), `chi_sim` (Simplified Chinese).

How does DPI affect accuracy?

Higher DPI renders sharper page images, giving Tesseract more detail to work with. For clean scans, 200–300 DPI is sufficient. For poor-quality or small-text scans, try 400–600 DPI. Lower DPI is faster but may miss small text.

When should I use OCR vs regular text extraction?

Try `/pdf/text` first — it's faster and more accurate for digitally-created PDFs. Use OCR only when text extraction returns empty or garbled results, indicating the PDF contains images rather than embedded text.

Start using PDF OCR now

Get your free API key and make your first request in under a minute.

Get Free API Key View Docs