PDF OCR
Extract text from scanned or image-based PDFs using optical character recognition
/v1/pdf/ocr
curl -X POST "https://pdf.toolkitapi.io/v1/pdf/ocr" \
-H "Content-Type: application/json" \
-d '{
"url": "https://toolkitapi.io/scanned-document.pdf",
"pages": "1-3",
"language": "eng",
"dpi": 300
}'
import httpx
resp = httpx.post(
"https://pdf.toolkitapi.io/v1/pdf/ocr",
json={
"url": "https://toolkitapi.io/scanned-document.pdf",
"pages": "1-3",
"language": "eng",
"dpi": 300
},
)
print(resp.json())
const resp = await fetch("https://pdf.toolkitapi.io/v1/pdf/ocr", {
method: "POST",
headers: {
"Content-Type": "application/json",
},
body: JSON.stringify({
"url": "https://toolkitapi.io/scanned-document.pdf",
"pages": "1-3",
"language": "eng",
"dpi": 300
}),
});
const data = await resp.json();
console.log(data);
# See curl example
{
"pages": [
{"page": 1, "text": "INVOICE\n\nInvoice Number: INV-2024-0042\nDate: December 15, 2024\n\nBill To:\nAcme Corp...", "word_count": 87, "confidence": null},
{"page": 2, "text": "Item Description Qty Price\nWidget A 10 $25.00\nWidget B 5 $42.00...", "word_count": 124, "confidence": null}
],
"total_pages": 5,
"extracted_pages": 2,
"total_word_count": 211,
"language": "eng"
}
Try It Live
Description
How to Use
1. Provide the PDF via `pdf` (base64) or `url` (public URL).
2. Optionally set `pages` to OCR specific pages. Omit to process all pages.
3. Set `language` to the Tesseract language code matching your document (default: `eng`). Common codes: `fra` (French), `deu` (German), `spa` (Spanish).
4. Adjust `dpi` if needed — higher values (300–600) improve accuracy for poor-quality scans, lower values (150) are faster.
About This Tool
PDF OCR extracts text from scanned documents and image-only PDFs where standard text extraction returns empty results. It renders each page to an image at your specified DPI, then runs Tesseract OCR to recognize and extract the text content.
This is the companion to the text extraction endpoint. If `/pdf/text` returns blank pages, the PDF likely contains scanned images rather than embedded text — and this is where OCR steps in. It supports multiple languages and lets you tune the rendering resolution for accuracy.
Higher DPI settings produce sharper page images, which improves OCR accuracy at the cost of processing time. The default of 300 DPI works well for most scanned documents.
Why Use This Tool
- Scanned document digitization — Extract text from paper documents that were scanned to PDF
- Legacy document processing — Make old image-based PDFs searchable and indexable
- Invoice and receipt parsing — Pull text from photographed or scanned financial documents
- Multilingual extraction — OCR documents in various languages using Tesseract language packs
- Accessibility conversion — Create text versions of image-only PDFs for screen readers
Frequently Asked Questions
What languages are supported?
How does DPI affect accuracy?
When should I use OCR vs regular text extraction?
Start using PDF OCR now
Get your free API key and make your first request in under a minute.