PDF Text Extractor
Extract text content from a PDF, page by page
/v1/pdf/text
curl -X POST "https://pdf.toolkitapi.io/v1/pdf/text" \
-H "Content-Type: application/json" \
-d '{
"url": "https://toolkitapi.io/document.pdf",
"pages": "1-5"
}'
import httpx
resp = httpx.post(
"https://pdf.toolkitapi.io/v1/pdf/text",
json={
"url": "https://toolkitapi.io/document.pdf",
"pages": "1-5"
},
)
print(resp.json())
const resp = await fetch("https://pdf.toolkitapi.io/v1/pdf/text", {
method: "POST",
headers: {
"Content-Type": "application/json",
},
body: JSON.stringify({
"url": "https://toolkitapi.io/document.pdf",
"pages": "1-5"
}),
});
const data = await resp.json();
console.log(data);
# See curl example
{
"pages": [
{
"page": 1,
"text": "Annual Report 2024\n\nThis document summarizes the financial performance...",
"word_count": 342,
"char_count": 2105
},
{
"page": 2,
"text": "Revenue Overview\n\nTotal revenue increased by 15% year-over-year...",
"word_count": 518,
"char_count": 3240
}
],
"total_pages": 24,
"extracted_pages": 2,
"total_word_count": 860,
"total_char_count": 5345
}
Description
How to Use
1. Provide the PDF via `pdf` (base64) or `url` (public URL).
2. Optionally set `pages` to target specific pages (e.g. `"1-3,5,10"`). Omit to extract from all pages.
3. The response contains per-page text with word and character counts, plus aggregate totals.
4. If a page returns empty text, it may contain images or scanned content — try the `/pdf/ocr` endpoint instead.
About This Tool
PDF Text Extractor pulls text content from PDF documents page by page. It reads the embedded text layer directly — no OCR involved — making it fast and accurate for digitally-created PDFs.
Use this tool when you need to index document content for search, feed PDF text into an LLM or NLP pipeline, or simply grab the readable content from a report. Each page comes with word and character counts, which is handy for content analysis and estimating processing costs.
For scanned documents or image-only PDFs where this endpoint returns empty text, use the OCR endpoint instead.
Why Use This Tool
- Search indexing — Extract document text for full-text search databases
- LLM ingestion — Feed PDF content into language models for summarization or Q&A
- Content analysis — Count words and characters for readability or cost estimation
- Data migration — Pull text from PDF archives into structured formats
- Compliance review — Extract document content for automated policy checks
Frequently Asked Questions
Why does a page return empty text?
Does this preserve formatting?
Is there a page limit?
Start using PDF Text Extractor now
Get your free API key and make your first request in under a minute.