📄

PDF Text Extractor

Extract text content from a PDF, page by page

POST 1 credit /v1/pdf/text
curl -X POST "https://pdf.toolkitapi.io/v1/pdf/text" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://toolkitapi.io/document.pdf",
    "pages": "1-5"
  }'
import httpx

resp = httpx.post(
    "https://pdf.toolkitapi.io/v1/pdf/text",
    json={
    "url": "https://toolkitapi.io/document.pdf",
    "pages": "1-5"
  },
)
print(resp.json())
const resp = await fetch("https://pdf.toolkitapi.io/v1/pdf/text", {
  method: "POST",
  headers: {
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    "url": "https://toolkitapi.io/document.pdf",
    "pages": "1-5"
  }),
});
const data = await resp.json();
console.log(data);
# See curl example
Response 200 OK
{
  "pages": [
    {
      "page": 1,
      "text": "Annual Report 2024\n\nThis document summarizes the financial performance...",
      "word_count": 342,
      "char_count": 2105
    },
    {
      "page": 2,
      "text": "Revenue Overview\n\nTotal revenue increased by 15% year-over-year...",
      "word_count": 518,
      "char_count": 3240
    }
  ],
  "total_pages": 24,
  "extracted_pages": 2,
  "total_word_count": 860,
  "total_char_count": 5345
}

Description

Extract text content from a PDF, page by page

How to Use

1

1. Provide the PDF via `pdf` (base64) or `url` (public URL).

2

2. Optionally set `pages` to target specific pages (e.g. `"1-3,5,10"`). Omit to extract from all pages.

3

3. The response contains per-page text with word and character counts, plus aggregate totals.

4

4. If a page returns empty text, it may contain images or scanned content — try the `/pdf/ocr` endpoint instead.

About This Tool

PDF Text Extractor pulls text content from PDF documents page by page. It reads the embedded text layer directly — no OCR involved — making it fast and accurate for digitally-created PDFs.

Use this tool when you need to index document content for search, feed PDF text into an LLM or NLP pipeline, or simply grab the readable content from a report. Each page comes with word and character counts, which is handy for content analysis and estimating processing costs.

For scanned documents or image-only PDFs where this endpoint returns empty text, use the OCR endpoint instead.

Why Use This Tool

Frequently Asked Questions

Why does a page return empty text?
The page likely contains scanned images rather than embedded text. Use the `/pdf/ocr` endpoint to extract text from image-based PDFs using optical character recognition.
Does this preserve formatting?
The extracted text preserves reading order and line breaks but not visual formatting like fonts, colors, or exact layout positioning. It's plain text, not rich text.
Is there a page limit?
There's no hard page limit — the tool processes whatever pages you request. For very large documents, consider extracting specific page ranges to keep response sizes manageable.

Start using PDF Text Extractor now

Get your free API key and make your first request in under a minute.