📄

PDF Text Extractor

Extract text content from a PDF, page by page

POST 1 credit /v1/pdf/text

curl -X POST "https://pdf.toolkitapi.io/v1/pdf/text" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://toolkitapi.io/document.pdf",
    "pages": "1-5"
  }'

import httpx

resp = httpx.post(
    "https://pdf.toolkitapi.io/v1/pdf/text",
    json={
    "url": "https://toolkitapi.io/document.pdf",
    "pages": "1-5"
  },
)
print(resp.json())

const resp = await fetch("https://pdf.toolkitapi.io/v1/pdf/text", {
  method: "POST",
  headers: {
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    "url": "https://toolkitapi.io/document.pdf",
    "pages": "1-5"
  }),
});
const data = await resp.json();
console.log(data);

# See curl example

Response 200 OK

{
  "pages": [
    {
      "page": 1,
      "text": "Annual Report 2024\n\nThis document summarizes the financial performance...",
      "word_count": 342,
      "char_count": 2105
    },
    {
      "page": 2,
      "text": "Revenue Overview\n\nTotal revenue increased by 15% year-over-year...",
      "word_count": 518,
      "char_count": 3240
    }
  ],
  "total_pages": 24,
  "extracted_pages": 2,
  "total_word_count": 860,
  "total_char_count": 5345
}

Description

Extract text content from a PDF, page by page

How to Use

1

1. Provide the PDF via `pdf` (base64) or `url` (public URL).

2

2. Optionally set `pages` to target specific pages (e.g. `"1-3,5,10"`). Omit to extract from all pages.

3

3. The response contains per-page text with word and character counts, plus aggregate totals.

4

4. If a page returns empty text, it may contain images or scanned content — try the `/pdf/ocr` endpoint instead.

About This Tool

PDF Text Extractor pulls text content from PDF documents page by page. It reads the embedded text layer directly — no OCR involved — making it fast and accurate for digitally-created PDFs.

Use this tool when you need to index document content for search, feed PDF text into an LLM or NLP pipeline, or simply grab the readable content from a report. Each page comes with word and character counts, which is handy for content analysis and estimating processing costs.

For scanned documents or image-only PDFs where this endpoint returns empty text, use the OCR endpoint instead.

Why Use This Tool

Search indexing — Extract document text for full-text search databases
LLM ingestion — Feed PDF content into language models for summarization or Q&A
Content analysis — Count words and characters for readability or cost estimation
Data migration — Pull text from PDF archives into structured formats
Compliance review — Extract document content for automated policy checks

Frequently Asked Questions

Why does a page return empty text?

The page likely contains scanned images rather than embedded text. Use the `/pdf/ocr` endpoint to extract text from image-based PDFs using optical character recognition.

Does this preserve formatting?

The extracted text preserves reading order and line breaks but not visual formatting like fonts, colors, or exact layout positioning. It's plain text, not rich text.

Is there a page limit?

There's no hard page limit — the tool processes whatever pages you request. For very large documents, consider extracting specific page ranges to keep response sizes manageable.

Start using PDF Text Extractor now

Get your free API key and make your first request in under a minute.

Get Free API Key View Docs