PDF Text Extractor: Bulk PDF to Text and Metadata, $1 per 1,000 PDFs

Pull plain text, page-by-page content, and full metadata from any PDF URL. Returns title, author, creation date, page count, character count, and flags for scanned (image-only) or encrypted documents. Built for bulk: pass it 10,000 URLs and it returns structured rows. Ideal for legal discovery, RAG ingestion, compliance audits, and document workflow automation.

Open on Apify → Try it now

Pricing

$0.001/PDF

RAM

128MB

Coverage

Any URL

Output fields

12+

Proxy

Apify datacenter

Tech

HTTP + pdfcpu

What you get

Page text: full extracted text plus per-page array for paginated workflows
Metadata: title, author, subject, creator, producer, creation_date, modification_date
Stats: page_count, char_count, word_count, file_size_bytes
Flags: is_scanned (image-only PDF), is_encrypted, is_form
Source: source_url, content_type, http_status, scraped_at

Primary use cases

Legal discovery. Bulk-extract text from court filings, contracts, and pleadings for keyword search and review
RAG ingestion. Feed PDF reports, whitepapers, and manuals directly into vector databases
Compliance audits. Scan thousands of policy PDFs for required clauses, signatures, or regulatory keywords
Research aggregation. Pull abstracts, authors, and full text from academic preprint URLs
Invoice and receipt parsing. Extract structured text from supplier PDFs in accounts-payable pipelines

API example

# Start a run via the Apify API
curl -X POST "https://api.apify.com/v2/acts/santamaria-automations~pdf-extractor/runs?token=YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "pdfUrls": [
      "https://example.com/report-2026.pdf",
      "https://example.com/contract-v3.pdf",
      "https://example.com/whitepaper.pdf"
    ],
    "extractText": true,
    "extractMetadata": true,
    "perPageText": false
  }'

# Or use with AI agents via MCP:
# https://mcp.apify.com?tools=santamaria-automations/pdf-extractor

Integrations

n8n, Make, Zapier: trigger runs and process records via webhook
AI Agents (MCP): Claude Desktop, Cursor, VS Code, LangChain, LlamaIndex
Python, Node.js: Apify SDK for programmatic access
Google Sheets, Airtable: bulk input in, structured data out

Output fields

Field	Type	Example
source_url	string	https://example.com/report.pdf
title	string	Annual Report 2026
author	string	Acme Corp
page_count	integer	142
char_count	integer	284,512
text	string	Executive Summary...
creation_date	string	2026-01-15T09:30:00Z
is_scanned	boolean	false
is_encrypted	boolean	false
file_size_bytes	integer	4,218,940

Related Actors

HTML to Markdown: web pages converted to clean markdown for the same pipelines
Wikipedia Scraper: extract Wikipedia articles by title or URL with metadata
Sitemap URL Discovery: find every PDF link on a site before bulk extraction

Open on Apify → Try it now (free tier available)