Pull plain text, page-by-page content, and full metadata from any PDF URL. Returns title, author, creation date, page count, character count, and flags for scanned (image-only) or encrypted documents. Built for bulk: pass it 10,000 URLs and it returns structured rows. Ideal for legal discovery, RAG ingestion, compliance audits, and document workflow automation.
# Start a run via the Apify API curl -X POST "https://api.apify.com/v2/acts/santamaria-automations~pdf-extractor/runs?token=YOUR_TOKEN" \ -H "Content-Type: application/json" \ -d '{ "pdfUrls": [ "https://example.com/report-2026.pdf", "https://example.com/contract-v3.pdf", "https://example.com/whitepaper.pdf" ], "extractText": true, "extractMetadata": true, "perPageText": false }' # Or use with AI agents via MCP: # https://mcp.apify.com?tools=santamaria-automations/pdf-extractor
| Field | Type | Example |
|---|---|---|
| source_url | string | https://example.com/report.pdf |
| title | string | Annual Report 2026 |
| author | string | Acme Corp |
| page_count | integer | 142 |
| char_count | integer | 284,512 |
| text | string | Executive Summary... |
| creation_date | string | 2026-01-15T09:30:00Z |
| is_scanned | boolean | false |
| is_encrypted | boolean | false |
| file_size_bytes | integer | 4,218,940 |