PDF 文本提取器：批量 PDF 转文本和元数据，每 1,000 PDF $1

从任意 PDF URL 提取纯文本、逐页内容和完整元数据。返回标题、作者、创建日期、页数、字符总数，以及扫描（纯图像）或加密文档标识。为批量而生：输入 10,000 URL，返回结构化数据行。适用于法律电子取证、RAG 数据摄入、合规审计和文档工作流自动化。

在 Apify 上打开 → 立即试用

价格

$0.001/PDF

内存

128MB

覆盖

任意 URL

输出字段

12+

代理

Apify 数据中心

技术

HTTP + pdfcpu

您能获得什么

页面文本：完整提取文本，加上每页数组用于分页工作流
元数据：title、author、subject、creator、producer、creation_date、modification_date
统计：page_count、char_count、word_count、file_size_bytes
标识：is_scanned（纯图像 PDF）、is_encrypted、is_form
来源：source_url、content_type、http_status、scraped_at

主要使用场景

法律电子取证。批量从法院文件、合同和诉状中提取文本，用于关键词搜索和审阅
RAG 数据摄入。将 PDF 报告、白皮书和手册直接送入向量数据库
合规审计。扫描数千份政策 PDF，查找必需条款、签名或监管关键词
研究聚合。从学术预印本 URL 提取摘要、作者和全文
发票与收据解析。在应付账款管道中从供应商 PDF 提取结构化文本

API 示例

# 通过 Apify API 启动运行
curl -X POST "https://api.apify.com/v2/acts/santamaria-automations~pdf-extractor/runs?token=YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "pdfUrls": [
      "https://example.com/report-2026.pdf",
      "https://example.com/contract-v3.pdf",
      "https://example.com/whitepaper.pdf"
    ],
    "extractText": true,
    "extractMetadata": true,
    "perPageText": false
  }'

# 或通过 MCP 与 AI 代理配合使用：
# https://mcp.apify.com?tools=santamaria-automations/pdf-extractor

集成

n8n、Make、Zapier：通过 Webhook 触发运行并处理记录
AI 代理 (MCP)：Claude Desktop、Cursor、VS Code、LangChain、LlamaIndex
Python、Node.js：使用 Apify SDK 进行编程访问
Google Sheets、Airtable：批量输入，结构化数据输出

输出字段

字段	类型	示例
source_url	string	https://example.com/report.pdf
title	string	2026 年度报告
author	string	Acme 公司
page_count	integer	142
char_count	integer	284,512
text	string	执行摘要...
creation_date	string	2026-01-15T09:30:00Z
is_scanned	boolean	false
is_encrypted	boolean	false
file_size_bytes	integer	4,218,940

PDF 文本提取器：批量 PDF 转文本和元数据，每 1,000 PDF $1

您能获得什么

主要使用场景

API 示例

集成

输出字段

相关 Actor