网站内容爬虫：面向 AI 与 RAG 管线的纯净 Markdown

从任意网站提取纯净 Markdown 与纯文本，针对 AI 摄取、RAG 管线与 LLM 上下文窗口优化。Readability 风格主体内容提取去除导航、页脚、侧栏与广告，让您的 AI 只获得有价值的内容。Flat fetch（深度=0）适用于 URL 列表，或最大深度 5 整站爬取。最多 20 个并行 worker。

在 Apify 打开 → 立即试用

定价

$1/千页 + $0.25 启动

内存

128MB

输出

Markdown + 文本

并发

最多 20 worker

爬取深度

0 至 5 层

引擎

HTTP-only Go

每页可获得

纯净 Markdown：最多 50,000 字符，保留标题、列表、链接与代码块
纯文本：最多 10,000 字符，去除全部 HTML，可直接用于嵌入
页面元数据：URL、标题（og:title 或 HTML title）、meta description、字数
内容类型检测：article、blog、documentation 或 generic，用于 RAG 路由
爬取上下文：深度、起始 URL、发现的内部链接、HTTP 状态码、scraped_at 时间戳

主要用例

RAG 知识库。爬取公司文档站点并将纯净 Markdown 注入向量库
LLM 接地。用博客、新闻和产品文档的最新内容为智能体提供事实依据
AI 摘要管线。批量提取文章文本用于摘要或主题聚类
竞品内容分析。爬取竞品博客和产品页进行结构化分析
离线阅读。批量将网页转换为 Markdown 用于归档或静态站点重建
ML 训练数据。从一份权威来源清单构建纯净文本语料

API 示例

# 直接抓取一组文档页（不爬取）
curl -X POST "https://api.apify.com/v2/acts/santamaria-automations~website-content-crawler/runs?token=YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "startUrls": [
      "https://docs.example.com/api/overview",
      "https://docs.example.com/api/authentication"
    ],
    "maxDepth": 0,
    "extractMainContent": true
  }'

# 或通过 MCP 与 AI 智能体一起使用：
# https://mcp.apify.com?tools=santamaria-automations/website-content-crawler

集成

n8n、Make、Zapier：触发爬取并将 Markdown 推送到您的向量数据库
AI 智能体（MCP）：Claude Desktop、Cursor、VS Code、LangChain、LlamaIndex
Python、Node.js：Apify SDK 用于程序化访问
Pinecone、Weaviate、Qdrant：将 Markdown 直接接入嵌入管线

输出字段

字段	类型	描述
url	string	已爬取页面 URL
title	string	页面标题
description	string	Meta description
markdown	string	纯净 Markdown，最多 50,000 字符
text	string	纯文本，最多 10,000 字符
word_count	integer	纯文本字数
content_type	string	article、blog、documentation、generic
depth	integer	爬取深度（0 = 起始 URL）
status_code	integer	HTTP 状态码
scraped_at	string	ISO 8601 UTC 时间戳

网站内容爬虫：面向 AI 与 RAG 管线的纯净 Markdown

每页可获得

主要用例

API 示例

集成

输出字段

相关 Actor