Website Content Crawler: Clean Markdown for AI & RAG Pipelines

Extract clean Markdown and plain text from any website, optimized for AI ingestion, RAG pipelines, and LLM context windows. Readability-style main content extraction strips navigation, footers, sidebars, and ads so your AI gets only content that matters. Flat fetch (depth=0) for URL lists, or crawl entire sites up to depth 5. Up to 20 parallel workers.

Open on Apify → Try it now
Pricing
$1/1k pages + $0.25 start
RAM
128MB
Output
Markdown + text
Concurrency
Up to 20 workers
Crawl depth
0 to 5 levels
Engine
HTTP-only Go

What you get per page

Primary use cases

API example

# Flat fetch a list of documentation pages (no crawling)
curl -X POST "https://api.apify.com/v2/acts/santamaria-automations~website-content-crawler/runs?token=YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "startUrls": [
      "https://docs.example.com/api/overview",
      "https://docs.example.com/api/authentication"
    ],
    "maxDepth": 0,
    "extractMainContent": true
  }'

# Crawl a blog 2 levels deep with 10 workers
# {"startUrls":["https://blog.example.com"],"maxDepth":2,"maxConcurrency":10}

# Or use with AI agents via MCP:
# https://mcp.apify.com?tools=santamaria-automations/website-content-crawler

Integrations

Output fields

FieldTypeDescription
urlstringURL of the crawled page
titlestringPage title (og:title or HTML title)
descriptionstringMeta description
markdownstringClean Markdown, up to 50,000 chars
textstringPlain text, up to 10,000 chars
word_countintegerWord count of plain text
content_typestringarticle, blog, documentation, generic
depthintegerCrawl depth (0 = start URL)
status_codeintegerHTTP status code
scraped_atstringISO 8601 UTC timestamp

Related Actors

Open on Apify → Try it now (free tier available)