Extract clean Markdown and plain text from any website, optimized for AI ingestion, RAG pipelines, and LLM context windows. Readability-style main content extraction strips navigation, footers, sidebars, and ads so your AI gets only content that matters. Flat fetch (depth=0) for URL lists, or crawl entire sites up to depth 5. Up to 20 parallel workers.
# Flat fetch a list of documentation pages (no crawling) curl -X POST "https://api.apify.com/v2/acts/santamaria-automations~website-content-crawler/runs?token=YOUR_TOKEN" \ -H "Content-Type: application/json" \ -d '{ "startUrls": [ "https://docs.example.com/api/overview", "https://docs.example.com/api/authentication" ], "maxDepth": 0, "extractMainContent": true }' # Crawl a blog 2 levels deep with 10 workers # {"startUrls":["https://blog.example.com"],"maxDepth":2,"maxConcurrency":10} # Or use with AI agents via MCP: # https://mcp.apify.com?tools=santamaria-automations/website-content-crawler
| Field | Type | Description |
|---|---|---|
| url | string | URL of the crawled page |
| title | string | Page title (og:title or HTML title) |
| description | string | Meta description |
| markdown | string | Clean Markdown, up to 50,000 chars |
| text | string | Plain text, up to 10,000 chars |
| word_count | integer | Word count of plain text |
| content_type | string | article, blog, documentation, generic |
| depth | integer | Crawl depth (0 = start URL) |
| status_code | integer | HTTP status code |
| scraped_at | string | ISO 8601 UTC timestamp |