Website Content Crawler: Clean Markdown for AI & RAG Pipelines

Extract clean Markdown and plain text from any website, optimized for AI ingestion, RAG pipelines, and LLM context windows. Readability-style main content extraction strips navigation, footers, sidebars, and ads so your AI gets only content that matters. Flat fetch (depth=0) for URL lists, or crawl entire sites up to depth 5. Up to 20 parallel workers.

Open on Apify → Try it now

Pricing

$1/1k pages + $0.25 start

RAM

128MB

Output

Markdown + text

Concurrency

Up to 20 workers

Crawl depth

0 to 5 levels

Engine

HTTP-only Go

What you get per page

Clean Markdown: up to 50,000 characters with headings, lists, links and code blocks preserved
Plain text: up to 10,000 characters with all HTML removed, ready for embeddings
Page metadata: URL, title (og:title or HTML title), meta description, word count
Content type detection: article, blog, documentation, or generic, useful for RAG routing
Crawl context: depth, start URL, internal links discovered, HTTP status code, scraped_at timestamp

Primary use cases

RAG knowledge bases. Crawl company documentation sites and feed clean Markdown into vector stores
LLM grounding. Feed agents with up-to-date content from blog posts, news, and product docs
AI summarization pipelines. Extract article text for batch summarization or topic clustering
Competitor content analysis. Crawl competitor blogs and product pages for structured analysis
Offline reading. Bulk-convert web pages to Markdown for archival or static-site rebuilds
ML training data. Build clean text corpora from a curated list of authoritative sources

API example

# Flat fetch a list of documentation pages (no crawling)
curl -X POST "https://api.apify.com/v2/acts/santamaria-automations~website-content-crawler/runs?token=YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "startUrls": [
      "https://docs.example.com/api/overview",
      "https://docs.example.com/api/authentication"
    ],
    "maxDepth": 0,
    "extractMainContent": true
  }'

# Crawl a blog 2 levels deep with 10 workers
# {"startUrls":["https://blog.example.com"],"maxDepth":2,"maxConcurrency":10}

# Or use with AI agents via MCP:
# https://mcp.apify.com?tools=santamaria-automations/website-content-crawler

Integrations

n8n, Make, Zapier: trigger crawls and push Markdown into your vector DB
AI Agents (MCP): Claude Desktop, Cursor, VS Code, LangChain, LlamaIndex
Python, Node.js: Apify SDK for programmatic access
Pinecone, Weaviate, Qdrant: pipe Markdown directly into your embedding pipeline

Output fields

Field	Type	Description
url	string	URL of the crawled page
title	string	Page title (og:title or HTML title)
description	string	Meta description
markdown	string	Clean Markdown, up to 50,000 chars
text	string	Plain text, up to 10,000 chars
word_count	integer	Word count of plain text
content_type	string	article, blog, documentation, generic
depth	integer	Crawl depth (0 = start URL)
status_code	integer	HTTP status code
scraped_at	string	ISO 8601 UTC timestamp

Related Actors

HTML to Markdown: single-page HTML conversion without crawling
Website Email Scraper: same crawl engine, contact-extraction output
Sitemap URL Discovery: feed start URLs from robots.txt and sitemap indexes

Open on Apify → Try it now (free tier available)