Convert any HTML page to clean, LLM-ready Markdown. Strips chrome (navs, ads, sidebars), preserves headings, tables, fenced code blocks, images with alt text, and links. Returns the page title, primary content as Markdown, word count, extracted image and link arrays, and the inferred main URL. Built for batch: feed it 10,000 article URLs and it returns one row per page. Perfect for LLM training corpora, RAG ingestion, documentation mirrors, and content monitoring.
# Start a run via the Apify API curl -X POST "https://api.apify.com/v2/acts/santamaria-automations~html-to-markdown/runs?token=YOUR_TOKEN" \ -H "Content-Type: application/json" \ -d '{ "urls": [ "https://blog.example.com/post-1", "https://docs.example.com/getting-started", "https://news.example.com/article-2026" ], "extractImages": true, "extractLinks": true, "mainContentOnly": true }' # Or use with AI agents via MCP: # https://mcp.apify.com?tools=santamaria-automations/html-to-markdown
| Field | Type | Example |
|---|---|---|
| source_url | string | https://blog.example.com/post-1 |
| title | string | Building RAG Pipelines |
| main_content | string | # Building RAG Pipelines\n\nA practical guide... |
| word_count | integer | 1,842 |
| reading_time_minutes | integer | 8 |
| language | string | en |
| canonical_url | string | https://blog.example.com/post-1 |
| images | array | [{"src":"...","alt":"diagram"}] |
| links | array | [{"href":"...","text":"docs"}] |
| scraped_at | string | 2026-06-13T10:15:42Z |