HTML to Markdown Converter: Bulk Web Pages to Clean MD, $1 per 1,000 Pages

Convert any HTML page to clean, LLM-ready Markdown. Strips chrome (navs, ads, sidebars), preserves headings, tables, fenced code blocks, images with alt text, and links. Returns the page title, primary content as Markdown, word count, extracted image and link arrays, and the inferred main URL. Built for batch: feed it 10,000 article URLs and it returns one row per page. Perfect for LLM training corpora, RAG ingestion, documentation mirrors, and content monitoring.

Open on Apify → Try it now

Pricing

$0.001/page

RAM

128MB

Coverage

Any URL

Output fields

10+

Proxy

Apify datacenter

Tech

HTTP + Readability

What you get

Clean markdown: main_content as GitHub-flavored Markdown, no chrome
Structure preserved: headings, lists, tables, fenced code blocks, blockquotes
Extracted assets: images[] with alt text, links[] with anchor and href
Stats: word_count, char_count, reading_time_minutes
Metadata: title, meta_description, canonical_url, language, scraped_at

Primary use cases

LLM training corpora. Convert thousands of blog posts and docs into clean markdown for fine-tuning
RAG ingestion. Feed cleaned page content into vector databases without HTML noise
Documentation mirrors. Snapshot competitor docs or knowledge bases into structured markdown
Content monitoring. Track changes to important pages over time using diff-friendly markdown
AI agent context. Give agents readable page content instead of raw HTML to save tokens

API example

# Start a run via the Apify API
curl -X POST "https://api.apify.com/v2/acts/santamaria-automations~html-to-markdown/runs?token=YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "urls": [
      "https://blog.example.com/post-1",
      "https://docs.example.com/getting-started",
      "https://news.example.com/article-2026"
    ],
    "extractImages": true,
    "extractLinks": true,
    "mainContentOnly": true
  }'

# Or use with AI agents via MCP:
# https://mcp.apify.com?tools=santamaria-automations/html-to-markdown

Integrations

n8n, Make, Zapier: trigger runs and process records via webhook
AI Agents (MCP): Claude Desktop, Cursor, VS Code, LangChain, LlamaIndex
Python, Node.js: Apify SDK for programmatic access
Google Sheets, Airtable: bulk input in, structured data out

Output fields

Field	Type	Example
source_url	string	https://blog.example.com/post-1
title	string	Building RAG Pipelines
main_content	string	# Building RAG Pipelines\n\nA practical guide...
word_count	integer	1,842
reading_time_minutes	integer	8
language	string	en
canonical_url	string	https://blog.example.com/post-1
images	array	[{"src":"...","alt":"diagram"}]
links	array	[{"href":"...","text":"docs"}]
scraped_at	string	2026-06-13T10:15:42Z

Related Actors

PDF Text Extractor: PDF companion for the same content pipelines
SEO Metadata Extractor: pull SEO metadata in parallel with markdown
Sitemap URL Discovery: enumerate every URL on a site before bulk conversion

Open on Apify → Try it now (free tier available)