Sitemap and URL Discovery: Find Every URL on Any Site, $1 per 1,000 Sitemaps

Parse robots.txt, sitemap.xml, and nested sitemap indexes to discover every published URL on any website. Returns each URL with its last-modified date, change frequency, and priority. Supports gzipped sitemaps and sitemap-index chains. Built for bulk: pass a list of domains and it returns the full URL inventory for each. Ideal for SEO audits, content monitoring, competitive intelligence, and pre-crawl URL discovery.

Open on Apify → Try it now

Pricing

$0.001/sitemap

RAM

128MB

Coverage

Any domain

Output fields

Proxy

Apify datacenter

Tech

Native XML + gzip

What you get

URL inventory: every from sitemap.xml and nested sitemap indexes
Per-URL metadata: lastmod, changefreq, priority, image_count, video_count
Sitemap source: sitemap_url, sitemap_type (urlset / sitemapindex), is_gzipped
robots.txt: declared sitemaps, allowed/disallowed paths, crawl_delay
Aggregates: total_urls, last_modified_max, scraped_at

Primary use cases

SEO audits. Compare declared URLs to crawled URLs to spot indexation gaps
Content monitoring. Diff sitemaps over time to detect new, changed, or removed pages
Competitive intelligence. Map a competitor's full content footprint by parsing their sitemap chain
Pre-crawl URL discovery. Hand the URL list to a downstream crawler for targeted scraping
Migration QA. Verify every URL in the old sitemap redirects correctly after a relaunch

API example

# Start a run via the Apify API
curl -X POST "https://api.apify.com/v2/acts/santamaria-automations~sitemap-url-discovery/runs?token=YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "domains": [
      "https://example.com",
      "https://competitor.com",
      "https://news-site.com"
    ],
    "maxUrlsPerSite": 10000,
    "followSitemapIndexes": true,
    "parseRobotsTxt": true
  }'

# Or use with AI agents via MCP:
# https://mcp.apify.com?tools=santamaria-automations/sitemap-url-discovery

Integrations

n8n, Make, Zapier: trigger runs and process records via webhook
AI Agents (MCP): Claude Desktop, Cursor, VS Code, LangChain, LlamaIndex
Python, Node.js: Apify SDK for programmatic access
Google Sheets, Airtable: bulk input in, structured data out

Output fields

Field	Type	Example
source_domain	string	example.com
sitemap_url	string	https://example.com/sitemap.xml
total_urls	integer	12,486
url	string	https://example.com/blog/post-1
lastmod	string	2026-06-10T14:00:00Z
changefreq	string	weekly
priority	number	0.8
robots_sitemaps	array	["https://example.com/sitemap.xml"]
is_gzipped	boolean	false
scraped_at	string	2026-06-13T10:15:42Z

Related Actors

HTML to Markdown: feed discovered URLs into a bulk markdown extractor
SEO Metadata Extractor: extract SEO metadata from each discovered URL
Broken Link Checker: validate every URL in the sitemap returns 200

Open on Apify → Try it now (free tier available)