EngineeringApril 29, 20266 min read

How to extract all links from a website

Crawling a whole site for every internal and external URL — the right way with a sitemap, the brute-force way with Scrapy, and what to do when neither works.

By Dawid Sibinski

Pulling every URL on a site is two different problems depending on the site. If they publish a sitemap, you're done in one HTTP request. If they don't, you're crawling — which means rate limits, robots.txt, and de-duplication.

1. The sitemap (always check first)

Most sites publish /sitemap.xml or /sitemap_index.xml. One curl gets you a list of every URL the site wants indexed:

curl -s https://example.com/sitemap.xml | grep -oE 'https?://[^<]+' | sort -u

If the root sitemap is an index pointing to per-section sitemaps, follow those. The Python ultimate-sitemap-parser handles index recursion automatically.

What you'll miss: pages the site doesn't want indexed (admin pages, paginated archives), and pages that just aren't in the sitemap because someone forgot. For those, you need to crawl.

2. Crawling with Scrapy

The right tool for a real site crawl. Scrapy handles concurrency, retries, deduplication, and respect for robots.txt out of the box.

import scrapy class LinkSpider(scrapy.Spider): name = "links" start_urls = ["https://example.com"] custom_settings = {"DOWNLOAD_DELAY": 0.5} def parse(self, response): for href in response.css("a::attr(href)").getall(): url = response.urljoin(href) yield {"url": url} if url.startswith("https://example.com"): yield response.follow(href, self.parse)

Run with scrapy runspider spider.py -O links.csv. The DOWNLOAD_DELAY keeps you polite; bump it on smaller sites.

3. Lightweight: requests + BeautifulSoup

For a single page or a tightly scoped crawl, Scrapy is overkill. requests + BeautifulSoup handles it in a script:

import requests from bs4 import BeautifulSoup from urllib.parse import urljoin r = requests.get("https://example.com") soup = BeautifulSoup(r.text, "html.parser") links = {urljoin(r.url, a["href"]) for a in soup.find_all("a", href=True)} print("\n".join(sorted(links)))

4. JavaScript-rendered sites

Scrapy and requests both see only the initial HTML. Single-page apps that build links client-side (React, Vue, Svelte SPAs without SSR) return empty pages to both. Swap in Playwright:

from playwright.sync_api import sync_playwright with sync_playwright() as p: browser = p.chromium.launch() page = browser.new_page() page.goto("https://example.com") page.wait_for_load_state("networkidle") links = page.eval_on_selector_all("a[href]", "els => els.map(e => e.href)")

5. No-code: Chrome extensions

Link Klipper and Link Grabber both extract every link from the current page to CSV. Fine for one-off needs; not useful for crawling more than the page you're on.

Etiquette and law

Always check robots.txt first. It doesn't make scraping illegal but it shapes the legal exposure if there's a dispute.
Throttle. A polite 0.5–1s delay between requests prevents you from being treated as a DoS.
Set a real User-Agent that identifies you and how to contact you. "MyResearchScraper (contact: you@example.com)" gets way fewer blocks than "python-requests/2.31."
Cache aggressively. If you're iterating on parsing logic, save raw HTML locally and re-parse instead of re-fetching.