Crawl a page

What a crawler actually is

A crawler does three things: fetch, parse, decide. Fetch pulls HTML over HTTP. Parse extracts structured data from it. Decide chooses whether to follow links, when to stop, and how fast to go. Modern crawlers add a storage layer and a queue — but at the core, it’s those three.

The playground above runs a worker-sandboxed version of the first two. Try it: hit Run. You’ll see the fixture page’s title, h1s, and first 10 links printed as JSON.

Parsing HTML three ways

Approach	When to use	Trade-offs
Regex (what the playground uses)	Extracting one obvious pattern on known-shape HTML	Fast, brittle. Breaks on malformed markup. Do not parse arbitrary HTML this way.
`DOMParser` (in-browser or in a DOM-polyfilled worker)	Any client-side scrape, any reasonable HTML	Real DOM access; fine in Worker with `linkedom` or similar polyfill.
`cheerio` (Node.js, server-side)	jQuery-style API; most server-side scrapers	Needs Node runtime; lighter than full jsdom.
`jsdom`	You need script execution	Heavy. Reach for it only if you must run the page’s JS.

The playground uses regex because the fixture page is known-shape and we want the worker-sandboxed version to be small. A production crawler should almost always use cheerio (Node) or linkedom (browser/worker) instead.

Politeness

Every real crawler needs four things the playground doesn’t have:

1. robots.txt respect. Before you fetch https://example.com/foo, fetch https://example.com/robots.txt and check that your user agent is allowed. Half of crawler-gets-blocked stories are a robots.txt you could have read.

2. A real User-Agent header. Identify yourself:

fetch(url, {
  headers: {
    'User-Agent': 'VorluxBot/1.0 (+https://vorluxai.com/bots)',
  },
});

This lets site owners block you if they want to. You want that — it’s the contract that lets crawling keep working for everyone.

3. Rate limiting. One request every 250ms per host is the industry “polite default.” More aggressive and you’ll get rate-limited or banned; less aggressive and you won’t finish in this century.

async function sleep(ms) {
  return new Promise((r) => setTimeout(r, ms));
}
for (const url of urls) {
  await fetch(url);
  await sleep(250);
}

4. Response caching. If you re-crawl daily, check Last-Modified / ETag headers and send If-None-Match on the second run. You’ll save 90% of the bytes.

When you graduate to a real tool

Your script in Node/Deno — what you’d write if the target is under 100 pages, one-shot.
Scrapy (Python) — when you want built-in middlewares, a proper queue, retries, and 10K+ pages. The de-facto standard.
Playwright (Node) — when the pages are JS-rendered SPAs and you need a real browser. Slower per-page, but unavoidable for some sites.
Apify SDK — when you want hosted crawlers you don’t maintain yourself. They also OSS their SDK for self-hosting.

We use a mix: Playwright for SPA-heavy targets, cheerio + a custom scheduler for static HTML, Scrapy only when someone already in-house knows it.

How VORLUX uses this in production

Our KB crawler is a cheerio-based Node service running in the orchestrator. It fetches known-good sources (Docebo docs, EU AI Act publications, Spanish regulatory sites), parses structure-aware, and feeds content into the vorlux_content.db SQLite store with FTS5 indexing. Rate-limited to one request per 400ms per host. The news aggregator is a separate Playwright job for a handful of SPA-backed news sites that don’t render server-side.

The crawl loop dashboard shows you the queue depth, success rate, and last-fetched-at per source. We re-run the KB crawler nightly and the news aggregator every 4 hours. About 1,400 pages in the KB today came through this pipeline.

Go try it

In the playground, change the fetch URL to your own site, and update the regex to group all <a> tags by their hostname — expect output like {"vorluxai.com": 12, "animejs.com": 3, "codesandbox.io": 1}.

Solution

(async () => {
  const res = await fetch('YOUR_URL_HERE');
  const html = await res.text();

  const links = [...html.matchAll(/<a[^>]+href="([^"]+)"/gi)].map((m) => m[1]);
  const byHost = {};
  for (const link of links) {
    try {
      const host = new URL(link, 'YOUR_URL_HERE').hostname;
      byHost[host] = (byHost[host] || 0) + 1;
    } catch {
      /* relative or malformed URL — skip */
    }
  }
  self.postMessage({ type: 'result', data: byHost });
})();

Where to go next

Scrapy docs — Architecture overview — the diagram alone is worth it.
Playwright — Web scraping guide — when fetch isn’t enough.
Apify’s web-scraping academy — a good course if you want to build this professionally.

If you’re building a data ingestion pipeline and want production-grade help, that’s one of the things Vorlux consults on. Reach out.