medium webcrawlingdata-ingestionjavascript 6 min read

Crawl a page

A honest primer on web crawlers. Fetch + parse is 80% of the work. We show the other 20%.

Docs ↗

What a crawler actually is

A crawler does three things: fetch, parse, decide. Fetch pulls HTML over HTTP. Parse extracts structured data from it. Decide chooses whether to follow links, when to stop, and how fast to go. Modern crawlers add a storage layer and a queue — but at the core, it’s those three.

The playground above runs a worker-sandboxed version of the first two. Try it: hit Run. You’ll see the fixture page’s title, h1s, and first 10 links printed as JSON.

Parsing HTML three ways

ApproachWhen to useTrade-offs
Regex (what the playground uses)Extracting one obvious pattern on known-shape HTMLFast, brittle. Breaks on malformed markup. Do not parse arbitrary HTML this way.
DOMParser (in-browser or in a DOM-polyfilled worker)Any client-side scrape, any reasonable HTMLReal DOM access; fine in Worker with linkedom or similar polyfill.
cheerio (Node.js, server-side)jQuery-style API; most server-side scrapersNeeds Node runtime; lighter than full jsdom.
jsdomYou need script executionHeavy. Reach for it only if you must run the page’s JS.

The playground uses regex because the fixture page is known-shape and we want the worker-sandboxed version to be small. A production crawler should almost always use cheerio (Node) or linkedom (browser/worker) instead.

Politeness

Every real crawler needs four things the playground doesn’t have:

1. robots.txt respect. Before you fetch https://example.com/foo, fetch https://example.com/robots.txt and check that your user agent is allowed. Half of crawler-gets-blocked stories are a robots.txt you could have read.

2. A real User-Agent header. Identify yourself:

fetch(url, {
  headers: {
    'User-Agent': 'VorluxBot/1.0 (+https://vorluxai.com/bots)',
  },
});

This lets site owners block you if they want to. You want that — it’s the contract that lets crawling keep working for everyone.

3. Rate limiting. One request every 250ms per host is the industry “polite default.” More aggressive and you’ll get rate-limited or banned; less aggressive and you won’t finish in this century.

async function sleep(ms) {
  return new Promise((r) => setTimeout(r, ms));
}
for (const url of urls) {
  await fetch(url);
  await sleep(250);
}

4. Response caching. If you re-crawl daily, check Last-Modified / ETag headers and send If-None-Match on the second run. You’ll save 90% of the bytes.

When you graduate to a real tool

We use a mix: Playwright for SPA-heavy targets, cheerio + a custom scheduler for static HTML, Scrapy only when someone already in-house knows it.

How VORLUX uses this in production

Our KB crawler is a cheerio-based Node service running in the orchestrator. It fetches known-good sources (Docebo docs, EU AI Act publications, Spanish regulatory sites), parses structure-aware, and feeds content into the vorlux_content.db SQLite store with FTS5 indexing. Rate-limited to one request per 400ms per host. The news aggregator is a separate Playwright job for a handful of SPA-backed news sites that don’t render server-side.

The crawl loop dashboard shows you the queue depth, success rate, and last-fetched-at per source. We re-run the KB crawler nightly and the news aggregator every 4 hours. About 1,400 pages in the KB today came through this pipeline.

Go try it

In the playground, change the fetch URL to your own site, and update the regex to group all <a> tags by their hostname — expect output like {"vorluxai.com": 12, "animejs.com": 3, "codesandbox.io": 1}.

Solution
(async () => {
  const res = await fetch('YOUR_URL_HERE');
  const html = await res.text();

  const links = [...html.matchAll(/<a[^>]+href="([^"]+)"/gi)].map((m) => m[1]);
  const byHost = {};
  for (const link of links) {
    try {
      const host = new URL(link, 'YOUR_URL_HERE').hostname;
      byHost[host] = (byHost[host] || 0) + 1;
    } catch {
      /* relative or malformed URL — skip */
    }
  }
  self.postMessage({ type: 'result', data: byHost });
})();

Where to go next

If you’re building a data ingestion pipeline and want production-grade help, that’s one of the things Vorlux consults on. Reach out.