Browsertrix

Table of Contents

Overview #

Browsertrix (formerly Browsertrix Crawler) is a headless browser-based web crawler developed by Webrecorder that creates high-fidelity WARC (Web ARChive) files. Unlike traditional crawlers (wget, HTTrack) that download raw HTML, Browsertrix uses a real browser engine (Chromium) to render pages, executing JavaScript, loading dynamic content, and capturing the page as a user would actually see it.

This makes Browsertrix essential for archiving modern web applications, single-page apps (SPAs), sites with lazy-loaded content, and pages that rely heavily on client-side rendering. The resulting WARC files can be replayed with pixel-perfect fidelity using tools like ReplayWeb.page.

Browsertrix is available both as a standalone Docker-based crawler (Browsertrix Crawler) and as a full cloud-hosted platform (Browsertrix Cloud) with team collaboration, scheduling, and a web UI.

Key Features #

Browser-based crawling – uses headless Chromium to render pages exactly as a browser would, capturing JavaScript-rendered content, SPAs, and dynamically loaded elements.
WARC output – produces standard WARC files compatible with the broader web archiving ecosystem (Wayback Machine, ReplayWeb.page, pywb).
Configurable crawling – supports seed URLs, URL scoping (same-domain, same-path, custom regex), depth limits, and page limits.
Behavioral scripts – can execute custom behaviors during crawling (scroll to load lazy content, click through paginated content, dismiss cookie banners, log in to authenticated sites).
Parallel crawling – runs multiple browser instances concurrently for faster crawling of large sites.
Docker-native – runs as a Docker container, making deployment and scaling straightforward.
Cloud platform – Browsertrix Cloud adds team collaboration, crawl scheduling, quality review, and a web-based management interface.

Basic Usage #

Browsertrix Crawler (Docker) #

# Crawl a single site
docker run -v $PWD/crawls:/crawls \
    webrecorder/browsertrix-crawler crawl \
    --url "https://example.com" \
    --scopeType domain \
    --limit 100

# Output is in ./crawls/collections/
ls crawls/collections/*/
# archive/  indexes/  pages/

Crawl Configuration #

# crawl-config.yaml
seeds:
  - url: "https://docs.example.com"
    scopeType: prefix
  - url: "https://blog.example.com"
    scopeType: domain

limit: 500
workers: 4
behaviors: autoscroll,autoplay
blockAds: true

docker run -v $PWD:/config -v $PWD/crawls:/crawls \
    webrecorder/browsertrix-crawler crawl \
    --config /config/crawl-config.yaml

WARC Format #

WARC (Web ARChive) is an ISO standard (ISO 28500) for storing web crawl data. Each WARC file contains:

HTTP request/response pairs – the full network traffic of each page load.
Rendered page resources – HTML, CSS, JavaScript, images, fonts, and media files as served to the browser.
Metadata records – crawl timestamps, software version, and configuration.

WARC files can be replayed (viewed as the original website) using:

ReplayWeb.page – client-side WARC replay in the browser
pywb – Python-based Wayback Machine implementation
OpenWayback – Java-based Wayback Machine

git-annex / DataLad Integration #

Integration level: external.

WARC files are binary archives that can be large (hundreds of MB to GB). They are well-suited for git-annex content-addressed storage.

Importing Browsertrix Output into a DataLad Dataset #

# Create a dataset for web archives
datalad create web-warcs
cd web-warcs

# Configure annex for WARC files
echo "*.warc annex.largefiles=anything" >> .gitattributes
echo "*.warc.gz annex.largefiles=anything" >> .gitattributes
echo "*.cdx annex.largefiles=nothing" >> .gitattributes

# Run a crawl
docker run -v $PWD/crawls:/crawls \
    webrecorder/browsertrix-crawler crawl \
    --url "https://example.com" --scopeType domain

# Import crawl results
cp -r crawls/collections/*/archive/*.warc.gz ./warcs/
cp crawls/collections/*/indexes/*.cdx ./indexes/

# Save with provenance
datalad save -m "Archive example.com via Browsertrix"

Automated Crawl with Provenance #

datalad run -m "Crawl example.com with Browsertrix" \
    --output "warcs/" --output "indexes/" \
    'docker run -v $PWD/crawls:/crawls \
        webrecorder/browsertrix-crawler crawl \
        --url "https://example.com" --scopeType domain && \
    cp crawls/collections/*/archive/*.warc.gz warcs/ && \
    cp crawls/collections/*/indexes/*.cdx indexes/'

AI Readiness #

Level: ai-manual.

WARC files are binary archives that are not directly consumable by LLMs:

Replay required – WARC content must be replayed through a Wayback Machine implementation or extracted with specialized tools (warctools, warcio) before the text content is accessible.
Mixed content – a single WARC file contains HTML, CSS, JavaScript, images, and other resources interleaved. Extracting just the text content requires parsing.
CDX indexes – the companion CDX (Capture inDeX) files are structured text that provides a machine-readable index of what is in the WARC. These are ai-ready.

For AI workflows, the recommended pipeline is:

Crawl with Browsertrix to create WARC files (preserves full fidelity).
Extract text content using warcio or similar WARC processing libraries.
Index extracted text for search and RAG pipelines.
Keep the original WARCs in git-annex for long-term preservation and replay.

The WARC format is optimized for archival fidelity, not for AI consumption. Think of it as the “master copy” from which AI-friendly derivatives can be produced.