Web · con/serve

The web is the connective tissue of modern research – project homepages, documentation sites, blog posts, institutional pages, and online supplements all live at URLs that may not persist.

Link rot is well-documented: studies consistently find that a significant fraction of URLs cited in published papers become unreachable within a few years.

This section catalogs tools for capturing web content and archiving it into git or git-annex repositories.

Approaches #

Full-site archival – Tools like ArchiveBox and HTTrack crawl entire websites and preserve them as self-contained archives (WARC, static HTML, or single-file snapshots).

Single-page capture – SingleFile captures individual pages as self-contained HTML files with all resources inlined, ideal for preserving specific pages or articles.

Headless crawling – Browsertrix uses a headless browser to capture JavaScript-heavy sites that static crawlers cannot handle, producing standards-compliant WARC archives.

Integration with git-annex #

Web archives can be large (especially WARC files from full-site crawls), making git-annex the natural storage backend. The typical pattern is:

Run the archival tool to capture content locally
Import the output into a git-annex repository
Use git annex addurl to record the original URL as a retrievable source
Replicate to special remotes for backup

For simpler needs, git annex addurl and datalad download-url can directly archive individual URLs without an intermediate tool.

ArchiveBox

12 February 2026·4 mins

Ai-Partial Web External Web Web Archiving Self-Hosted Html Pdf Screenshots

A self-hosted internet archiving tool that takes URLs and saves them in multiple formats – HTML, PDF, screenshot, WARC, media files – for long-term preservation.

Browsertrix

12 February 2026·4 mins

Ai-Manual Web External Web Crawler Headless Warc Web Archiving

A cloud-native, headless browser-based web crawler that creates high-fidelity WARC archives, capturing JavaScript-rendered content that traditional crawlers miss.

HTTrack

12 February 2026·4 mins

Ai-Partial Web External Web Mirror Website Copier Offline Html

A mature, well-established website mirroring tool that creates offline-browsable copies of entire websites, preserving directory structure and link integrity.

SingleFile

12 February 2026·4 mins

Ai-Partial Web External Web Browser Extension Html Single-File Web

A browser extension (and CLI tool) that saves a complete web page – including CSS, images, fonts, and iframes – into a single, self-contained HTML file.