ArchiveBox

Table of Contents

Overview #

ArchiveBox is a self-hosted web archiving tool that takes a list of URLs and saves snapshots of each page in multiple formats simultaneously. For every URL, it can produce an HTML copy, a PDF rendering, a full-page screenshot, a WARC archive, extracted media files, and a git-tracked history of changes.

Unlike browser extensions that save a single page at a time, ArchiveBox is designed for bulk archiving – it can ingest URLs from bookmarks exports, browser history, RSS feeds, Pocket, Pinboard, or plain text lists. It runs as a local server with a web UI, CLI, and REST API.

Key Features #

Multiple output formats – for each URL, ArchiveBox can save:
- Static HTML (wget mirror)
- Single-file HTML (via SingleFile)
- PDF rendering (via headless Chromium)
- Full-page screenshot (PNG)
- WARC archive (for replay with tools like ReplayWeb.page)
- Extracted audio/video/images
- Git history of page changes over time
- Plain text extraction
- DOM dump
Bulk import – accepts URLs from browser bookmarks (Chrome, Firefox), Pocket, Pinboard, RSS/Atom feeds, browser history databases, or plain text files.
Web UI and API – Django-based web interface for browsing archives, searching content, and managing snapshots. REST API for programmatic access.
Scheduling – built-in scheduler for periodic re-archiving of URLs.
Full-text search – search across archived page content using Sonic or Ripgrep backends.
Self-hosted – runs on your own hardware; no third-party dependencies for archiving. Docker or bare-metal installation.
Deduplication – detects and avoids re-archiving identical content.

Basic Usage #

# Install via pip or Docker
pip install archivebox

# Initialize an archive
mkdir ~/web-archive && cd ~/web-archive
archivebox init

# Add URLs
archivebox add "https://example.com/important-page"

# Import from bookmarks
archivebox add < bookmarks.html

# Import from a text file of URLs
archivebox add < urls.txt

# Start the web UI
archivebox server 0.0.0.0:8000

Output Structure #

ArchiveBox organizes its output by URL hash:

archive/
  1707123456.789/          # timestamp-based snapshot ID
    index.json             # metadata (URL, title, timestamps, status)
    singlefile.html        # self-contained HTML
    output.pdf             # PDF rendering
    screenshot.png         # full-page screenshot
    warc/                  # WARC archive
    media/                 # extracted media files
    git/                   # git-tracked page history
    readability/           # extracted article text

git-annex / DataLad Integration #

Integration level: external.

ArchiveBox manages its own storage and does not directly integrate with git-annex or DataLad. However, its output directory can be imported into a DataLad dataset for version-controlled, distributed archival.

Importing ArchiveBox Output into git-annex #

# Create a DataLad dataset for web archives
datalad create web-archive
cd web-archive

# Configure git-annex to handle ArchiveBox output appropriately
# Large files (PDFs, screenshots, WARCs, media) go to annex
# Small files (index.json, readability text) go to git
cat >> .gitattributes << 'EOF'
*.pdf annex.largefiles=anything
*.png annex.largefiles=anything
*.warc annex.largefiles=anything
*.warc.gz annex.largefiles=anything
*.mp4 annex.largefiles=anything
*.mp3 annex.largefiles=anything
singlefile.html annex.largefiles=anything
index.json annex.largefiles=nothing
EOF

# Copy ArchiveBox output into the dataset
cp -r ~/web-archive/archive/ ./archive/

# Save with DataLad
datalad save -m "Import ArchiveBox snapshots"

Periodic Archival Workflow #

# Run ArchiveBox to update archives
archivebox add < new-urls.txt

# Sync changes into the DataLad dataset
rsync -av ~/web-archive/archive/ ./archive/
datalad save -m "Update web archive $(date +%Y-%m-%d)"

For a more automated approach, wrap the archiving step with datalad run to record the full provenance of each archival operation.

AI Readiness #

Level: ai-partial.

ArchiveBox produces a mix of AI-friendly and AI-challenging outputs:

AI-ready components:
- index.json – structured metadata (URL, title, timestamps, tags) is immediately parseable.
- Readability-extracted text – clean article text stripped of navigation and boilerplate.
- Plain text extractions – directly consumable by LLMs.
AI-partial components:
- singlefile.html – contains the full page content but mixed with styling, scripts, and layout markup. Useful but noisy for LLM consumption.
- PDF renderings – require PDF text extraction for LLM use.
AI-manual components:
- Screenshots (PNG) – require OCR or vision models.
- WARC archives – require specialized tools to extract and replay content.
- Media files – require transcription for audio/video.

The readability extraction and plain text outputs are the most valuable for AI workflows. For building a searchable knowledge base from archived web pages, the text extractions plus index.json metadata provide a solid foundation for RAG (retrieval-augmented generation) pipelines.