con/annextube
Table of Contents
annextube is the flagship tool in the con/serve ecosystem for archiving YouTube content into DataLad/git-annex repositories. It wraps yt-dlp with native DataLad integration, providing a purpose-built pipeline for preserving video channels and playlists with full metadata, incremental updates, and content-addressed storage.
Why annextube? #
YouTube content is ephemeral. Channels disappear, videos get removed, playlists are reorganized, and community posts vanish without warning. For research groups that rely on educational content, conference recordings, or tutorial series, this creates a real preservation problem.
annextube solves this by treating YouTube archival as a first-class DataLad workflow:
- Video files are stored in git-annex (content-addressed, deduplicated, annexable to remote storage)
- Metadata (titles, descriptions, upload dates, thumbnails, subtitles) is stored in git (version-tracked, diffable, searchable)
- Incremental updates mean you only download what is new since the last run
- Provenance is captured automatically through DataLad’s run records
Architecture #
annextube builds on a clear separation between large binary content and structured metadata:
my-channel-archive/
.datalad/
.git/
.gitattributes
videos/
<video-id>/
<video-id>.mp4 # git-annex (content-addressed)
<video-id>.info.json # git (yt-dlp metadata)
<video-id>.description # git (video description)
<video-id>.en.vtt # git (English subtitles)
<video-id>.thumbnail.jpg # git-annex (thumbnail image)
channel_metadata.json # git (channel-level metadata)
This layout means:
git logshows you when videos were added, metadata changed, or descriptions updatedgit annex whereistells you where each video file is stored (local, S3, institutional storage)datalad statusgives you an instant overview of what has changed- Standard git tools (diff, blame, log) work on all the text metadata
Key Features #
Channel and Playlist Archival #
Archive an entire YouTube channel or specific playlists. annextube handles pagination, rate limiting, and error recovery automatically.
Incremental Updates #
After the initial archive, subsequent runs only download new videos and updated metadata. This makes it practical to maintain living archives of active channels without re-downloading everything.
Metadata Extraction and Storage #
For each video, annextube extracts and stores:
- info.json – the full yt-dlp metadata dump including title, description, upload date, duration, view count, tags, categories, chapters, and more
- Description files – the video description as a standalone text file for easy searching
- Thumbnails – preserved in git-annex alongside the video
- Subtitles and transcripts – auto-generated and manual subtitles in VTT/SRT format, stored in git for full-text searchability
Subtitle and Transcript Archival #
Subtitles are particularly valuable for AI readiness. annextube downloads all available subtitle tracks (both creator-uploaded and YouTube’s auto-generated captions) and stores them as plain text files in git. This means:
- Full-text search across all archived video transcripts using standard
grep/git grep - LLM-based analysis of video content without needing to process the video files themselves
- Structured subtitle formats (VTT with timestamps) enable time-aligned references back to the source video
DataLad Integration #
annextube operates as a DataLad-aware tool:
- Creates proper DataLad datasets for new archives
- Uses
datalad saveto commit changes with meaningful messages - Supports DataLad’s run mechanism for full provenance tracking
- Works with DataLad siblings for pushing archives to remote storage
Installation #
pip install annextube
Or with uv:
uv pip install annextube
Prerequisites #
- Python 3.8+
- git-annex
- DataLad
- yt-dlp (installed as a dependency)
Usage #
Archive a YouTube Channel #
# Create a new DataLad dataset for the archive
datalad create my-channel-archive
cd my-channel-archive
# Archive an entire channel
annextube archive https://www.youtube.com/@ChannelName
Archive a Playlist #
annextube archive https://www.youtube.com/playlist?list=PLxxxxxxxx
Incremental Update #
# Run again later to pick up new videos
annextube archive https://www.youtube.com/@ChannelName
annextube tracks what has already been downloaded and only fetches new content.
Example Workflow: Archiving a Research Channel #
A complete workflow for archiving a conference channel and making it available for AI-assisted analysis:
# 1. Create the archive dataset
datalad create -c text2git conference-talks
cd conference-talks
# 2. Initial archive of the channel
annextube archive https://www.youtube.com/@ConferenceName
# 3. Push video files to institutional S3 storage
git annex initremote s3 type=S3 bucket=conference-archive
git annex copy --to s3 .
# 4. The metadata and subtitles are in git, push to GitHub/Forgejo
datalad push --to origin
# 5. Later: update with new uploads
annextube archive https://www.youtube.com/@ConferenceName
# 6. Search across all archived transcripts
git grep "interesting topic" -- '*.vtt'
AI Readiness #
annextube produces ai-partial output:
| Component | AI Ready? | Notes |
|---|---|---|
| info.json metadata | Yes | Structured JSON, directly parseable |
| Video descriptions | Yes | Plain text, immediately usable |
| Subtitles/transcripts | Yes | Time-stamped text, excellent for RAG |
| Video files | No | Require transcription (Whisper, etc.) |
| Thumbnails | No | Require vision model for analysis |
The combination of structured metadata and subtitle text means that a large fraction of a channel’s informational content is immediately accessible to AI systems without any additional processing. For videos lacking subtitles, tools like OpenAI Whisper can be run on the git-annex-stored video files to generate transcripts.
Comparison with yt-dlp #
annextube builds on yt-dlp and shares its download capabilities, but adds the DataLad/git-annex layer:
| Feature | yt-dlp alone | annextube |
|---|---|---|
| Video download | Yes | Yes (via yt-dlp) |
| Metadata extraction | Yes | Yes (via yt-dlp) |
| git-annex storage | Manual setup | Automatic |
| DataLad integration | None | Native |
| Incremental updates | Manual tracking | Built-in |
| Content deduplication | No | Via git-annex |
| Remote storage (S3, etc.) | No | Via git-annex special remotes |
| Provenance tracking | No | Via DataLad run records |
If you just need to download a few videos, yt-dlp is simpler. If you are building a persistent, versioned, deduplicated archive of YouTube content integrated with your research data management infrastructure, annextube is the right tool.
Limitations and Caveats #
- YouTube rate limiting: Heavy archival can trigger rate limits. annextube inherits yt-dlp’s throttling behavior but long-running archives of large channels may need to be done in stages.
- YouTube Terms of Service: Archiving content for research preservation purposes. Users should be aware of YouTube’s ToS and applicable copyright considerations.
- Beta status: The tool is functional and actively used but the API and output format may still evolve.
- yt-dlp dependency: Changes in YouTube’s infrastructure occasionally break yt-dlp, which cascades to annextube. Keeping yt-dlp updated is important.
See Also #
- yt-dlp – the underlying download engine
- gallery-dl – similar concept for image galleries
- Zoom Recording Archival – archiving video from another platform