Skip to main content
  1. Tools/

Media

Research media – conference talks, lab meeting recordings, tutorial videos, podcast episodes, and image datasets – is some of the hardest content to preserve. Files are large, hosting platforms impose retention limits, and binary formats resist version control.

git-annex was designed precisely for this problem: content-addressed storage that tracks large files without bloating the git repository.

This section catalogs tools for downloading, organizing, and archiving media artifacts into git-annex/DataLad repositories.

Platforms and Formats #

YouTube – The dominant platform for research talks, tutorials, and conference recordings. annextube provides DataLad-native YouTube archival; yt-dlp offers a more general-purpose approach.

Zoom – Ubiquitous for lab meetings and virtual conferences. Recordings often have expiration dates, making proactive archival essential. See Zoom Archival.

Podcasts and Audio – Research podcasts, interview recordings, and audio datasets. yt-dlp handles most podcast feeds; specialized tools exist for specific use cases.

Image Galleries – Figures, microscopy images, and photo documentation. gallery-dl archives images from numerous hosting platforms.

AI Readiness #

Media files are inherently ai-manual – binary content that requires transcription or captioning before an LLM can work with it. However, many tools also capture structured metadata (titles, descriptions, timestamps, chapter markers) that is ai-ready on its own. A practical archival strategy preserves both the media files (in git-annex) and their metadata (in git) so that AI workflows can operate on the metadata while the full media remains available for human review.