Skip to main content
  1. User Stories/

Personal Archive

The Goal #

A single person wants to build a comprehensive personal digital archive that captures their digital life across platforms and services, stored in git-annex repositories they control, with browsable frontends for day-to-day access.

This is the “data hoarder” use case in its purest form: not a research lab, not an institution – just one person who wants to own their data.

Data Sources #

Google Account (via Google Takeout) #

Google Takeout is the single largest ingestion event. A typical personal Google account export includes:

SourceWhatVolume
GmailYears of email, attachments5-50 GB
Google PhotosEvery photo/video from phone backups20-200+ GB
Google DriveDocuments, spreadsheets, presentations1-50 GB
YouTubeWatch history, liked videos, playlists, subscriptionsMetadata only (small)
CalendarEvents spanning yearsSmall
ContactsAddress bookSmall
Location HistoryGPS tracks, place visitsModerate
ChromeBookmarks, browsing historySmall
Google MapsSaved places, reviews, starred locationsSmall
KeepNotes, lists, voice memosSmall
Google FitActivity and health dataModerate
Hangouts/ChatMessaging historyModerate

Status: Google Takeout documents the ingestion workflow. The main gap is automated splitting and metadata reassembly – a connector that takes a raw Takeout dump and produces well-organized domain-specific datasets.

Photos #

Photos are often the most emotionally valuable part of a personal archive and the most voluminous. Sources include:

  • Google Photos (from Takeout) – the primary source for anyone with an Android phone
  • Apple Photos (via iCloud export or direct DCIM copy)
  • Camera imports – SD cards, USB transfers
  • Messaging apps – photos shared in WhatsApp, Telegram, Signal
  • Social media – photos posted to Instagram, Facebook (via their export tools)

Desired end state:

~/vault/personal/photos/
    ├── import-google-2026-02/    # Takeout dump, metadata reassembled
    ├── import-camera-2025/       # Direct camera imports
    ├── albums/                   # Curated albums (links or metadata)
    ├── photos.tsv                # Hierarchical summary index
    └── .datalad/

With browsable frontends:

  • PhotoPrism for AI-assisted browsing (face recognition, map view, auto-classification)
  • Photoview for lightweight filesystem-native gallery (deployed in LiaB)
  • copyparty for zero-setup quick browsing and sharing

Personal Messaging #

PlatformToolFormatNotes
Telegram (personal channels, groups, DMs)tg-archiveHTML + JSONPersonal channels are an increasingly common “microblog”
WhatsAppExport from app settingsText + mediaManual process, no good automated tool
SignalSignal backup decoderSQLiteEncrypted backup requires passphrase
Slack (personal workspaces)slackdumpJSONFor personal or small-team workspaces
Matrix (personal rooms)matrix-archiveHTML/JSONSelf-hosted servers simplify this
DiscordDiscordChatExporterHTML/JSONPersonal servers and DMs

Desired end state:

~/vault/personal/messaging/
    ├── telegram/
    │   ├── channels/        # Personal channels (microblog-style)
    │   ├── groups/          # Group chats
    │   └── dms/             # Direct messages
    ├── whatsapp/
    ├── signal/
    └── .datalad/

YouTube (Personal Collection) #

A personal YouTube presence includes:

  • Watch history – years of viewing data
  • Liked videos – curated collection of valuable content
  • Playlists – organized collections (educational, music, etc.)
  • Subscriptions – channels you follow
  • Own uploads (if any) – personal content you’ve published

Metadata comes from Google Takeout. Actual video archival for liked/playlist videos uses annextube or yt-dlp.

Desired end state:

~/vault/personal/youtube/
    ├── watch-history/
    │   └── history.json         # From Takeout
    ├── liked-videos/
    │   └── videos.tsv           # Index of liked videos
    ├── playlists/
    │   ├── playlists.tsv        # All playlists
    │   └── {playlist-name}/
    │       ├── videos.tsv
    │       └── {video_id}/
    │           ├── metadata.json
    │           └── video.mkv    # git-annex
    ├── subscriptions/
    │   └── subscriptions.json   # From Takeout
    └── .datalad/

Other Personal Data #

SourceDescriptionTool
Personal website/blogSelf-hosted contenthttrack or direct git archive
Social mediaFacebook, Twitter/X, LinkedIn exportsPlatform export tools, gallery-dl for media
Browser bookmarksCross-browser bookmark collectionFrom Takeout + browser exports
Password manager exportCredential metadata (NOT passwords in git!)Manual, encrypted
Health/fitness dataStrava, Garmin, Apple HealthPlatform APIs + export tools
Financial recordsBank statements, receiptsManual import, git-annex
Music libraryPurchased music, playlistsDirect file archive

Hypothetical Vault Organization #

TODO: AI-generated layout, to be curated.

The personal archive as a DataLad superdataset:

~/vault/personal/                      # DataLad superdataset
    ├── google-takeout-raw/            # Raw Takeout dump (archival reference)
    ├── photos/                        # Canonical photo collection
    ├── email/                         # Gmail MBOX → processed email
    ├── messaging/                     # All messaging platforms
    ├── youtube/                       # Video collection + metadata
    ├── drive/                         # Google Drive documents
    ├── calendar/                      # Calendar events
    ├── contacts/                      # Address book
    ├── location-history/              # GPS tracks
    ├── web-archives/                  # Saved web pages
    ├── music/                         # Music library
    └── .datalad/

Each subdirectory is a nested DataLad dataset, following YODA principles. This allows independent version tracking, selective replication, and fine-grained access control via git-annex wanted expressions.

Workflow Overview #

TODO: AI-generated layout, to be curated.

flowchart TD GT[Google Takeout dump] --> extract[Extract & split] extract --> photos[Photos dataset] extract --> email[Email dataset] extract --> yt_meta[YouTube metadata] extract --> drive[Drive documents] extract --> other[Calendar, Contacts, ...] TG[Telegram export] --> messaging[Messaging dataset] WA[WhatsApp export] --> messaging SIG[Signal backup] --> messaging yt_meta --> yt[YouTube dataset] annextube[annextube / yt-dlp] --> yt rclone[rclone incremental sync] -.-> photos rclone -.-> drive photos --> photoprism[PhotoPrism / Photoview] yt --> annextube_ui[annextube Svelte UI] messaging --> search[Full-text search] subgraph vault[Personal Vault - git-annex / DataLad] photos email yt drive messaging other end subgraph frontends[Visualization Frontends] photoprism annextube_ui search end

Relevant Tools #

ComponentToolStatus
Google Takeout downloadGoogle TakeoutManual, no full automation yet
Photo browsingPhotoPrism, Photoview, copypartyDeployable
Telegram archivaltg-archiveWorking
YouTube video archivalannextube, yt-dlpWorking
Cloud syncrcloneWorking
Image gallery archivalgallery-dlWorking
Infrastructure deploymentLab-in-a-BoxAlpha

Distribution and Privacy #

A personal archive contains some of the most sensitive data imaginable: location history, private messages, financial records, health data, and photos.

The privacy and access control principles apply with maximum force here:

  • Archive aggressively – your personal data is yours, and platforms can lose or restrict it at any time
  • Distribute selectively – use git-annex wanted expressions to ensure private content never reaches public remotes
  • Encrypt at rest – use git-annex encryption for any remote storage (S3, Backblaze, etc.)
  • Separate personal from professional – the personal vault should be a separate superdataset from any institutional or lab archives, with independent access control
# Example: ensure private content never reaches a public remote
git annex wanted public-remote "include=*.md or include=*.tsv"
git annex wanted encrypted-backup "anything"
git annex wanted here "anything"

See Also #