Google Takeout
Table of Contents
Google Takeout is Google’s official data export service. It lets you download a copy of your data from 70+ Google products in a single (often enormous) archive. For anyone building a personal digital archive, a Takeout dump is typically the single largest ingestion event – tens or hundreds of gigabytes covering years of digital life.
What You Get #
A Takeout export can include data from any combination of these services:
| Service | Format | Typical Size | AI Readiness |
|---|---|---|---|
| Gmail | MBOX | Gigabytes | ai-ready (text), ai-partial (attachments) |
| Google Photos | JPEG/PNG/MP4 + JSON metadata | Tens of GB | ai-manual (media), ai-ready (metadata) |
| Google Drive | Original files or converted (DOCX→PDF) | Variable | Mixed |
| YouTube | Watch history, playlists, subscriptions (JSON) | Small | ai-ready |
| Calendar | ICS | Small | ai-ready |
| Contacts | vCard (VCF) | Small | ai-ready |
| Location History | JSON (semantic locations, raw signals) | Moderate | ai-ready |
| Chrome | Bookmarks (HTML), history (JSON) | Small | ai-ready |
| Google Maps | Reviews, saved places (JSON/GeoJSON) | Small | ai-ready |
| Keep | Notes as HTML + JSON | Small | ai-ready |
| Fit | Activity data (TCX) | Moderate | ai-partial |
| Hangouts / Chat | Conversation history (JSON) | Moderate | ai-ready |
| Blogger | Posts (Atom XML) | Small | ai-ready |
| Google Pay | Transactions (CSV) | Small | ai-ready |
The full list is much longer – Takeout supports 70+ products.
The Challenge #
A Takeout archive is a raw dump, not a curated dataset. The challenges for con/serve-style archival are:
- Scale – a full export can be 50-500+ GB, split across multiple ZIP files
- Heterogeneous formats – MBOX, JSON, ICS, vCard, HTML, binary media, all mixed together
- Metadata coupling – Google Photos stores metadata in sidecar
.jsonfiles alongside each image/video, requiring reassembly - No incremental export – each Takeout is a full dump. There is no “export only what changed since last time.”
- Nested archives – the export itself is a set of ZIP or TGZ files that must be extracted before processing
Ingestion Status #
No fully automated workflow exists yet for downloading, extracting, and categorizing a Takeout dump into domain-specific git-annex datasets. The process today is largely manual: request the export, download the ZIP files, extract, and sort by hand.
Key pain points that remain unsolved:
- Download automation – Takeout delivers download links via email or pushes to Drive/Dropbox. There is no stable API for triggering or retrieving exports programmatically.
- Splitting and categorization – the raw dump mixes all services into a flat directory tree. Splitting into domain-specific datasets (photos, email, YouTube metadata, etc.) requires manual scripting.
- Photo metadata reassembly – Google Photos stores GPS, descriptions, people tags, and album membership in sidecar
.jsonfiles separate from the images. Merging this metadata back into EXIF is fragile and format-dependent. Various community tools exist but none are comprehensive. - Deduplication across exports – since each Takeout is a full dump (no incremental mode), successive exports overlap massively. Identifying what is new requires content-based comparison.
rclone can access Google Drive and Google Photos directly for incremental sync between Takeout dumps, but rclone’s Google Photos backend is read-only and does not faithfully export album structure.
git-annex / DataLad Integration #
Integration level: external.
Google Takeout produces ZIP/TGZ archives that must be manually extracted and imported into git-annex. A dedicated connector that automates extraction, metadata reassembly, and import into domain-specific datasets would be a valuable contribution to the con/serve ecosystem (see Personal Archive user story).
AI Readiness #
Level: ai-partial.
Takeout exports are a mixed bag:
- Structured metadata (JSON from YouTube, Calendar ICS, Contacts vCard, Location History, Chrome bookmarks) is immediately AI-consumable
- Email (MBOX) is text-heavy and highly AI-relevant, but requires parsing MIME structure for attachments
- Photos/Videos are binary media requiring vision models or transcription
- Google Drive documents exported as PDF/DOCX may need text extraction
The metadata sidecar files are among the most AI-valuable outputs – they contain structured information about years of digital activity (where you were, what you watched, who you communicated with) in machine-readable JSON.
Limitations and Caveats #
- No incremental export: every Takeout is a full dump, which means deduplication against previous exports is essential
- Rate limited: Google limits how frequently you can request exports and how long download links remain active
- Format instability: Google occasionally changes export formats without notice. Community tooling around Takeout metadata parsing can break.
- Album reconstruction: Google Photos album membership is in sidecar JSON, not in the file hierarchy. Reconstructing albums requires parsing these files.
- Shared content: content shared with you by others may or may not be included, depending on the product
See Also #
- rclone – incremental sync from Google Drive/Photos between Takeout dumps
- Personal Archive – the user story for building a complete personal digital archive
- Ingestion Patterns – patterns for bringing external data into git-annex
- gallery-dl – for archiving photos from other image hosting platforms