python-github-backup
Table of Contents
Overview #
python-github-backup (also available as the github-backup CLI command) creates
local backups of GitHub repositories along with their associated metadata. Unlike
a simple git clone, it captures the full ecosystem around a repository: issues,
pull requests, milestones, labels, releases, wikis, stars, watchers, and forks
metadata.
This makes it a practical tool for archiving the social and project-management layer of GitHub that would otherwise be locked into the platform and lost during migrations or account closures.
Key Features #
- Repository cloning – bare or mirror clones of the git repository itself.
- Issue and PR backup – all issues and pull requests with their full comment threads, labels, milestones, and assignees, saved as JSON files.
- Release assets – download release tarballs and attached binaries.
- Wiki backup – clone the wiki repository (if present).
- Stars and watchers – record who starred or watches the repository.
- Organization-wide backup – iterate over all repositories in a GitHub organization with a single command.
- Incremental – re-running updates existing backups with new data.
- GitHub Enterprise support – works with self-hosted GitHub instances.
Basic Usage #
pip install github-backup
# Backup a single repository with all metadata
github-backup USER --token "$GITHUB_TOKEN" \
--repository REPO \
--issues --pulls --milestones --labels --releases \
--wikis --stars --watchers \
--output-directory ./backups/
# Backup all repositories in an organization
github-backup ORG --token "$GITHUB_TOKEN" \
--organization \
--issues --pulls --releases --wikis \
--output-directory ./backups/
Output Structure #
The backup produces a directory per repository containing:
REPO/
repository/ # bare git clone
issues/ # one JSON file per issue
pull_requests/ # one JSON file per PR
milestones/ # milestone metadata
releases/ # release metadata + downloaded assets
wiki/ # wiki git repository clone
stars.json # list of stargazers
watchers.json # list of watchers
git-annex / DataLad Integration #
Integration level: git-only.
python-github-backup writes its output to a local directory. The JSON metadata files are small and well-suited for direct git tracking, while release assets (binaries, tarballs) can be large and are better suited for git-annex.
A recommended workflow for archiving into a DataLad dataset:
# Create a DataLad dataset for GitHub backups
datalad create github-archive
cd github-archive
# Run the backup
github-backup ORG --token "$GITHUB_TOKEN" \
--organization --all \
--output-directory .
# Save everything -- git-annex will handle large files automatically
# based on .gitattributes annex.largefiles settings
datalad save -m "GitHub backup $(date +%Y-%m-%d)"
For periodic backups, wrap the command with datalad run to record
provenance:
datalad run -m "Update GitHub backup" \
--output "ORG/" \
github-backup ORG --token "$GITHUB_TOKEN" \
--organization --all \
--output-directory .
AI Readiness #
Level: ai-ready.
The tool produces JSON files for all metadata (issues, PRs, milestones, releases, stars). These are directly parseable by any programming language or LLM:
- Issue and PR JSON includes full comment threads with timestamps and author info.
- The structured format makes it straightforward to build summaries, search across issues, or analyze project activity patterns.
- Wiki content (cloned as a git repository of markdown files) is immediately readable by LLMs.
No format conversion is needed for AI consumption. The JSON metadata is well-structured and self-describing.