Skip to main content
  1. User Stories/

Software Project

·5 mins·

The Goal #

An open-source software project lives on GitHub: a GitHub organization with dozens of repositories, active issue trackers, pull request workflows, GitHub Actions CI/CD, Discussions for design conversations, and Slack for real-time team communication.

A prototypical target is the DANDI project – a neuroscience data archive with ~50 repositories, thousands of issues and PRs, extensive CI pipelines, and an active Slack workspace.

The goal is to archive all project activity into a vault with ongoing synchronization, so that if GitHub changes terms, restricts access, or the project migrates platforms, the complete record of development activity is preserved – not just the code, but the conversations, decisions, and CI history that explain why the code is the way it is.

Data Sources #

GitHub Repositories #

The code itself is already in git. But a git clone captures only commits and refs – it misses the platform layer that lives on GitHub’s servers.

For each repository, the vault needs:

ArtifactToolFormat
Git history (commits, branches, tags)git clone --mirrorGit
Issues and PRs (with comments, labels, milestones)git-bug, python-github-backupGit refs / JSON
Discussionsgh-discussions-exportJSON / Markdown
Wiki pagesgh-mdMarkdown
Releases and artifactspython-github-backupJSON + binaries

Some repositories are private. These must be archived with distribution-restrictions=private so they are excluded from any public-facing remotes.

GitHub Actions (CI Logs) #

CI runs produce build logs, test output, and artifacts that are essential for debugging regressions and understanding project history. GitHub retains Actions logs for 90 days by default; after that, the build history is gone.

con/tinuous archives CI logs from GitHub Actions (and other CI services) into git-annex repositories, preserving the complete build history as a DataLad dataset.

Slack (Team Communication) #

The project’s Slack workspace contains: design discussions, user support threads, release coordination, triage decisions, and the informal exchanges that never make it into issues or docs.

slackdump archives these conversations into structured JSON.

Channel typeContentPrivacy
Public channels (#general, #dev, #releases)Development discussion, announcementsprivate (workspace-internal)
Private channels (#core-team, #security)Sensitive design and security discussionsprivate
DMsOne-on-one coordinationprivate

Organization Metadata #

Beyond individual repositories, the GitHub organization itself has structure: teams and their memberships, organization-level settings, repository permissions, and team discussions. This metadata defines who could do what – important context for understanding the project’s governance history.

Forgejo-Aneksajo Mirror #

The vault’s forge counterpart is a Forgejo-Aneksajo instance that mirrors the GitHub organization.

GitHub Authentication #

Forgejo supports OAuth2 authentication providers, including GitHub. Configuring the Forgejo-Aneksajo instance to use GitHub as an authentication source means:

  • Project members log in with their existing GitHub accounts
  • No separate credentials to manage
  • Team memberships and org roles can be synced (Forgejo supports mapping OAuth2 groups to Forgejo organizations/teams)

This makes the Forgejo mirror feel like a natural extension of the GitHub-based workflow rather than a separate system.

Repository Mirroring #

Forgejo supports mirror repositories – repositories that automatically sync from an upstream source. Each GitHub repository in the organization can be configured as a Forgejo mirror:

# Forgejo API: create a mirror of a GitHub repo
curl -X POST https://forgejo.example.org/api/v1/repos/migrate \
    -H "Authorization: token $FORGEJO_TOKEN" \
    -d '{
        "clone_addr": "https://github.com/dandi/dandi-cli",
        "mirror": true,
        "repo_name": "dandi-cli",
        "repo_owner": "dandi"
    }'

This gives the project a self-hosted backup of all code with a browsable web interface, independent of GitHub’s availability.

Private Repositories #

Private GitHub repositories are mirrored as private Forgejo repositories. The authentication token used for mirroring needs access to private repos, and the Forgejo instance’s access controls ensure only authorized team members can see them.

In the vault, content from private repos carries distribution-restrictions=private metadata.

Hypothetical Vault Organization #

TODO: AI-generated layout, to be curated.

The layout follows the self-contained per-entity grouping principle: everything about a given repository – code, issues, CI logs, discussions – lives under one per-repo superdataset rather than being scattered across artifact-type trees. // marks subdataset boundaries; plain / is a directory within the same dataset.

project-vault//                          # DataLad superdataset
    repos/                               # Organizational directory (not a subdataset)
        dandi-cli//                      # Per-repo superdataset: everything about dandi-cli
            git//                        # Git mirror (the repo itself)
            issues//                     # git-bug bridge or JSON export
            discussions//                # Exported GitHub Discussions
            wiki//                       # Wiki pages
            tinuous-logs//               # CI log archive (con/tinuous)
            releases/                    # Release artifacts (plain dir or subdataset)
        dandi-archive//
            git//
            issues//
            ...
        dandischema//
            ...
    communications/
        slack//                          # Archived Slack workspace
    .datalad/

Each aspect within a per-repo superdataset is its own subdataset (git//, issues//, tinuous-logs//), enabling independent synchronization schedules – issues sync frequently, CI logs archive nightly – while datalad get repos/dandi-cli retrieves everything about that repository as a unit.

Distribution and Privacy #

ContentDistributionRationale
Public repo mirrorsForgejo (public)Redundancy, self-hosted access
Private repo mirrorsForgejo (private) + encrypted backupAccess-controlled
Issues/PRs (public repos)Forgejo / vaultPublic project metadata
Issues/PRs (private repos)Private onlyContains restricted discussion
CI logsVault (private)May contain secrets in error messages
Slack archivesPrivate (encrypted backup)Workspace-internal communications
DiscussionsVault (mirrors repo visibility)Follows source repo access
Release artifactsForgejo / public mirrorsAlready public

Workflow Overview #

TODO: AI-generated layout, to be curated.

flowchart TD gh_org[GitHub Organization] -->|git mirror| repos[repos/] gh_org -->|git-bug bridge| issues[issues/] gh_org -->|gh-discussions-export| discussions[discussions/] gh_org -->|con/tinuous| ci[ci/] gh_org -->|python-github-backup| releases[releases/] slack[Slack Workspace] -->|slackdump| comms[communications/slack/] repos -->|Forgejo mirror sync| forgejo[Forgejo-Aneksajo] issues --> forgejo forgejo -->|GitHub OAuth| auth[GitHub Authentication] subgraph vault[Project Vault -- git-annex / DataLad] repos issues discussions ci comms releases end subgraph mirror[Self-Hosted Mirror] forgejo auth end

Relevant Tools #

ComponentToolStatus
Issue archivalgit-bugStable
Repository backuppython-github-backupStable
Discussions exportgh-discussions-exportBeta
Wiki exportgh-mdStable
CI log archivalcon/tinuousStable
Slack archivalslackdumpWorking
Self-hosted forgeForgejo-AneksajoBeta
DeploymentLab-in-a-BoxAlpha

See Also #