Conservation to External Resources
Table of Contents
The con/serve architecture has two halves: ingestion (pulling artifacts in) and conservation (pushing them out to durable, distributed locations). This page describes the outbound half – how content in your git-annex vault gets replicated, published, and backed up to external resources.
The guiding principle is simple: your data should exist in at least two places you control, and ideally also in a domain-specific archive that outlives your lab.
git-annex Special Remotes #
The primary mechanism for outbound distribution is the git-annex special remote. A special remote is a storage backend that git-annex can push content to and pull content from. git-annex ships with built-in support for several backends and has an extensible protocol for third-party implementations.
Built-in Special Remotes #
| Remote Type | Use Case |
|---|---|
| S3 | Amazon S3 and S3-compatible services (Wasabi, MinIO, Backblaze B2) |
| rsync | Any server with SSH and rsync – the simplest backup target |
| web | Register URLs as content sources (not a backup target, but a distribution mechanism) |
| bittorrent | Distribute large datasets via BitTorrent |
| directory | Local or mounted filesystem path (USB drives, NAS, NFS mounts) |
| glacier | Amazon Glacier for cold archival storage |
Third-Party Special Remotes #
| Remote | Provider Coverage |
|---|---|
| git-annex-remote-rclone | 70+ cloud providers via rclone – Google Drive, Dropbox, OneDrive, Azure, etc. |
| git-annex-remote-globus | Globus endpoints (CONP-PCNO-specific, not a generic Globus remote) |
Example: Setting Up S3 Backup #
# Configure an S3 special remote
git annex initremote s3-backup type=S3 \
bucket=lab-archive-2026 \
encryption=shared \
chunk=100MiB
# Copy all content to S3
git annex copy --to s3-backup .
# Verify content is on S3
git annex whereis .
Example: rclone to Google Drive #
# Configure rclone (one-time setup)
rclone config
# ... configure a "gdrive" remote ...
# Configure git-annex to use rclone
git annex initremote gdrive type=external \
externaltype=rclone \
target=gdrive \
prefix=lab-archive
# Copy content
git annex copy --to gdrive .
DataLad Publishing #
DataLad provides higher-level commands for publishing datasets to various targets. These wrap git-annex operations with dataset-aware logic (handling nested subdatasets, metadata, etc.).
datalad push #
# Push to any configured sibling (git remote + optional annex special remote)
datalad push --to origin
# Push including all subdatasets
datalad push --to origin --recursive
datalad create-sibling-* #
DataLad includes commands for creating siblings on various platforms:
| Command | Target |
|---|---|
create-sibling-github | GitHub repositories |
create-sibling-gitlab | GitLab repositories |
create-sibling-gogs | Gitea/Forgejo/Gogs instances (including Forgejo-Aneksajo) |
create-sibling-osf | Open Science Framework (via datalad-osf) |
create-sibling-ria | RIA (Remote Indexed Archive) stores |
Example: Publishing to OSF #
pip install datalad-osf
# Create an OSF sibling
datalad create-sibling-osf --name osf \
--title "My Research Dataset" \
--category data
# Push dataset
datalad push --to osf
rclone as Universal Adapter #
rclone deserves special mention because it serves a dual role in the con/serve architecture:
- Ingestion: pull files from cloud storage into git-annex repos
- Conservation: push archived content back to cloud storage as a backup or distribution target
With support for 70+ providers, rclone ensures that no matter where your institution or collaborators keep their cloud storage, you can replicate your archive there.
Distribution Patterns #
Backup (Resilience) #
Replicate content to multiple locations for disaster recovery:
# Local NAS
git annex initremote nas type=directory directory=/mnt/nas/archive
git annex copy --to nas .
# Cloud (S3)
git annex copy --to s3-backup .
# Off-site (rsync to collaborator's server)
git annex initremote offsite type=rsync \
rsyncurl=user@remote.example.org:/archive
git annex copy --to offsite .
Publish (Accessibility) #
Make datasets discoverable and downloadable by the community:
- Domain archives: OpenNeuro, DANDI, EMBER, OSF
- Institutional repositories: via RIA stores or institutional Forgejo instances
- DataLad Hub: hosted DataLad dataset sharing
Distribute (Scale) #
For very large datasets, use distribution mechanisms that scale:
- BitTorrent special remote: peer-to-peer distribution
- Web special remote: register download URLs so consumers can
git annex getfrom the original source - CDN-backed S3: put content on S3 behind CloudFront or similar
Privacy-Aware Distribution #
Not all vault content should reach all remotes. Many ingested artifacts come from restricted sources – private Slack workspaces, access-controlled cloud drives, embargoed manuscripts, or recordings with consent constraints. The vault preserves them, but distribution must respect their original access restrictions.
git-annex’s wanted expressions can encode these boundaries using custom metadata.
In practice, we use a distribution-restrictions metadata field:
# Mark restricted content
git annex metadata --set distribution-restrictions=private some/private/file
git annex metadata --set distribution-restrictions=sensitive some/sensitive/data
# Public-facing remote: only DataLad metadata (always needed) and
# anything NOT marked with distribution restrictions
git annex wanted origin \
"include=.datalad/* and (not metadata=distribution-restrictions=*)"
# Encrypted backup gets everything -- no restrictions
git annex wanted s3-encrypted "anything"
# Lab-internal Forgejo gets everything except sensitive
git annex wanted lab-forgejo \
"not metadata=distribution-restrictions=sensitive"
Combined with repository-level access control (private repos on GitHub or Forgejo) and encryption on special remotes, this gives fine-grained control over what leaves the vault and in what form.
Collect Metadata at Ingestion Time #
These distribution decisions can only be made if the relevant metadata exists. This places an obligation on the ingestion side of the pipeline: when archiving artifacts, capture and record the provenance and rights information needed to make distribution decisions later.
Metadata to collect at ingestion time:
- Original owner / copyright holder – who created or owns the content
- License – under what terms it was originally shared (Creative Commons, institutional policy, proprietary, etc.)
- Source access level – was this from a public channel, a private workspace, a restricted drive, an authenticated portal?
- Consent constraints – for recordings: were participants informed? Was consent given for internal use only, or for broader distribution?
- Data Use Ontology (DUO) – for research data, DUO provides a standardized vocabulary for data use conditions (e.g., “general research use”, “disease-specific research”, “no general methods research”). Annotating datasets with DUO terms at ingestion time enables machine-readable access decisions downstream.
- Embargo periods – does the content have a time-limited restriction (e.g., pre-publication embargo, grant reporting period)?
The principle is: you cannot selectively distribute what you have not annotated. Collecting rights and provenance metadata at ingestion time – even when immediate distribution is not planned – preserves the ability to make informed sharing decisions in the future.
Content Policies #
git-annex’s numcopies and required settings let you express policies about where content must exist:
# Require at least 2 copies of everything
git annex numcopies 2
# Require that archival content always exists on S3
git annex required s3-backup "include=*"
# Only keep recent data locally, archive everything to cold storage
git annex wanted here "not copies=s3-backup:1"
git annex wanted s3-backup "anything"
These policies are stored in the git-annex branch and shared across all clones, ensuring consistent archival behavior regardless of who runs the distribution commands.
See Also #
- rclone – universal cloud storage adapter
- Ingestion Patterns – the inbound counterpart
- DataLad Hub – hosted publishing target