Vault Organization

Table of Contents

The con/serve vault is a DataLad superdataset containing archived digital research artifacts of many types – messaging histories, video recordings, code forge metadata, AI coding sessions, CI logs, scholarly publications, web captures, cloud storage mirrors, and more. How should this heterogeneous collection be organized on disk?

This page surveys existing approaches to directory organization, evaluates their fit for the vault use case, and identifies the principles that should guide the layout. Each user story includes a “Hypothetical Vault Organization” section showing the proposed superdataset structure that brings its data sources together – the // boundaries (see Dataset Nesting Notation below) show where independent datasets are composed into a coherent whole.

The Challenge #

The vault’s organizing problem is unusual. Most directory layout standards assume a single artifact type (neuroimaging sessions, data science notebooks, library holdings) or a single organizational axis (by project, by date, by type). The vault must handle:

Multiple artifact types with fundamentally different internal structures (JSON message exports, video files, git repositories, tabular metadata)
Multiple sources (Slack, Telegram, GitHub, YouTube, Zoom, Strava, …) that each have their own identity model and access patterns
Multiple organizational axes – by source platform, by project/team, by time period, by access level
Nesting – each leaf may itself be a DataLad dataset with its own internal structure and history
Selective distribution – some content is public, some private, some embargoed (see Privacy-Aware Distribution)

Survey of Existing Approaches #

Personal Knowledge Management #

PARA Method (Tiago Forte) – Four top-level buckets: Projects (active, time-bound), Areas (ongoing responsibilities), Resources (reference material), Archives (inactive). Designed for personal productivity. The actionability axis (active vs. archived) is interesting but the categories don’t map to artifact types or provenance. An institutional vault is almost entirely “Archive” in PARA terms.

Johnny Decimal – Max 10 “areas” (numbered 10-19, 20-29, …), max 10 “categories” per area, then individual items below. The constraint is deliberately rigid – “no more than ten things in any folder.” Could inspire numbering discipline, but the 10x10=100 ceiling is too small for a multi-source vault, and the system explicitly discourages deep nesting – opposite of what DataLad subdatasets encourage.

Research Project Templates #

Cookiecutter Data Science – data/ (external, interim, processed), notebooks/, models/, reports/, src/. Good for a single analysis project. Entirely code+data oriented – no concept of communications, media, or heterogeneous artifact types.

Kedro – A Python data engineering framework that adds two ideas on top of Cookiecutter-style layout: a numbered data-layer hierarchy (data/01_raw/, 02_intermediate/, 03_primary/, … 08_reporting/) encoding pipeline stages in directory names, and a Data Catalog (conf/catalog.yml) that decouples logical dataset names from physical paths and formats. The catalog abstraction is an interesting parallel to DataLad metadata – both let you refer to data by name rather than path. But Kedro is fundamentally a pipeline execution framework (a DAG of Python functions), not a storage layout standard; its directory structure serves pipeline stages, not archival organization. For a detailed comparison of Kedro and DataLad/YODA approaches, see the DataLad Handbook draft chapter.

Harvard RDM / university guides – Generic advice: organize by project, time, location, or file type. Document your scheme. Keep depth to 3-4 levels. Acknowledge heterogeneous data but don’t provide a concrete taxonomy.

Digital Preservation Standards #

BagIt (Library of Congress, RFC 8493) – A packaging format: data/ payload + manifest + checksums. A bag is a transfer unit, not an organizational hierarchy. Relevant as a transport mechanism (git-annex already does content-addressing better) but provides no guidance for what goes inside.

OCFL (Oxford Common File Layout) – Content-addressed objects with versioned directories (v1/content/, v2/content/, …) and inventory.json. Forward deltas, immutable versions. Very close in spirit to git-annex (content-addressed, versioned, self-describing). But OCFL organizes per object – the hierarchy above the object root is left to the repository.

RO-Crate (Research Object Crate) – A directory with ro-crate-metadata.json (JSON-LD, schema.org-based) describing all contained files. Lightweight, composable, domain-agnostic. Doesn’t prescribe internal folder structure – it describes whatever is there. Could complement DataLad as a metadata layer but doesn’t solve the layout question.

BIDS: More Than Neuroimaging #

BIDS (Brain Imaging Data Structure) – Often dismissed as neuroimaging-specific, but its general compositional principles are highly relevant:

Entity-label system – BIDS encodes metadata directly in path components using key-value entities: sub-01/ses-retest/anat/sub-01_ses-retest_T1w.nii.gz. Each path segment carries structured meaning without requiring a database or sidecar lookup.

Hierarchical specificity – Information becomes progressively specific moving down the tree: dataset → subject → session → data type → file. Higher levels define context; lower levels define content.

Metadata inheritance – Sidecar JSON files at higher levels apply to all files below, with lower-level sidecars overriding specific fields. Reduces duplication while allowing per-file precision.

Separation of concerns – Raw data, source data (sourcedata/), and derived outputs (derivatives/) live in separate directory trees, preventing accidental modification and clarifying provenance.

Study-level structure – A BIDS dataset is also a project: dataset_description.json at the root, participants.tsv for the subject registry, code/, docs/, CHANGES. The EMBER study template makes this explicit: code/, derivatives/, docs/, logs/, scratch/, sourcedata/raw/.

Nipoppy extends BIDS with a full study lifecycle layout: sourcedata/imaging/{pre_reorg,post_reorg}/ for DICOMs before and after reorganization, bids/ for converted data, derivatives/<pipeline>/<version>/output/ for processing results, tabular/ for phenotypic data, containers/ for Apptainer images, pipelines/ for per-stage configs, and manifest.tsv as the ground-truth participant registry. It also maintains processing status trackers – curation_status.tsv and processing_status.tsv – that record per-subject/per-session pipeline completion (see Metadata Extraction). Nipoppy does not use DataLad/git-annex underneath, but its layout conventions and tracker outputs are complementary components worth integrating.

These principles – entity-labeled paths, metadata inheritance, raw/derivative separation – apply to any structured collection, not just brain scans.

Hive Partitioning #

BIDS path structure is essentially hive partitioning – the same pattern used by DuckDB, Apache Arrow, and Parquet datasets:

# Hive partitioning (DuckDB, Arrow)
orders/year=2021/month=01/data.parquet

# BIDS (current syntax)
sub-01/ses-retest/anat/sub-01_ses-retest_T1w.nii.gz

# BIDS (proposed = syntax, aligning with hive convention)
sub=01/ses=retest/anat/sub=01_ses=retest_T1w.nii.gz

When directory names encode key=value pairs, any tool that understands hive partitioning – DuckDB, Pandas, Polars, Arrow – can query the directory tree directly as a dataset, without an index. This is a powerful property for a vault: the folder hierarchy is the queryable schema.

mykrok demonstrates this pattern outside neuroimaging – it organizes Strava activity backups (GPS tracks, photos, metadata) using hive-partitioned paths:

data/
├── athletes.tsv
├── mykrok.html
└── athl=alice/
    ├── athlete.json
    ├── sessions.tsv
    └── ses=2024-03-15T08-30/
        ├── info.json
        ├── tracking.parquet
        └── photos/

Text files tracked by git, binary content by git-annex – the same split the vault uses. The athl= / ses= hierarchy is directly queryable by DuckDB’s read_parquet('data/athl=*/ses=*/tracking.parquet', hive_partitioning=true).

DataLad Native Nesting #

DataLad’s own nesting model is the vault’s foundational mechanism: a superdataset tracks subdataset state (commit SHA), not content. Each subdataset has its own annex and history. The YODA convention adds: inputs/ for consumed data, code/ for scripts, and outputs/ for results – but that’s per-analysis, not per-vault.

Nesting depth is unlimited, and subdatasets can live on different servers, making the superdataset a curated catalog pointing to distributed storage.

Dataset Nesting Notation #

Vault layout diagrams in this documentation use // (double slash) to mark a git submodule (DataLad subdataset) boundary. A single / is a plain directory separator within the same git repository.

project-vault//repos/dandi-cli//git//
^^^^^^^^^^^^^^  ^^^^^^^^^^^^^^^^ ^^^^^
superdataset    subdataset       sub-subdataset
               (plain dir `repos/` is just
                organizational grouping)

This matters because subdataset boundaries determine cloning, syncing, and access control granularity. A datalad get on project-vault//repos/dandi-cli// fetches the per-repo superdataset without pulling every sibling repo, while datalad get project-vault//repos/dandi-cli//tinuous-logs// fetches only the CI log archive within it. Without the // annotation, a reader of the layout diagram cannot tell which path segments are independent datasets and which are plain directories in the same repository.

This is a documentation convention, not a filesystem feature – actual paths on disk use single /.

Emerging Principles #

From this survey, several principles emerge for the vault layout:

Entity-labeled paths (from BIDS / hive partitioning) – encode organizational metadata in directory names so the hierarchy is self-describing and queryable
Shallow top, deep leaves (from Johnny Decimal + DataLad nesting) – keep the top-level vault structure broad and flat (artifact categories), but allow each subdataset to have arbitrarily rich internal structure
Separation of raw and derived (from BIDS / Cookiecutter) – keep ingested raw artifacts separate from any processed/derived outputs
Metadata at every level (from BIDS / RO-Crate / OCFL) – each directory level should carry a description file (dataset_description.json, .datalad/, or equivalent) that makes it self-contained
Distribution metadata alongside content (from con/serve privacy model) – distribution-restrictions and provenance annotations travel with the data, enabling selective outbound distribution
Queryable without a database (from hive partitioning / mykrok) – directory structure doubles as a query schema for tools like DuckDB, eliminating the need for a separate catalog database
Self-contained per-entity grouping (from STAMPED S) – when a vault archives multiple aspects of the same entity (e.g., a GitHub repo’s code, issues, CI logs, discussions), group them under one subdataset rather than scattering by artifact type. The per-entity superdataset is a complete retrieval unit: datalad get on it brings everything about that entity, and datalad drop removes it cleanly.

Contrast two layouts for archiving a GitHub repository’s artifacts:

Scattered (by artifact type):

project-vault//repos/dandi-cli//
project-vault//issues/dandi-cli//
project-vault//ci/dandi-cli//

Each aspect of dandi-cli lives under a different top-level tree. Getting “everything about dandi-cli” requires knowing where each artifact type is stored.

Grouped (by entity, self-contained):

project-vault//repos/dandi-cli//
    git//              # the repo itself
    issues//           # git-bug or JSON
    tinuous-logs//     # CI log archive
    discussions//      # exported discussions

The vault superdataset (project-vault//) contains plain directories like repos/ for organizational grouping – these are NOT separate subdatasets. Subdatasets appear at the per-repo level (dandi-cli//) and per-aspect level within (git//, issues//, etc.). Only put // where there’s a genuine need for independent versioning, syncing, or access control.

Vault-to-Forge Mapping #

The vault’s deep subdataset hierarchy doesn’t map directly to a forge’s flat {org}/{repo} namespace. A vault might nest project-vault//repos/dandi-cli//tinuous-logs//, but Forgejo or Gitea has no concept of sub-repositories – only {org}/{repo}.

This is an open research question explored in the self-contain-github-repo project. The options under consideration include:

Satellite repos with naming convention – dandi-cli, dandi-cli--tinuous-logs, dandi-cli--issues. Simple, but pollutes the org namespace.
Single repo with custom ref prefixes – git-bug already uses refs/bugs/ within the repo; the same pattern could extend to refs/tinuous/..., refs/discussions/.... Most self-contained, but requires tooling support.
Git namespaces (GIT_NAMESPACE) – partitions refs and shares the object store, so a clone only fetches the requested namespace. But Forgejo/Gitea don’t support namespaces natively (no UI, no API filtering).
Forgejo org-level grouping – organizations provide one level of hierarchy but not the deeper nesting the vault uses.

For parallel discussion about naming within nested structures, see bids-specification#2191.