Skip to main content

Concepts

Table of Contents

The Tools section catalogs individual tools. The Infrastructure section describes the services that host them. This section describes the patterns and architectural concepts that tie everything together.

These are not tool-specific – they apply across artifact types and describe how the con/serve ecosystem works as a whole.

Topics #

Ingestion Patterns – Common strategies for pulling data into git-annex repositories: direct download, API extraction, crawling, mounting, and bridging.

Conservation to External Resources – How to publish and back up from your git-annex vault to cloud storage, domain archives, and institutional repositories.

Vault Organization – Survey of directory organization approaches – PARA, Johnny Decimal, BagIt, OCFL, RO-Crate, BIDS, hive partitioning – and the principles that should guide the layout of a heterogeneous DataLad superdataset vault.

Data-Visualization Separation – The MVC principle applied to archived data: keep collected data in standard formats (TSV, JSON, Parquet), build hierarchical summaries for navigation, and let use-case-appropriate viewers (VisiData, Datasette, custom HTML) attach freely.

Automation and Pipelines – Triggering ingestion on external events, multi-step data transformation (ETL), human-and-AI-in-the-loop curation, branch-based workflow orchestration (BIDS-flux), observability dashboards, and idempotent processing over git/git-annex/DataLad.

Metadata Extraction and Dependencies – Extracting data knowledge from vault datasets into hierarchical summary tables, with git-native dependency tracking for incremental recomputation. Covers per-record granularity (update one row when one subject changes), cascading dependencies, datalad-metalad as extraction engine, and the connection to derivative reprocessing.

Experience Ledger – Compressing operational experiences into reusable knowledge: failure patterns, resource baselines, and operational heuristics extracted from execution logs (con/duct, con/tinuous, ReproMan). Operational knowledge specific to the processing system, distinct from data knowledge about the datasets themselves.

AI Agents and Vault Operations – How AI agents operate within the vault – discovery, ingestion, architecture, and pipeline monitoring – and the solidification lifecycle from agent-assisted exploration to deterministic pipelines.

Domain Extensions – How the generic con/serve platform extends to domain-specific workflows like neuroimaging, genomics, or digital humanities, adding specialized formats, conversion pipelines, and publishing targets.