Skip to main content
  1. Concepts/

Metadata Extraction and Dependencies

·11 mins·

Every dataset in the vault contains more knowledge than its raw files alone express. A BIDS neuroimaging dataset has per-subject file counts, scan durations, voxel dimensions, modality coverage. A YouTube archive has per-channel video counts, total durations, caption availability. A Slack export has message counts, attachment sizes, date ranges.

This data knowledge – intrinsic properties of the data itself, independent of where or how it is processed – needs to be extracted, summarized into navigable tables, and kept up to date as the data evolves.

This is distinct from operational knowledge (which processing steps fail on which systems, how much memory a particular cluster needs) – that is the domain of the Experience Ledger. Data knowledge is about the data; operational knowledge is about the infrastructure. The two connect – the experience ledger correlates failures with dataset characteristics – but they are maintained separately.

The Pattern: Hierarchical Extraction #

The vault already demonstrates this pattern in several places. The Data-Visualization Separation page describes the presentation side; here we focus on the extraction and dependency tracking that keeps the summaries current.

Existing Examples #

annextube builds a three-tier summary pyramid: metadata.json (per-video) → videos.tsv (per-channel) → channels.tsv (all channels). Each level is distilled from the one below.

mykrok does the same with hive-partitioned paths: info.json (per-session) → sessions.tsv (per-athlete) → athletes.tsv (all athletes).

BIDS applies the pattern to neuroimaging: common sidecar metadata (e.g., task-rest_bold.json) at higher directory levels applies to all files below (metadata inheritance); participants.tsv at the root summarizes all subjects; dataset_description.json describes the dataset as a whole. HeuDiConv extracts shared acquisition parameters and places them at the appropriate inheritance level.

OpenNeuroStudies extends this with sourcedata+subjects.tsv – a per-subject operational profile:

source_idsubject_idbold_numbold_sizebold_duration_totalbold_voxels_totalt1w_numt1w_sizedatatypes
ds000001sub-0131418713031800.040550415663237anat,func
ds000001sub-0231394658211800.040550415487612anat,func

These summaries distill operationally relevant properties: file counts and sizes by modality, scan durations, voxel counts – exactly the fields that predict resource requirements for downstream processing.

The General Rule #

Every dataset type in the vault should maintain summary tables at appropriate granularities:

Artifact typePer-item summaryAggregation levelTop-level summary
BIDS neuroimagingper-subject: modalities, file sizes, durations, voxel countsper-session or per-taskparticipants.tsv + sourcedata+subjects.tsv
YouTube archiveper-video: duration, resolution, caption availabilityper-channel: video count, total durationchannels.tsv
Slack workspaceper-channel: message count, attachment size, date rangeper-workspaceworkspaces.tsv
Citationsper-reference: has PDF, PDF size, yearper-collectioncollections.tsv
Git forgeper-issue/PR: comment count, attachment sizeper-repositoryrepos.tsv

These are data knowledge tables – properties of the content itself, useful for any consumer (human, AI, pipeline scheduler, dashboard) regardless of what system processes them.

Row-Level Dependency Tracking #

The key challenge is keeping summaries up to date without re-extracting everything from scratch.

When sub-01/ses-01 gets new data, only that row in sourcedata+subjects.tsv needs updating. When study-ds000001/ changes in a vault-level studies.tsv, only that study’s row needs refreshing. When a single video is added to a YouTube channel, only that row in videos.tsv and the aggregated channels.tsv need updating.

This is a dependency tracking problem, and git already solves it.

Git as Dependency Tracker #

Git knows exactly what changed between any two states. The dependency mechanism is:

  1. Detect what changed: git diff --name-only between the last extraction and the current state identifies which leaf directories have new or modified content
  2. Map changes to summary rows: directory structure encodes the mapping – a change under sub-01/ses-01/ maps to the sub-01 + ses-01 row in the per-subject summary
  3. Re-extract affected rows only: run the extractor scoped to the changed paths
  4. Update summary tables: merge new rows into existing TSV/Parquet, then propagate upward to higher-level aggregations
  5. Commit the update: datalad run wraps the extraction, recording provenance (which inputs triggered which summary update)

Because the extraction is wrapped in datalad run, the whole process is idempotent: git reset --hard to any prior state and re-running the extraction pipeline will regenerate exactly the correct summaries. No external dependency database is needed – git is the dependency tracker.

Example: Incremental BIDS Summary Update #

# What changed since last extraction?
git diff --name-only HEAD~1 -- sub-*/ses-*/

# Output: sub-01/ses-02/func/sub-01_ses-02_task-rest_bold.nii.gz
# → only sub-01/ses-02 needs re-extraction

# Re-extract that row
datalad run -m "Update sourcedata+subjects.tsv for sub-01/ses-02" \
    --input sub-01/ses-02/ \
    --output sourcedata+subjects.tsv \
    python code/extract_subject_summary.py --subject sub-01 --session ses-02

The --input and --output declarations make the dependency explicit in the provenance record. datalad rerun can replay this extraction later if the extraction script itself is updated.

Cascading Dependencies #

Changes propagate upward through the summary hierarchy:

sub-01/ses-02/ changed
  → update row in sourcedata+subjects.tsv         (per-subject summary)
  → update row in sourcedata+datasets.tsv          (per-dataset summary, if in a superdataset)
  → update row in studies.tsv                       (vault-level overview)

The same cascade applies to derivatives: if sub-01/ses-02 has new raw data, existing derivatives for that subject/session may need recomputation. The dependency is the same: git knows what changed, the pipeline knows what depends on it.

This connects directly to the BIDS-flux branching model: a new or updated subject/session arrives on a feature branch; merging it triggers both metadata re-extraction and derivative reprocessing for exactly the affected records.

Building on datalad-metalad #

datalad-metalad provides the extraction framework for this: pluggable extractors that run against datasets and files, producing structured metadata records stored in git, aggregated up superdataset hierarchies, and queried via meta-dump. The DataLad Registry (23,443 datasets, ~2 PB of annexed content) is built on metalad extractors running at scale.

Metalad’s Extraction Pipeline #

CommandPurpose
meta-extractRun an extractor against a dataset or file
meta-addStore metadata records in the dataset’s git repo
meta-aggregateConsolidate metadata from subdatasets upward
meta-dumpQuery and filter stored metadata
meta-conductOrchestrate multi-stage extraction pipelines

Existing Extractors #

ExtractorScopeWhat it captures
metalad_coredataset + fileDataset ID, version, contributors, dates, contentbytesize (total and per-file), distributions (remote URLs, annex UUIDs)
metalad_annexfilegit-annex metadata fields (including distribution-restrictions, custom key-value pairs)
metalad_runprovdatasetPROV-DM records from datalad run commits: activities, generated entities, agent associations
bids_datasetdatasetBIDS entities present (subjects, sessions, tasks), variable names from TSV files, dataset_description.json fields
nifti1fileNIfTI header: voxel dimensions, data type, affine, TR, image shape
dicomfileDICOM tags: patient, study, series, acquisition parameters
genericjson_dataset/filebothArbitrary JSON metadata from sidecar files
external_dataset/filebothDelegate to an external process – the general extension mechanism

The metalad_core extractor already captures per-file contentbytesize and dataset-level totals – the raw material for summary tables. The nifti1 and dicom extractors already read acquisition parameters from imaging files. What’s missing is the summarization step that compacts per-file metadata into per-subject/per-record rows and exports them as tabular files.

Gaps in Existing Tooling #

The metalad framework is extensible – new extractors can be Python classes or external scripts, meta-conduct supports multi-stage pipelines, and storage in git means results are versioned alongside the data. But several gaps remain between what metalad provides today and what the vault’s extraction pattern requires:

  • Per-record summarization – existing extractors capture per-file metadata (NIfTI headers, DICOM tags, file sizes) and dataset-level presence inventories (which entities exist), but do not compute per-subject/per-record operational profiles like sourcedata+subjects.tsv. The summarization step – compacting per-file metadata into per-record rows – is missing.
  • Tabular output – metalad stores metadata as JSON-LD, which is semantically rich but not directly consumable as the TSV/Parquet summary tables that operators, dashboards, and query tools need.
  • Incremental/scoped extraction – metalad currently extracts against a full dataset or file. The row-level dependency tracking described above requires scoping extraction to changed paths and merging results into existing summaries.
  • Aggregation granularitymeta-aggregate consolidates subdataset metadata upward, but the same incremental update logic is needed: when one subdataset changes, only its contribution to the parent should be refreshed.

Execution Telemetry and the Experience Ledger #

Two of metalad’s extractors serve the Experience Ledger rather than data knowledge:

  • metalad_runprov extracts PROV-DM records from datalad run commits – the provenance layer of operational knowledge
  • A new execution telemetry extractor would read con/duct .duct/ JSON Lines logs and produce structured metadata (peak RSS, CPU, wall time, exit code) linked to the datalad run commit

These extractors produce operational knowledge (specific to the processing system), not data knowledge (intrinsic to the data). The metalad framework can host both, but the outputs feed different consumers: data summaries feed dashboards and query tools; execution telemetry feeds the experience ledger and its failure pattern analysis.

Prior Art: Nipoppy Trackers and Neurobagel Digest #

Nipoppy and Neurobagel have already built concrete implementations of processing status tracking and metadata dashboarding – without DataLad/git-annex underneath, but with reusable components.

Nipoppy’s Processing Trackers #

Nipoppy maintains two tracker TSV files:

curation_status.tsv – tracks BIDSification progress per subject/session:

participant_idsession_idin_pre_reorgin_post_reorgin_bids
sub-01ses-01truetruetrue
sub-02ses-01truetruefalse

processing_status.tsv – tracks pipeline completion per subject/session:

participant_idsession_idpipeline_namepipeline_versionpipeline_stepstatus
sub-01ses-01fmriprep24.1.1participantSUCCESS
sub-01ses-01mriqc24.2.0participantFAIL

Status is determined by file-existence checks: each pipeline declares expected output paths (with glob patterns and [[NIPOPPY_PARTICIPANT_ID]] / [[NIPOPPY_SESSION_ID]] templates) in a tracker.json configuration. The tracker runs independently after pipeline execution, checking whether all declared outputs exist.

This is exactly the “summary table of processing state” that the vault needs – and the file-existence approach is simple, idempotent, and works with any pipeline.

Neurobagel Digest #

Neurobagel Digest consumes Nipoppy’s processing_status.tsv files and renders them as interactive dashboards (Dash/Plotly) for exploring per-subject pipeline availability. The digest schema extends the tracker with timing columns (pipeline_starttime, pipeline_endtime) and supports both imaging and phenotypic digests.

The broader Neurobagel ecosystem adds:

  • bagel-cli – extracts subject-level metadata from BIDS datasets + annotated phenotypic TSVs, producing JSON-LD for a semantic graph database (SPARQL-queryable)
  • annotation-tool – web UI for mapping phenotypic columns to controlled vocabularies (SNOMED CT, NCIT, NIDM)
  • federation-api – cross-site query across distributed Neurobagel nodes

Reuse and Integration #

For the vault, the relevant components are:

Directly reusable:

  • Nipoppy’s tracker pattern (file-existence status determination) generalizes to any pipeline and any artifact type. The same approach works for annextube caption extraction, Slack export completeness, or any datalad run-wrapped step.
  • The processing_status.tsv schema is a natural output format for vault pipeline status.
  • Digest’s dashboarding of processing status TSVs.

Interoperability gap:

  • Nipoppy assumes traditional filesystem storage, not git-annex content-addressed storage. File-existence checks need adaptation for annexed files (which appear as symlinks; content may or may not be locally present).
  • Nipoppy’s manifest.tsv (ground-truth participant list) overlaps with BIDS participants.tsv – in a DataLad-backed vault, git history already provides the ground truth for what exists.
  • Neurobagel’s graph database (RDF/SPARQL) is a separate infrastructure component; the vault’s approach leans toward summary tables queryable by lightweight tools (DuckDB, VisiData) rather than requiring a graph store.

Integration path: the vault should be able to produce Nipoppy-compatible tracker outputs from its own processing records (datalad run provenance + file existence), so that Neurobagel digest and bagel-cli can consume them without requiring Nipoppy’s own orchestration layer. Conversely, datasets organized by Nipoppy can be ingested into the vault with their tracker outputs preserved as metadata.

Relation to Derivatives Recomputation #

The dependency tracking mechanism for metadata extraction is the same one needed for derivative recomputation.

If sub-01/ses-02 gets new raw data:

  • The metadata extraction pipeline updates that subject’s row in the summary table
  • The derivative pipeline checks whether existing derivatives (fMRIPrep outputs, MRIQC reports) are still valid for that subject/session
  • If not, it triggers reprocessing – but only for the affected records, not the entire dataset

Both are triggered by the same signal (git diff showing what changed), follow the same pattern (scope to affected records, rerun, commit), and benefit from the same idempotency guarantee (git reset --hard + datalad rerun).

The Automation and Pipelines page describes the execution primitives (datalad run, con/duct) and orchestration (BIDS-flux, Forgejo Actions). This page describes the dependency logic that determines what needs re-execution and at what granularity.

Open Questions #

  • metalad revival scope – how much of metalad’s current codebase is reusable as-is vs. needing modernization? Last released January 2024 with 141 open issues – is the extraction pipeline stable enough to build on, or does the core need work first?
  • Tabular vs. JSON-LD – operators and agents need flat tables; metalad stores rich graphs. Is the right answer a tabular export processor in meta-conduct, a separate meta-dump --format=tsv output mode, or tabular files generated alongside JSON-LD during extraction?
  • Granularity model – metalad supports file-level and dataset-level extraction. Per-subject/per-record summaries are an intermediate level that doesn’t cleanly map to either. How to represent this in the extractor API?
  • Schema alignment – can metalad’s JSON-LD output be aligned with the DataLad concepts vocabulary so that registry, catalog, and extraction results all speak the same schema?
  • Cross-vault metadata sharing – if two labs extract the same metadata from similar datasets, can they share summary tables (without sharing the data itself) to pool knowledge about dataset characteristics?

See Also #