Domain Extensions

Table of Contents

The con/serve platform is media-agnostic at its core. git-annex does not care whether a file is a video, a DICOM scan, a genome sequence, or a PDF – it stores content-addressed blobs and tracks their locations. DataLad does not care whether a dataset holds Slack exports or neuroimaging data – it manages versioned collections of files.

This generality is a strength, but real research domains have specific needs: specialized file formats, conversion pipelines, metadata standards, and community archives. Domain extensions layer these domain-specific concerns on top of the generic con/serve platform without modifying its core.

The Extension Model #

A domain extension adds four layers to the generic platform:

1. Domain-Specific Ingestion #

Specialized tools for acquiring data in domain formats:

What formats are produced by instruments and services in this domain?
What tools convert raw acquisitions into standardized representations?
What metadata must be captured alongside the data?

2. Conversion Pipelines #

Transformations from raw/proprietary formats to community standards:

What community standard does this domain use? (BIDS, NWB, FASTQ, TEI, etc.)
What tools perform the conversion?
What validation ensures correctness?

3. Publishing Targets #

Domain-specific archives and repositories:

Where does this domain publish datasets?
What metadata schemas do these archives require?
What tools handle the submission process?

4. Metadata Standards #

Controlled vocabularies, ontologies, and schemas:

What ontologies describe this domain’s data?
How are they represented in the DataLad dataset metadata?

Example: Neuroimaging Extension #

Neuroimaging is the most developed domain extension in the con/serve ecosystem, reflecting its origins in the Center for Open Neuroscience.

Ingestion #

Source	Tool	Output
MRI scanner (DICOM)	Direct acquisition / PACS export	Raw DICOM files
Stimuli presentation	ReproStim	Stimulus logs, screen recordings
Behavioral events	CurDes BIRCH	Event timing files
Environmental sensors	Custom loggers	Temperature, humidity, noise logs

Conversion #

From	To	Tool
DICOM	BIDS	HeuDiConv with ReproIn naming
Raw physiology	BIDS physio	Custom scripts
Stimulus logs	BIDS events	ReproStim exporters

The BIDS (Brain Imaging Data Structure) standard defines directory layouts, naming conventions, and metadata files for neuroimaging datasets. HeuDiConv and ReproIn automate the conversion from scanner-native DICOM to BIDS-compliant DataLad datasets.

Publishing #

Target	Tool	What Gets Published
OpenNeuro	`datalad push` / OpenNeuro CLI	BIDS datasets
DANDI	`datalad-dandi`	NWB neurophysiology data
EMBER	DataLad publish	Multi-modal brain data
OSF	`datalad-osf`	Any dataset

Metadata Standards #

BIDS – file naming, directory structure, JSON sidecars
NWB (Neurodata Without Borders) – electrophysiology data format
NIDM (Neuroimaging Data Model) – provenance and results reporting

Other Potential Domain Extensions #

The extension model is not limited to neuroimaging. Any research domain with specialized formats, conversion needs, and publishing targets can define its own extension.

Genomics #

Layer	Examples
Ingestion	FASTQ from sequencers, BAM/CRAM from alignment
Conversion	Raw reads to aligned, annotated genomes
Publishing	SRA, ENA, GEO, dbGaP
Standards	FASTQ, BAM, VCF, BED

Environmental Science #

Layer	Examples
Ingestion	Sensor networks, satellite imagery, weather stations
Conversion	Raw telemetry to NetCDF, CF conventions
Publishing	PANGAEA, EOSDIS, DataONE
Standards	CF conventions, ISO 19115, EML

Digital Humanities #

Layer	Examples
Ingestion	Digitized manuscripts, archival photographs, oral histories
Conversion	OCR, TEI encoding, IIIF manifests
Publishing	HathiTrust, Internet Archive, institutional repositories
Standards	TEI, Dublin Core, METS, IIIF

Building a Domain Extension #

A domain extension is not a formal plugin system – it is a pattern. To create one:

Identify the domain-specific tools that your community already uses
Document the ingestion sources and the formats they produce
Define conversion pipelines from raw to standardized formats
List the publishing targets and their submission requirements
Map the metadata standards to DataLad dataset metadata

The result is a set of documentation, scripts, and configurations that sit alongside the generic con/serve tools. The core platform handles storage, versioning, and distribution; the domain extension handles everything that is specific to your field.