Development Roadmap

Table of Contents

This page outlines the active and near-term development tracks for the con/serve platform. Each track is a concrete workstream with a defined goal, not a speculative wish list.

Track 1: Containerized Ingestion Tool Collection #

Goal: A curated collection of OCI containers, each packaging one ingestion tool with its dependencies, for consistent and reliable deployment across environments.

Motivation #

Ingestion tools have heterogeneous dependencies – slackdump needs Go, annextube needs Python and yt-dlp, tg-archive needs Python, ArchiveBox needs Node and Chromium. Installing them all on a single host creates conflicts and makes reproducible deployment difficult.

Containers solve this: each tool runs in an isolated environment with pinned dependencies, and the collection can be tested as a unit.

Scope #

The initial collection should cover the tools from the user stories data sources:

Container	Tool(s)	User stories
`conserve/slackdump`	slackdump	Neuroimaging Lab, Software Project
`conserve/annextube`	annextube, yt-dlp	Personal Archive
`conserve/tg-archive`	tg-archive	Personal Archive
`conserve/git-bug`	git-bug	Software Project, Neuroimaging Lab
`conserve/tinuous`	con/tinuous	Software Project, Neuroimaging Lab
`conserve/github-backup`	python-github-backup	Software Project
`conserve/gh-export`	gh-discussions-export, gh-md	Software Project

Design Considerations #

Base image: minimal Debian or Alpine, with git and git-annex pre-installed so containers can interact with vault datasets directly
Entrypoint pattern: each container wraps the tool such that it can be invoked as a datalad run command, enabling provenance-tracked ingestion
Registry: containers published to a registry accessible from the Lab-in-a-Box deployment and from HPC environments (via Singularity/Apptainer conversion)
Testing: each container tested against a synthetic vault to verify the ingestion workflow end-to-end

Relationship to datalad-container #

These containers complement datalad-container, which manages containers for processing (BIDS Apps). The ingestion containers serve a different role – they bring data into the vault rather than processing data already in it – but the registration and invocation patterns should be compatible.

Track 2: Lab-in-a-Box Deployment and Testing #

Goal: Streamline the Lab-in-a-Box deployment of Forgejo-Aneksajo, HedgeDoc, and associated services, with automated testing against the user story scenarios.

Current State #

The projects/liab-conserve/ directory contains a working podman-compose development stack:

Service	Status
Forgejo-Aneksajo (behind go-away proxy)	Working
HedgeDoc	Working
User provisioning scripts	Working
Vagrant VM testing (Debian, Ubuntu, Rocky)	Working

The pyinfra deployment in projects/liab-deployments/ targets bare Debian servers with rootless Podman and systemd user units.

Development Goals #

Integrate ingestion containers – the containerized tool collection (Track 1) should be deployable alongside the core services. A Lab-in-a-Box deployment should include not just the forge and editor, but the ingestion tools needed for the target use case.
User story test scenarios – automated tests that exercise real ingestion workflows against the deployed stack:
- Create a vault on Forgejo-Aneksajo
- Run an ingestion container to import test data
- Verify the dataset structure, metadata, and provenance
- Test distribution-restrictions enforcement
GitHub OAuth integration – the Software Project story requires Forgejo to authenticate via GitHub OAuth and sync team memberships. This needs to be a tested, documented deployment option.
Multi-distro testing – the Vagrant setup already targets Debian, Ubuntu, and Rocky. CI should exercise deployments on all three to catch distribution-specific issues.

Relation to Production #

The liab-conserve compose stack is the development environment; liab-deployments is the production deployment tool. Changes should flow from dev (compose) to production (pyinfra) after testing in Vagrant VMs.

Track 3: Ingestion Skills for Agent-Assisted Operations #

Goal: Create the first Claude Code skills (.claude/commands/) that encode repeatable ingestion workflows, bridging the gap between ad-hoc agent use and the solidified pipelines they will eventually become.

Motivation #

The agent TODO sections identify operations that are candidates for solidification. Skills are the intermediate step: more structured than a freeform agent conversation, less rigid than a deterministic CLI tool.

A skill encodes:

What questions to ask (source type, access credentials, vault location)
What tools to invoke and in what order
What metadata to set
What verification to perform

Both human operators and agents can invoke skills identically (/vault.add-source, /vault.discover), making them the shared interface between human and AI-assisted operations.

Initial Skill Set #

Derived from agent TODOs and user story workflows:

Skill	Agent	Purpose
`/vault.discover`	discovery-curator	Scan environment for archivable data sources, produce prioritized inventory
`/vault.add-source`	ingestion-curator	Walk through adding a new data source: classify ingestion paradigm, create `datalad run` wrapper, set metadata
`/vault.audit-ingestion`	ingestion-curator	Audit existing imports for metadata completeness, provenance coverage, idempotency
`/vault.triage`	pipeline-operator	Investigate a pipeline failure: check logs, consult experience ledger, propose resolution
`/vault.layout`	vault-architect	Design or review vault directory structure against YODA/STAMPED conventions

Development Approach #

Start with /vault.add-source – it is the most commonly needed operation and touches all the concepts (ingestion patterns, metadata annotation, distribution-restrictions, datalad run).

The skill should:

Ask what kind of source is being added (messaging, media, code artifacts, etc.)
Identify the appropriate tool from the catalog
Generate the datalad run wrapper command
Specify metadata annotations
Recommend a vault location following conventions
If a container exists (Track 1), use it

Each skill invocation produces structured output that can be reviewed, edited, and committed. Over time, patterns that recur across invocations become candidates for solidification into deterministic CLI commands – the /vault.add-source skill might eventually generate a vault-ingest config file that a CLI tool consumes.

Research Projects #

Self-Contained GitHub Repository Archival #

The Software Project user story archives multiple artifact types per GitHub repository (code, issues, CI logs, discussions). The vault layout groups these under a per-repo superdataset, but the vault’s deep hierarchy doesn’t map directly to a forge’s flat {org}/{repo} namespace.

The self-contain-github-repo project explores this duality: how to organize repo artifacts on the vault side (subdataset structure) and represent them on the forge side (naming conventions, git namespaces, ref prefixes). This directly affects Track 2 (how repos are organized on Forgejo-Aneksajo) and Track 1 (how ingestion containers deposit artifacts into the vault).

Dependencies Between Tracks #

Track 1: Containers ─────────┐
                              ├──→ Track 2: LiaB Testing
Track 2: LiaB Deployment ────┘        (test containers in deployed stack)
                                           │
Track 3: Skills ───────────────────────────┘
    (skills invoke containers,
     test against deployed stack)

Track 1 and Track 2 can proceed in parallel. Track 3 benefits from both but can start independently with skills that invoke tools directly (without containers).

Track 1: Containerized Ingestion Tool Collection #

Motivation #

Scope #

Design Considerations #

Relationship to datalad-container #

Track 2: Lab-in-a-Box Deployment and Testing #

Current State #

Development Goals #

Relation to Production #

Track 3: Ingestion Skills for Agent-Assisted Operations #

Motivation #

Initial Skill Set #

Development Approach #

Research Projects #

Self-Contained GitHub Repository Archival #

Dependencies Between Tracks #

See Also #