citations-collector
Table of Contents
citations-collector is a tool for building and maintaining comprehensive, version-controlled collections of scholarly citations. Given a set of seed publications (your lab’s papers, a project’s key references), it discovers all citing and cited works across multiple sources, acquires full-text PDFs where available, syncs with Zotero for reference management, and stores everything in a DataLad dataset with full provenance tracking.
The Problem #
Scholarly citation management is fragmented:
- Discovery is scattered across databases: CrossRef has DOI metadata, OpenCitations has citation graphs, DataCite covers datasets, OpenAlex aggregates from multiple sources. No single service has complete coverage.
- PDF acquisition is manual: you search Google Scholar, check Unpaywall, try preprint servers, and download one PDF at a time. The provenance (where did this PDF come from?) is lost.
- Reference management lives in Zotero or Mendeley – proprietary databases that do not version-track changes or integrate with data management workflows.
- Citation relationships (who cites whom, and how) are not captured in reference managers at all.
citations-collector addresses all four problems in a single, DataLad-native workflow.
Architecture #
my-citations/ # DataLad dataset
.datalad/
citations/
seed-papers.yaml # Seed publications (input)
discovered/
10.1234-paper-a.yaml # Discovered citation records
10.5678-paper-b.yaml # Each with full metadata
...
graph/
citation-graph.json # Citation relationships
pdfs/
10.1234-paper-a.pdf # git-annex (content-addressed)
10.5678-paper-b.pdf # git-annex (provenance tracked)
zotero/
collection-export.json # Zotero library sync
schema/
citation.yaml # LinkML schema definition
Data Model #
citations-collector uses a LinkML schema to define its data model. The schema is aligned with established scholarly ontologies:
- CiTO (Citation Typing Ontology) – classifies citation relationships (cites, cites as authority, cites as data source, etc.)
- FaBiO (FRBR-aligned Bibliographic Ontology) – describes bibliographic entities (journal article, conference paper, dataset, etc.)
This means citation records are not just flat metadata – they carry semantic information about why one paper cites another and what kind of scholarly entity each record represents.
Key Features #
Multi-Source Citation Discovery #
citations-collector queries multiple scholarly databases and merges the results:
| Source | What It Provides |
|---|---|
| CrossRef | DOI metadata, reference lists, funding information |
| OpenCitations | Open citation graph data (who cites whom) |
| DataCite | Dataset citations and data-paper linkages |
| OpenAlex | Aggregated scholarly metadata from multiple sources |
By querying all four sources, citations-collector builds a more complete picture than any single database provides. CrossRef may have the DOI metadata but miss the citation graph; OpenCitations has the graph but may lack recent papers; DataCite captures dataset-to-paper links that others miss; OpenAlex fills gaps from its broad aggregation.
Zotero Synchronization #
citations-collector integrates with Zotero for reference management:
- Import from Zotero: pull an existing Zotero library or collection as seed publications
- Export to Zotero: push discovered citations back to Zotero for use in writing workflows
- Sync: keep the DataLad dataset and Zotero library in sync as new citations are discovered
This bridges the gap between the citation discovery pipeline and the day-to-day reference management that researchers actually use when writing papers.
PDF Acquisition with Provenance #
For each discovered citation, citations-collector attempts to acquire the full-text PDF through legal open-access channels:
- Unpaywall – checks for open-access versions via the Unpaywall API
- Publisher OA repositories – direct links from CrossRef metadata
- Preprint servers – arXiv, bioRxiv, medRxiv, SSRN
- Institutional repositories – where available
Each PDF is stored in git-annex with provenance metadata:
# git-annex records where the PDF was obtained
git annex whereis pdfs/10.1234-paper-a.pdf
# => web: https://arxiv.org/pdf/2026.12345
# => web: https://doi.org/10.1234/paper-a (via Unpaywall)
This means the acquisition source is permanently recorded. If a PDF needs to be re-downloaded (e.g., after a storage failure), git-annex knows where it came from.
Citation Graph Analysis #
Beyond individual citation records, citations-collector builds a citation graph that captures relationships between papers:
{
"edges": [
{
"source": "10.1234/paper-a",
"target": "10.5678/paper-b",
"type": "cites",
"cito_type": "cites_as_data_source"
}
]
}
This graph enables analyses like:
- Which of our papers has the most downstream citations?
- What are the key “bridge” papers connecting two research areas?
- Which datasets are most frequently cited by papers in our field?
DataLad Integration #
citations-collector is DataLad-native:
- Creates proper DataLad datasets for new citation collections
- Uses
datalad saveto commit changes with meaningful messages - Supports incremental updates – re-running discovery only fetches new citations
- Tracks provenance through DataLad run records
Usage #
Initialize a Citation Collection #
# Create a new citation dataset
citations-collector init my-lab-citations
# Add seed publications (your lab's papers)
citations-collector add-seed --doi 10.1234/our-paper-1
citations-collector add-seed --doi 10.5678/our-paper-2
# Or import seeds from a Zotero collection
citations-collector add-seed --zotero-collection "Lab Papers"
Discover Citations #
# Discover all citing and cited works
citations-collector discover
# This queries CrossRef, OpenCitations, DataCite, and OpenAlex
# and stores results in citations/discovered/
Acquire PDFs #
# Attempt to acquire PDFs for all discovered citations
citations-collector acquire-pdfs
# PDFs are stored in git-annex with provenance URLs
Sync with Zotero #
# Push discovered citations to a Zotero collection
citations-collector sync-zotero --collection "Discovered Citations"
Update #
# Re-run to pick up new citations
citations-collector discover --update
# Only new citations since the last run are fetched
AI Readiness #
Level: ai-ready.
citations-collector produces highly structured, AI-consumable output at every level:
| Component | Format | AI Use Case |
|---|---|---|
| Citation records | YAML with LinkML schema | Metadata extraction, summarization |
| Citation graph | JSON | Network analysis, relationship discovery |
| PDF full text | PDF (many with text layers) | RAG, literature review, question answering |
| Zotero export | JSON/BibTeX | Bibliography generation, duplicate detection |
| LinkML schema | YAML | Schema-aware querying, validation |
The structured metadata and citation graph are immediately usable by LLMs for tasks like:
- “Summarize the key themes across all papers citing our dataset”
- “Identify the most influential papers in this citation network”
- “Generate a literature review section covering these 50 citations”
- “Find papers that cite both our method paper and our competitor’s”
The LinkML schema provides type information that AI systems can use for schema-aware processing, reducing hallucination and improving extraction accuracy.
Limitations and Caveats #
- Alpha status: The tool is functional but the CLI interface, schema, and output format are still evolving.
- API rate limits: CrossRef, OpenCitations, DataCite, and OpenAlex all have rate limits. Large discovery runs need to respect these.
- PDF availability: Not all papers have legally accessible PDFs. The acquisition pipeline only uses open-access sources.
- Zotero API: Zotero’s API has its own rate limits and authentication requirements.
- Citation completeness: No single source has complete citation data. The multi-source approach improves coverage but gaps remain, especially for very recent publications.
See Also #
- Zotero – reference management integration
- Conservation to External Resources – publishing citation datasets
- Ingestion Patterns – API extraction pattern used by citations-collector