rclone

Table of Contents

rclone is a command-line program for managing files on cloud storage. It supports over 70 cloud providers – from the major platforms (Google Drive, Dropbox, OneDrive, S3, Azure Blob) to specialized services (Backblaze B2, Wasabi, MEGA, pCloud, Storj) and generic protocols (SFTP, FTP, WebDAV, HTTP).

In the con/serve architecture, rclone is the universal adapter between the git-annex vault and the cloud storage ecosystem. It plays a dual role that no other single tool covers:

Ingestion: pull files from any cloud provider into a local directory for git-annex import
Distribution: act as a git-annex special remote so archived content can be pushed back to cloud storage for backup, sharing, or compliance

This bidirectionality makes rclone the most versatile tool in the cloud storage section.

Why rclone? #

Research groups accumulate files across dozens of cloud services – Google Drive folders shared by collaborators, Dropbox directories synced from instruments, S3 buckets holding processed datasets, OneDrive directories mandated by institutional IT. Each service has its own API, authentication flow, and sync semantics.

rclone provides a single, consistent interface to all of them. Learn one tool, access everything.

Supported Providers #

rclone supports over 70 providers. Key ones for research workflows:

Provider	Use Case
Google Drive	Institutional G Suite accounts, shared drives
Dropbox	Collaborator file sharing
Amazon S3	Archival storage, compute data staging
Backblaze B2	Low-cost archival storage
Microsoft OneDrive	Institutional Office 365 accounts
Box	Enterprise file sharing (common in universities)
SFTP	Any server with SSH access
WebDAV	OwnCloud, Nextcloud, and other WebDAV servers
Azure Blob	Microsoft cloud research credits
Google Cloud Storage	GCP-based research infrastructure
Wasabi	Hot cloud storage, S3-compatible, no egress fees

Core Operations #

Configuration #

# Interactive configuration wizard
rclone config

# Result: named remotes stored in ~/.config/rclone/rclone.conf
# Example: a remote named "gdrive" pointing to Google Drive

Sync and Copy #

# Copy files from Google Drive to local directory
rclone copy gdrive:shared-project/ ./local-copy/

# Sync (make destination match source, deleting extras)
rclone sync gdrive:shared-project/ ./local-copy/

# Copy with progress and stats
rclone copy gdrive:shared-project/ ./local-copy/ --progress

# Copy only files newer than a date
rclone copy gdrive:shared-project/ ./local-copy/ --min-age 0 --max-age 30d

Mount as Filesystem #

# Mount Google Drive as a local filesystem (FUSE)
rclone mount gdrive:shared-project/ /mnt/gdrive --daemon

# Now access cloud files as if they were local
ls /mnt/gdrive/
cp /mnt/gdrive/important-file.csv ./

# Unmount when done
fusermount -u /mnt/gdrive

List and Check #

# List files on a remote
rclone ls gdrive:shared-project/

# Check for differences between local and remote
rclone check ./local-copy/ gdrive:shared-project/

# Show storage usage
rclone about gdrive:

git-annex Integration #

rclone’s most important role in the con/serve stack is as a git-annex special remote via git-annex-remote-rclone. This lets git-annex use any rclone-supported provider as a storage backend.

Setup #

# Install the bridge
pip install git-annex-remote-rclone

# Or from source
git clone https://github.com/DanielDent/git-annex-remote-rclone.git
cp git-annex-remote-rclone/git-annex-remote-rclone ~/bin/

Configure a Special Remote #

# Initialize an rclone-backed special remote
git annex initremote gdrive-backup type=external \
    externaltype=rclone \
    target=gdrive \
    prefix=annex-backup \
    encryption=shared \
    chunk=50MiB

# Copy content to the remote
git annex copy --to gdrive-backup .

# Verify
git annex whereis .

Ingestion Workflow #

Pull files from cloud storage into a git-annex repository:

# 1. Copy from cloud to local staging area
rclone copy gdrive:lab-data/experiment-2026/ ./staging/

# 2. Add to git-annex
cd my-dataset
git annex add ../staging/*
datalad save -m "Import experiment data from Google Drive"

# Or use git annex import directly
git annex import gdrive:lab-data/experiment-2026/ --to main

Distribution Workflow #

Push archived content to cloud storage for backup:

# Push all content to Google Drive backup
git annex copy --to gdrive-backup .

# Push specific files
git annex copy --to gdrive-backup data/large-files/

# Set up preferred content expressions
git annex wanted gdrive-backup "include=data/* and not copies=gdrive-backup:1"

# Auto-distribute based on preferred content
git annex sync --content

Complete Example: Lab File Consolidation #

A research group has files scattered across Google Drive, Dropbox, and an institutional S3 bucket. Here is how to consolidate them into a single DataLad dataset:

# Create the dataset
datalad create lab-archive
cd lab-archive

# Pull from Google Drive
rclone copy gdrive:shared-lab-data/ ./google-drive-import/
git annex add google-drive-import/
datalad save -m "Import shared lab data from Google Drive"

# Pull from Dropbox
rclone copy dropbox:instrument-outputs/ ./dropbox-import/
git annex add dropbox-import/
datalad save -m "Import instrument outputs from Dropbox"

# Pull from S3
rclone copy s3:research-bucket/processed/ ./s3-import/
git annex add s3-import/
datalad save -m "Import processed data from S3"

# Set up backup remotes
git annex initremote wasabi type=external externaltype=rclone \
    target=wasabi prefix=lab-archive encryption=shared
git annex copy --to wasabi .

# Now everything is in one versioned, deduplicated, backed-up dataset
git annex whereis .

AI Readiness #

Level: ai-partial.

rclone itself is a transport tool – it does not transform content. The AI readiness depends entirely on what is being transferred:

Structured data (JSON, CSV, markdown) synced from cloud storage is immediately AI-consumable
Binary content (images, videos, proprietary formats) requires domain-specific processing
rclone’s own configuration and output (JSON flags, rclone lsjson) is structured and AI-friendly

rclone’s lsjson command is particularly useful for AI-assisted workflows:

# Get structured file listing from any provider
rclone lsjson gdrive:shared-project/ --recursive

This produces JSON that can be processed by LLMs to understand directory structures, identify file types, or plan ingestion strategies.

Comparison with Direct Provider Tools #

Feature	rclone	Provider-specific tools
Provider coverage	70+	One each
Learning curve	Learn once	Learn each separately
git-annex integration	Via git-annex-remote-rclone	Manual or none
FUSE mount	Yes	Varies
Unified CLI	Yes	Different tools, flags, behaviors
Bandwidth control	Built-in (`--bwlimit`)	Varies
Encryption	Client-side (`crypt` remote)	Provider-dependent

Limitations and Caveats #

Authentication complexity: Each provider has its own auth flow (OAuth2, API keys, service accounts). Initial setup requires navigating provider-specific credential management.
Rate limits: Cloud providers impose API rate limits. Large sync operations may need --tpslimit and --transfers flags tuned per provider.
Eventual consistency: Some providers (especially S3-compatible ones) have eventual consistency guarantees. Operations immediately after writes may not see the latest state.
FUSE mount performance: rclone mount works but is not as fast as local filesystem access. Suitable for browsing and occasional file access, not for heavy random I/O.
Provider-specific quirks: Google Drive’s file versioning, Dropbox’s case sensitivity, OneDrive’s character restrictions – rclone handles most of these but edge cases exist.