Audience: Omics Data Scientists, Bioinformaticians, and Research Teams
Theme: From Download Manifests to Verified FASTQ Assets
Once studies have been identified, accession systems understood, and metadata evaluated, the next step is acquiring the sequencing data.
Data download transforms study selection and metadata assets into locally accessible sequencing files that can be validated, organized, transferred, and assembled into downstream reference datasets. A reproducible acquisition workflow requires more than simply downloading files. Researchers must understand where files reside, how they are accessed, how downloads can be resumed, and how file integrity can be verified.
Learning Objectives
By the end of this chapter, you will be able to:
Select appropriate repositories for sequencing data download.
Reuse the CDI data acquisition Conda environment.
Install and verify SRA Toolkit commands.
Build download manifests from metadata assets.
Retrieve sequencing files through ENA and NCBI workflows.
Organize downloaded files into a reproducible project structure.
Verify file presence, file size, checksums, and sample counts.
Generate FASTQ inventories for downstream validation.
Why a Data Download System Matters
Metadata acquisition tells us what data exist.
The Data Download System retrieves the actual sequencing files required for downstream analysis.
flowchart TD
A[Metadata Assets]
--> B[Download Manifest]
B --> C[Repository Selection]
C --> D[Download Execution]
D --> E[Integrity Verification]
E --> F[FASTQ Inventory]
F --> G[Data Validation System]
The Data Download System begins with metadata assets generated in Chapter 04 and ends with a documented inventory of downloaded files.
Environment Reuse
The CDI Data Acquisition System uses a single Conda environment throughout the workflow.
conda activate cdi-data-acquisition
This avoids unnecessary environment switching and ensures that metadata acquisition, data download, validation, and dataset assembly operate within a consistent software environment.
CDI Data Acquisition Environment
The environment used throughout this book is defined as:
Successful execution confirms that the shared CDI data acquisition environment is ready for metadata acquisition, data download, and data validation workflows.
Common Sources of Downloadable Data
Public repositories provide access to a variety of downloadable file types.
Repository
Typical Data
Common Role in This System
SRA
Raw sequencing reads
Run accession source and SRA Toolkit download
ENA
Raw sequencing reads and FASTQ URLs
Direct FASTQ retrieval and checksum validation
GEO
Processed and supplementary data
Study-level context and supplementary files
MGnify
Processed microbiome outputs
Processed microbiome reference outputs
GWAS Catalog
Summary statistics and study information
Association study discovery and summary statistics
For large public sequencing datasets, the two most important download sources are usually ENA and NCBI SRA.
Common Download Formats
Researchers frequently encounter:
FASTQ
FASTA
BAM
CRAM
TSV
CSV
Metadata spreadsheets
For raw sequencing acquisition, the most common target output is FASTQ. In microbiome, RNA-seq, metagenomics, and many other NGS workflows, FASTQ files serve as the starting point for downstream quality control and analysis.
Metadata Assets as Inputs
The Data Download System consumes metadata assets generated during metadata acquisition.
The manifest becomes the system contract: it defines what should be downloaded, from where, and how completeness will later be checked.
Download Strategy
The CDI Data Acquisition System uses a staged download strategy that validates repository workflows before initiating large-scale data retrieval.
Build Manifest
↓
Create Test Manifest
↓
Validate Repository Workflows
├── ENA
└── NCBI/SRA
↓
Production Download
This approach reduces risk by ensuring that:
Metadata have been translated correctly into download manifests.
Repository access is functioning as expected.
Download scripts operate correctly.
Retrieved files can be validated before large-scale acquisition.
Bandwidth and storage resources are used efficiently.
The test manifest generated from the full download manifest serves as a representative subset for repository validation.
download-manifest.tsv
↓
test-manifest.tsv
↓
Repository Validation
↓
Production Download
Successful test downloads provide confidence that the acquisition workflow is functioning correctly before scaling to the complete dataset.
Only after successful repository validation should production downloads be initiated.
Download Tools
Several tools are commonly used to retrieve public omics datasets.
SRA Toolkit
The SRA Toolkit is commonly used for retrieving sequencing data from the NCBI Sequence Read Archive.
Tool
Purpose
prefetch
Download SRA files
fasterq-dump
Convert SRA files to FASTQ
vdb-validate
Validate downloaded SRA files
FTP and HTTPS Utilities
ENA and other repositories often expose direct file links that can be retrieved with common command-line tools.
Tool
Purpose
wget
Download files from HTTP, HTTPS, or FTP links
curl
Retrieve files and web resources
md5sum
Verify MD5 checksums on Linux
md5
Verify MD5 checksums on macOS
Cloud-Based Access
Some repositories also provide access through cloud infrastructure and object storage systems. Cloud transfer workflows are introduced later in the book, after local download and validation concepts are established.
Installing SRA Toolkit
Activate the shared CDI data acquisition environment:
If the commands return version information, the toolkit is available in the active environment.
Checking Download Utilities
Many systems already include curl. Some may require wget installation.
curl--versionwget--version
If wget is missing, install it in the same environment:
conda install -c conda-forge wget
For checksum verification, Linux commonly uses:
md5sum--version
On macOS, use:
md5--version
The exact checksum command can differ by operating system, but the purpose is the same: confirm that downloaded files match the expected repository checksums.
Recommended Directory Structure
Downloaded files should be organized systematically.
The expected count becomes one of the first validation checks after download.
System Scripts
The CDI Data Download System is implemented through a set of reusable Bash scripts.
scripts/bash/
├── 05a-build-download-manifest.sh
│ └── Create production and test manifests
│
├── 05b-download-ena-fastq.sh
│ └── Download FASTQ files from ENA
│
├── 05c-download-ncbi-sra.sh
│ └── Download and convert SRA accessions
│
├── 05d-verify-downloads.sh
│ └── Verify downloaded FASTQ assets
│
└── 05e-build-fastq-inventory.sh
└── Generate FASTQ inventory reports
These scripts operate on the directory structure and manifest created in the previous sections and provide a reproducible implementation of the Data Download System.
Before executing the full workflow, it is good practice to validate the environment, directory structure, and download process using a single accession. The following section demonstrates this verification step before scaling to larger manifests and automated script execution.
Testing and Verifying the Download System
Before downloading an entire dataset, it is good practice to validate the workflow using a single accession from the download manifest.
This preliminary verification provides confidence that the acquisition workflow is functioning correctly.
Rather than immediately downloading hundreds of sequencing runs, validating the workflow on a single accession helps confirm that repository access, directory structure, software installation, FASTQ conversion, and basic file inspection are all functioning as expected.
Once the workflow has been verified, the same approach can be scaled to the complete download manifest through the CDI Data Download System scripts.
The following sections describe two common strategies for large-scale data acquisition: direct FASTQ retrieval from ENA and SRA-based retrieval using the NCBI SRA Toolkit. Rather than executing individual commands manually, these workflows are implemented through the reusable Bash scripts introduced above.
Repository Selection Strategy
The recommended strategy is to use NCBI for study discovery and accession resolution, then use ENA for direct FASTQ download when suitable links and checksums are available.
Study Discovery
↓
NCBI
↓
Metadata Acquisition
↓
ENA
↓
Download Planning
↓
Data Download
Scenario
Preferred Repository
Metadata discovery
NCBI
FASTQ download with direct URLs
ENA
Missing FASTQ URLs
NCBI SRA Toolkit
Need SRA-native retrieval
NCBI SRA Toolkit
Controlled-access data
Repository-specific process
Supplementary processed files
GEO, ENA, project archive, or journal supplement
This strategy avoids treating all repositories as interchangeable. Each repository contributes differently to the acquisition system.
Download Workflow 1: ENA Direct FASTQ Download
ENA is often preferred when direct FASTQ links and MD5 checksums are available.
ENA Metadata
↓
FASTQ URLs
↓
wget or curl
↓
FASTQ Files
↓
Checksum Verification
The ENA metadata file may contain columns such as:
run_accession
fastq_ftp
fastq_md5
fastq_bytes
The exact column names should be inspected before writing download commands:
head-n 1 data/metadata/ena-prjna477349.tsv
The ENA download workflow is implemented by:
bash scripts/bash/05b-download-ena-fastq.sh
This script extracts FASTQ URLs from the ENA metadata file, writes them to:
data/manifests/ena-fastq-urls.txt
and downloads the files into:
data/raw/fastq/
Download logs are written to:
data/logs/download-ena.log
The script uses wget --continue, which allows interrupted downloads to resume when possible.
Download Workflow 2: NCBI SRA Retrieval
When direct FASTQ links are unavailable, the SRA Toolkit workflow can be used.
This test-first pattern prevents accidentally launching a full dataset download before the workflow has been validated.
For large datasets, compression can take time. Faster alternatives such as pigz can be used when parallel compression is required.
The previous single-accession test demonstrated that the download workflow is functioning correctly on a representative sequencing run. After scaling to the complete download manifest, additional checks should be performed to confirm file integrity and completeness before proceeding to downstream validation.
Common post-download checks include:
Expected file counts
Checksum validation
Metadata consistency
Sample completeness
The following section focuses on checksum verification, one of the most important safeguards against incomplete or corrupted downloads.
Checksum Verification
If ENA provides MD5 checksums, they should be used to verify that downloaded FASTQ files match the files distributed by the repository.
The inventory provides a concise summary of acquired sequencing assets and supports:
FASTQ file counting
Storage estimation
Download completeness assessment
Sample tracking
Dataset auditing
The expected number of FASTQ files, total storage requirements, and sample coverage can all be assessed from the inventory before formal validation begins.
The inventory becomes a key input to the Data Validation System introduced in Chapter 06.
End-to-End Download Example
Suppose a healthy reference microbiome project identifies 850 eligible samples.
The workflow proceeds as follows:
850 Eligible Samples
↓
Metadata Acquisition
↓
Download Manifest
↓
Single-Accession Test
↓
Workflow Verification
↓
ENA or NCBI Download Workflow
↓
Download Verification
↓
FASTQ Inventory
↓
Data Validation System
At this stage, the objective is not analysis. The objective is to acquire a complete and reproducible sequencing dataset suitable for downstream validation and reference dataset assembly.
Common Challenges
Researchers frequently encounter:
Interrupted downloads
Missing files
Incomplete metadata
Repository-specific formats
Storage limitations
Slow transfer speeds
Inconsistent file naming
Paired-end files split across multiple URLs
Planning for these challenges improves acquisition reliability.
Reproducible Download Workflows
Every download process should be documented.
Important records include:
Repository source
Accessions used
Download date
Commands executed
File counts
Checksum results
Validation results
The goal is not only to download files, but to make the download process auditable and repeatable.
Summary
The Data Download System converts metadata assets into verified sequencing files.
Metadata Assets
↓
Download Manifest
↓
Single-Accession Test
↓
Workflow Verification
↓
ENA or NCBI Download Workflow
↓
Download Verification
↓
FASTQ Inventory
↓
Data Validation System
A reliable download system should answer five questions:
What should be downloaded?
Where should it be downloaded from?
Has the download workflow been validated?
How can downloaded files be verified?
What inventory records the final acquired files?
System Validation
The CDI Data Download System was validated using both ENA and NCBI acquisition workflows.
Validation included:
Manifest generation (133 accessions)
Test manifest generation (3 accessions)
ENA direct FASTQ download
NCBI SRA download and FASTQ conversion
FASTQ verification using SeqKit
FASTQ inventory generation
The validation workflow successfully generated sequencing files, verification reports, and inventory records from publicly available microbiome sequencing datasets.
Looking Ahead
After sequencing files have been downloaded, verified, and inventoried, the next challenge is determining whether the acquired dataset is complete, internally consistent, and suitable for downstream analysis.
In the next chapter, we implement the Data Validation System to evaluate file integrity, sample completeness, metadata consistency, and overall dataset readiness before reference dataset assembly and analysis.