Code
flowchart TD
A[Data Download]
--> B[Data Validation]
B --> C[Quality Assessment]
C --> D[Dataset Assembly]flowchart TD A[Data Download] --> B[Data Validation] B --> C[Quality Assessment] C --> D[Dataset Assembly]
Acquiring data is only one part of the data acquisition process. Before datasets can be used for analysis or reference dataset construction, researchers must verify that the downloaded files are complete, consistent, and aligned with project expectations.
Data validation provides confidence that the acquired dataset accurately represents the intended studies and samples.
Even when data originate from reputable public repositories, validation remains essential.
Potential issues include:
Identifying these issues early prevents downstream problems and improves reproducibility.
flowchart TD
A[Data Download]
--> B[Data Validation]
B --> C[Quality Assessment]
C --> D[Dataset Assembly]flowchart TD A[Data Download] --> B[Data Validation] B --> C[Quality Assessment] C --> D[Dataset Assembly]
Validation acts as a checkpoint between data acquisition and dataset construction.
Before initiating production downloads, the CDI Data Acquisition System validates repository workflows using a representative test manifest.
Test Manifest
↓
SRR7450741
SRR7450738
SRR7450759
↓
├── ENA Retrieval
│ ↓
│ 6 FASTQ Files
│
└── NCBI Retrieval
↓
6 FASTQ Files
↓
Repository Validation
↓
Production Readiness
The objective of repository validation is to confirm that:
find data/raw/ena -name "*.fastq.gz" | wc -lOutput:
6
find data/raw/ncbi -name "*.fastq.gz" | wc -lOutput:
6
Validation summary:
Repository Expected FASTQ Files Downloaded FASTQ Files Status
ENA 6 6 PASS
NCBI/SRA 6 6 PASS
Overall result:
Repository Validation Status
ENA: PASS
NCBI/SRA: PASS
READY FOR PRODUCTION DOWNLOAD
The CDI Data Acquisition System uses a multi-layer validation strategy.
flowchart TD
A[Repository Validation]
--> B[File Count Validation]
B --> C[Integrity Validation]
C --> D[Metadata Validation]
D --> E[FASTQ Structure Validation]
E --> F[Validation Report]flowchart TD A[Repository Validation] --> B[File Count Validation] B --> C[Integrity Validation] C --> D[Metadata Validation] D --> E[FASTQ Structure Validation] E --> F[Validation Report]
Each validation layer addresses a different source of acquisition risk.
data/
├── metadata/
├── manifests/
├── logs/
├── raw/
│ ├── ena/
│ ├── ncbi/
│ ├── sra/
│ └── fastq/
└── validation/
├── repository-validation.tsv
├── file-validation.tsv
├── metadata-validation.tsv
└── validation-report.tsv
The validation directory stores outputs generated during the validation process.
File-level validation focuses on verifying the integrity of downloaded files.
Common checks include:
These checks help identify incomplete or damaged downloads.
Repository validation example:
find data/raw/ena -name "*.fastq.gz" | wc -l
find data/raw/ncbi -name "*.fastq.gz" | wc -lExpected result:
ENA: 6
NCBI/SRA: 6
Production validation example:
find data/raw/fastq \
-name "*.fastq.gz" \
| wc -lCompare against:
wc -l data/manifests/ena-fastq-urls.txtExpected outcome:
Manifest Files = Downloaded Files
Many repositories provide checksums that can be used to verify file integrity.
Common checksum methods include:
Checksum validation helps confirm that downloaded files match repository records.
Example:
md5sum sample.fastq.gzMost sequencing files are distributed as compressed archives.
Before analysis, compressed files should be tested for corruption.
gzip -tv data/raw/fastq/*.gzExpected output:
OK
Files failing this check should be re-downloaded.
Researchers should verify that expected sample counts match downloaded data.
Example:
Expected Samples: 850
Downloaded Samples: 850
Status: Pass
Discrepancies should be investigated before proceeding.
Validation should also compare downloaded files against metadata records.
Questions to consider:
Consistency between metadata and sequencing data is critical.
Duplicate samples may occur when:
Duplicate detection helps prevent inflated sample counts and biased analyses.
For sequencing projects, validating FASTQ structure is particularly important.
Each FASTQ record should contain four lines:
@SEQ_ID
SEQUENCE
+
QUALITY
Quick inspection:
zcat sample.fastq.gz | headThis confirms that the file can be decompressed and follows the expected FASTQ format.
Many sequencing studies generate paired-end reads.
Each sample should contain:
sample_1.fastq.gz
sample_2.fastq.gz
Verify that forward and reverse files exist for every sample.
ls *_1.fastq.gz | wc -l
ls *_2.fastq.gz | wc -lExpected outcome:
Forward Reads = Reverse Reads
Different repositories may provide data in various formats.
Common examples include:
Researchers should verify that files are in the expected format and can be successfully opened or parsed.
To support reproducibility, validation should be automated whenever possible.
Example system component:
scripts/bash/06-validate-downloads.sh
Workflow:
Repository Validation
↓
File Count Check
↓
Compression Check
↓
Metadata Check
↓
FASTQ Validation
↓
Validation Report
Automated validation minimizes manual errors and ensures consistency across projects.
Validation results should be recorded as part of the acquisition workflow.
Recommended records include:
Documentation improves transparency and reproducibility.
Suppose a healthy reference microbiome project downloads data for 850 eligible samples.
Validation may involve:
850 Expected Samples
↓
Verify Downloaded Files
↓
Check Metadata Consistency
↓
Check FASTQ Integrity
↓
Remove Duplicates
↓
Validated Dataset
Only after successful validation should samples be considered for reference dataset assembly.
Researchers frequently encounter:
A structured validation workflow helps address these challenges systematically.
Data validation is not merely a technical exercise.
It reduces the risk of:
Validation protects the integrity of downstream scientific work.
The validation script generates a machine-readable report summarizing validation outcomes.
Example execution:
bash scripts/bash/06-validate-download.sh data/raw/enaOutput:
check status value
fastq_file_count PASS 6
compression_validation PASS 0
paired_end_validation PASS 3
fastq_validation PASS data/raw/ena/SRR7450738_1.fastq.gz
Interpretation:
FASTQ File Count
Expected: 6
Observed: 6
Status: PASS
Compression Validation
Failed Files: 0
Status: PASS
Paired-End Validation
Forward Reads: 3
Reverse Reads: 3
Status: PASS
FASTQ Validation
Test File:
SRR7450738_1.fastq.gz
Status: PASS
Overall result:
Repository Validation: PASS
File Validation: PASS
Compression Validation: PASS
Paired-End Validation: PASS
FASTQ Validation: PASS
Overall Status:
VALIDATED
The validation report provides a reproducible record of dataset integrity and serves as the final quality gate before production-scale data acquisition and reference dataset assembly.
Data validation transforms downloaded files into trusted analytical assets.
Repository Validation
↓
File Validation
↓
Compression Validation
↓
Metadata Validation
↓
FASTQ Validation
↓
Validated Dataset
A validated dataset provides confidence that acquisition objectives have been achieved and that downstream analyses are based on complete and reliable data.
Once datasets have been validated, the next challenge is managing storage, transfers, and scalable access to acquired data.
In the next chapter, we explore cloud storage and data transfer strategies that support reproducible and collaborative data acquisition workflows while preserving the integrity of validated datasets.