Audience: Omics Data Scientists, Bioinformaticians, and Research Teams
Theme: Building a Healthy Reference Microbiome from Public Data
Throughout this guide, we explored the major components of a reproducible data acquisition workflow, including study discovery, accession systems, metadata acquisition, data download, validation, cloud storage, and reference dataset assembly.
This chapter integrates those components through a practical case study inspired by a common challenge in microbiome research:
How can we build a healthy reference microbiome dataset from public repositories?
The objective is not to perform downstream microbiome analysis, but to demonstrate the acquisition workflow required to construct a reusable reference dataset.
Project Objective
The goal is to assemble a healthy human gut microbiome reference dataset suitable for future comparative analyses.
Potential applications include:
Population benchmarking
Health reference construction
Method development
Educational workflows
Commercial microbiome products
Step 1: Define the Target Population
Before searching repositories, define the target cohort.
Inclusion Criteria
Human samples
Healthy individuals
Stool samples
Publicly available metadata
Publicly available sequencing data
Exclusion Criteria
Disease cohorts
Antibiotic-treated individuals
Unknown health status
Non-human samples
Missing metadata
These criteria establish the scope of the reference dataset.
flowchart TD
A[Public Data Landscape]
--> B[Study Discovery]
B --> C[Accession Systems]
C --> D[Metadata Acquisition]
D --> E[Data Download]
E --> F[Data Validation]
F --> G[Cloud Storage and Transfer]
G --> H[Reference Dataset Assembly]
H --> I[Healthy Reference Dataset]
The AlphaBiomics case study demonstrates how each component of the CDI Data Acquisition System contributes to the creation of a reproducible reference dataset.
Next Steps
The workflow presented in this guide provides a foundation for many downstream activities, including microbiome analysis, RNA-Seq studies, benchmarking projects, clinical evidence generation, and commercial product development.
The acquisition process is complete. The resulting reference dataset is now ready to support the next stage of scientific investigation.