AlphaBiomics Case Study

Published

Jun 2026

ID: DAS-009
Type: Case Study
Audience: Omics Data Scientists, Bioinformaticians, and Research Teams
Theme: Building a Healthy Reference Microbiome from Public Data

Throughout this guide, we explored the major components of a reproducible data acquisition workflow, including study discovery, accession systems, metadata acquisition, data download, validation, cloud storage, and reference dataset assembly.

This chapter integrates those components through a practical case study inspired by a common challenge in microbiome research:

How can we build a healthy reference microbiome dataset from public repositories?

The objective is not to perform downstream microbiome analysis, but to demonstrate the acquisition workflow required to construct a reusable reference dataset.

Project Objective

The goal is to assemble a healthy human gut microbiome reference dataset suitable for future comparative analyses.

Potential applications include:

Population benchmarking
Health reference construction
Method development
Educational workflows
Commercial microbiome products

Step 1: Define the Target Population

Before searching repositories, define the target cohort.

Inclusion Criteria

Human samples
Healthy individuals
Stool samples
Publicly available metadata
Publicly available sequencing data

Exclusion Criteria

Disease cohorts
Antibiotic-treated individuals
Unknown health status
Non-human samples
Missing metadata

These criteria establish the scope of the reference dataset.

Step 2: Study Discovery

The next step is identifying candidate studies.

Code

flowchart TD

A[Healthy Reference Objective]
--> B[Repository Search]

B --> C[Candidate Studies]

C --> D[Study Evaluation]

flowchart TD

A[Healthy Reference Objective]
--> B[Repository Search]

B --> C[Candidate Studies]

C --> D[Study Evaluation]

Potential repositories include:

NCBI
ENA
MGnify
GEO

At this stage, the objective is to identify studies that may contain eligible samples.

Step 3: Accession Retrieval

For each candidate study:

Code

flowchart TD

A[BioProject]
--> B[BioSamples]

B --> C[Experiments]

C --> D[Runs]

flowchart TD

A[BioProject]
--> B[BioSamples]

B --> C[Experiments]

C --> D[Runs]

Accession identifiers provide the link between studies, samples, metadata, and downloadable sequencing data.

Step 4: Metadata Acquisition

Metadata determine whether samples satisfy project requirements.

Variable	Example
Health Status	Healthy
Body Site	Stool
Age	Adult
Antibiotic Use	No
Host	Human

Metadata acquisition often becomes the most time-consuming stage of the workflow.

Step 5: Sample Filtering

A typical filtering process may resemble:

2,000 Samples
        ↓
Remove Disease Samples
        ↓
Remove Antibiotic-Treated Samples
        ↓
Remove Missing Metadata
        ↓
850 Eligible Samples

Filtering transforms candidate studies into a defined cohort.

Step 6: Data Download

Once eligible samples are identified:

Eligible Samples
        ↓
Retrieve Run Accessions
        ↓
Download Data
        ↓
Organize Files

Downloaded data should be stored using a reproducible directory structure.

Step 7: Data Validation

Validation confirms:

Expected sample counts
File integrity
Metadata consistency
Absence of duplicates

Only validated samples should proceed to dataset assembly.

Step 8: Reference Dataset Assembly

Validated samples are combined into a unified resource.

Code

flowchart TD

A[Validated Study 1]
--> D[Reference Dataset]

B[Validated Study 2]
--> D

C[Validated Study 3]
--> D

flowchart TD

A[Validated Study 1]
--> D[Reference Dataset]

B[Validated Study 2]
--> D

C[Validated Study 3]
--> D

Metadata are harmonized and provenance information is preserved.

Final Dataset

The resulting dataset contains:

Selected studies
Harmonized metadata
Validated sequencing data
Documentation
Provenance records

The dataset is now suitable for downstream analysis and interpretation.

Lessons Learned

This case study highlights several important principles:

Public data are not automatically reference datasets.
Metadata often determine dataset quality.
Validation is essential.
Reproducibility should be considered throughout the workflow.
Reference datasets are constructed, not simply downloaded.

CDI Data Acquisition System Summary

Code

flowchart TD

A[Public Data Landscape]
--> B[Study Discovery]

B --> C[Accession Systems]

C --> D[Metadata Acquisition]

D --> E[Data Download]

E --> F[Data Validation]

F --> G[Cloud Storage and Transfer]

G --> H[Reference Dataset Assembly]

H --> I[Healthy Reference Dataset]

flowchart TD

A[Public Data Landscape]
--> B[Study Discovery]

B --> C[Accession Systems]

C --> D[Metadata Acquisition]

D --> E[Data Download]

E --> F[Data Validation]

F --> G[Cloud Storage and Transfer]

G --> H[Reference Dataset Assembly]

H --> I[Healthy Reference Dataset]

The AlphaBiomics case study demonstrates how each component of the CDI Data Acquisition System contributes to the creation of a reproducible reference dataset.

Next Steps

The workflow presented in this guide provides a foundation for many downstream activities, including microbiome analysis, RNA-Seq studies, benchmarking projects, clinical evidence generation, and commercial product development.

The acquisition process is complete. The resulting reference dataset is now ready to support the next stage of scientific investigation.