Study Discovery

Published

Jun 2026

ID: DAS-002
Type: Foundations
Audience: Omics Data Scientists, Bioinformaticians, and Research Teams
Theme: Finding Studies That Match Your Question

Public repositories contain millions of samples and thousands of studies spanning a wide range of biological questions, populations, technologies, and experimental designs. The challenge is rarely the lack of data. Instead, the challenge is identifying studies that are relevant, trustworthy, and suitable for a specific objective.

Study discovery is the process of transforming a research objective into a set of candidate studies that can support downstream analysis. Before any data are downloaded, researchers must determine which studies are most appropriate for their intended purpose.

Why Study Discovery Matters

Effective data acquisition begins with selecting the right studies.

For example, a project focused on constructing a healthy reference microbiome cannot rely on a simple keyword search alone. Researchers must determine:

Which studies contain healthy individuals?
Which samples are human?
Which body sites are relevant?
Which metadata fields are available?
Which studies satisfy project-specific criteria?

The quality of study discovery directly influences the quality of the resulting dataset.

From Question to Search Strategy

Every study discovery process begins with a clearly defined objective.

Code

flowchart LR
A[Research Objective] --> B[Search Strategy]
B --> C[Study Discovery]
C --> D[Study Evaluation]

flowchart LR
A[Research Objective] --> B[Search Strategy]
B --> C[Study Discovery]
C --> D[Study Evaluation]

Defining Inclusion Criteria

Healthy Reference Microbiome Example

Include:

Human samples
Healthy individuals
Stool samples
Adequate metadata
Publicly available sequencing data

Defining Exclusion Criteria

Examples of exclusion criteria include:

Disease cohorts
Antibiotic-treated individuals
Animal samples
Missing phenotype information
Incomplete metadata
Poor documentation

Repository Search Interfaces

Repository	Typical Use
NCBI	Broad study discovery
GEO	Functional genomics
ENA	Sequence archives
MGnify	Microbiome studies
GWAS Catalog	Variant studies

Metadata-Driven Discovery

Relevant metadata may include:

Disease status
Age
Sex
Geography
Body site
Treatment status
Sequencing platform

Study Evaluation Checklist

Does the study address the target population?
Is metadata available and sufficiently detailed?
Are raw sequencing data available?
Is the sample size adequate?
Is study documentation complete?
Can the study support the intended analysis?

AlphaBiomics Example

Build a healthy reference gut microbiome.

Code

flowchart TD
A[Healthy Reference Goal] --> B[Search Public Repositories]
B --> C[Identify Candidate Studies]
C --> D[Review Metadata]
D --> E[Apply Inclusion and Exclusion Criteria]
E --> F[Shortlist Studies]

flowchart TD
A[Healthy Reference Goal] --> B[Search Public Repositories]
B --> C[Identify Candidate Studies]
C --> D[Review Metadata]
D --> E[Apply Inclusion and Exclusion Criteria]
E --> F[Shortlist Studies]

Looking Ahead

In the next chapter, we examine accession systems and the relationships between BioProjects, BioSamples, Experiments, and Runs.