Public Data Landscape

Published

Jun 2026

ID: DAS-001
Type: Foundations
Audience: Omics Data Scientists, Bioinformaticians, and Research Teams
Theme: Understanding the Public Omics Ecosystem

A successful data acquisition workflow begins with understanding where public omics data reside. Modern sequencing projects generate enormous quantities of biological data that are deposited into public repositories for preservation, sharing, and reuse. These repositories form a global ecosystem that supports scientific discovery, reproducibility, education, and innovation.

Before searching for studies or downloading files, it is important to understand the major repository systems, how they relate to one another, and what types of data they contain.

Why Public Repositories Matter

Public repositories serve several important functions:

Long-term data preservation
Scientific transparency
Reproducibility
Data reuse
Meta-analysis
Benchmarking and method development

Without these repositories, many large-scale biological discoveries would not be possible.

The Global Repository Ecosystem

Most public omics data are distributed through a small number of major repository systems.

Code

flowchart LR

A[NCBI]
B[EMBL-EBI]
C[DDBJ]

A <--> B
B <--> C
A <--> C

flowchart LR

A[NCBI]
B[EMBL-EBI]
C[DDBJ]

A <--> B
B <--> C
A <--> C

Together these organizations support the global exchange of biological data.

The International Nucleotide Sequence Database Collaboration

The International Nucleotide Sequence Database Collaboration (INSDC) is a long-standing partnership between three major organizations:

Organization	Region
NCBI	United States
EMBL-EBI	Europe
DDBJ	Japan

These repositories routinely exchange data, ensuring that submitted datasets become available throughout the international community.

Major Repository Components

The public data landscape consists of several interconnected resources.

BioProject

A BioProject describes the overall research initiative.

Examples:

Human gut microbiome study
RNA-Seq disease study
Agricultural genomics project

BioSample

BioSample records describe individual biological specimens.

Examples:

Stool sample
Blood sample
Tissue biopsy

Sequence Archives

Sequence archives store the raw sequencing data generated from samples.

Examples include:

SRA (Sequence Read Archive)
ENA (European Nucleotide Archive)
DRA (DDBJ Sequence Read Archive)

Functional Genomics Repositories

These repositories provide processed data and study-level information.

Examples:

GEO (Gene Expression Omnibus)
ArrayExpress

Domain-Specific Examples

Microbiome

Common resources:

BioProject
BioSample
SRA
MGnify

RNA-Seq

Common resources:

GWAS

Common resources:

GWAS Catalog
dbGaP

Single-Cell

Common resources:

GEO
SRA
ArrayExpress

A First Look at Repository Relationships

A common relationship encountered in many public datasets is:

Code

flowchart LR

BioProject --> BioSample
BioSample --> Experiment
Experiment --> Run

flowchart LR

BioProject --> BioSample
BioSample --> Experiment
Experiment --> Run

Understanding these relationships is essential for navigating accession systems, which are covered in a later chapter.

Looking Ahead

Understanding where public omics data reside is the first step in the data acquisition process. The next challenge is identifying studies that match a specific scientific objective.

In the next chapter, we explore strategies for discovering relevant studies across public repositories.