Code
flowchart LR
A[NCBI]
B[EMBL-EBI]
C[DDBJ]
A <--> B
B <--> C
A <--> Cflowchart LR A[NCBI] B[EMBL-EBI] C[DDBJ] A <--> B B <--> C A <--> C
A successful data acquisition workflow begins with understanding where public omics data reside. Modern sequencing projects generate enormous quantities of biological data that are deposited into public repositories for preservation, sharing, and reuse. These repositories form a global ecosystem that supports scientific discovery, reproducibility, education, and innovation.
Before searching for studies or downloading files, it is important to understand the major repository systems, how they relate to one another, and what types of data they contain.
Public repositories serve several important functions:
Without these repositories, many large-scale biological discoveries would not be possible.
Most public omics data are distributed through a small number of major repository systems.
flowchart LR
A[NCBI]
B[EMBL-EBI]
C[DDBJ]
A <--> B
B <--> C
A <--> Cflowchart LR A[NCBI] B[EMBL-EBI] C[DDBJ] A <--> B B <--> C A <--> C
Together these organizations support the global exchange of biological data.
The International Nucleotide Sequence Database Collaboration (INSDC) is a long-standing partnership between three major organizations:
| Organization | Region |
|---|---|
| NCBI | United States |
| EMBL-EBI | Europe |
| DDBJ | Japan |
These repositories routinely exchange data, ensuring that submitted datasets become available throughout the international community.
The public data landscape consists of several interconnected resources.
A BioProject describes the overall research initiative.
Examples:
BioSample records describe individual biological specimens.
Examples:
Sequence archives store the raw sequencing data generated from samples.
Examples include:
These repositories provide processed data and study-level information.
Examples:
Common resources:
Common resources:
Common resources:
Common resources:
A common relationship encountered in many public datasets is:
flowchart LR
BioProject --> BioSample
BioSample --> Experiment
Experiment --> Runflowchart LR BioProject --> BioSample BioSample --> Experiment Experiment --> Run
Understanding these relationships is essential for navigating accession systems, which are covered in a later chapter.
Understanding where public omics data reside is the first step in the data acquisition process. The next challenge is identifying studies that match a specific scientific objective.
In the next chapter, we explore strategies for discovering relevant studies across public repositories.