Code
flowchart TD
A[BioProject]
--> B[BioSample]
B --> C[Experiment]
C --> D[Run]flowchart TD A[BioProject] --> B[BioSample] B --> C[Experiment] C --> D[Run]
Public repositories contain enormous amounts of biological data, but locating the correct files often requires navigating multiple layers of identifiers and accession systems.
A single study may contain hundreds or thousands of samples, multiple experiments, and numerous sequencing runs. Understanding how these entities relate to one another is essential for efficient data acquisition and reproducible dataset assembly.
This chapter introduces the accession systems commonly encountered in public omics repositories and explains how they connect different components of a study.
Accession identifiers provide a structured way to organize, track, retrieve, and reference biological data.
They help researchers answer questions such as:
Without accession systems, large-scale public repositories would be difficult to navigate and maintain.
A common hierarchy used by major sequence repositories is:
flowchart TD
A[BioProject]
--> B[BioSample]
B --> C[Experiment]
C --> D[Run]flowchart TD A[BioProject] --> B[BioSample] B --> C[Experiment] C --> D[Run]
Each level describes a different aspect of the data acquisition process.
| Level | Description | Example Prefix |
|---|---|---|
| BioProject | Research project | PRJNA |
| BioSample | Biological specimen | SAMN |
| Experiment | Sequencing experiment | SRX |
| Run | Sequencing output | SRR |
A BioProject represents the overall research initiative.
Examples:
Common accession prefix:
PRJNA123456
A BioSample describes an individual biological specimen.
Examples:
BioSample records often contain valuable metadata such as age, sex, disease status, body site, and geographic origin.
Common accession prefix:
SAMN12345678
An Experiment describes how a sample was processed and sequenced.
Examples include:
Common accession prefix:
SRX1234567
A Run represents the actual sequencing output generated from an experiment.
Runs typically correspond to downloadable sequence files.
Common accession prefix:
SRR12345678
Although accession systems differ slightly between repositories, the underlying concepts are similar.
| Repository | Study | Sample | Data |
|---|---|---|---|
| NCBI | BioProject | BioSample | SRA |
| GEO | GSE | GSM | Supplementary Files |
| ENA | Project | Sample | Runs |
| DDBJ | BioProject | BioSample | DRA |
Suppose we identify a microbiome study relevant to our project.
flowchart TD
A[BioProject]
--> B[BioSample Metadata]
B --> C[Experiment]
C --> D[Run Accessions]
D --> E[Data Download]flowchart TD A[BioProject] --> B[BioSample Metadata] B --> C[Experiment] C --> D[Run Accessions] D --> E[Data Download]
Researchers often begin with a BioProject and work downward toward downloadable run files.
A healthy reference microbiome workflow may proceed as follows:
Healthy Reference Objective
↓
Identify BioProject
↓
Retrieve BioSamples
↓
Review Metadata
↓
Select Eligible Samples
↓
Collect Run Accessions
↓
Download Data
This process illustrates why accession systems are central to reproducible data acquisition.
Accession systems tell us where data reside, but metadata tell us whether those data are suitable for our objective.
In the next chapter, we explore metadata acquisition and how metadata drive study selection, sample filtering, and reference dataset construction.