Metadata Acquisition

Published

Jun 2026

Introduction

Once relevant studies have been discovered and their accession systems understood, the next step is metadata acquisition.

Metadata provide the context needed to understand biological samples, sequencing experiments, and associated data files. They help researchers determine whether a study is relevant to a particular objective and whether individual samples should be included or excluded from downstream analyses.

Without metadata, sequencing files become difficult to interpret, compare, validate, or integrate into a reproducible dataset.

Metadata acquisition is the process of discovering, retrieving, validating, evaluating, and organizing information that describes sequencing studies, samples, experiments, sequencing runs, and associated files.

Before any sequencing data are downloaded, metadata provide the information needed to understand:

What was studied
Which samples were collected
How sequencing was performed
Which sequencing runs are available
Which files can be downloaded
Whether a dataset meets project requirements

For many projects, metadata acquisition is the most important stage of the data acquisition workflow because it determines which data ultimately enter the analysis pipeline.

What Is Metadata?

Metadata are data about data.

In public sequencing repositories, metadata describe the context surrounding biological data rather than the sequencing reads themselves.

Data	Metadata
FASTQ file	File size, sequencing platform, run accession
Sequencing run	Run date, read count, instrument model
Biological sample	Organism, body site, disease status
Study	Title, description, publication

Why Metadata Matter?

Metadata support:

Study discovery
Sample selection
Quality assessment
Download planning
Dataset assembly
Reproducible research

Metadata Across the Data Acquisition Lifecycle

Study Discovery
      ↓
Metadata Acquisition
      ↓
Metadata Evaluation
      ↓
Download Planning
      ↓
Data Download
      ↓
Reference Dataset Assembly

Metadata Categories

Study Metadata

Examples:

BioProject accession
Study title
Study description
Associated publication

Sample Metadata

Examples:

BioSample accession
Organism
Body site
Disease status
Geographic location

Experiment Metadata

Examples:

Library strategy
Library source
Library selection
Sequencing platform

Run Metadata

Examples:

Run accession
Read counts
Base counts
Instrument model

File Metadata

Examples:

FASTQ locations
File sizes
MD5 checksums
Download URLs

Metadata Sources

Repository Metadata

NCBI
ENA
DDBJ

Publication Metadata

Study objectives
Experimental design
Cohort descriptions

Supplementary Metadata

Sample sheets
Metadata spreadsheets
Clinical annotation tables

Curated Metadata

Download manifests
Integrated metadata tables
Project inventories

Metadata Completeness

Not all studies provide the same level of metadata.

Rich metadata may include:

Age
Sex
Disease status
Treatment information
Body site
Collection date

Limited metadata may include only accession identifiers and basic sequencing information.

Metadata completeness often determines whether a dataset can be reused for a particular analysis.

Inclusion and Exclusion Criteria

Metadata are frequently used to decide which samples should enter a project.

Examples of inclusion criteria:

Healthy individuals
Human gut microbiome samples
Shotgun metagenomic sequencing

Examples of exclusion criteria:

Disease cohorts
Intervention studies
Incomplete metadata

Healthy Reference Microbiome Example

Building a healthy reference microbiome dataset requires careful metadata review before any sequencing files are downloaded.

Samples may be excluded because of:

Disease status
Antibiotic exposure
Missing metadata
Inappropriate sample types

Metadata Challenges

Common metadata challenges include:

Missing metadata
Inconsistent terminology
Ambiguous sample descriptions
Repository-specific differences
Incomplete publications

These challenges often require manual review and curation.

Metadata as a Scientific Asset

Metadata are not merely supporting information.

Well-curated metadata become scientific assets that:

Enable reproducible research
Support dataset integration
Improve study interpretation
Facilitate downstream analyses

AlphaBiomics Example

In projects such as healthy reference microbiome construction, metadata become the foundation for:

Study selection
Sample selection
Dataset integration
Reference dataset assembly

Installing Entrez Direct (EDirect)

EDirect provides command-line access to NCBI databases and is widely used for metadata retrieval from BioProject, BioSample, SRA, and PubMed.

Throughout this chapter we use three core commands:

esearch — search NCBI databases
efetch — retrieve records
xtract — extract fields from structured outputs

Environment Setup

Create an environment.yml file containing the required metadata acquisition tools.

name: cdi-data-acquisition

channels:
  - conda-forge
  - bioconda

dependencies:
  - entrez-direct

Create the environment:

mamba env create -f environment.yml

Activate the environment:

conda activate cdi-data-acquisition

Verification

Confirm that EDirect tools are available:

which esearch
which efetch
which xtract

Example output:

/Users/tmbmacbookair/anaconda3/envs/cdi-data-acquisition/bin/esearch
/Users/tmbmacbookair/anaconda3/envs/cdi-data-acquisition/bin/efetch
/Users/tmbmacbookair/anaconda3/envs/cdi-data-acquisition/bin/xtract

Next, test communication with NCBI:

esearch -db sra \
-query "PRJNA477349[bioproject]"

Example output:

<ENTREZ_DIRECT>
  <Db>sra</Db>
  <WebEnv>MCID_6a292bec9551de53c901708f</WebEnv>
  <QueryKey>1</QueryKey>
  <Count>133</Count>
  <Step>1</Step>
  <Elapsed>3</Elapsed>
</ENTREZ_DIRECT>

Interpretation:

Db indicates the database queried (sra).
WebEnv is an NCBI session identifier used internally by EDirect.
QueryKey identifies the result set within the session.
Count reports the number of matching records (133 sequencing runs for this BioProject). Later, when retrieving the RunInfo table, you should expect 134 lines in total: 133 sequencing runs plus 1 header row.
Step indicates the workflow step executed by EDirect.
Elapsed reports the query execution time in seconds.

A successful query confirms that EDirect is installed correctly and can communicate with NCBI. The result count also provides a useful first validation of the dataset before retrieving detailed metadata.

Tip

Using an environment file ensures that metadata acquisition workflows remain reproducible and portable across systems.

Worked Example: From BioProject to Download Manifest

BioProject:

PRJNA477349

Retrieve Run Metadata

esearch -db sra \
-query 'PRJNA477349[bioproject]' \
| efetch -format runinfo \
> data/metadata/runinfo-prjna477349.csv

Validate Metadata

wc -l data/metadata/runinfo-prjna477349.csv

Expected:

133 sequencing runs plus one header row.

Extract Run Accessions

cut -d',' -f1 data/metadata/runinfo-prjna477349.csv | head

Explanation:

-d',' specifies comma-delimited fields.
-f1 extracts the first column.

Remove the header:

tail -n +2 data/metadata/runinfo-prjna477349.csv | wc -l

Create Download Manifest

cut -d',' -f1 data/metadata/runinfo-prjna477349.csv \
| tail -n +2 \
> data/metadata/srr-accessions.txt

Metadata Validation

Validation activities include:

Record counts
Accession completeness
Field verification
Manifest generation

Outputs:

runinfo-prjna477349.csv
srr-accessions.txt

Metadata Acquisition Beyond NCBI

The INSDC Ecosystem

Public sequencing metadata are distributed across the International Nucleotide Sequence Database Collaboration (INSDC), a partnership between three major repositories:

INSDC
│
├── NCBI (USA)
├── ENA  (Europe)
└── DDBJ (Japan)

These repositories regularly exchange records and often contain the same underlying studies, samples, and sequencing runs.

ENA Example

In addition to NCBI, metadata can be retrieved directly from the European Nucleotide Archive (ENA).

curl -o data/metadata/ena-prjna477349.tsv \
"https://www.ebi.ac.uk/ena/portal/api/filereport?accession=PRJNA477349&result=read_run"

Example output:

100 33717    0 33717    0     0  12205      0 --:--:--  0:00:02 --:--:-- 12203

Interpretation:

100 indicates that the download completed successfully.
33717 bytes of metadata were downloaded.
The metadata were saved to:

data/metadata/ena-prjna477349.tsv

Validation

Confirm that the metadata file was created successfully:

wc -l data/metadata/ena-prjna477349.tsv

Example output:

Interpretation:

The ENA file contains:

133 sequencing run records
1 header row

This matches the NCBI metadata retrieval results obtained earlier.

NCBI Versus ENA

The same BioProject can often be accessed through both NCBI and ENA.

For BioProject PRJNA477349:

EDirect Search
      ↓
Count = 133

NCBI RunInfo
      ↓
133 records

ENA File Report
      ↓
133 records

The agreement between NCBI and ENA provides confidence that metadata retrieval was successful and complete.

Repository	Primary Strength
NCBI	Metadata discovery, accession relationships, and repository navigation
ENA	Download metadata, FASTQ locations, file sizes, and checksums
DDBJ	Alternative INSDC access point

Repository Selection Strategy

A practical metadata acquisition workflow often combines NCBI and ENA.

Study Discovery
      ↓
NCBI
      ↓
Metadata Validation
      ↓
ENA
      ↓
Download Planning
      ↓
Data Download

In this workflow:

NCBI is used to discover studies and retrieve metadata.
ENA is used to obtain download-oriented metadata such as FASTQ locations, file sizes, and MD5 checksums.
DDBJ serves as an additional access point within the INSDC ecosystem when needed.

Metadata Acquisition Outputs

The workflows presented in this chapter generate reusable metadata assets that support downstream data acquisition activities.

data/
└── metadata/
    ├── runinfo-prjna477349.csv
    ├── ena-prjna477349.tsv
    └── srr-accessions.txt

File	Purpose
`runinfo-prjna477349.csv`	Run-level metadata retrieved from NCBI SRA
`ena-prjna477349.tsv`	Download-oriented metadata retrieved from ENA
`srr-accessions.txt`	Download manifest containing sequencing run accessions

Together, these assets provide the foundation for reproducible data download and dataset assembly workflows.

Summary

In this chapter, we transformed a study accession into validated metadata assets suitable for downstream analysis.

Study Accession
      ↓
Metadata Discovery
      ↓
Metadata Retrieval
      ↓
Metadata Validation
      ↓
Metadata Evaluation
      ↓
Download Manifest

Along the way, we explored:

Metadata categories and sources
Metadata completeness and quality
Inclusion and exclusion criteria
NCBI metadata retrieval using EDirect
ENA metadata retrieval using the ENA API
Validation and manifest generation
Metadata as a scientific asset

The resulting metadata assets provide the information needed to identify relevant samples, plan downloads, and support reproducible dataset construction.

Looking Ahead

Metadata acquisition determines what data should be downloaded.

The next stage of the Data Acquisition System focuses on retrieving the data themselves.

In Chapter 05, we use the download manifest generated here to acquire sequencing data from public repositories, validate downloaded files, and prepare them for downstream analysis.

Metadata Acquisition
      ↓
Download Manifest
      ↓
Data Download
      ↓
FASTQ Files