Python for Genomics Sequencing — Core Concepts
The genomics data explosion
A single Illumina NovaSeq run produces up to 6 terabytes of raw data. Oxford Nanopore’s PromethION generates reads exceeding 100,000 bases. The UK Biobank has sequenced 500,000 genomes. This scale demands efficient computational tools, and Python sits at the center of most genomics workflows — either as the primary analysis language or as the glue connecting specialized tools.
Sequencing technologies
| Technology | Read length | Error rate | Throughput | Use case |
|---|---|---|---|---|
| Illumina (short-read) | 150-300 bp | 0.1% | Very high | Standard WGS, exome, RNA-seq |
| Oxford Nanopore | 10k-1M+ bp | 2-5% | Moderate | Structural variants, real-time |
| PacBio HiFi | 10-25k bp | <1% | Moderate | High-quality long reads |
| Element Biosciences | 150-300 bp | 0.1% | High | Cost-competitive short reads |
Each technology produces different file formats and requires different processing strategies.
Core file formats
FASTQ — raw reads
@read_001
ATCGATCGATCGATCGATCG
+
IIIIIIIIIIIIIIIIIII
Each read has four lines: header, sequence, separator, and quality scores (ASCII-encoded Phred scores where I = quality 40, meaning 99.99% accuracy).
BAM/CRAM — aligned reads
After alignment to a reference genome, reads are stored in BAM (binary) or CRAM (compressed) format. Python’s pysam library reads both:
import pysam
bam = pysam.AlignmentFile("sample.bam", "rb")
# Count reads aligned to chromosome 1, positions 1M-2M
count = 0
for read in bam.fetch("chr1", 1_000_000, 2_000_000):
if not read.is_unmapped and read.mapping_quality >= 30:
count += 1
print(f"High-quality reads in region: {count}")
VCF — variants
Variant Call Format stores genetic differences found in a sample:
from cyvcf2 import VCF
vcf = VCF("sample.vcf.gz")
for variant in vcf:
if variant.CHROM == "chr17" and variant.QUAL > 30:
print(f"{variant.CHROM}:{variant.POS} {variant.REF}→{variant.ALT[0]} "
f"Quality={variant.QUAL:.0f}")
The standard genomics pipeline
Step 1: Quality control
Before analysis, check read quality with FastQC and trim low-quality bases:
import subprocess
# Quality check
subprocess.run(["fastqc", "reads_R1.fastq.gz", "reads_R2.fastq.gz"])
# Trimming with fastp (Python wrapper)
subprocess.run([
"fastp",
"-i", "reads_R1.fastq.gz",
"-I", "reads_R2.fastq.gz",
"-o", "trimmed_R1.fastq.gz",
"-O", "trimmed_R2.fastq.gz",
"--qualified_quality_phred", "20",
"--length_required", "50",
])
Step 2: Alignment
Map reads to a reference genome. BWA-MEM2 is the standard for short reads:
# BWA-MEM2 alignment
subprocess.run([
"bwa-mem2", "mem",
"-t", "16", # threads
"-R", "@RG\\tID:sample\\tSM:sample\\tPL:ILLUMINA",
"reference.fa",
"trimmed_R1.fastq.gz",
"trimmed_R2.fastq.gz",
], stdout=open("aligned.sam", "w"))
# Sort and index with samtools
subprocess.run(["samtools", "sort", "-@", "8", "-o", "sorted.bam", "aligned.sam"])
subprocess.run(["samtools", "index", "sorted.bam"])
Step 3: Variant calling
Identify where the sample’s DNA differs from the reference:
# GATK HaplotypeCaller (gold standard for germline variants)
subprocess.run([
"gatk", "HaplotypeCaller",
"-R", "reference.fa",
"-I", "sorted.bam",
"-O", "variants.g.vcf.gz",
"-ERC", "GVCF",
])
For Python-native variant calling, DeepVariant (Google) uses a CNN trained on pileup images and is invoked through Docker or Python scripts.
Step 4: Annotation
Add biological meaning to raw variants:
# Using pyVEP or subprocess to Ensembl VEP
# Common annotations: gene name, consequence (missense, frameshift),
# population frequency (gnomAD), pathogenicity (ClinVar)
import pandas as pd
annotated = pd.read_csv("annotated_variants.tsv", sep="\t")
pathogenic = annotated[
(annotated["ClinVar_CLNSIG"].str.contains("Pathogenic", na=False)) &
(annotated["gnomAD_AF"] < 0.01) # Rare variants
]
print(f"Rare pathogenic variants: {len(pathogenic)}")
Python-native genomics libraries
| Library | Purpose |
|---|---|
pysam | Read BAM/CRAM/VCF files |
cyvcf2 | Fast VCF parsing (Cython-backed) |
pyfaidx | Random access to FASTA reference genomes |
pybedtools | Genomic interval operations |
scikit-allel | Population genetics, allele frequency analysis |
pygenomics | General genomics utilities |
Common misconception
“Sequencing a genome gives you all the answers.” Sequencing provides raw data — billions of letters. Interpreting what those letters mean is the hard part. Most of the genome (>98%) does not code for proteins. Many variants are “variants of uncertain significance” (VUS) — we simply do not know their effect yet. The gap between data generation and clinical interpretation is where most of the challenge lies.
Real-world applications
- 23andMe and Ancestry use Python pipelines to genotype millions of customers, comparing SNP arrays against reference panels to report ancestry and health risk.
- Genomics England’s 100,000 Genomes Project used Python-based pipelines to sequence 100,000 NHS patients, identifying diagnostic variants for rare diseases and cancer.
- The Cancer Genome Atlas (TCGA) characterized 33 cancer types using Python workflows for somatic mutation calling, copy number analysis, and gene expression profiling.
The one thing to remember: Python orchestrates the entire genomics pipeline — from raw FASTQ quality control through alignment, variant calling, and clinical annotation — serving as both the glue connecting specialized tools and the primary language for downstream analysis and interpretation.
See Also
- Python Biopython Bioinformatics How Python helps scientists read the instruction manual hidden inside every living thing's DNA.
- Python Clinical Trial Analysis How Python helps scientists figure out whether a new medicine actually works by crunching the numbers from clinical trials.
- Python Drug Interaction Modeling How Python helps scientists figure out which medicines are safe to take together and which combinations could be dangerous.
- Python Medical Image Analysis How Python helps doctors see inside your body more clearly by teaching computers to read X-rays, MRIs, and CT scans.
- Python Pandemic Modeling How Python helps scientists predict the spread of diseases like COVID-19 and plan the best ways to slow them down.