Python for Genomics Sequencing — Core Concepts

The genomics data explosion

A single Illumina NovaSeq run produces up to 6 terabytes of raw data. Oxford Nanopore’s PromethION generates reads exceeding 100,000 bases. The UK Biobank has sequenced 500,000 genomes. This scale demands efficient computational tools, and Python sits at the center of most genomics workflows — either as the primary analysis language or as the glue connecting specialized tools.

Sequencing technologies

TechnologyRead lengthError rateThroughputUse case
Illumina (short-read)150-300 bp0.1%Very highStandard WGS, exome, RNA-seq
Oxford Nanopore10k-1M+ bp2-5%ModerateStructural variants, real-time
PacBio HiFi10-25k bp<1%ModerateHigh-quality long reads
Element Biosciences150-300 bp0.1%HighCost-competitive short reads

Each technology produces different file formats and requires different processing strategies.

Core file formats

FASTQ — raw reads

@read_001
ATCGATCGATCGATCGATCG
+
IIIIIIIIIIIIIIIIIII

Each read has four lines: header, sequence, separator, and quality scores (ASCII-encoded Phred scores where I = quality 40, meaning 99.99% accuracy).

BAM/CRAM — aligned reads

After alignment to a reference genome, reads are stored in BAM (binary) or CRAM (compressed) format. Python’s pysam library reads both:

import pysam

bam = pysam.AlignmentFile("sample.bam", "rb")

# Count reads aligned to chromosome 1, positions 1M-2M
count = 0
for read in bam.fetch("chr1", 1_000_000, 2_000_000):
    if not read.is_unmapped and read.mapping_quality >= 30:
        count += 1
print(f"High-quality reads in region: {count}")

VCF — variants

Variant Call Format stores genetic differences found in a sample:

from cyvcf2 import VCF

vcf = VCF("sample.vcf.gz")
for variant in vcf:
    if variant.CHROM == "chr17" and variant.QUAL > 30:
        print(f"{variant.CHROM}:{variant.POS} {variant.REF}{variant.ALT[0]} "
              f"Quality={variant.QUAL:.0f}")

The standard genomics pipeline

Step 1: Quality control

Before analysis, check read quality with FastQC and trim low-quality bases:

import subprocess

# Quality check
subprocess.run(["fastqc", "reads_R1.fastq.gz", "reads_R2.fastq.gz"])

# Trimming with fastp (Python wrapper)
subprocess.run([
    "fastp",
    "-i", "reads_R1.fastq.gz",
    "-I", "reads_R2.fastq.gz",
    "-o", "trimmed_R1.fastq.gz",
    "-O", "trimmed_R2.fastq.gz",
    "--qualified_quality_phred", "20",
    "--length_required", "50",
])

Step 2: Alignment

Map reads to a reference genome. BWA-MEM2 is the standard for short reads:

# BWA-MEM2 alignment
subprocess.run([
    "bwa-mem2", "mem",
    "-t", "16",           # threads
    "-R", "@RG\\tID:sample\\tSM:sample\\tPL:ILLUMINA",
    "reference.fa",
    "trimmed_R1.fastq.gz",
    "trimmed_R2.fastq.gz",
], stdout=open("aligned.sam", "w"))

# Sort and index with samtools
subprocess.run(["samtools", "sort", "-@", "8", "-o", "sorted.bam", "aligned.sam"])
subprocess.run(["samtools", "index", "sorted.bam"])

Step 3: Variant calling

Identify where the sample’s DNA differs from the reference:

# GATK HaplotypeCaller (gold standard for germline variants)
subprocess.run([
    "gatk", "HaplotypeCaller",
    "-R", "reference.fa",
    "-I", "sorted.bam",
    "-O", "variants.g.vcf.gz",
    "-ERC", "GVCF",
])

For Python-native variant calling, DeepVariant (Google) uses a CNN trained on pileup images and is invoked through Docker or Python scripts.

Step 4: Annotation

Add biological meaning to raw variants:

# Using pyVEP or subprocess to Ensembl VEP
# Common annotations: gene name, consequence (missense, frameshift),
# population frequency (gnomAD), pathogenicity (ClinVar)

import pandas as pd

annotated = pd.read_csv("annotated_variants.tsv", sep="\t")
pathogenic = annotated[
    (annotated["ClinVar_CLNSIG"].str.contains("Pathogenic", na=False)) &
    (annotated["gnomAD_AF"] < 0.01)  # Rare variants
]
print(f"Rare pathogenic variants: {len(pathogenic)}")

Python-native genomics libraries

LibraryPurpose
pysamRead BAM/CRAM/VCF files
cyvcf2Fast VCF parsing (Cython-backed)
pyfaidxRandom access to FASTA reference genomes
pybedtoolsGenomic interval operations
scikit-allelPopulation genetics, allele frequency analysis
pygenomicsGeneral genomics utilities

Common misconception

“Sequencing a genome gives you all the answers.” Sequencing provides raw data — billions of letters. Interpreting what those letters mean is the hard part. Most of the genome (>98%) does not code for proteins. Many variants are “variants of uncertain significance” (VUS) — we simply do not know their effect yet. The gap between data generation and clinical interpretation is where most of the challenge lies.

Real-world applications

  • 23andMe and Ancestry use Python pipelines to genotype millions of customers, comparing SNP arrays against reference panels to report ancestry and health risk.
  • Genomics England’s 100,000 Genomes Project used Python-based pipelines to sequence 100,000 NHS patients, identifying diagnostic variants for rare diseases and cancer.
  • The Cancer Genome Atlas (TCGA) characterized 33 cancer types using Python workflows for somatic mutation calling, copy number analysis, and gene expression profiling.

The one thing to remember: Python orchestrates the entire genomics pipeline — from raw FASTQ quality control through alignment, variant calling, and clinical annotation — serving as both the glue connecting specialized tools and the primary language for downstream analysis and interpretation.

pythongenomicsbioinformatics

See Also