Python for Electronic Health Records — Core Concepts

Learn how Python extracts, transforms, and analyzes electronic health record data for clinical research and predictive modeling.

What EHR data looks like

Electronic health records are not a single tidy spreadsheet. A typical EHR system like Epic or Cerner stores data across hundreds of tables covering:

Demographics — age, sex, race, insurance
Encounters — hospital visits, admissions, discharges
Diagnoses — coded with ICD-10 (International Classification of Diseases)
Procedures — coded with CPT (Current Procedural Terminology)
Medications — prescriptions, administration records, pharmacy data
Lab results — blood tests, imaging reports, pathology
Clinical notes — free-text physician notes, nursing assessments, discharge summaries
Vital signs — heart rate, blood pressure, temperature, oxygen saturation

The sheer variety of data types — structured codes, numerical measurements, timestamps, and unstructured text — makes EHR analysis a multi-modal data engineering problem.

Common data models

Different hospitals use different EHR vendors, making cross-institution research difficult. Common data models standardize the schema:

Model	Used by	Key feature
OMOP CDM	OHDSI network (3+ billion patient records)	Standardized vocabularies, analytics tools
PCORnet CDM	Patient-Centered Outcomes Research	Focus on comparative effectiveness
i2b2	Academic medical centers	Star-schema for fast queries
FHIR	Modern health apps	REST API for real-time access

OMOP (Observational Medical Outcomes Partnership) is the most widely adopted for research. It maps local hospital codes to standard concepts, so “aspirin” at Hospital A and “ASA 325mg” at Hospital B both become the same OMOP concept ID.

Python’s EHR ecosystem

Data extraction and transformation

import pandas as pd
import sqlalchemy

# Connect to OMOP database
engine = sqlalchemy.create_engine("postgresql://user:pass@omop-server/cdm")

# Extract diabetic patients with lab results
query = """
SELECT p.person_id, p.year_of_birth, p.gender_concept_id,
       m.measurement_date, m.value_as_number, c.concept_name
FROM person p
JOIN condition_occurrence co ON p.person_id = co.person_id
JOIN measurement m ON p.person_id = m.person_id
JOIN concept c ON m.measurement_concept_id = c.concept_id
WHERE co.condition_concept_id = 201826  -- Type 2 diabetes
  AND c.concept_name = 'Hemoglobin A1c'
ORDER BY p.person_id, m.measurement_date
"""
df = pd.read_sql(query, engine)

Clinical NLP — extracting information from notes

Up to 80% of clinically relevant information lives in unstructured text. Python NLP tools extract structured data from clinical notes:

import medspacy

nlp = medspacy.load()

text = """
Patient presents with chest pain radiating to left arm. 
History of hypertension and diabetes mellitus type 2.
No family history of coronary artery disease.
Denies shortness of breath.
"""

doc = nlp(text)
for ent in doc.ents:
    print(f"{ent.text:40s} | {ent.label_:15s} | Negated: {ent._.is_negated}")

Output:

chest pain radiating to left arm          | PROBLEM         | Negated: False
hypertension                              | PROBLEM         | Negated: False
diabetes mellitus type 2                  | PROBLEM         | Negated: False
coronary artery disease                   | PROBLEM         | Negated: True
shortness of breath                       | PROBLEM         | Negated: True

Negation detection is critical — “denies chest pain” has the opposite meaning of “reports chest pain.”

De-identification

Research on EHR data requires removing protected health information (PHI) under HIPAA. Python tools automate this:

Presidio (Microsoft) — detects and anonymizes names, dates, medical record numbers, addresses
philter — rule-based PHI filter designed specifically for clinical notes
Date shifting — shifts all dates for a patient by a random offset, preserving time intervals

Predictive modeling on EHR data

Readmission prediction

Hospital readmissions within 30 days cost the US healthcare system $26 billion annually. Models predicting readmission risk help target interventions:

Common features: number of prior admissions, length of stay, number of medications, lab values at discharge, comorbidity scores (Charlson, Elixhauser).

Sepsis early warning

Sepsis kills 270,000 Americans per year. Models that detect sepsis 4-6 hours before clinical recognition save lives. Features include vital sign trends, lab values (lactate, white blood cell count), and medication patterns.

Disease phenotyping

Identifying cohorts of patients with specific conditions requires combining diagnosis codes, medications, labs, and notes. A diabetes phenotype might require: ICD-10 code E11.x + HbA1c > 6.5% + prescribed metformin. Python pipelines automate these definitions across millions of records.

Common misconception

“EHR data is clean and ready for analysis.” EHR data is collected for billing and clinical care, not research. It contains missing values (a lab not ordered is not the same as a normal result), inconsistent coding (the same condition coded differently by different providers), and survivorship bias (patients who leave the health system disappear from the data). Careful data cleaning and domain expertise are essential before any analysis.

Real-world applications

MIMIC-IV, a freely available dataset of 300,000+ ICU admissions from Beth Israel Deaconess Medical Center, has spawned thousands of Python-based research papers on mortality prediction, treatment effectiveness, and clinical decision support.
Epic’s Sepsis Model (deployed in 200+ hospitals) uses EHR data including vital signs, labs, and nursing assessments to alert clinicians to deteriorating patients.
OHDSI’s ATLAS platform, backed by Python and R analytics, enables researchers to run the same analysis across 300+ hospitals worldwide without moving patient data.

The one thing to remember: Python transforms messy, multi-modal EHR data into research-ready datasets through standardized data models like OMOP, clinical NLP for unstructured notes, and careful de-identification — but the biggest challenge is not the code, it is understanding the clinical context that generated the data.

pythonhealthcaredata-engineering