Python for Electronic Health Records — Core Concepts
What EHR data looks like
Electronic health records are not a single tidy spreadsheet. A typical EHR system like Epic or Cerner stores data across hundreds of tables covering:
- Demographics — age, sex, race, insurance
- Encounters — hospital visits, admissions, discharges
- Diagnoses — coded with ICD-10 (International Classification of Diseases)
- Procedures — coded with CPT (Current Procedural Terminology)
- Medications — prescriptions, administration records, pharmacy data
- Lab results — blood tests, imaging reports, pathology
- Clinical notes — free-text physician notes, nursing assessments, discharge summaries
- Vital signs — heart rate, blood pressure, temperature, oxygen saturation
The sheer variety of data types — structured codes, numerical measurements, timestamps, and unstructured text — makes EHR analysis a multi-modal data engineering problem.
Common data models
Different hospitals use different EHR vendors, making cross-institution research difficult. Common data models standardize the schema:
| Model | Used by | Key feature |
|---|---|---|
| OMOP CDM | OHDSI network (3+ billion patient records) | Standardized vocabularies, analytics tools |
| PCORnet CDM | Patient-Centered Outcomes Research | Focus on comparative effectiveness |
| i2b2 | Academic medical centers | Star-schema for fast queries |
| FHIR | Modern health apps | REST API for real-time access |
OMOP (Observational Medical Outcomes Partnership) is the most widely adopted for research. It maps local hospital codes to standard concepts, so “aspirin” at Hospital A and “ASA 325mg” at Hospital B both become the same OMOP concept ID.
Python’s EHR ecosystem
Data extraction and transformation
import pandas as pd
import sqlalchemy
# Connect to OMOP database
engine = sqlalchemy.create_engine("postgresql://user:pass@omop-server/cdm")
# Extract diabetic patients with lab results
query = """
SELECT p.person_id, p.year_of_birth, p.gender_concept_id,
m.measurement_date, m.value_as_number, c.concept_name
FROM person p
JOIN condition_occurrence co ON p.person_id = co.person_id
JOIN measurement m ON p.person_id = m.person_id
JOIN concept c ON m.measurement_concept_id = c.concept_id
WHERE co.condition_concept_id = 201826 -- Type 2 diabetes
AND c.concept_name = 'Hemoglobin A1c'
ORDER BY p.person_id, m.measurement_date
"""
df = pd.read_sql(query, engine)
Clinical NLP — extracting information from notes
Up to 80% of clinically relevant information lives in unstructured text. Python NLP tools extract structured data from clinical notes:
import medspacy
nlp = medspacy.load()
text = """
Patient presents with chest pain radiating to left arm.
History of hypertension and diabetes mellitus type 2.
No family history of coronary artery disease.
Denies shortness of breath.
"""
doc = nlp(text)
for ent in doc.ents:
print(f"{ent.text:40s} | {ent.label_:15s} | Negated: {ent._.is_negated}")
Output:
chest pain radiating to left arm | PROBLEM | Negated: False
hypertension | PROBLEM | Negated: False
diabetes mellitus type 2 | PROBLEM | Negated: False
coronary artery disease | PROBLEM | Negated: True
shortness of breath | PROBLEM | Negated: True
Negation detection is critical — “denies chest pain” has the opposite meaning of “reports chest pain.”
De-identification
Research on EHR data requires removing protected health information (PHI) under HIPAA. Python tools automate this:
- Presidio (Microsoft) — detects and anonymizes names, dates, medical record numbers, addresses
- philter — rule-based PHI filter designed specifically for clinical notes
- Date shifting — shifts all dates for a patient by a random offset, preserving time intervals
Predictive modeling on EHR data
Readmission prediction
Hospital readmissions within 30 days cost the US healthcare system $26 billion annually. Models predicting readmission risk help target interventions:
Common features: number of prior admissions, length of stay, number of medications, lab values at discharge, comorbidity scores (Charlson, Elixhauser).
Sepsis early warning
Sepsis kills 270,000 Americans per year. Models that detect sepsis 4-6 hours before clinical recognition save lives. Features include vital sign trends, lab values (lactate, white blood cell count), and medication patterns.
Disease phenotyping
Identifying cohorts of patients with specific conditions requires combining diagnosis codes, medications, labs, and notes. A diabetes phenotype might require: ICD-10 code E11.x + HbA1c > 6.5% + prescribed metformin. Python pipelines automate these definitions across millions of records.
Common misconception
“EHR data is clean and ready for analysis.” EHR data is collected for billing and clinical care, not research. It contains missing values (a lab not ordered is not the same as a normal result), inconsistent coding (the same condition coded differently by different providers), and survivorship bias (patients who leave the health system disappear from the data). Careful data cleaning and domain expertise are essential before any analysis.
Real-world applications
- MIMIC-IV, a freely available dataset of 300,000+ ICU admissions from Beth Israel Deaconess Medical Center, has spawned thousands of Python-based research papers on mortality prediction, treatment effectiveness, and clinical decision support.
- Epic’s Sepsis Model (deployed in 200+ hospitals) uses EHR data including vital signs, labs, and nursing assessments to alert clinicians to deteriorating patients.
- OHDSI’s ATLAS platform, backed by Python and R analytics, enables researchers to run the same analysis across 300+ hospitals worldwide without moving patient data.
The one thing to remember: Python transforms messy, multi-modal EHR data into research-ready datasets through standardized data models like OMOP, clinical NLP for unstructured notes, and careful de-identification — but the biggest challenge is not the code, it is understanding the clinical context that generated the data.
See Also
- Python Fhir Health Data How Python speaks the universal language that lets hospitals, apps, and doctors share your health information safely.