Lifelines for Survival Analysis — Core Concepts
Why survival analysis matters
Standard regression predicts a value. Classification predicts a category. But many real-world questions are about when — not just if — something will happen. And crucially, for many subjects the event has not happened yet when you analyze the data. This is called censoring, and it is what makes survival analysis fundamentally different from ordinary statistics.
Without proper handling of censoring, naive approaches either throw away incomplete observations (losing data) or treat “still alive” the same as “died at the end of the study” (introducing bias). Lifelines handles censoring correctly by design.
Key concepts
The survival function S(t)
The survival function gives the probability that an event has not happened by time t. At time zero, S(0) = 1 (everyone is alive). Over time, S(t) decreases. The shape of this curve reveals whether failures happen early, late, or at a constant rate.
Censoring
A subject is right-censored if the event has not occurred by the end of observation. Examples:
- A patient is still alive when the study ends
- A customer has not yet churned
- A machine is still running
Lifelines requires two columns: duration (how long observed) and event_observed (1 if the event happened, 0 if censored).
Hazard function h(t)
The hazard is the instantaneous rate of the event at time t, given survival up to t. A rising hazard means failure becomes more likely over time (aging). A falling hazard suggests early failures that thin out (infant mortality in electronics). A constant hazard means the event is memoryless (like radioactive decay).
Core estimators in Lifelines
Kaplan-Meier estimator
The most common non-parametric estimator. It makes no assumptions about the shape of the survival curve — it just follows the data. Each observed event causes the curve to step down. Censored observations are accounted for by adjusting the “at risk” count.
The Kaplan-Meier curve is the first thing most analysts plot. It answers: “What fraction of subjects are still event-free at each time point?”
Nelson-Aalen estimator
An alternative that estimates the cumulative hazard function rather than the survival function directly. Useful when you want to visualize how risk accumulates over time.
Cox proportional hazards model
The workhorse of survival regression. It relates the hazard to covariates (age, treatment group, product tier) without assuming a specific baseline hazard shape. The “proportional” part means each covariate multiplies the hazard by a constant factor across all time points.
Key output: hazard ratios. A hazard ratio of 2.0 for a covariate means subjects with that characteristic experience the event at twice the rate.
Parametric models
When you can assume a specific distribution for survival times, parametric models (Weibull, Log-Normal, Log-Logistic) provide smoother curves and the ability to extrapolate beyond the observed data range. Lifelines supports all common parametric families.
A typical analysis workflow
- Prepare data — Ensure each row has a duration and an event indicator. Handle missing data.
- Plot Kaplan-Meier curves — Visualize overall survival and survival by group.
- Compare groups — Use the log-rank test to determine if survival differs significantly between groups.
- Model covariates — Fit a Cox model or parametric model to quantify how variables affect survival.
- Validate — Check proportional hazards assumption (Schoenfeld residuals), assess concordance index.
- Predict — Estimate median survival time or survival probability at specific time points for new subjects.
Common misconception
People often confuse survival analysis with simple time-series analysis. They are different problems. Time-series analysis models how a quantity changes over time (stock prices, temperature). Survival analysis models when a one-time event occurs (death, failure, churn). The statistical machinery is completely different because survival analysis must handle censoring — you rarely have censored stock prices.
Where Lifelines fits in the ecosystem
- Lifelines — pure Python, easy API, great for learning and small-to-medium datasets
- scikit-survival — integrates with scikit-learn pipelines, supports random survival forests
- R survival package — the gold standard in biostatistics, more parametric model options
- PySurvival — deep learning survival models
Lifelines is the most Pythonic option and the best starting point. Its API follows pandas conventions and produces matplotlib-compatible plots.
The one thing to remember: Lifelines brings statistically rigorous time-to-event analysis to Python, correctly handling the censored observations that would trip up ordinary regression or classification approaches.
See Also
- Python Statsmodels Regression How Python draws the best-fit line through messy data and tells you whether to trust it.