Plotnine (ggplot for Python) — Deep Dive

Plotnine implements the layered Grammar of Graphics with a pipeline that transforms data through statistical computations, scale mappings, and coordinate projections before rendering via Matplotlib. Understanding this pipeline enables custom extensions, sophisticated multi-layer plots, and publication-ready output.

The Rendering Pipeline

When you call print(plot) or plot.save(), plotnine executes a multi-stage pipeline:

  1. Data preparation — each layer receives its data (from the layer or the plot default)
  2. Aesthetic mappingaes() columns are identified
  3. Statistical transformation — the layer’s stat computes derived data (e.g., stat_bin creates bin counts)
  4. Scale training — scales learn the data range across all layers
  5. Scale mapping — data values are transformed to visual values (positions, colors, sizes)
  6. Faceting — data is split into panels
  7. Coordinate transformation — positions are projected (Cartesian, polar, flipped)
  8. Rendering — Matplotlib draws the actual graphics

This pipeline means that geom_bar() with stat='count' doesn’t require you to pre-aggregate — the stat stage handles counting. Each stage is modular and replaceable.

Advanced Aesthetic Patterns

Aesthetics support expressions and computed columns:

from plotnine import *
import pandas as pd
import numpy as np

df = pd.DataFrame({
    'date': pd.date_range('2025-01-01', periods=365),
    'sales': np.random.lognormal(4, 0.5, 365),
    'region': np.random.choice(['East', 'West', 'Central'], 365),
    'returns': np.random.lognormal(2, 0.8, 365)
})

# Computed aesthetic — net value
(ggplot(df, aes(x='date', y='sales - returns', color='region'))
 + geom_line(alpha=0.7)
 + geom_smooth(method='loess', se=True, span=0.3)
 + labs(y='Net Sales', title='Daily Net Sales by Region'))

The after_stat() function accesses computed statistics within aesthetic mappings. This lets you normalize histograms or access density values:

# Density histogram with percentages
(ggplot(df, aes(x='sales'))
 + geom_histogram(aes(y=after_stat('density')), bins=40, fill='#3498db', alpha=0.7)
 + geom_density(color='red', linewidth=1.2)
 + labs(y='Density'))

Position Adjustments

Position adjustments control how overlapping geoms are arranged:

# Dodged bars — groups side by side
(ggplot(df, aes(x='region', y='sales', fill='region'))
 + geom_boxplot(position='dodge'))

# Stacked bars with stat identity
monthly = df.groupby([df['date'].dt.month, 'region'])['sales'].sum().reset_index()
monthly.columns = ['month', 'region', 'sales']

(ggplot(monthly, aes(x='month', y='sales', fill='region'))
 + geom_col(position='stack'))

# Jittered points to avoid overplotting
(ggplot(df, aes(x='region', y='sales'))
 + geom_jitter(width=0.2, alpha=0.3, size=1)
 + geom_boxplot(alpha=0.5, outlier_shape=''))

position_dodge(), position_stack(), position_fill() (normalized stacking), position_jitter(), and position_nudge() provide fine-grained control. position_dodge(width=0.8) adjusts spacing between grouped bars.

Coordinate Systems

Coordinate systems transform how positions map to the final display:

# Flipped coordinates — horizontal bars
(ggplot(df.head(20), aes(x='region', y='sales'))
 + geom_col(fill='steelblue')
 + coord_flip())

# Polar coordinates — pie/donut charts
region_totals = df.groupby('region')['sales'].sum().reset_index()
(ggplot(region_totals, aes(x='1', y='sales', fill='region'))
 + geom_col(width=1)
 + coord_polar(theta='y')
 + theme_void())

# Fixed aspect ratio for spatial data
(ggplot(df, aes(x='sales', y='returns'))
 + geom_point(alpha=0.3)
 + coord_fixed(ratio=1))

coord_cartesian(xlim=(0, 100)) zooms without removing data points (unlike scale_x_continuous(limits=...) which filters data before stat computation). This distinction matters for boxplots and smoothers that depend on all data being present.

Advanced Faceting

Facets support several configuration options beyond basic splitting:

# Free scales per panel
(ggplot(df, aes(x='date', y='sales'))
 + geom_line()
 + facet_wrap('~region', scales='free_y', ncol=1)
 + theme(figure_size=(10, 8)))

# Two-variable grid
df['quarter'] = df['date'].dt.quarter.astype(str)
(ggplot(df, aes(x='sales', y='returns'))
 + geom_point(alpha=0.3)
 + geom_smooth(method='lm', color='red')
 + facet_grid('quarter ~ region')
 + theme(figure_size=(12, 10)))

scales='free_y' allows each panel its own y-axis range — useful when comparing trends rather than absolute values. scales='free' frees both axes. The default shared scales make magnitude comparison easy but can compress panels with different ranges.

Custom Stats

Creating custom statistical transformations lets you extend plotnine’s grammar:

from plotnine import stat

class stat_rolling_mean(stat):
    REQUIRED_AES = {'x', 'y'}
    DEFAULT_PARAMS = {'geom': 'line', 'position': 'identity',
                      'window': 30, 'min_periods': 1}
    
    @classmethod
    def compute_group(cls, data, scales, **params):
        data = data.sort_values('x').copy()
        data['y'] = data['y'].rolling(
            params['window'], min_periods=params['min_periods']
        ).mean()
        return data.dropna(subset=['y'])

# Use the custom stat
(ggplot(df, aes(x='date', y='sales'))
 + geom_point(alpha=0.1, size=0.5)
 + stat_rolling_mean(window=14, color='red', linewidth=1.5)
 + stat_rolling_mean(window=60, color='blue', linewidth=1.5)
 + labs(title='Sales with 14-day and 60-day moving averages'))

The compute_group class method receives a DataFrame subset (per group) and must return a DataFrame with the same aesthetic columns. The stat is then rendered using its default geom.

Multi-Layer Composition Patterns

Complex analytical figures often require multiple data sources and geom types:

# Reference lines and annotations alongside data
summary = df.groupby('region').agg(
    mean_sales=('sales', 'mean'),
    median_sales=('sales', 'median')
).reset_index()

(ggplot(df, aes(x='region', y='sales'))
 + geom_violin(fill='lightblue', alpha=0.5)
 + geom_jitter(width=0.15, alpha=0.2, size=0.8)
 + geom_point(data=summary, mapping=aes(x='region', y='mean_sales'),
              color='red', size=4, shape='D')
 + geom_hline(yintercept=df['sales'].median(), 
              linetype='dashed', color='gray')
 + annotate('text', x=0.5, y=df['sales'].median() + 5,
            label='Overall Median', size=8, color='gray')
 + labs(title='Sales Distribution with Group Means'))

Each layer can specify its own data and mapping, overriding the plot defaults. This enables combining raw data, summaries, and annotations in a single coherent visualization.

Theme Customization for Publication

Journals and conferences have specific formatting requirements. Plotnine’s theme system handles this:

publication_theme = (
    theme_minimal() +
    theme(
        figure_size=(7, 5),
        text=element_text(family='serif', size=10),
        axis_title=element_text(size=11, face='bold'),
        axis_text=element_text(size=9),
        legend_position='bottom',
        legend_title=element_text(size=10, face='bold'),
        legend_text=element_text(size=9),
        panel_grid_minor=element_blank(),
        panel_grid_major=element_line(color='#e0e0e0', size=0.3),
        strip_text=element_text(size=10, face='bold'),
        plot_title=element_text(size=13, face='bold', ha='left'),
        plot_margin=0.05
    )
)

# Apply to any plot
(ggplot(df, aes(x='sales', y='returns', color='region'))
 + geom_point(alpha=0.4)
 + geom_smooth(method='lm', se=True)
 + facet_wrap('~region')
 + publication_theme
 + labs(title='Sales vs Returns by Region',
        x='Total Sales ($)', y='Returns ($)'))

Export and Rendering

Plotnine saves to any format Matplotlib supports:

plot = (ggplot(df, aes(x='date', y='sales', color='region'))
        + geom_line()
        + publication_theme)

# Vector formats for publication
plot.save('figure.pdf', dpi=300, width=7, height=5)
plot.save('figure.svg', width=7, height=5)

# Raster for web
plot.save('figure.png', dpi=150, width=10, height=6)

For batch figure generation, loop over parameters and save programmatically. Since plotnine plots are Python objects, you can store them in lists and render later — enabling report-generation pipelines that define all figures before writing any files.

Performance Considerations

Plotnine renders through Matplotlib, which means it inherits Matplotlib’s performance characteristics. For datasets over ~50K rows, geom_point() can be slow. Strategies:

  • Use geom_bin2d() or geom_hex() for large scatter plots — they bin data into cells, reducing rendering load from N points to M bins
  • Pre-aggregate with Pandas before plotting
  • Reduce alpha and size for dense point clouds
  • For faceted plots with many panels, increase figure_size proportionally to avoid cramped rendering
# Hex binning for large datasets
(ggplot(large_df, aes(x='x', y='y'))
 + geom_hex(bins=50)
 + scale_fill_continuous(cmap_name='viridis')
 + theme_minimal())

Plotnine vs. ggplot2 Differences

While plotnine mirrors ggplot2’s API closely, some differences matter:

  • String formulas use '~variable' syntax instead of R’s ~variable
  • after_stat() replaces ggplot2’s older ..stat.. syntax
  • Some ggplot2 extensions (ggridges, ggrepel) don’t exist in plotnine yet
  • Performance is generally slower than ggplot2 due to Matplotlib’s rendering overhead
  • Python’s + operator requires parentheses around multi-line expressions

One thing to remember: Plotnine’s Grammar of Graphics pipeline — data → stats → scales → coordinates → rendering — gives you composable, predictable charts where every component is independent and swappable, making it the most systematic approach to statistical visualization in Python.

pythonplotnineggplotdata-visualization

See Also