12  Distributions

Histograms, density plots, boxplots, violins, and ridgelines for exploring the shape and spread of data.

(ns plotje-book.distributions
  (:require
   ;; Rdatasets -- standard datasets
   [scicloj.metamorph.ml.rdatasets :as rdatasets]
   ;; Kindly -- notebook rendering protocol
   [scicloj.kindly.v4.kind :as kind]
   ;; Plotje -- composable plotting
   [scicloj.plotje.api :as pj]))

Histogram

Distribution of sepal length across all species.

(-> (rdatasets/datasets-iris)
    (pj/lay-histogram :sepal-length))
sepal length4.55.05.56.06.57.07.58.00510152025

Colored Histogram

Split by species – each group gets its own color.

(-> (rdatasets/datasets-iris)
    (pj/lay-histogram :sepal-length {:color :species}))
sepal lengthspeciessetosaversicolorvirginica4.55.05.56.06.57.07.58.00246810121416

Petal Width Histogram

Petal width has a bimodal distribution.

(-> (rdatasets/datasets-iris)
    (pj/lay-histogram :petal-width))
petal width0.00.20.40.60.81.01.21.41.61.82.02.22.42.60510152025303540

Histogram with Custom Title

(-> (rdatasets/reshape2-tips)
    (pj/lay-histogram :total-bill)
    (pj/options {:title "Distribution of Total Bill"
                 :x-label "Amount ($)"}))
Distribution of Total BillAmount ($)5101520253035404550010203040506070

Density-Normalized Histogram

Pass {:normalize :density} so the y-axis shows probability density instead of raw counts. This makes the histogram directly comparable with a density curve overlay.

(-> (rdatasets/datasets-iris)
    (pj/lay-histogram :sepal-length {:normalize :density :alpha 0.5})
    pj/lay-density)
sepal length3456789100.00.050.10.150.20.250.30.350.40.45

Log-Scale Histogram

When bin counts span orders of magnitude, the smallest bars disappear next to the largest on a linear y-axis. A log y-scale lets every bar register. The data below doubles the count of each successive bin (1, 2, 4, …, 512); on a log axis the doubling shows as a uniform staircase – each bar a fixed step above the previous, the same step every time.

The lower bound under log comes from the smallest positive bin count, not from the visual zero baseline – log scales have no zero. Empty bins emit no bar and do not pull the axis down.

(-> {:x (mapcat (fn [i] (repeat (long (Math/pow 2 i)) i)) (range 10))}
    (pj/lay-histogram {:bins 10})
    (pj/scale :y :log)
    (pj/options {:title "Log Y on Histogram"}))
Log Y on Histogramx01234567891101001000

Density Plot

A smooth curve estimating the probability density function. Less sensitive to bin width than histograms.

(-> (rdatasets/datasets-iris)
    (pj/lay-density :sepal-length))
sepal length3456789100.00.050.10.150.20.250.30.350.4

Grouped Density

Per-species density curves with automatic color mapping.

(-> (rdatasets/datasets-iris)
    (pj/lay-density :sepal-length {:color :species}))
sepal lengthspeciessetosaversicolorvirginica4567890.00.20.40.60.81.01.2

Density with Custom Bandwidth

A narrow bandwidth reveals more detail; a wide bandwidth smooths more.

(-> (rdatasets/datasets-iris)
    (pj/lay-density :sepal-length {:bandwidth 0.3}))
sepal length3456789100.00.050.10.150.20.250.30.350.4

Rug

A rug shows the raw data positions as short tick marks along the axis. Layered with a density curve, it shows the smooth shape and the underlying observations together.

(-> (rdatasets/datasets-iris)
    (pj/lay-density :sepal-length)
    pj/lay-rug)
sepal length3456789100.00.050.10.150.20.250.30.350.4

Strip Plot (Jitter)

When plotting a numeric column against a categorical column, points stack on the same band positions. :jitter true spreads them with small random offsets along the categorical axis.

(-> (rdatasets/datasets-iris)
    (pj/lay-point :species :sepal-width {:jitter true}))
sepal widthspeciessetosaversicolorvirginica2.02.53.03.54.04.5

Pass a number to control the jitter amount in drawing units.

(-> (rdatasets/datasets-iris)
    (pj/lay-point :species :sepal-width {:jitter 10 :alpha 0.5}))
sepal widthspeciessetosaversicolorvirginica2.02.53.03.54.04.5

Boxplot

Median, quartiles, whiskers at 1.5xIQR (interquartile range), and outlier points.

(-> (rdatasets/datasets-iris)
    (pj/lay-boxplot :species :sepal-width))
sepal widthspeciessetosaversicolorvirginica2.02.53.03.54.04.5

The 1.5xIQR claim is structural: each whisker stays within the Tukey fence [Q1 - 1.5*IQR, Q3 + 1.5*IQR], and every outlier falls outside it.

Grouped Boxplot

Side-by-side boxplots colored by a grouping variable.

(-> (rdatasets/reshape2-tips)
    (pj/lay-boxplot :day :total-bill {:color :smoker}))
total billdaysmokerNoYesSunSatThurFri5101520253035404550

Each color group gets a distinct dodge offset, visible as side-by-side boxes within each day.

Horizontal Boxplot

Flipped coordinate for horizontal orientation.

(-> (rdatasets/datasets-iris)
    (pj/lay-boxplot :species :sepal-width)
    (pj/coord :flip))
speciessepal width2.02.53.03.54.04.5setosaversicolorvirginica

Violin Plot

A violin shows the full density shape per category – more informative than a boxplot for multimodal distributions.

(-> (rdatasets/reshape2-tips)
    (pj/lay-violin :day :total-bill))
total billdaySunSatThurFri-20-10010203040506070

Grouped Violin

Color splits each category into side-by-side violins.

(-> (rdatasets/reshape2-tips)
    (pj/lay-violin :day :total-bill {:color :smoker}))
total billdaysmokerNoYesSunSatThurFri-20-10010203040506070

Each color group gets a distinct dodge offset, visible as side-by-side violins within each day.

Horizontal Violin

(-> (rdatasets/datasets-iris)
    (pj/lay-violin :species :petal-length)
    (pj/coord :flip))
speciespetal length12345678setosaversicolorvirginica

Ridgeline Plot

Overlapping density curves stacked vertically by category – good for comparing distribution shapes across many groups.

(-> (rdatasets/datasets-iris)
    (pj/lay-ridgeline :species :sepal-length))
speciessepal length456789setosaversicolorvirginica

Colored Ridgeline

Map color to the same categorical column for distinct curves.

(-> (rdatasets/datasets-iris)
    (pj/lay-ridgeline :species :sepal-length {:color :species}))
speciessepal lengthspeciessetosaversicolorvirginica456789setosaversicolorvirginica

Comparing Multiple Columns

Pass a vector of column names to pj/lay-histogram (or any lay-* function) to create one panel per column. This is useful for comparing the shape of different variables side by side.

(pj/lay-histogram (rdatasets/datasets-iris) [:sepal-length :sepal-width :petal-length])
680510152025234051015202530354045505051015202530354045sepal lengthsepal widthpetal length

Combine with :color to see group differences within each column.

(pj/lay-density (rdatasets/datasets-iris) [:sepal-length :sepal-width :petal-length] {:color :species})
50.00.20.40.60.81.01.2240.00.20.40.60.81.01.250.00.51.01.52.02.5sepal lengthsepal widthpetal lengthspeciessetosaversicolorvirginica

The multi-column vector works with any lay-* function – histograms, density curves, boxplots, violin plots, and more.

See Also

  • Core Concepts – mappings and aesthetics referenced throughout
  • Relationships – two-distribution comparisons via heatmap, contour, and SPLOM

What’s Next

  • Ranking – bar charts and lollipop plots for categorical comparisons
  • Faceting – split distributions by groups into separate panels
source: notebooks/plotje_book/distributions.clj