13  Distributions

Histograms, density plots, boxplots, violins, and ridgelines for exploring the shape and spread of data.

(ns plotje-book.distributions
  (:require
   ;; Rdatasets -- standard datasets
   [scicloj.metamorph.ml.rdatasets :as rdatasets]
   ;; Kindly -- notebook rendering protocol
   [scicloj.kindly.v4.kind :as kind]
   ;; Plotje -- composable plotting
   [scicloj.plotje.api :as pj]))

Histogram

Distribution of sepal length across all species.

(-> (rdatasets/datasets-iris)
    (pj/lay-histogram :sepal-length))
sepal length4.55.05.56.06.57.07.58.00510152025

Colored Histogram

Split by species – each group gets its own color.

(-> (rdatasets/datasets-iris)
    (pj/lay-histogram :sepal-length {:color :species}))
sepal lengthspeciessetosaversicolorvirginica4.55.05.56.06.57.07.58.00246810121416

Petal Width Histogram

Petal width has a bimodal distribution.

(-> (rdatasets/datasets-iris)
    (pj/lay-histogram :petal-width))
petal width0.00.20.40.60.81.01.21.41.61.82.02.22.42.60510152025303540

Histogram with Custom Title

(-> (rdatasets/reshape2-tips)
    (pj/lay-histogram :total-bill)
    (pj/options {:title "Distribution of Total Bill"
                 :x-label "Amount ($)"}))
Distribution of Total BillAmount ($)5101520253035404550010203040506070

Density-Normalized Histogram

Pass {:normalize :density} so the y-axis shows probability density instead of raw counts. This makes the histogram directly comparable with a density curve overlay.

(-> (rdatasets/datasets-iris)
    (pj/lay-histogram :sepal-length {:normalize :density :alpha 0.5})
    pj/lay-density)
sepal length3456789100.00.050.10.150.20.250.30.350.40.45

Density Plot

A smooth curve estimating the probability density function. Less sensitive to bin width than histograms.

(-> (rdatasets/datasets-iris)
    (pj/lay-density :sepal-length))
sepal length3456789100.00.050.10.150.20.250.30.350.4

Grouped Density

Per-species density curves with automatic color mapping.

(-> (rdatasets/datasets-iris)
    (pj/lay-density :sepal-length {:color :species}))
sepal lengthspeciessetosaversicolorvirginica4567890.00.20.40.60.81.01.2

Density with Custom Bandwidth

A narrow bandwidth reveals more detail; a wide bandwidth smooths more.

(-> (rdatasets/datasets-iris)
    (pj/lay-density :sepal-length {:bandwidth 0.3}))
sepal length3456789100.00.050.10.150.20.250.30.350.4

Rug

A rug shows the raw data positions as short tick marks along the axis. Layered with a density curve, it shows the smooth shape and the underlying observations together.

(-> (rdatasets/datasets-iris)
    (pj/lay-density :sepal-length)
    pj/lay-rug)
sepal length3456789100.00.050.10.150.20.250.30.350.4

Boxplot

Median, quartiles, whiskers at 1.5xIQR (interquartile range), and outlier points.

(-> (rdatasets/datasets-iris)
    (pj/lay-boxplot :species :sepal-width))
sepal widthspeciesno datasetosaversicolorvirginica2.02.53.03.54.04.5

The 1.5xIQR claim is structural: each whisker stays within the Tukey fence [Q1 - 1.5*IQR, Q3 + 1.5*IQR], and every outlier falls outside it.

(let [plan (-> (rdatasets/datasets-iris)
               (pj/lay-boxplot :species :sepal-width)
               pj/plan)
      box-layer (first (filter #(= :boxplot (:mark %))
                               (:layers (first (:panels plan)))))]
  (mapv (fn [{:keys [q1 q3 whisker-lo whisker-hi outliers]}]
          (let [iqr (- q3 q1)
                lo-fence (- q1 (* 1.5 iqr))
                hi-fence (+ q3 (* 1.5 iqr))]
            {:whisker-lo-in-fence (>= whisker-lo lo-fence)
             :whisker-hi-in-fence (<= whisker-hi hi-fence)
             :outliers-outside-fence
             (every? (fn [o] (or (< o lo-fence) (> o hi-fence)))
                     outliers)}))
        (:boxes box-layer)))
[{:whisker-lo-in-fence true,
  :whisker-hi-in-fence true,
  :outliers-outside-fence true}
 {:whisker-lo-in-fence true,
  :whisker-hi-in-fence true,
  :outliers-outside-fence true}
 {:whisker-lo-in-fence true,
  :whisker-hi-in-fence true,
  :outliers-outside-fence true}]

Grouped Boxplot

Side-by-side boxplots colored by a grouping variable.

(-> (rdatasets/reshape2-tips)
    (pj/lay-boxplot :day :total-bill {:color :smoker}))
total billdaysmokerNoYesno dataSunSatThurFri5101520253035404550

Verify dodge positioning: each color group gets a distinct offset.

(let [plan (-> (rdatasets/reshape2-tips)
               (pj/lay-boxplot :day :total-bill {:color :smoker})
               pj/plan)
      panel (first (:panels plan))
      box-layer (first (filter #(= :boxplot (:mark %)) (:layers panel)))
      cats (:color-categories box-layer)]
  (count cats))
2

Horizontal Boxplot

Flipped coordinate for horizontal orientation.

(-> (rdatasets/datasets-iris)
    (pj/lay-boxplot :species :sepal-width)
    (pj/coord :flip))
speciessepal widthno data2.02.53.03.54.04.5setosaversicolorvirginica

Violin Plot

A violin shows the full density shape per category – more informative than a boxplot for multimodal distributions.

(-> (rdatasets/reshape2-tips)
    (pj/lay-violin :day :total-bill))
total billdayno dataSunSatThurFri-20-10010203040506070

Grouped Violin

Color splits each category into side-by-side violins.

(-> (rdatasets/reshape2-tips)
    (pj/lay-violin :day :total-bill {:color :smoker}))
total billdaysmokerNoYesno dataSunSatThurFri-20-10010203040506070

Verify dodge positioning: each color group gets a distinct offset.

(let [plan (-> (rdatasets/reshape2-tips)
               (pj/lay-violin :day :total-bill {:color :smoker})
               pj/plan)
      panel (first (:panels plan))
      viol-layer (first (filter #(= :violin (:mark %)) (:layers panel)))
      cats (:color-categories viol-layer)]
  (count cats))
2

Horizontal Violin

(-> (rdatasets/datasets-iris)
    (pj/lay-violin :species :petal-length)
    (pj/coord :flip))
speciespetal lengthno data12345678setosaversicolorvirginica

Ridgeline Plot

Overlapping density curves stacked vertically by category – good for comparing distribution shapes across many groups.

(-> (rdatasets/datasets-iris)
    (pj/lay-ridgeline :species :sepal-length))
speciessepal lengthno data456789setosaversicolorvirginica

Colored Ridgeline

Map color to the same categorical column for distinct curves.

(-> (rdatasets/datasets-iris)
    (pj/lay-ridgeline :species :sepal-length {:color :species}))
speciessepal lengthspeciessetosaversicolorvirginicano data456789setosaversicolorvirginica

Comparing Multiple Columns

Pass a vector of column names to pj/lay-histogram (or any lay-* function) to create one panel per column. This is useful for comparing the shape of different variables side by side.

(pj/lay-histogram (rdatasets/datasets-iris) [:sepal-length :sepal-width :petal-length])
680510152025234051015202530354045505051015202530354045sepal lengthsepal widthpetal length

Combine with :color to see group differences within each column.

(pj/lay-density (rdatasets/datasets-iris) [:sepal-length :sepal-width :petal-length] {:color :species})
50.00.20.40.60.81.01.2240.00.20.40.60.81.01.250.00.51.01.52.02.5sepal lengthsepal widthpetal lengthspeciessetosaversicolorvirginica

The multi-column vector works with any lay-* function – histograms, density curves, boxplots, violin plots, and more.

What’s Next

  • Ranking – bar charts and lollipop plots for categorical comparisons
  • Faceting – split distributions by groups into separate panels
source: notebooks/plotje_book/distributions.clj