16  Relationships

Scatter plots, regression, smoothing, density estimation, and heatmaps – revealing structure between two variables.

Scatter is the foundation. Each row becomes a point in the plane, and the eye reads structure off the cloud. Regression and smoothing draw trend lines through it; 2D density and contours reveal where the cloud is dense or sparse; the scatter-plot matrix (SPLOM) at the end shows every pair of columns at once.

(ns plotje-book.relationships
  (:require
   ;; Kindly -- notebook rendering protocol
   [scicloj.kindly.v4.kind :as kind]
   ;; Rdatasets -- standard datasets
   [scicloj.metamorph.ml.rdatasets :as rdatasets]
   ;; Plotje -- composable plotting
   [scicloj.plotje.api :as pj]
   ;; Fastmath -- random number generation
   [fastmath.random :as rng]))

Basic Scatter

Sepal dimensions, no color – the default mark.

(-> (rdatasets/datasets-iris)
    (pj/lay-point :sepal-length :sepal-width))
sepal widthsepal length4.55.05.56.06.57.07.58.02.02.53.03.54.04.5

Colored by Species

Adding :color :species groups points by species with distinct colors.

(-> (rdatasets/datasets-iris)
    (pj/lay-point :sepal-length :sepal-width {:color :species}))
sepal widthsepal lengthspeciessetosaversicolorvirginica4.55.05.56.06.57.07.58.02.02.53.03.54.04.5

Petal Dimensions

Petal length vs width – a strongly correlated pair, set up here as the running example for the regression sections below.

(-> (rdatasets/datasets-iris)
    (pj/lay-point :petal-length :petal-width {:color :species}))
petal widthpetal lengthspeciessetosaversicolorvirginica12345670.00.51.01.52.02.5

Linear Regression

A single regression line through all data.

(-> (rdatasets/datasets-iris)
    (pj/lay-point :sepal-length :sepal-width)
    (pj/lay-smooth {:stat :linear-model}))
sepal widthsepal length4.55.05.56.06.57.07.58.02.02.53.03.54.04.5

Per-Group Regression

Fit a regression line per group.

(-> (rdatasets/datasets-iris)
    (pj/pose :petal-length :petal-width {:color :species})
    pj/lay-point
    (pj/lay-smooth {:stat :linear-model}))
petal widthpetal lengthspeciessetosaversicolorvirginica12345670.00.51.01.52.02.5

Regression with Confidence Ribbon

Pass {:confidence-band true} to show a 95% confidence band around the line.

(-> (rdatasets/datasets-iris)
    (pj/pose :sepal-length :sepal-width {:color :species})
    pj/lay-point
    (pj/lay-smooth {:stat :linear-model :confidence-band true}))
sepal widthsepal lengthspeciessetosaversicolorvirginica4.55.05.56.06.57.07.58.02.02.53.03.54.04.5

Pass :level to widen or narrow the band. A 99% interval covers more of the regression’s uncertainty than the default 95%; an 80% interval covers less:

(-> (rdatasets/datasets-iris)
    (pj/pose :sepal-length :sepal-width)
    pj/lay-point
    (pj/lay-smooth {:stat :linear-model :confidence-band true :level 0.80}))
sepal widthsepal length4.55.05.56.06.57.07.58.02.02.53.03.54.04.5
(-> (rdatasets/datasets-iris)
    (pj/pose :sepal-length :sepal-width)
    pj/lay-point
    (pj/lay-smooth {:stat :linear-model :confidence-band true :level 0.99}))
sepal widthsepal length4.55.05.56.06.57.07.58.02.02.53.03.54.04.5

Tips with Regression

Do smokers and non-smokers tip differently?

(-> (rdatasets/reshape2-tips)
    (pj/pose :total-bill :tip {:color :smoker})
    pj/lay-point
    (pj/lay-smooth {:stat :linear-model}))
tiptotal billsmokerNoYes102030405012345678910

LOESS Smoothing

A smooth curve through noisy data.

(-> (let [r (rng/rng :jdk 42)
          xs (vec (range 50))]
      {:x xs
       :y (mapv #(+ (Math/sin (* % 0.2))
                    (* 0.3 (- (rng/drandom r) 0.5)))
                xs)})
    (pj/lay-point :x :y)
    (pj/lay-smooth {:bandwidth 0.2}))
yx05101520253035404550-1.2-1.0-0.8-0.6-0.4-0.20.00.20.40.60.81.0

Heatmap (Auto-Binned)

Bin x and y into a grid, count points per cell.

(-> (rdatasets/datasets-iris)
    (pj/lay-tile :sepal-length :sepal-width))
sepal widthsepal lengthcount0.0009.0004.55.05.56.06.57.07.58.02.02.53.03.54.04.5

Heatmap (Pre-Computed)

Use a numeric column for tile color.

(def grid-data
  (let [r (rng/rng :jdk 99)]
    {:x (for [i (range 5) _j (range 5)] i)
     :y (for [_i (range 5) j (range 5)] j)
     :value (repeatedly 25 #(rng/irandom r 100))}))
(-> grid-data
    (pj/lay-tile :x :y {:fill :value}))
yxfill0.00099.000.00.51.01.52.02.53.03.54.00.00.51.01.52.02.53.03.54.0

Density 2D

KDE-smoothed 2D density heatmap.

(-> (rdatasets/datasets-iris)
    (pj/lay-density-2d :sepal-length :sepal-width))
sepal widthsepal lengthrelative density0.00027.1034567891.52.02.53.03.54.04.55.0

Density 2D with Points

Overlay scatter points on the density heatmap.

(-> (rdatasets/datasets-iris)
    (pj/lay-density-2d :sepal-length :sepal-width)
    (pj/lay-point {:alpha 0.5}))
sepal widthsepal lengthrelative density0.00027.1034567891.52.02.53.03.54.04.55.0

Contour Lines

Iso-density contour lines from 2D KDE.

(-> (rdatasets/datasets-iris)
    (pj/lay-contour :sepal-length :sepal-width))
sepal widthsepal lengthrelative density0.00027.1034567891.52.02.53.03.54.04.55.0

Contour with Points

Contour lines overlaid on scatter points.

(-> (rdatasets/datasets-iris)
    (pj/lay-point :sepal-length :sepal-width {:alpha 0.3})
    (pj/lay-contour {:levels 8}))
sepal widthsepal lengthrelative density0.00027.1034567891.52.02.53.03.54.04.55.0

Scatter Plot Matrix (SPLOM)

pj/cross generates all combinations of two lists. Passing column names produces a grid of scatter plots – one per pair of variables. The diagonal shows histograms (automatic inference for same-column pairs).

Start small: two variables crossed with themselves give a 2x2 grid. Off-diagonal cells (where the row and column variables differ) get scatter plots; diagonal cells (where they match) get histograms.

(def small-cols [:sepal-length :petal-length])
(-> (rdatasets/datasets-iris)
    (pj/pose (pj/cross small-cols small-cols) {:color :species}))
051015246567246sepal-lengthpetal-lengthsepal-lengthpetal-lengthspeciessetosaversicolorvirginica

The full 4x4 SPLOM follows the same pattern with iris’s four numeric columns:

(def cols [:sepal-length :sepal-width :petal-length :petal-width])
(-> (rdatasets/datasets-iris)
    (pj/pose (pj/cross cols cols) {:color :species}))
01066126121212sepal-lengthsepal-widthpetal-lengthpetal-widthsepal-lengthsepal-widthpetal-lengthpetal-widthspeciessetosaversicolorvirginica

Per-cell inference picks the layer type for each panel: diagonal cells (x = y) get histograms; off-diagonal cells get scatter plots. All panels share the color aesthetic set at the composite root.

See the Faceting chapter for more SPLOM variations, and the Customization chapter for brush selection.

See Also

  • Composition – composite poses (the SPLOM is one) and shared scales
  • Distributions – one-variable shape and spread

What’s Next

source: notebooks/plotje_book/relationships.clj