16 Intro to data visualization with Tableplot

This tutorial will guide us through an exploration of the classic Iris dataset using the Tableplot library in Clojure. We will demonstrate how to use Tableplot’s Plotly API to create various visualizations, while explaining the core ideas and functionality of the API.

16.1 Setup

(ns noj-book.tableplot-datavis-intro
  (:require [scicloj.tableplot.v1.plotly :as plotly]
            [tablecloth.api :as tc]
            [tablecloth.column.api :as tcc]
            [scicloj.metamorph.ml.rdatasets :as rdatasets]))

16.2 Introduction

Tableplot is a Clojure library for creating data visualizations using a functional grammar inspired by ggplot2 and the layered grammar of graphics. It allows for composable plots, where layers can be built up incrementally and data transformations can be seamlessly integrated.

In this tutorial, we will:

Inspect the Iris dataset using Tablecloth.
Create various types of plots using Tableplot’s Plotly API.
Explore the relationships between different variables in the dataset.
Demonstrate how to customize plots and use different features of the API.

16.3 Looking into the Iris Dataset

First, let’s look into the Iris dataset using the scicloj.metamorph.ml.rdatasets namespace, that allows us to fetch data from the Rdatasets collection.

(rdatasets/datasets-iris)

https://vincentarelbundock.github.io/Rdatasets/csv/datasets/iris.csv [150 6]:

:rownames	:sepal-length	:sepal-width	:petal-length	:petal-width	:species
1	5.1	3.5	1.4	0.2	setosa
2	4.9	3.0	1.4	0.2	setosa
3	4.7	3.2	1.3	0.2	setosa
4	4.6	3.1	1.5	0.2	setosa
5	5.0	3.6	1.4	0.2	setosa
6	5.4	3.9	1.7	0.4	setosa
7	4.6	3.4	1.4	0.3	setosa
8	5.0	3.4	1.5	0.2	setosa
9	4.4	2.9	1.4	0.2	setosa
10	4.9	3.1	1.5	0.1	setosa
…	…	…	…	…	…
140	6.9	3.1	5.4	2.1	virginica
141	6.7	3.1	5.6	2.4	virginica
142	6.9	3.1	5.1	2.3	virginica
143	5.8	2.7	5.1	1.9	virginica
144	6.8	3.2	5.9	2.3	virginica
145	6.7	3.3	5.7	2.5	virginica
146	6.7	3.0	5.2	2.3	virginica
147	6.3	2.5	5.0	1.9	virginica
148	6.5	3.0	5.2	2.0	virginica
149	6.2	3.4	5.4	2.3	virginica
150	5.9	3.0	5.1	1.8	virginica

The Iris dataset contains measurements for 150 iris flowers from three species (setosa, versicolor, virginica). The variables are:

sepal-length: Length of the sepal (cm)
sepal-width: Width of the sepal (cm)
petal-length: Length of the petal (cm)
petal-width: Width of the petal (cm)
species: Species of the iris flower

16.4 Scatter Plot

Let’s start by creating a simple scatter plot to visualize the relationship between sepal-length and sepal-width.

(-> (rdatasets/datasets-iris)
    (plotly/layer-point
     {:=x :sepal-length
      :=y :sepal-width}))

This plot shows the distribution of sepal length and width for the flowers in the dataset.

16.4.1 Adding Color by Species

To distinguish between the different species, we can add color encoding based on the species column.

(-> (rdatasets/datasets-iris)
    (plotly/layer-point
     {:=x :sepal-length
      :=y :sepal-width
      :=color :species}))

Now, each species is represented by a different color, making it easier to see any patterns or differences between them.

16.5 Exploring Petal Measurements

Next, let’s explore how petal measurements vary across species.

(-> (rdatasets/datasets-iris)
    (plotly/layer-point
     {:=x :petal-length
      :=y :petal-width
      :=color :species}))

This plot shows a clearer separation between species based on petal measurements compared to sepal measurements.

16.5.1 Text plots

In certain cases, it might be useful to label the datapoints

(-> (rdatasets/datasets-iris)
    (tc/map-columns :species-short [:species] #(subs % 0 2))
    (plotly/layer-text
     {:=x :petal-length
      :=y :petal-width
      :=text :species-short}))

16.6 Combining Sepal and Petal Measurements

We can create a scatter plot matrix (SPLOM) to visualize the relationships between all pairs of variables.

(-> (rdatasets/datasets-iris)
    (plotly/splom
     {:=colnames [:sepal-length :sepal-width :petal-length :petal-width]
      :=color :species
      :=height 600
      :=width 600}))

The SPLOM shows pairwise scatter plots for all combinations of the selected variables, with points colored by species.

16.7 Histograms

Let’s create histograms to explore the distribution of sepal-length.

(-> (rdatasets/datasets-iris)
    (plotly/layer-histogram
     {:=x :sepal-length
      :=histnorm "count"
      :=histogram-nbins 20}))

16.7.1 Histograms by Species

To see how the distribution of sepal-length varies by species, we can add color encoding.

(-> (rdatasets/datasets-iris)
    (plotly/layer-histogram
     {:=x :sepal-length
      :=color :species
      :=histnorm "count"
      :=histogram-nbins 20
      :=mark-opacity 0.7}))

16.8 Density Plots

Another way to visualize the distribution of a variable is with a density plot. These can also be colored by species.

(-> (rdatasets/datasets-iris)
    (plotly/layer-density
     {:=x :sepal-length
      :=color :species}))

16.9 Bar Charts

(-> (rdatasets/datasets-iris)
    (tc/group-by [:species])
    (tc/aggregate {:mean-width #(tcc/mean (:sepal-width %))})
    (plotly/layer-bar
     {:=x :species
      :=y :mean-width}))

16.10 Box Plots

Box plots are useful for comparing distributions across categories.

(-> (rdatasets/datasets-iris)
    (plotly/layer-boxplot
     {:=y :sepal-length
      :=x :species}))

This box plot shows the distribution of sepal-length for each species.

16.11 Violin Plots

Violin plots provide a richer representation of the distribution.

(-> (rdatasets/datasets-iris)
    (plotly/layer-violin
     {:=y :sepal-length
      :=x :species
      :=box-visible true
      :=meanline-visible true}))

16.12 Scatter Plot with Trend Lines

We can add a smoothing layer to show trend lines in the data.

(-> (rdatasets/datasets-iris)
    (plotly/base
     {:=x :sepal-length
      :=y :sepal-width
      :=color :species})
    plotly/layer-point
    plotly/layer-smooth)

This plot shows a scatter plot of sepal measurements with trend lines added for each species.

16.13 Customizing Plots

Tableplot allows for customization of plot aesthetics.

16.13.1 Changing Marker Sizes

(-> (rdatasets/datasets-iris)
    (plotly/layer-point
     {:=x :sepal-length
      :=y :sepal-width
      :=color :species
      :=symbol :species
      :=mark-size 15}))

16.13.2 Changing Marker Color (for all marks)

(-> (rdatasets/datasets-iris)
    (plotly/layer-point
     {:=x :sepal-length
      :=y :sepal-width
      :=symbol :species
      :=mark-size 15
      :=mark-color :darkblue}))

16.13.3 Adjusting Opacity

(-> (rdatasets/datasets-iris)
    (plotly/layer-point
     {:=x :sepal-length
      :=y :sepal-width
      :=color :species
      :=mark-size 15
      :=mark-opacity 0.6}))

16.13.4 Changing axis titles

If you desire different axis titles than the variable names, those can be changed as well:

(-> (rdatasets/datasets-iris)
    (plotly/layer-point
     {:=x :sepal-length
      :=y :sepal-width
      :=color :species
      :=mark-size 15
      :=mark-opacity 0.6
      :=x-title "Sepal length"
      :=y-title "Sepal width"}))

16.14 3d Scatter Plot

We can create a 3d scatter plot to visualize relationships in three dimensions.

(-> (rdatasets/datasets-iris)
    (plotly/layer-point
     {:=x :sepal-length
      :=y :sepal-width
      :=z :petal-length
      :=color :species
      :=coordinates :3d
      :=mark-size 5}))

16.15 Conclusion

In this tutorial, we have explored the Iris dataset using the Tableplot library in Clojure. We demonstrated how to create various types of plots, customize them, and explore relationships in the data.

Tableplot’s API is designed to be intuitive and flexible, allowing for the creation of complex plots with simple, composable functions.

For more information and advanced usage, refer to the Tableplot documentation.

16.16 Appendix: Understanding the Tableplot API

The core idea of the Tableplot API is to build plots by composing layers. Each layer corresponds to a visual representation of data, such as points, lines, bars, etc.

16.16.1 Basic Functions

plotly/base: Specifies common parameters of layers.
plotly/layer-point: Adds a scatter plot layer with points.
plotly/layer-line: Adds a line plot layer.
plotly/layer-bar: Adds a bar plot layer.
plotly/layer-boxplot: Adds a box plot layer.
plotly/layer-violin: Adds a violin plot layer.
plotly/layer-histogram: Adds a histogram layer.
plotly/layer-density: Adds a density plot layer.
plotly/layer-smooth: Adds a smoothing layer (trend line).
plotly/splom: Creates a scatter plot matrix (SPLOM).

16.16.2 Parameters

Parameters are provided as a map, with keys prefixed by := to distinguish them from dataset columns.

:=x: The x-axis variable.
:=y: The y-axis variable.
:=z: The z-axis variable (for 3D plots).
:=color: Variable used to color the data points.
:=symbol: Variable used to determine marker symbols.
:=mark-opacity: Opacity of the markers.
:=mark-size: Size of the markers.
:=mark-color: Color of the markers.
:=histogram-nbins: Number of bins in the x-axis for histograms.
:=box-visible: Whether to show box plot inside violin plots.
:=meanline-visible: Whether to show mean line in violin plots.
:=x-title: The title of the x-axis.
:=y-title: The title of the y-axis.

For a complete list of parameters, see the Plotly API reference

16.16.3 Composing Plots

Plots are built by starting with a dataset and chaining layer functions.

(comment
  (-> dataset
      (plotly/layer-point
       {:=x :x-variable
        :=y :y-variable})))

Multiple layers can be added to create more complex plots, sharing parameters defined in base.

(comment
  (-> dataset
      (plotly/base
       {:=x :x-variable
        :=y :y-variable})
      (plotly/layer-point {... ...})
      (plotly/layer-smooth {... ...})))

16.17 References

source: notebooks/noj_book/tableplot_datavis_intro.clj