3  Datasets

You do not need to know about datasets to plot with Plotje – you can pass plain Clojure data (maps, vectors of maps) directly. But understanding datasets is recommended background for four reasons:

This chapter gives a brief introduction. For full documentation, see the Tablecloth and tech.ml.dataset docs.

(ns plotje-book.datasets
  (:require
   ;; Tablecloth -- dataset manipulation
   [tablecloth.api :as tc]
   ;; Kindly -- notebook rendering protocol
   [scicloj.kindly.v4.kind :as kind]
   ;; Plotje -- composable plotting
   [scicloj.plotje.api :as pj]
   ;; Rdatasets -- standard datasets
   [scicloj.metamorph.ml.rdatasets :as rdatasets]))

Plain Data Works

Plotje accepts plain Clojure data – a map of columns or a vector of row maps. No dataset wrapping needed:

(-> [{:month "Jan" :temperature 5}
     {:month "Feb" :temperature 7}
     {:month "Mar" :temperature 12}
     {:month "Apr" :temperature 16}]
    (pj/lay-line :month :temperature)
    pj/lay-point)
temperaturemonthJanFebMarApr6810121416

This is all you need for quick plots. The rest of this chapter covers datasets, which become useful as your data grows.

What Is a Dataset?

A dataset is a columnar table backed by efficient typed arrays. It is the Clojure equivalent of an R data frame or a Python pandas DataFrame.

The core implementation is tech.ml.dataset. Tablecloth is a higher-level wrapper with a more ergonomic API. Plotje uses Tablecloth internally and in its documentation.

Creating Datasets

From a map of columns

(tc/dataset {:x [1 2 3 4 5]
             :y [10 20 15 30 25]})

_unnamed [5 2]:

:x :y
1 10
2 20
3 15
4 30
5 25

From a vector of row maps

(tc/dataset [{:name "Alice" :score 92}
             {:name "Bob"   :score 85}
             {:name "Carol" :score 97}])

_unnamed [3 2]:

:name :score
Alice 92
Bob 85
Carol 97

From a sequence of row vectors

(tc/dataset [["Alice" 92]
             ["Bob"   85]
             ["Carol" 97]]
            {:column-names [:name :score]})

:_unnamed [3 2]:

:name :score
Alice 92
Bob 85
Carol 97

From a CSV or URL

(tc/dataset "https://vincentarelbundock.github.io/Rdatasets/csv/datasets/iris.csv"
            {:key-fn keyword})

https://vincentarelbundock.github.io/Rdatasets/csv/datasets/iris.csv [150 6]:

:rownames :Sepal.Length :Sepal.Width :Petal.Length :Petal.Width :Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
… … … … … …
140 6.9 3.1 5.4 2.1 virginica
141 6.7 3.1 5.6 2.4 virginica
142 6.9 3.1 5.1 2.3 virginica
143 5.8 2.7 5.1 1.9 virginica
144 6.8 3.2 5.9 2.3 virginica
145 6.7 3.3 5.7 2.5 virginica
146 6.7 3.0 5.2 2.3 virginica
147 6.3 2.5 5.0 1.9 virginica
148 6.5 3.0 5.2 2.0 virginica
149 6.2 3.4 5.4 2.3 virginica
150 5.9 3.0 5.1 1.8 virginica

(The :key-fn keyword option converts CSV string headers like "Sepal.Length" to keywords like :Sepal.Length. Without it, column names remain strings.)

The RDatasets collection

Many examples in this book use datasets from the RDatasets collection – over 2,300 datasets from R packages, available as CSV files.

The Clojure bridge is provided by the metamorph.ml library. You can add it as a direct dependency: Clojars Project

Or use the Noj toolkit, which includes it along with other data science libraries.

Each dataset has a memoized accessor function. The first call fetches the CSV from the web; subsequent calls return the cached dataset instantly:

(rdatasets/datasets-iris)

https://vincentarelbundock.github.io/Rdatasets/csv/datasets/iris.csv [150 6]:

:rownames :sepal-length :sepal-width :petal-length :petal-width :species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
… … … … … …
140 6.9 3.1 5.4 2.1 virginica
141 6.7 3.1 5.6 2.4 virginica
142 6.9 3.1 5.1 2.3 virginica
143 5.8 2.7 5.1 1.9 virginica
144 6.8 3.2 5.9 2.3 virginica
145 6.7 3.3 5.7 2.5 virginica
146 6.7 3.0 5.2 2.3 virginica
147 6.3 2.5 5.0 1.9 virginica
148 6.5 3.0 5.2 2.0 virginica
149 6.2 3.4 5.4 2.3 virginica
150 5.9 3.0 5.1 1.8 virginica

Column names are kebab-case keywords (:sepal-length, not Sepal.Length).

A few datasets used throughout this book:

(kind/table
 {:column-names ["Function" "Rows" "Description"]
  :row-maps
  (let [mpg (rdatasets/ggplot2-mpg)]
    [{"Function" (kind/code "rdatasets/datasets-iris")
      "Rows" (tc/row-count (rdatasets/datasets-iris))
      "Description" "Iris flower measurements by species"}
     {"Function" (kind/code "rdatasets/reshape2-tips")
      "Rows" (tc/row-count (rdatasets/reshape2-tips))
      "Description" "Restaurant tips with bill, day, time, smoker"}
     {"Function" (kind/code "rdatasets/ggplot2-mpg")
      "Rows" (tc/row-count mpg)
      "Description" (str "Fuel economy for "
                         (count (distinct (mpg :model)))
                         " car models")}
     {"Function" (kind/code "rdatasets/ggplot2-diamonds")
      "Rows" (tc/row-count (rdatasets/ggplot2-diamonds))
      "Description" "Diamond price, carat, cut, color, clarity"}
     {"Function" (kind/code "rdatasets/gapminder-gapminder")
      "Rows" (tc/row-count (rdatasets/gapminder-gapminder))
      "Description" "Country-level life expectancy and GDP"}
     {"Function" (kind/code "rdatasets/datasets-mtcars")
      "Rows" (tc/row-count (rdatasets/datasets-mtcars))
      "Description" "Motor Trend car road tests"}])})
Function Rows Description
rdatasets/datasets-iris
150 Iris flower measurements by species
rdatasets/reshape2-tips
244 Restaurant tips with bill, day, time, smoker
rdatasets/ggplot2-mpg
234 Fuel economy for 38 car models
rdatasets/ggplot2-diamonds
53940 Diamond price, carat, cut, color, clarity
rdatasets/gapminder-gapminder
1704 Country-level life expectancy and GDP
rdatasets/datasets-mtcars
32 Motor Trend car road tests

Useful Tablecloth operations

The examples in this book use a handful of Tablecloth functions. Here is a quick reference:

tc/head – first N rows:

(tc/head (rdatasets/datasets-iris) 3)

https://vincentarelbundock.github.io/Rdatasets/csv/datasets/iris.csv [3 6]:

:rownames :sepal-length :sepal-width :petal-length :petal-width :species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa

tc/select-rows – filter rows by predicate:

(-> (rdatasets/datasets-iris)
    (tc/select-rows #(= "setosa" (:species %))))

https://vincentarelbundock.github.io/Rdatasets/csv/datasets/iris.csv [50 6]:

:rownames :sepal-length :sepal-width :petal-length :petal-width :species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
… … … … … …
40 5.1 3.4 1.5 0.2 setosa
41 5.0 3.5 1.3 0.3 setosa
42 4.5 2.3 1.3 0.3 setosa
43 4.4 3.2 1.3 0.2 setosa
44 5.0 3.5 1.6 0.6 setosa
45 5.1 3.8 1.9 0.4 setosa
46 4.8 3.0 1.4 0.3 setosa
47 5.1 3.8 1.6 0.2 setosa
48 4.6 3.2 1.4 0.2 setosa
49 5.3 3.7 1.5 0.2 setosa
50 5.0 3.3 1.4 0.2 setosa

tc/group-by and tc/aggregate – split and summarize:

(-> (rdatasets/datasets-iris)
    (tc/group-by [:species])
    (tc/aggregate {:mean-sl (fn [ds] (/ (reduce + (ds :sepal-length))
                                        (tc/row-count ds)))}))

_unnamed [3 2]:

:species :mean-sl
setosa 5.006
versicolor 5.936
virginica 6.588

tc/order-by – sort rows:

(-> (rdatasets/datasets-mtcars)
    (tc/order-by [:mpg] :desc)
    (tc/head 3))

https://vincentarelbundock.github.io/Rdatasets/csv/datasets/mtcars.csv [3 12]:

:rownames :mpg :cyl :disp :hp :drat :wt :qsec :vs :am :gear :carb
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2

tc/column-names – list columns:

(tc/column-names (rdatasets/datasets-iris))
(:rownames
 :sepal-length
 :sepal-width
 :petal-length
 :petal-width
 :species)

tc/row-count – number of rows:

(tc/row-count (rdatasets/ggplot2-diamonds))
53940

Datasets and Plotje

When you pass plain data to Plotje, it is coerced to a dataset internally. You can also pass a dataset directly – the result is the same:

Plain data:

(-> {:x [1 2 3] :y [4 5 6]}
    (pj/lay-point :x :y))
yx1.01.21.41.61.82.02.22.42.62.83.04.04.24.44.64.85.05.25.45.65.86.0

Dataset:

(-> (tc/dataset {:x [1 2 3] :y [4 5 6]})
    (pj/lay-point :x :y))
yx1.01.21.41.61.82.02.22.42.62.83.04.04.24.44.64.85.05.25.45.65.86.0

Both produce the same plot. Use whichever is more convenient for your workflow.

What’s Next

  • Pose Model – how Plotje composes layers, aesthetics, and layer types
  • Quickstart – if you skipped straight here, go back and build your first plots
source: notebooks/plotje_book/datasets.clj