3 Datasets

You do not need to know about datasets to plot with Plotje – you can pass plain Clojure data (maps, vectors of maps) directly. But understanding datasets is recommended background for four reasons:

Performance: datasets are columnar and backed by typed arrays. For large data (thousands of rows and above), they are significantly faster than plain Clojure maps and vectors.
Ergonomics: many people find that dataset operations – filtering, grouping, aggregation – read more naturally as a pipeline than the equivalent Clojure core code. This is a matter of taste, but the convention is widespread in the Clojure data science ecosystem.
Column types matter for plotting: dataset columns carry type information (numeric, categorical, temporal) that Plotje uses to choose scales, axis formatting, and statistical transforms. A column of doubles gets a continuous axis; a column of strings gets a categorical axis. When you pass plain Clojure data, types are inferred on coercion – sometimes not the way you expect. Working with datasets gives you explicit control.
Understanding Plotje internals: Plotje coerces your data to a dataset internally. Knowing what a dataset is helps you understand column names, types, and the inference rules.

This chapter gives a brief introduction. For full documentation, see the Tablecloth and tech.ml.dataset docs.

(ns plotje-book.datasets
  (:require
   ;; Tablecloth -- dataset manipulation
   [tablecloth.api :as tc]
   ;; Kindly -- notebook rendering protocol
   [scicloj.kindly.v4.kind :as kind]
   ;; Plotje -- composable plotting
   [scicloj.plotje.api :as pj]
   ;; Rdatasets -- standard datasets
   [scicloj.metamorph.ml.rdatasets :as rdatasets]
   [clojure.string :as str]))

Plain Data Works

Plotje accepts plain Clojure data – a map of columns or a vector of row maps. No dataset wrapping needed:

(-> [{:month "Jan" :temperature 5}
     {:month "Feb" :temperature 7}
     {:month "Mar" :temperature 12}
     {:month "Apr" :temperature 16}]
    (pj/lay-line :month :temperature)
    pj/lay-point)

This is all you need for quick plots. The rest of this chapter covers datasets, which become useful as your data grows.

What Is a Dataset?

A dataset is a columnar table backed by efficient typed arrays. It is the Clojure equivalent of an R data frame or a Python pandas DataFrame.

The core implementation is tech.ml.dataset. Tablecloth is a higher-level wrapper with a more convenient API. Plotje uses Tablecloth internally and in its documentation.

Creating Datasets

From a map of columns

(tc/dataset {:x [1 2 3 4 5]
             :y [10 20 15 30 25]})

_unnamed [5 2]:

:x	:y
1	10
2	20
3	15
4	30
5	25

From a vector of row maps

(tc/dataset [{:name "Alice" :score 92}
             {:name "Bob"   :score 85}
             {:name "Carol" :score 97}])

_unnamed [3 2]:

:name	:score
Alice	92
Bob	85
Carol	97

From a sequence of row vectors

(tc/dataset [["Alice" 92]
             ["Bob"   85]
             ["Carol" 97]]
            {:column-names [:name :score]})

stderr

[nREPL-session-c4efc34a-916b-4217-91dc-0265479b8ba9] WARN tablecloth.api.dataset - Dataset creation behaviour changed for 2d 2-element arrays in v7.029. See https://github.com/scicloj/tablecloth/issues/142 for details.

:_unnamed [3 2]:

:name	:score
Alice	92
Bob	85
Carol	97

From a CSV or URL

(tc/dataset "https://vincentarelbundock.github.io/Rdatasets/csv/datasets/iris.csv"
            {:key-fn keyword})

https://vincentarelbundock.github.io/Rdatasets/csv/datasets/iris.csv [150 6]:

:rownames	:Sepal.Length	:Sepal.Width	:Petal.Length	:Petal.Width	:Species
1	5.1	3.5	1.4	0.2	setosa
2	4.9	3.0	1.4	0.2	setosa
3	4.7	3.2	1.3	0.2	setosa
4	4.6	3.1	1.5	0.2	setosa
5	5.0	3.6	1.4	0.2	setosa
6	5.4	3.9	1.7	0.4	setosa
7	4.6	3.4	1.4	0.3	setosa
8	5.0	3.4	1.5	0.2	setosa
9	4.4	2.9	1.4	0.2	setosa
10	4.9	3.1	1.5	0.1	setosa
…	…	…	…	…	…
140	6.9	3.1	5.4	2.1	virginica
141	6.7	3.1	5.6	2.4	virginica
142	6.9	3.1	5.1	2.3	virginica
143	5.8	2.7	5.1	1.9	virginica
144	6.8	3.2	5.9	2.3	virginica
145	6.7	3.3	5.7	2.5	virginica
146	6.7	3.0	5.2	2.3	virginica
147	6.3	2.5	5.0	1.9	virginica
148	6.5	3.0	5.2	2.0	virginica
149	6.2	3.4	5.4	2.3	virginica
150	5.9	3.0	5.1	1.8	virginica

(The :key-fn keyword option converts CSV string headers like "Sepal.Length" to keywords like :Sepal.Length. Without it, column names remain strings.)

The RDatasets collection

Many examples in this book use datasets from the RDatasets collection – over 2,300 datasets from R packages, available as CSV files.

The Clojure bridge is provided by the metamorph.ml library. You can add it as a direct dependency:

Or use the Noj toolkit, which includes it along with other data science libraries.

Each dataset has a memoized accessor function. The first call fetches the CSV from the web; subsequent calls return the cached dataset instantly:

(rdatasets/datasets-iris)

https://vincentarelbundock.github.io/Rdatasets/csv/datasets/iris.csv [150 6]:

:rownames	:sepal-length	:sepal-width	:petal-length	:petal-width	:species
1	5.1	3.5	1.4	0.2	setosa
2	4.9	3.0	1.4	0.2	setosa
3	4.7	3.2	1.3	0.2	setosa
4	4.6	3.1	1.5	0.2	setosa
5	5.0	3.6	1.4	0.2	setosa
6	5.4	3.9	1.7	0.4	setosa
7	4.6	3.4	1.4	0.3	setosa
8	5.0	3.4	1.5	0.2	setosa
9	4.4	2.9	1.4	0.2	setosa
10	4.9	3.1	1.5	0.1	setosa
…	…	…	…	…	…
140	6.9	3.1	5.4	2.1	virginica
141	6.7	3.1	5.6	2.4	virginica
142	6.9	3.1	5.1	2.3	virginica
143	5.8	2.7	5.1	1.9	virginica
144	6.8	3.2	5.9	2.3	virginica
145	6.7	3.3	5.7	2.5	virginica
146	6.7	3.0	5.2	2.3	virginica
147	6.3	2.5	5.0	1.9	virginica
148	6.5	3.0	5.2	2.0	virginica
149	6.2	3.4	5.4	2.3	virginica
150	5.9	3.0	5.1	1.8	virginica

Column names are kebab-case keywords (:sepal-length, not Sepal.Length).

A few datasets used throughout this book:

_unnamed [6 3]:

:function	:rows	:description
datasets-iris	150	Edgar Anderson’s Iris Data
reshape2-tips	244	Tipping data
ggplot2-mpg	234	Fuel economy data from 1999 to 2008 for 38 popular models of…
ggplot2-diamonds	53940	Prices of over 50,000 round cut diamonds
gapminder-gapminder	1704	Gapminder data
datasets-mtcars	32	Motor Trend Car Road Tests

Useful Tablecloth operations

The examples in this book use a handful of Tablecloth functions. Here is a quick reference:

tc/head – first N rows:

(tc/head (rdatasets/datasets-iris) 3)

https://vincentarelbundock.github.io/Rdatasets/csv/datasets/iris.csv [3 6]:

:rownames	:sepal-length	:sepal-width	:petal-length	:petal-width	:species
1	5.1	3.5	1.4	0.2	setosa
2	4.9	3.0	1.4	0.2	setosa
3	4.7	3.2	1.3	0.2	setosa

tc/select-rows – filter rows by predicate:

(-> (rdatasets/datasets-iris)
    (tc/select-rows #(= "setosa" (:species %))))

https://vincentarelbundock.github.io/Rdatasets/csv/datasets/iris.csv [50 6]:

:rownames	:sepal-length	:sepal-width	:petal-length	:petal-width	:species
1	5.1	3.5	1.4	0.2	setosa
2	4.9	3.0	1.4	0.2	setosa
3	4.7	3.2	1.3	0.2	setosa
4	4.6	3.1	1.5	0.2	setosa
5	5.0	3.6	1.4	0.2	setosa
6	5.4	3.9	1.7	0.4	setosa
7	4.6	3.4	1.4	0.3	setosa
8	5.0	3.4	1.5	0.2	setosa
9	4.4	2.9	1.4	0.2	setosa
10	4.9	3.1	1.5	0.1	setosa
…	…	…	…	…	…
40	5.1	3.4	1.5	0.2	setosa
41	5.0	3.5	1.3	0.3	setosa
42	4.5	2.3	1.3	0.3	setosa
43	4.4	3.2	1.3	0.2	setosa
44	5.0	3.5	1.6	0.6	setosa
45	5.1	3.8	1.9	0.4	setosa
46	4.8	3.0	1.4	0.3	setosa
47	5.1	3.8	1.6	0.2	setosa
48	4.6	3.2	1.4	0.2	setosa
49	5.3	3.7	1.5	0.2	setosa
50	5.0	3.3	1.4	0.2	setosa

tc/group-by and tc/aggregate – split and summarize:

(-> (rdatasets/datasets-iris)
    (tc/group-by [:species])
    (tc/aggregate {:mean-sl (fn [ds] (/ (reduce + (ds :sepal-length))
                                        (tc/row-count ds)))}))

_unnamed [3 2]:

:species	:mean-sl
setosa	5.006
versicolor	5.936
virginica	6.588

tc/order-by – sort rows:

(-> (rdatasets/datasets-mtcars)
    (tc/order-by [:mpg] :desc)
    (tc/head 3))

https://vincentarelbundock.github.io/Rdatasets/csv/datasets/mtcars.csv [3 12]:

:rownames	:mpg	:cyl	:disp	:hp	:drat	:wt	:qsec	:vs	:am	:gear	:carb
Toyota Corolla	33.9	4	71.1	65	4.22	1.835	19.90	1	1	4	1
Fiat 128	32.4	4	78.7	66	4.08	2.200	19.47	1	1	4	1
Lotus Europa	30.4	4	95.1	113	3.77	1.513	16.90	1	1	5	2

tc/column-names – list columns:

(tc/column-names (rdatasets/datasets-iris))

(:rownames
 :sepal-length
 :sepal-width
 :petal-length
 :petal-width
 :species)

tc/row-count – number of rows:

(tc/row-count (rdatasets/ggplot2-diamonds))

Datasets and Plotje

When you pass plain data to Plotje, it is coerced to a dataset internally. You can also pass a dataset directly – the result is the same:

Plain data:

(-> {:x [1 2 3] :y [4 5 6]}
    (pj/lay-point :x :y))

Dataset:

(-> (tc/dataset {:x [1 2 3] :y [4 5 6]})
    (pj/lay-point :x :y))

Both produce the same plot. Use whichever is more convenient for your workflow.

From Wide to Long

Plotje plots long-form (tidy) data: one row per observation, with categories and groupings in their own columns. Real datasets often arrive in wide form – each measurement in its own column. tc/pivot->longer is the canonical reshape:

(def temps-wide
  (tc/dataset
   {:month   ["Jan" "Feb" "Mar"]
    :tokyo   [3 5 9]
    :paris   [4 6 11]
    :nairobi [22 23 24]}))

temps-wide

_unnamed [3 4]:

:month	:tokyo	:paris	:nairobi
Jan	3	4	22
Feb	5	6	23
Mar	9	11	24

Three city columns become two: a :city label column and a :temperature value column. The row count triples (3 months times 3 cities equals 9):

(def temps-long
  (tc/pivot->longer temps-wide [:tokyo :paris :nairobi]
                    {:target-columns :city
                     :value-column-name :temperature}))

temps-long

_unnamed [9 3]:

:month	:city	:temperature
Jan	:tokyo	3
Feb	:tokyo	5
Mar	:tokyo	9
Jan	:paris	4
Feb	:paris	6
Mar	:paris	11
Jan	:nairobi	22
Feb	:nairobi	23
Mar	:nairobi	24

Plot the long form by mapping the new label column to :color:

(-> temps-long
    (pj/lay-line :month :temperature
                 {:color :city}))

The inverse, tc/pivot->wider, reshapes long back to wide – useful for tabular reports but rarely the right shape for plotting.

What’s Next

Poses – how Plotje composes layers, aesthetics, and layer types
Quickstart – if you skipped straight here, go back and build your first plots

source: notebooks/plotje_book/datasets.clj