3 Datasets
You do not need to know about datasets to plot with Plotje β you can pass plain Clojure data (maps, vectors of maps) directly. But understanding datasets is recommended background for four reasons:
Performance: datasets are columnar and backed by typed arrays. For large data (thousands of rows and above), they are significantly faster than plain Clojure maps and vectors.
Ergonomics: many people find that dataset operations β filtering, grouping, aggregation β read more naturally as a pipeline than the equivalent Clojure core code. This is a matter of taste, but the convention is widespread in the Clojure data science ecosystem.
Column types matter for plotting: dataset columns carry type information (numeric, categorical, temporal) that Plotje uses to choose scales, axis formatting, and statistical transforms. A column of doubles gets a continuous axis; a column of strings gets a categorical axis. When you pass plain Clojure data, types are inferred on coercion β sometimes not the way you expect. Working with datasets gives you explicit control.
Understanding Plotje internals: Plotje coerces your data to a dataset internally. Knowing what a dataset is helps you understand column names, types, and the inference rules.
This chapter gives a brief introduction. For full documentation, see the Tablecloth and tech.ml.dataset docs.
(ns plotje-book.datasets
(:require
;; Tablecloth -- dataset manipulation
[tablecloth.api :as tc]
;; Kindly -- notebook rendering protocol
[scicloj.kindly.v4.kind :as kind]
;; Plotje -- composable plotting
[scicloj.plotje.api :as pj]
;; Rdatasets -- standard datasets
[scicloj.metamorph.ml.rdatasets :as rdatasets]))Plain Data Works
Plotje accepts plain Clojure data β a map of columns or a vector of row maps. No dataset wrapping needed:
(-> [{:month "Jan" :temperature 5}
{:month "Feb" :temperature 7}
{:month "Mar" :temperature 12}
{:month "Apr" :temperature 16}]
(pj/lay-line :month :temperature)
pj/lay-point)This is all you need for quick plots. The rest of this chapter covers datasets, which become useful as your data grows.
What Is a Dataset?
A dataset is a columnar table backed by efficient typed arrays. It is the Clojure equivalent of an R data frame or a Python pandas DataFrame.
The core implementation is tech.ml.dataset. Tablecloth is a higher-level wrapper with a more ergonomic API. Plotje uses Tablecloth internally and in its documentation.
Creating Datasets
From a map of columns
(tc/dataset {:x [1 2 3 4 5]
:y [10 20 15 30 25]})_unnamed [5 2]:
| :x | :y |
|---|---|
| 1 | 10 |
| 2 | 20 |
| 3 | 15 |
| 4 | 30 |
| 5 | 25 |
From a vector of row maps
(tc/dataset [{:name "Alice" :score 92}
{:name "Bob" :score 85}
{:name "Carol" :score 97}])_unnamed [3 2]:
| :name | :score |
|---|---|
| Alice | 92 |
| Bob | 85 |
| Carol | 97 |
From a sequence of row vectors
(tc/dataset [["Alice" 92]
["Bob" 85]
["Carol" 97]]
{:column-names [:name :score]}):_unnamed [3 2]:
| :name | :score |
|---|---|
| Alice | 92 |
| Bob | 85 |
| Carol | 97 |
From a CSV or URL
(tc/dataset "https://vincentarelbundock.github.io/Rdatasets/csv/datasets/iris.csv"
{:key-fn keyword})https://vincentarelbundock.github.io/Rdatasets/csv/datasets/iris.csv [150 6]:
| :rownames | :Sepal.Length | :Sepal.Width | :Petal.Length | :Petal.Width | :Species |
|---|---|---|---|---|---|
| 1 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
| 2 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
| 3 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
| 4 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
| 5 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
| 6 | 5.4 | 3.9 | 1.7 | 0.4 | setosa |
| 7 | 4.6 | 3.4 | 1.4 | 0.3 | setosa |
| 8 | 5.0 | 3.4 | 1.5 | 0.2 | setosa |
| 9 | 4.4 | 2.9 | 1.4 | 0.2 | setosa |
| 10 | 4.9 | 3.1 | 1.5 | 0.1 | setosa |
| β¦ | β¦ | β¦ | β¦ | β¦ | β¦ |
| 140 | 6.9 | 3.1 | 5.4 | 2.1 | virginica |
| 141 | 6.7 | 3.1 | 5.6 | 2.4 | virginica |
| 142 | 6.9 | 3.1 | 5.1 | 2.3 | virginica |
| 143 | 5.8 | 2.7 | 5.1 | 1.9 | virginica |
| 144 | 6.8 | 3.2 | 5.9 | 2.3 | virginica |
| 145 | 6.7 | 3.3 | 5.7 | 2.5 | virginica |
| 146 | 6.7 | 3.0 | 5.2 | 2.3 | virginica |
| 147 | 6.3 | 2.5 | 5.0 | 1.9 | virginica |
| 148 | 6.5 | 3.0 | 5.2 | 2.0 | virginica |
| 149 | 6.2 | 3.4 | 5.4 | 2.3 | virginica |
| 150 | 5.9 | 3.0 | 5.1 | 1.8 | virginica |
(The :key-fn keyword option converts CSV string headers like "Sepal.Length" to keywords like :Sepal.Length. Without it, column names remain strings.)
The RDatasets collection
Many examples in this book use datasets from the RDatasets collection β over 2,300 datasets from R packages, available as CSV files.
The Clojure bridge is provided by the metamorph.ml library. You can add it as a direct dependency:
Or use the Noj toolkit, which includes it along with other data science libraries.
Each dataset has a memoized accessor function. The first call fetches the CSV from the web; subsequent calls return the cached dataset instantly:
(rdatasets/datasets-iris)https://vincentarelbundock.github.io/Rdatasets/csv/datasets/iris.csv [150 6]:
| :rownames | :sepal-length | :sepal-width | :petal-length | :petal-width | :species |
|---|---|---|---|---|---|
| 1 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
| 2 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
| 3 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
| 4 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
| 5 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
| 6 | 5.4 | 3.9 | 1.7 | 0.4 | setosa |
| 7 | 4.6 | 3.4 | 1.4 | 0.3 | setosa |
| 8 | 5.0 | 3.4 | 1.5 | 0.2 | setosa |
| 9 | 4.4 | 2.9 | 1.4 | 0.2 | setosa |
| 10 | 4.9 | 3.1 | 1.5 | 0.1 | setosa |
| β¦ | β¦ | β¦ | β¦ | β¦ | β¦ |
| 140 | 6.9 | 3.1 | 5.4 | 2.1 | virginica |
| 141 | 6.7 | 3.1 | 5.6 | 2.4 | virginica |
| 142 | 6.9 | 3.1 | 5.1 | 2.3 | virginica |
| 143 | 5.8 | 2.7 | 5.1 | 1.9 | virginica |
| 144 | 6.8 | 3.2 | 5.9 | 2.3 | virginica |
| 145 | 6.7 | 3.3 | 5.7 | 2.5 | virginica |
| 146 | 6.7 | 3.0 | 5.2 | 2.3 | virginica |
| 147 | 6.3 | 2.5 | 5.0 | 1.9 | virginica |
| 148 | 6.5 | 3.0 | 5.2 | 2.0 | virginica |
| 149 | 6.2 | 3.4 | 5.4 | 2.3 | virginica |
| 150 | 5.9 | 3.0 | 5.1 | 1.8 | virginica |
Column names are kebab-case keywords (:sepal-length, not Sepal.Length).
A few datasets used throughout this book:
(kind/table
{:column-names ["Function" "Rows" "Description"]
:row-maps
(let [mpg (rdatasets/ggplot2-mpg)]
[{"Function" (kind/code "rdatasets/datasets-iris")
"Rows" (tc/row-count (rdatasets/datasets-iris))
"Description" "Iris flower measurements by species"}
{"Function" (kind/code "rdatasets/reshape2-tips")
"Rows" (tc/row-count (rdatasets/reshape2-tips))
"Description" "Restaurant tips with bill, day, time, smoker"}
{"Function" (kind/code "rdatasets/ggplot2-mpg")
"Rows" (tc/row-count mpg)
"Description" (str "Fuel economy for "
(count (distinct (mpg :model)))
" car models")}
{"Function" (kind/code "rdatasets/ggplot2-diamonds")
"Rows" (tc/row-count (rdatasets/ggplot2-diamonds))
"Description" "Diamond price, carat, cut, color, clarity"}
{"Function" (kind/code "rdatasets/gapminder-gapminder")
"Rows" (tc/row-count (rdatasets/gapminder-gapminder))
"Description" "Country-level life expectancy and GDP"}
{"Function" (kind/code "rdatasets/datasets-mtcars")
"Rows" (tc/row-count (rdatasets/datasets-mtcars))
"Description" "Motor Trend car road tests"}])})| Function | Rows | Description |
|---|---|---|
|
150 | Iris flower measurements by species |
|
244 | Restaurant tips with bill, day, time, smoker |
|
234 | Fuel economy for 38 car models |
|
53940 | Diamond price, carat, cut, color, clarity |
|
1704 | Country-level life expectancy and GDP |
|
32 | Motor Trend car road tests |
Useful Tablecloth operations
The examples in this book use a handful of Tablecloth functions. Here is a quick reference:
tc/head β first N rows:
(tc/head (rdatasets/datasets-iris) 3)https://vincentarelbundock.github.io/Rdatasets/csv/datasets/iris.csv [3 6]:
| :rownames | :sepal-length | :sepal-width | :petal-length | :petal-width | :species |
|---|---|---|---|---|---|
| 1 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
| 2 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
| 3 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
tc/select-rows β filter rows by predicate:
(-> (rdatasets/datasets-iris)
(tc/select-rows #(= "setosa" (:species %))))https://vincentarelbundock.github.io/Rdatasets/csv/datasets/iris.csv [50 6]:
| :rownames | :sepal-length | :sepal-width | :petal-length | :petal-width | :species |
|---|---|---|---|---|---|
| 1 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
| 2 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
| 3 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
| 4 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
| 5 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
| 6 | 5.4 | 3.9 | 1.7 | 0.4 | setosa |
| 7 | 4.6 | 3.4 | 1.4 | 0.3 | setosa |
| 8 | 5.0 | 3.4 | 1.5 | 0.2 | setosa |
| 9 | 4.4 | 2.9 | 1.4 | 0.2 | setosa |
| 10 | 4.9 | 3.1 | 1.5 | 0.1 | setosa |
| β¦ | β¦ | β¦ | β¦ | β¦ | β¦ |
| 40 | 5.1 | 3.4 | 1.5 | 0.2 | setosa |
| 41 | 5.0 | 3.5 | 1.3 | 0.3 | setosa |
| 42 | 4.5 | 2.3 | 1.3 | 0.3 | setosa |
| 43 | 4.4 | 3.2 | 1.3 | 0.2 | setosa |
| 44 | 5.0 | 3.5 | 1.6 | 0.6 | setosa |
| 45 | 5.1 | 3.8 | 1.9 | 0.4 | setosa |
| 46 | 4.8 | 3.0 | 1.4 | 0.3 | setosa |
| 47 | 5.1 | 3.8 | 1.6 | 0.2 | setosa |
| 48 | 4.6 | 3.2 | 1.4 | 0.2 | setosa |
| 49 | 5.3 | 3.7 | 1.5 | 0.2 | setosa |
| 50 | 5.0 | 3.3 | 1.4 | 0.2 | setosa |
tc/group-by and tc/aggregate β split and summarize:
(-> (rdatasets/datasets-iris)
(tc/group-by [:species])
(tc/aggregate {:mean-sl (fn [ds] (/ (reduce + (ds :sepal-length))
(tc/row-count ds)))}))_unnamed [3 2]:
| :species | :mean-sl |
|---|---|
| setosa | 5.006 |
| versicolor | 5.936 |
| virginica | 6.588 |
tc/order-by β sort rows:
(-> (rdatasets/datasets-mtcars)
(tc/order-by [:mpg] :desc)
(tc/head 3))https://vincentarelbundock.github.io/Rdatasets/csv/datasets/mtcars.csv [3 12]:
| :rownames | :mpg | :cyl | :disp | :hp | :drat | :wt | :qsec | :vs | :am | :gear | :carb |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Toyota Corolla | 33.9 | 4 | 71.1 | 65 | 4.22 | 1.835 | 19.90 | 1 | 1 | 4 | 1 |
| Fiat 128 | 32.4 | 4 | 78.7 | 66 | 4.08 | 2.200 | 19.47 | 1 | 1 | 4 | 1 |
| Lotus Europa | 30.4 | 4 | 95.1 | 113 | 3.77 | 1.513 | 16.90 | 1 | 1 | 5 | 2 |
tc/column-names β list columns:
(tc/column-names (rdatasets/datasets-iris))(:rownames
:sepal-length
:sepal-width
:petal-length
:petal-width
:species)tc/row-count β number of rows:
(tc/row-count (rdatasets/ggplot2-diamonds))53940Datasets and Plotje
When you pass plain data to Plotje, it is coerced to a dataset internally. You can also pass a dataset directly β the result is the same:
Plain data:
(-> {:x [1 2 3] :y [4 5 6]}
(pj/lay-point :x :y))Dataset:
(-> (tc/dataset {:x [1 2 3] :y [4 5 6]})
(pj/lay-point :x :y))Both produce the same plot. Use whichever is more convenient for your workflow.
Whatβs Next
- Pose Model β how Plotje composes layers, aesthetics, and layer types
- Quickstart β if you skipped straight here, go back and build your first plots