5 Table processing with Tablecloth
author: Cvetomir Dimov and Daniel Slutsky
last change: 2025-02-02
Tablecloth is a table processing library inspired by the dataframe ergonomics typical to the R ecosystem, specifically the Tidyverse, but offers certain advantages on top of that.
It is built on top of the data structures and functions of tech.ml.dataset, a high-performance table processing library (often called TMD), which is built on top of dtype-next, an array-programming library.
In this tutorial, we will see a few of the core ideas of Tablecloth. You are encouraged to look into the main documentation for more information.
5.1 Recommended resources
Data Manipulation in Clojure Compared to R and Python by Kira Howe (McLean) (2024-07-18)
Dealing with out-of-memory faulty csv’s with Clojure, Duckdb, and Parquet by Georgy Toporkov (2024-01-22)
A beginner-friendly intro by Mey Beisaron (Func Prog Sweden, 2023-03-22):
- An in-depth walkthrough by Ethan Miller (data-recur group, 2022-11-05):
5.2 About this tutorial
In this tutorial, we will demonstrate the ergonomics of so-called dataset data structures provided by Tablecloth. Datasets are table-like data structures, often called data-frames in other data science platforms.
We will assume basic familiarity with Clojure.
A lot of what we demonstrate here can also be implemented with the usual Clojure data structures such as vectors and maps. However, datasets offer not only performance advantages in space and time, but also certain usability features, which are arguably expressive and powerful.
We will often use treading macros, mostly ->. This approach is compatible with data science cultures such as the Tidyverse in R.
5.3 Setup
We create a namespace and require the following namespaces:
tablecloth.api- the main Tablecloth Dataset APItablecloth.column.api- the Tablecloth Column APItech.v3.dataset.printto control printing (from tech.ml.dataset)clojure.stringfor string processingclojure.java.iofor file input/outputscicloj.kindly.v4.kindof the Kindly standard to control the way certain values are displayedjava-time.apiof Clojure.Java-Time for some time calculationstech.v3.datatype.datetimeof date and time operations (from dtype-next)
(ns noj-book.tablecloth-table-processing
(:require [tablecloth.api :as tc]
[tablecloth.column.api :as tcc]
[tech.v3.dataset.print :as print]
[clojure.string :as str]
[clojure.java.io :as io]
[scicloj.kindly.v4.kind :as kind]
[java-time.api :as java-time]
[tech.v3.datatype.datetime :as datetime]))5.4 Creating a dataset
Assume we have a vector of vectors representing bike trips. Each trip has the type of the bike and the coordinates (latitude and longitude) of the trip’s start and end. We can turn this data structure into a dataset:
(tc/dataset [["classic_bike" 41.906866 -87.626217 41.92393131136619 -87.63582453131676]
["electric_bike" 41.869312286 -87.673897266 41.8895 -87.688257]
["classic_bike" 41.95600355078549 -87.68016144633293 41.886875 -87.62603]]
{:column-names [:rideable-type
:start-lat :start-lng
:end-lat :end-lng]}):_unnamed [3 5]:
| :rideable-type | :start-lat | :start-lng | :end-lat | :end-lng |
|---|---|---|---|---|
| classic_bike | 41.90686600 | -87.62621700 | 41.92393131 | -87.63582453 |
| electric_bike | 41.86931229 | -87.67389727 | 41.88950000 | -87.68825700 |
| classic_bike | 41.95600355 | -87.68016145 | 41.88687500 | -87.62603000 |
Sometimes, data may arrive in different shapes. We can also create such a dataset from a vector of maps:
(tc/dataset [{:rideable-type "classic_bike"
:start-lat 41.906866
:start-lng -87.626217
:end-lat 41.92393131136619
:end-lng -87.63582453131676}
{:rideable-type "electric_bike"
:start-lat 41.869312286
:start-lng -87.673897266
:end-lat 41.8895
:end-lng -87.688257}
{:rideable-type "classic_bike"
:start-lat 41.95600355078549
:start-lng -87.68016144633293
:end-lat 41.886875
:end-lng -87.62603}])_unnamed [3 5]:
| :rideable-type | :start-lat | :start-lng | :end-lat | :end-lng |
|---|---|---|---|---|
| classic_bike | 41.90686600 | -87.62621700 | 41.92393131 | -87.63582453 |
| electric_bike | 41.86931229 | -87.67389727 | 41.88950000 | -87.68825700 |
| classic_bike | 41.95600355 | -87.68016145 | 41.88687500 | -87.62603000 |
.. and also, from a map of vectors:
(tc/dataset {:rideable-type ["classic_bike" "electric_bike" "classic_bike"]
:start-lat [41.906866 41.869312286 41.95600355078549]
:start-lng [-87.626217 -87.673897266 -87.68016144633293]
:end-lat [41.92393131136619 41.8895 41.886875]
:end-lng [-87.63582453131676 -87.688257 -87.62603]})_unnamed [3 5]:
| :rideable-type | :start-lat | :start-lng | :end-lat | :end-lng |
|---|---|---|---|---|
| classic_bike | 41.90686600 | -87.62621700 | 41.92393131 | -87.63582453 |
| electric_bike | 41.86931229 | -87.67389727 | 41.88950000 | -87.68825700 |
| classic_bike | 41.95600355 | -87.68016145 | 41.88687500 | -87.62603000 |
Let us hold it in a var to explore it further:
(def some-trips
(tc/dataset {:rideable-type ["classic_bike" "electric_bike" "classic_bike"]
:start-lat [41.906866 41.869312286 41.95600355078549]
:start-lng [-87.626217 -87.673897266 -87.68016144633293]
:end-lat [41.92393131136619 41.8895 41.886875]
:end-lng [-87.63582453131676 -87.688257 -87.62603]}))5.5 Displaying a dataset
In an environment compatible with the Kindly standard, the default way a dataset is displayed is by printing it.
some-trips_unnamed [3 5]:
| :rideable-type | :start-lat | :start-lng | :end-lat | :end-lng |
|---|---|---|---|---|
| classic_bike | 41.90686600 | -87.62621700 | 41.92393131 | -87.63582453 |
| electric_bike | 41.86931229 | -87.67389727 | 41.88950000 | -87.68825700 |
| classic_bike | 41.95600355 | -87.68016145 | 41.88687500 | -87.62603000 |
If necessary, we may customize the printing using the tech.v3.dataset.print namespace of the tech.ml.dataset library.
For example:
(-> {:x (range 99)
:y (repeatedly 99 rand)}
tc/dataset
;; show at most 9 values
(print/print-range 9))_unnamed [99 2]:
| :x | :y |
|---|---|
| 0 | 0.24059880 |
| 1 | 0.56570539 |
| 2 | 0.87940284 |
| 3 | 0.39176387 |
| … | … |
| 94 | 0.82057912 |
| 95 | 0.15309230 |
| 96 | 0.39630696 |
| 97 | 0.37409969 |
| 98 | 0.33125979 |
We may also explicitly turn it into an HTML table:
(kind/table some-trips)| rideable-type | start-lat | start-lng | end-lat | end-lng |
|---|---|---|---|---|
| classic_bike | 41.906866 | -87.626217 | 41.92393131136619 | -87.63582453131676 |
| electric_bike | 41.869312286 | -87.673897266 | 41.8895 | -87.688257 |
| classic_bike | 41.95600355078549 | -87.68016144633293 | 41.886875 | -87.62603 |
This does not matter much for now, but it can be handy when certain inner values should be visualized in a certain way.
It is possible to use datatables to render kind/table and specify datatables options (see the full list).
(kind/table some-trips
{:use-datatables true
:datatables {:scrollY 100}})| rideable-type | start-lat | start-lng | end-lat | end-lng |
|---|---|---|---|---|
| classic_bike | 41.906866 | -87.626217 | 41.92393131136619 | -87.63582453131676 |
| electric_bike | 41.869312286 | -87.673897266 | 41.8895 | -87.688257 |
| classic_bike | 41.95600355078549 | -87.68016144633293 | 41.886875 | -87.62603 |
We may also nest the usual printed table inside other visualization kinds, such as Hiccup.
(kind/hiccup
[:div {:style {:width "60%"
:max-height "400px"
:overflow-x :auto
:overflow-y :auto
:background "floralwhite"}}
some-trips])_unnamed [3 5]:
| :rideable-type | :start-lat | :start-lng | :end-lat | :end-lng |
|---|---|---|---|---|
| classic_bike | 41.90686600 | -87.62621700 | 41.92393131 | -87.63582453 |
| electric_bike | 41.86931229 | -87.67389727 | 41.88950000 | -87.68825700 |
| classic_bike | 41.95600355 | -87.68016145 | 41.88687500 | -87.62603000 |
5.6 What is a dataset?
Let us explore this data structure, our little dataset of bike trips.
A dataset is a value of the Dataset datatype defined in the tech.ml.dataset library:
(type some-trips)tech.v3.dataset.impl.dataset.DatasetOne thing worth knowing about this datatype is that it is extended by quite a few interfaces and protocols.
For example, it behaves as a map.
(map? some-trips)trueThe keys are the column names:
(keys some-trips)(:rideable-type :start-lat :start-lng :end-lat :end-lng)(tc/column-names some-trips)(:rideable-type :start-lat :start-lng :end-lat :end-lng).. and the values are the columns:
(:start-lat some-trips)#tech.v3.dataset.column<float64>[3]
:start-lat
[41.91, 41.87, 41.96]Now we need to discuss what columns are.
5.7 What is a column?
(:start-lat some-trips)#tech.v3.dataset.column<float64>[3]
:start-lat
[41.91, 41.87, 41.96]A column is a value of Column datatype defined in the tech.ml.dataset library:
(-> some-trips
:start-lat
type)tech.v3.dataset.impl.column.ColumnThis datatype is also extended by quite a few interfaces and protocols.
For example, it is sequential.
(-> some-trips
:start-lat
sequential?)trueSo, we can use the sequence abstraction with it:
(->> (:start-lat some-trips)
(take 2))(41.906866 41.869312286)(->> (:rideable-type some-trips)
(filter #{"classic_bike"}))("classic_bike" "classic_bike")It is also assiciative:
(-> some-trips
:rideable-type
(assoc 2 "my strange and unique bike"))["classic_bike" "electric_bike" "my strange and unique bike"]5.8 Working with Columns
We may use Tablecloth’s Column API to create and process columns. For example:
(tcc/column ["classic_bike" "electrical_bike" "classic_bike"])#tech.v3.dataset.column<string>[3]
null
[classic_bike, electrical_bike, classic_bike]What is the average latitude where trips tend to start?
(-> some-trips
:start-lat
tcc/mean)41.9107272789285What is the type of elements in this Column?
(-> some-trips
:start-lat
tcc/typeof):float64Let us look into our latitudes in radians rather than degrees:
(-> some-trips
:start-lat
(tcc/* (/ Math/PI 180)))#tech.v3.dataset.column<float64>[3]
null
[0.7314, 0.7308, 0.7323]You see, columns are typed, and this has implications for both for time and space performance as well as ergonomics.
5.9 The data in columns
You will probably not need this detail most times, but it is worth knowing that the data actually held by the Column can be accessed as the .data field, and it can be of varying data structures (see list of all datatypes).
For example:
(-> some-trips
:start-lat
.data
type)tech.v3.datatype.array_buffer.ArrayBuffer(-> some-trips
:rideable-type
.data
type)tech.v3.dataset.string_table.StringTable(-> (range 9)
tcc/column
.data
type)clojure.lang.LongRangeBehind the scenes, tech.ml.dataset makes sure to use efficient data structures for columns, so that, e.g., random access by index will be efficient:
(-> some-trips
:start-lat
(nth 2))41.95600355078549The following is quick too!
(-> (range 1000000)
(tcc/* 1000)
(nth 10000))10000000Here we rely on the “lazy and non-caching” semantics of the underlying dtype-next library, which is a topic worth its own tutorial.
5.10 Operations on columns
In the examples with columns above, we used tcc/* to multiply the values of a column by a scalar. The result was another column. This operation can be applied to multiple two (or more) columns as well, or a mixture of columns and scalars.
(tcc/* (:start-lat some-trips)
(:end-lng some-trips)
3
4)#tech.v3.dataset.column<float64>[3]
null
[-4.407E+04, -4.406E+04, -4.412E+04]A long list of other operations have been implemented, among which arithmetic operations, trigonometric functions, and various predicates. Note that the tcc/min and tcc/max functions output a column as well.
(tcc/min (:start-lat some-trips)
(:end-lat some-trips))#tech.v3.dataset.column<float64>[3]
null
[41.91, 41.87, 41.89]They also work with scalar inputs.
(tcc/min (:start-lat some-trips)
100)#tech.v3.dataset.column<float64>[3]
null
[41.91, 41.87, 41.96](tcc/min (:start-lat some-trips)
-100)#tech.v3.dataset.column<float64>[3]
null
[-100.0, -100.0, -100.0]Other operations produce a scalar as an output. These include various measures of central tendency.
(-> some-trips
:start-lat
tcc/mean)41.9107272789285(-> some-trips
:end-lat
tcc/median)41.8895Important operations in this category are functions that reduce over the columns, such as reduce-*, reduce-+, reduce-min, and reduce-max. For example, the following can be used to find the maximum value in a column:
(-> some-trips
:start-lng
tcc/reduce-max)-87.6262175.11 Summarizing datasets
We can use the tc/info function to summarize a dataset:
(tc/info some-trips)_unnamed: descriptive-stats [5 12]:
| :col-name | :datatype | :n-valid | :n-missing | :min | :mean | :mode | :max | :standard-deviation | :skew | :first | :last |
|---|---|---|---|---|---|---|---|---|---|---|---|
| :rideable-type | :string | 3 | 0 | classic_bike | classic_bike | classic_bike | |||||
| :start-lat | :float64 | 3 | 0 | 41.86931229 | 41.91072728 | 41.95600355 | 0.04347443 | 0.39652474 | 41.91 | 41.96 | |
| :start-lng | :float64 | 3 | 0 | -87.68016145 | -87.66009190 | -87.62621700 | 0.02950325 | 1.64462246 | -87.63 | -87.68 | |
| :end-lat | :float64 | 3 | 0 | 41.88687500 | 41.90010210 | 41.92393131 | 0.02067839 | 1.70070295 | 41.92 | 41.89 | |
| :end-lng | :float64 | 3 | 0 | -87.68825700 | -87.65003718 | -87.62603000 | 0.03345967 | -1.56657916 | -87.64 | -87.63 |
5.12 Reading datasets
Datasets are often read from files.
Let us read a file from the Divvy Data, where the history of bike sharing trips in Chicago is shared publicly. Chicago bike trips dataset.
Let us first download the data of one month.
(def bike-trips-filename
"202304-divvy-tripdata.zip")(if (.exists (io/as-file bike-trips-filename))
[:already-downloaded bike-trips-filename]
(with-open [in (io/input-stream "https://divvy-tripdata.s3.amazonaws.com/202304-divvy-tripdata.zip")
out (io/output-stream bike-trips-filename)]
(io/copy in out)
[:just-downloaded bike-trips-filename]))[:just-downloaded "202304-divvy-tripdata.zip"]In this case, it is a CSV file held inside a compressed ZIP file, but other formats are supported as well.
First, let us read just a few rows:
(-> bike-trips-filename
tc/dataset
time)202304-divvy-tripdata.zip [426590 13]:
| ride_id | rideable_type | started_at | ended_at | start_station_name | start_station_id | end_station_name | end_station_id | start_lat | start_lng | end_lat | end_lng | member_casual |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 8FE8F7D9C10E88C7 | electric_bike | 2023-04-02 08:37:28 | 2023-04-02 08:41:37 | 41.80000000 | -87.60000000 | 41.79000000 | -87.60000000 | member | ||||
| 34E4ED3ADF1D821B | electric_bike | 2023-04-19 11:29:02 | 2023-04-19 11:52:12 | 41.87000000 | -87.65000000 | 41.93000000 | -87.68000000 | member | ||||
| 5296BF07A2F77CB5 | electric_bike | 2023-04-19 08:41:22 | 2023-04-19 08:43:22 | 41.93000000 | -87.66000000 | 41.93000000 | -87.66000000 | member | ||||
| 40759916B76D5D52 | electric_bike | 2023-04-19 13:31:30 | 2023-04-19 13:35:09 | 41.92000000 | -87.65000000 | 41.91000000 | -87.65000000 | member | ||||
| 77A96F460101AC63 | electric_bike | 2023-04-19 12:05:36 | 2023-04-19 12:10:26 | 41.91000000 | -87.65000000 | 41.91000000 | -87.63000000 | member | ||||
| 8D6A2328E19DC168 | electric_bike | 2023-04-19 12:17:34 | 2023-04-19 12:21:38 | 41.91000000 | -87.63000000 | 41.92000000 | -87.65000000 | member | ||||
| C97BBA66E07889F9 | electric_bike | 2023-04-19 09:35:48 | 2023-04-19 09:45:00 | 41.93000000 | -87.66000000 | 41.91000000 | -87.65000000 | member | ||||
| 6687AD4C575FF734 | electric_bike | 2023-04-11 16:13:43 | 2023-04-11 16:18:41 | 42.00000000 | -87.66000000 | 41.99000000 | -87.66000000 | member | ||||
| A8FA4F73B22BC11F | electric_bike | 2023-04-11 16:29:24 | 2023-04-11 16:40:23 | 41.99000000 | -87.66000000 | 42.00000000 | -87.66000000 | member | ||||
| 81E158FE63D99994 | electric_bike | 2023-04-19 17:35:40 | 2023-04-19 17:36:11 | 41.88000000 | -87.65000000 | 41.88000000 | -87.65000000 | member | ||||
| … | … | … | … | … | … | … | … | … | … | … | … | … |
| BAE25DBA0CD0DF40 | electric_bike | 2023-04-14 07:12:34 | 2023-04-14 07:42:07 | Richmond St & Diversey Ave | 15645 | Clark St & Ida B Wells Dr | TA1305000009 | 41.93186915 | -87.70120764 | 41.87593267 | -87.63058454 | member |
| 5A98F86A573AAB2C | electric_bike | 2023-04-24 07:27:02 | 2023-04-24 07:58:22 | Richmond St & Diversey Ave | 15645 | Clark St & Ida B Wells Dr | TA1305000009 | 41.93198085 | -87.70128024 | 41.87593267 | -87.63058454 | member |
| 92B02F2FBDC1FB9F | classic_bike | 2023-04-12 08:16:48 | 2023-04-12 08:40:08 | Michigan Ave & Lake St | TA1305000011 | May St & Taylor St | 13160 | 41.88602200 | -87.62439800 | 41.86948210 | -87.65548640 | member |
| 119A6C53607EAA4A | electric_bike | 2023-04-28 07:24:54 | 2023-04-28 07:53:44 | Richmond St & Diversey Ave | 15645 | Clark St & Ida B Wells Dr | TA1305000009 | 41.93190396 | -87.70125461 | 41.87593267 | -87.63058454 | member |
| 9BCF1E8BA027EAFA | electric_bike | 2023-04-21 07:15:06 | 2023-04-21 07:41:45 | Richmond St & Diversey Ave | 15645 | Clark St & Ida B Wells Dr | TA1305000009 | 41.93193305 | -87.70126295 | 41.87593267 | -87.63058454 | member |
| A17D800CE963661A | classic_bike | 2023-04-11 15:46:42 | 2023-04-11 15:50:03 | Michigan Ave & Lake St | TA1305000011 | Clark St & Randolph St | TA1305000030 | 41.88602200 | -87.62439800 | 41.88457623 | -87.63188991 | member |
| 8B441A6C436E9900 | classic_bike | 2023-04-29 21:20:21 | 2023-04-29 21:30:19 | Halsted St & 18th St | 13099 | Blue Island Ave & 18th St | 13135 | 41.85750568 | -87.64599144 | 41.85755600 | -87.66153500 | casual |
| 3980D64BE11540F1 | classic_bike | 2023-04-24 09:16:05 | 2023-04-24 09:22:27 | Halsted St & 18th St | 13099 | Blue Island Ave & 18th St | 13135 | 41.85750568 | -87.64599144 | 41.85755600 | -87.66153500 | casual |
| 3EF4B49FF7DAA02C | classic_bike | 2023-04-18 07:53:51 | 2023-04-18 07:59:16 | Franklin St & Jackson Blvd | TA1305000025 | Clark St & Randolph St | TA1305000030 | 41.87770796 | -87.63532114 | 41.88457623 | -87.63188991 | casual |
| 210B2ED6583DC231 | classic_bike | 2023-04-29 07:33:55 | 2023-04-29 07:38:57 | Michigan Ave & Lake St | TA1305000011 | Clark St & Randolph St | TA1305000030 | 41.88602200 | -87.62439800 | 41.88457623 | -87.63188991 | casual |
| D29CB39B9E3FC46A | electric_bike | 2023-04-18 08:00:32 | 2023-04-18 08:02:35 | Franklin St & Jackson Blvd | TA1305000025 | Clark St & Ida B Wells Dr | TA1305000009 | 41.87813417 | -87.63525450 | 41.87593267 | -87.63058454 | casual |
(-> bike-trips-filename
(tc/dataset {:num-rows 3}))202304-divvy-tripdata.zip [3 13]:
| ride_id | rideable_type | started_at | ended_at | start_station_name | start_station_id | end_station_name | end_station_id | start_lat | start_lng | end_lat | end_lng | member_casual |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 8FE8F7D9C10E88C7 | electric_bike | 2023-04-02 08:37:28 | 2023-04-02 08:41:37 | 41.80 | -87.60 | 41.79 | -87.60 | member | ||||
| 34E4ED3ADF1D821B | electric_bike | 2023-04-19 11:29:02 | 2023-04-19 11:52:12 | 41.87 | -87.65 | 41.93 | -87.68 | member | ||||
| 5296BF07A2F77CB5 | electric_bike | 2023-04-19 08:41:22 | 2023-04-19 08:43:22 | 41.93 | -87.66 | 41.93 | -87.66 | member |
So reading a dataset is easy, but sometimes we may wish to pass a few options to handle it a bit better.
For example, you see that by default, the column names are strings:
(-> bike-trips-filename
(tc/dataset {:num-rows 9})
tc/column-names)("ride_id"
"rideable_type"
"started_at"
"ended_at"
"start_station_name"
"start_station_id"
"end_station_name"
"end_station_id"
"start_lat"
"start_lng"
"end_lat"
"end_lng"
"member_casual")We can apply the keyword function to all of them, to conveniently have keywords instead.
(-> bike-trips-filename
(tc/dataset {:num-rows 9
:key-fn keyword})
tc/column-names)(:ride_id
:rideable_type
:started_at
:ended_at
:start_station_name
:start_station_id
:end_station_name
:end_station_id
:start_lat
:start_lng
:end_lat
:end_lng
:member_casual)Even better, we may process the names to replace underscores with dashes.
(-> bike-trips-filename
(tc/dataset {:num-rows 9
:key-fn (fn [s]
(-> s
(str/replace #"_" "-")
keyword))})
tc/column-names)(:ride-id
:rideable-type
:started-at
:ended-at
:start-station-name
:start-station-id
:end-station-name
:end-station-id
:start-lat
:start-lng
:end-lat
:end-lng
:member-casual)Also, the date-time columns are parsed as strings.
(-> bike-trips-filename
(tc/dataset {:num-rows 9
:key-fn (fn [s]
(-> s
(str/replace #"_" "-")
keyword))})
:started-at
tcc/typeof):stringLet us specify our own parsing for these columns.
(def datetime-parser [:local-date-time
"yyyy-MM-dd HH:mm:ss"])(-> bike-trips-filename
(tc/dataset {:num-rows 9
:key-fn (fn [s]
(-> s
(str/replace #"_" "-")
keyword))
;; Note we use the original column names
;; when defining the parser:
:parser-fn {"started_at" datetime-parser
"ended_at" datetime-parser}})
:started-at
tcc/typeof):local-date-timeLet us now read the whole dataset and hold it in a var for further exploration. We use defonce so that next time we evaluate this expression, nothing will happen. This practice is handy when reading big files.
(defonce trips
(-> bike-trips-filename
(tc/dataset {:key-fn (fn [s]
(-> s
(str/replace #"_" "-")
keyword))
:parser-fn {"started_at" datetime-parser
"ended_at" datetime-parser}})))trips202304-divvy-tripdata.zip [426590 13]:
| :ride-id | :rideable-type | :started-at | :ended-at | :start-station-name | :start-station-id | :end-station-name | :end-station-id | :start-lat | :start-lng | :end-lat | :end-lng | :member-casual |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 8FE8F7D9C10E88C7 | electric_bike | 2023-04-02T08:37:28 | 2023-04-02T08:41:37 | 41.80000000 | -87.60000000 | 41.79000000 | -87.60000000 | member | ||||
| 34E4ED3ADF1D821B | electric_bike | 2023-04-19T11:29:02 | 2023-04-19T11:52:12 | 41.87000000 | -87.65000000 | 41.93000000 | -87.68000000 | member | ||||
| 5296BF07A2F77CB5 | electric_bike | 2023-04-19T08:41:22 | 2023-04-19T08:43:22 | 41.93000000 | -87.66000000 | 41.93000000 | -87.66000000 | member | ||||
| 40759916B76D5D52 | electric_bike | 2023-04-19T13:31:30 | 2023-04-19T13:35:09 | 41.92000000 | -87.65000000 | 41.91000000 | -87.65000000 | member | ||||
| 77A96F460101AC63 | electric_bike | 2023-04-19T12:05:36 | 2023-04-19T12:10:26 | 41.91000000 | -87.65000000 | 41.91000000 | -87.63000000 | member | ||||
| 8D6A2328E19DC168 | electric_bike | 2023-04-19T12:17:34 | 2023-04-19T12:21:38 | 41.91000000 | -87.63000000 | 41.92000000 | -87.65000000 | member | ||||
| C97BBA66E07889F9 | electric_bike | 2023-04-19T09:35:48 | 2023-04-19T09:45 | 41.93000000 | -87.66000000 | 41.91000000 | -87.65000000 | member | ||||
| 6687AD4C575FF734 | electric_bike | 2023-04-11T16:13:43 | 2023-04-11T16:18:41 | 42.00000000 | -87.66000000 | 41.99000000 | -87.66000000 | member | ||||
| A8FA4F73B22BC11F | electric_bike | 2023-04-11T16:29:24 | 2023-04-11T16:40:23 | 41.99000000 | -87.66000000 | 42.00000000 | -87.66000000 | member | ||||
| 81E158FE63D99994 | electric_bike | 2023-04-19T17:35:40 | 2023-04-19T17:36:11 | 41.88000000 | -87.65000000 | 41.88000000 | -87.65000000 | member | ||||
| … | … | … | … | … | … | … | … | … | … | … | … | … |
| BAE25DBA0CD0DF40 | electric_bike | 2023-04-14T07:12:34 | 2023-04-14T07:42:07 | Richmond St & Diversey Ave | 15645 | Clark St & Ida B Wells Dr | TA1305000009 | 41.93186915 | -87.70120764 | 41.87593267 | -87.63058454 | member |
| 5A98F86A573AAB2C | electric_bike | 2023-04-24T07:27:02 | 2023-04-24T07:58:22 | Richmond St & Diversey Ave | 15645 | Clark St & Ida B Wells Dr | TA1305000009 | 41.93198085 | -87.70128024 | 41.87593267 | -87.63058454 | member |
| 92B02F2FBDC1FB9F | classic_bike | 2023-04-12T08:16:48 | 2023-04-12T08:40:08 | Michigan Ave & Lake St | TA1305000011 | May St & Taylor St | 13160 | 41.88602200 | -87.62439800 | 41.86948210 | -87.65548640 | member |
| 119A6C53607EAA4A | electric_bike | 2023-04-28T07:24:54 | 2023-04-28T07:53:44 | Richmond St & Diversey Ave | 15645 | Clark St & Ida B Wells Dr | TA1305000009 | 41.93190396 | -87.70125461 | 41.87593267 | -87.63058454 | member |
| 9BCF1E8BA027EAFA | electric_bike | 2023-04-21T07:15:06 | 2023-04-21T07:41:45 | Richmond St & Diversey Ave | 15645 | Clark St & Ida B Wells Dr | TA1305000009 | 41.93193305 | -87.70126295 | 41.87593267 | -87.63058454 | member |
| A17D800CE963661A | classic_bike | 2023-04-11T15:46:42 | 2023-04-11T15:50:03 | Michigan Ave & Lake St | TA1305000011 | Clark St & Randolph St | TA1305000030 | 41.88602200 | -87.62439800 | 41.88457623 | -87.63188991 | member |
| 8B441A6C436E9900 | classic_bike | 2023-04-29T21:20:21 | 2023-04-29T21:30:19 | Halsted St & 18th St | 13099 | Blue Island Ave & 18th St | 13135 | 41.85750568 | -87.64599144 | 41.85755600 | -87.66153500 | casual |
| 3980D64BE11540F1 | classic_bike | 2023-04-24T09:16:05 | 2023-04-24T09:22:27 | Halsted St & 18th St | 13099 | Blue Island Ave & 18th St | 13135 | 41.85750568 | -87.64599144 | 41.85755600 | -87.66153500 | casual |
| 3EF4B49FF7DAA02C | classic_bike | 2023-04-18T07:53:51 | 2023-04-18T07:59:16 | Franklin St & Jackson Blvd | TA1305000025 | Clark St & Randolph St | TA1305000030 | 41.87770796 | -87.63532114 | 41.88457623 | -87.63188991 | casual |
| 210B2ED6583DC231 | classic_bike | 2023-04-29T07:33:55 | 2023-04-29T07:38:57 | Michigan Ave & Lake St | TA1305000011 | Clark St & Randolph St | TA1305000030 | 41.88602200 | -87.62439800 | 41.88457623 | -87.63188991 | casual |
| D29CB39B9E3FC46A | electric_bike | 2023-04-18T08:00:32 | 2023-04-18T08:02:35 | Franklin St & Jackson Blvd | TA1305000025 | Clark St & Ida B Wells Dr | TA1305000009 | 41.87813417 | -87.63525450 | 41.87593267 | -87.63058454 | casual |
(tc/info trips)202304-divvy-tripdata.zip: descriptive-stats [13 12]:
| :col-name | :datatype | :n-valid | :n-missing | :min | :mean | :mode | :max | :standard-deviation | :skew | :first | :last |
|---|---|---|---|---|---|---|---|---|---|---|---|
| :ride-id | :string | 426590 | 0 | 89CC272248838ACC | 8FE8F7D9C10E88C7 | D29CB39B9E3FC46A | |||||
| :rideable-type | :string | 426590 | 0 | electric_bike | electric_bike | electric_bike | |||||
| :started-at | :local-date-time | 426590 | 0 | 2023-04-01T00:00:02 | 2023-04-16T05:15:35.589 | 2023-04-30T23:59:05 | 6.79308822E+08 | 0.08185427 | 2023-04-02T08:37:28 | 2023-04-18T08:00:32 | |
| :ended-at | :local-date-time | 426590 | 0 | 2023-04-01T00:03:10 | 2023-04-16T05:32:48.230 | 2023-05-03T10:37:12 | 6.79294547E+08 | 0.08147835 | 2023-04-02T08:41:37 | 2023-04-18T08:02:35 | |
| :start-station-name | :string | 362776 | 63814 | Franklin St & Jackson Blvd | |||||||
| :start-station-id | :string | 362776 | 63814 | TA1305000025 | |||||||
| :end-station-name | :string | 357960 | 68630 | Clark St & Ida B Wells Dr | |||||||
| :end-station-id | :string | 357960 | 68630 | TA1305000009 | |||||||
| :start-lat | :float64 | 426590 | 0 | 41.65 | 41.90 | 42.07 | 4.70980524E-02 | -0.35851400 | 41.80 | 41.88 | |
| :start-lng | :float64 | 426590 | 0 | -87.83 | -87.65 | -87.52 | 2.76893978E-02 | -0.82254345 | -87.60 | -87.64 | |
| :end-lat | :float64 | 426155 | 435 | 41.65 | 41.90 | 42.08 | 4.72412894E-02 | -0.37120814 | 41.79 | 41.88 | |
| :end-lng | :float64 | 426155 | 435 | -88.11 | -87.65 | -87.53 | 2.78076951E-02 | -0.82990154 | -87.60 | -87.63 | |
| :member-casual | :string | 426590 | 0 | member | member | casual |
It is a whole month of bike trips!
5.13 Getting the rows of a dataset
Datasets are organized by columns for efficiency, making use of the knowledge of homogenous types in columns.
Sometimes, however, it is useful to work with rows as well.
The tc/rows function provides the rows of a dataset, either as vectors or as maps. Note, however, that it does not copy the data. Rather, it provides a row-wise view of the column-wise dataset.
(take 2 (tc/rows trips))(["8FE8F7D9C10E88C7"
"electric_bike"
#object[java.time.LocalDateTime 0x27c5ae14 "2023-04-02T08:37:28"]
#object[java.time.LocalDateTime 0x491719c "2023-04-02T08:41:37"]
nil
nil
nil
nil
41.8
-87.6
41.79
-87.6
"member"]
["34E4ED3ADF1D821B"
"electric_bike"
#object[java.time.LocalDateTime 0x725575e2 "2023-04-19T11:29:02"]
#object[java.time.LocalDateTime 0x3fa841ea "2023-04-19T11:52:12"]
nil
nil
nil
nil
41.87
-87.65
41.93
-87.68
"member"])(take 2 (tc/rows trips :as-maps))({:ride-id "8FE8F7D9C10E88C7",
:rideable-type "electric_bike",
:started-at
#object[java.time.LocalDateTime 0x27c5ae14 "2023-04-02T08:37:28"],
:ended-at
#object[java.time.LocalDateTime 0x491719c "2023-04-02T08:41:37"],
:start-station-name nil,
:start-station-id nil,
:end-station-name nil,
:end-station-id nil,
:start-lat 41.8,
:start-lng -87.6,
:end-lat 41.79,
:end-lng -87.6,
:member-casual "member"}
{:ride-id "34E4ED3ADF1D821B",
:rideable-type "electric_bike",
:started-at
#object[java.time.LocalDateTime 0x725575e2 "2023-04-19T11:29:02"],
:ended-at
#object[java.time.LocalDateTime 0x3fa841ea "2023-04-19T11:52:12"],
:start-station-name nil,
:start-station-id nil,
:end-station-name nil,
:end-station-id nil,
:start-lat 41.87,
:start-lng -87.65,
:end-lat 41.93,
:end-lng -87.68,
:member-casual "member"})As you may know, Clojure shines in processing plain data, structured or unstructured, such as vectors and maps of any content. We do not lose any of that when using datasets, as we can still view them as rows which are just maps or vectors.
5.14 Querying datasets
Tablecloth offers various ways to view a subset of a dataset. Typically, they do not copy the data but provide views of the same space in memory.
The first few trips:
(tc/head trips)202304-divvy-tripdata.zip [5 13]:
| :ride-id | :rideable-type | :started-at | :ended-at | :start-station-name | :start-station-id | :end-station-name | :end-station-id | :start-lat | :start-lng | :end-lat | :end-lng | :member-casual |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 8FE8F7D9C10E88C7 | electric_bike | 2023-04-02T08:37:28 | 2023-04-02T08:41:37 | 41.80 | -87.60 | 41.79 | -87.60 | member | ||||
| 34E4ED3ADF1D821B | electric_bike | 2023-04-19T11:29:02 | 2023-04-19T11:52:12 | 41.87 | -87.65 | 41.93 | -87.68 | member | ||||
| 5296BF07A2F77CB5 | electric_bike | 2023-04-19T08:41:22 | 2023-04-19T08:43:22 | 41.93 | -87.66 | 41.93 | -87.66 | member | ||||
| 40759916B76D5D52 | electric_bike | 2023-04-19T13:31:30 | 2023-04-19T13:35:09 | 41.92 | -87.65 | 41.91 | -87.65 | member | ||||
| 77A96F460101AC63 | electric_bike | 2023-04-19T12:05:36 | 2023-04-19T12:10:26 | 41.91 | -87.65 | 41.91 | -87.63 | member |
Just a few columns:
(-> trips
(tc/select-columns [:rideable-type :started-at :ended-at])
(print/print-range 5))202304-divvy-tripdata.zip [426590 3]:
| :rideable-type | :started-at | :ended-at |
|---|---|---|
| electric_bike | 2023-04-02T08:37:28 | 2023-04-02T08:41:37 |
| electric_bike | 2023-04-19T11:29:02 | 2023-04-19T11:52:12 |
| … | … | … |
| classic_bike | 2023-04-18T07:53:51 | 2023-04-18T07:59:16 |
| classic_bike | 2023-04-29T07:33:55 | 2023-04-29T07:38:57 |
| electric_bike | 2023-04-18T08:00:32 | 2023-04-18T08:02:35 |
Only rows about classical bikes, and just a few columns:
(-> trips
(tc/select-rows (fn [row]
(-> row :rideable-type (= "classic_bike"))))
tc/head
(tc/select-columns [:rideable-type :started-at :ended-at])
(print/print-range 5))202304-divvy-tripdata.zip [5 3]:
| :rideable-type | :started-at | :ended-at |
|---|---|---|
| classic_bike | 2023-04-10T17:34:35 | 2023-04-10T18:02:36 |
| classic_bike | 2023-04-12T12:29:46 | 2023-04-12T12:54 |
| classic_bike | 2023-04-29T20:57:10 | 2023-04-29T20:57:13 |
| classic_bike | 2023-04-20T17:03:11 | 2023-04-20T17:24:58 |
| classic_bike | 2023-04-07T21:14:14 | 2023-04-07T21:15:14 |
5.15 Adding columns
Here we will demonstrate some of the ways to extend a dataset with new columns. For clarity, let us focus on a dataset with just a few of the columns:
(-> trips
(tc/select-columns [:start-lat :end-lat])
(print/print-range 5))202304-divvy-tripdata.zip [426590 2]:
| :start-lat | :end-lat |
|---|---|
| 41.80000000 | 41.79000000 |
| 41.87000000 | 41.93000000 |
| … | … |
| 41.87770796 | 41.88457623 |
| 41.88602200 | 41.88457623 |
| 41.87813417 | 41.87593267 |
One can create new columns by applying the dataset equivalents to the column operations discussed above. These operations have the same names, but now take a dataset and the name of the new column as additional inputs. For example, the min function, when taken from the tablecloth.api, can be used as follows:
(-> trips
(tc/select-columns [:start-lat :end-lat])
(tc/min :min-lat [:start-lat :end-lat])
(print/print-range 5))202304-divvy-tripdata.zip [426590 3]:
| :start-lat | :end-lat | :min-lat |
|---|---|---|
| 41.80000000 | 41.79000000 | 41.79000000 |
| 41.87000000 | 41.93000000 | 41.87000000 |
| … | … | … |
| 41.87770796 | 41.88457623 | 41.87770796 |
| 41.88602200 | 41.88457623 | 41.88457623 |
| 41.87813417 | 41.87593267 | 41.87593267 |
When no built-in operations exist, we can use the functions tc/map-columns and tc/add-columns.
The tc/map-columns function is useful when one needs to apply a function to the values in one or more of the existing columns, for every row.
(-> trips
(tc/select-columns [:rideable-type :started-at :ended-at])
(print/print-range 5)
(tc/map-columns :duration
[:started-at :ended-at]
java-time/duration))202304-divvy-tripdata.zip [426590 4]:
| :rideable-type | :started-at | :ended-at | :duration |
|---|---|---|---|
| electric_bike | 2023-04-02T08:37:28 | 2023-04-02T08:41:37 | PT4M9S |
| electric_bike | 2023-04-19T11:29:02 | 2023-04-19T11:52:12 | PT23M10S |
| … | … | … | … |
| classic_bike | 2023-04-18T07:53:51 | 2023-04-18T07:59:16 | PT5M25S |
| classic_bike | 2023-04-29T07:33:55 | 2023-04-29T07:38:57 | PT5M2S |
| electric_bike | 2023-04-18T08:00:32 | 2023-04-18T08:02:35 | PT2M3S |
(-> trips
(tc/select-columns [:rideable-type :started-at :ended-at])
(print/print-range 5)
(tc/map-columns :duration
[:started-at :ended-at]
java-time/duration)
(tc/map-columns :duration-in-seconds
[:duration]
#(java-time/as % :seconds)))202304-divvy-tripdata.zip [426590 5]:
| :rideable-type | :started-at | :ended-at | :duration | :duration-in-seconds |
|---|---|---|---|---|
| electric_bike | 2023-04-02T08:37:28 | 2023-04-02T08:41:37 | PT4M9S | 249 |
| electric_bike | 2023-04-19T11:29:02 | 2023-04-19T11:52:12 | PT23M10S | 1390 |
| … | … | … | … | … |
| classic_bike | 2023-04-18T07:53:51 | 2023-04-18T07:59:16 | PT5M25S | 325 |
| classic_bike | 2023-04-29T07:33:55 | 2023-04-29T07:38:57 | PT5M2S | 302 |
| electric_bike | 2023-04-18T08:00:32 | 2023-04-18T08:02:35 | PT2M3S | 123 |
The tc/add-columns function is useful when one needs to apply a function to a whole dataset and return a whole column at once. This combines nicely with the column API (tcc).
(-> trips
(tc/select-columns [:rideable-type :started-at :ended-at])
(print/print-range 5)
(tc/map-columns :duration
[:started-at :ended-at]
java-time/duration)
(tc/map-columns :duration-in-seconds
[:duration]
#(java-time/as % :seconds))
(tc/add-column :duration-in-minutes
(fn [ds]
(-> ds
:duration-in-seconds
(tcc/* 1/60)))))202304-divvy-tripdata.zip [426590 6]:
| :rideable-type | :started-at | :ended-at | :duration | :duration-in-seconds | :duration-in-minutes |
|---|---|---|---|---|---|
| electric_bike | 2023-04-02T08:37:28 | 2023-04-02T08:41:37 | PT4M9S | 249 | 4.150 |
| electric_bike | 2023-04-19T11:29:02 | 2023-04-19T11:52:12 | PT23M10S | 1390 | 23.17 |
| … | … | … | … | … | … |
| classic_bike | 2023-04-18T07:53:51 | 2023-04-18T07:59:16 | PT5M25S | 325 | 5.417 |
| classic_bike | 2023-04-29T07:33:55 | 2023-04-29T07:38:57 | PT5M2S | 302 | 5.033 |
| electric_bike | 2023-04-18T08:00:32 | 2023-04-18T08:02:35 | PT2M3S | 123 | 2.050 |
Let us also add a column of the hour when each trip started – an integer from 0 to 23.
To do that, we will first add the hour as a column. Earlier, we did some time processing using the java-time API. Now, we will demonstrate a different way, using the datetime namespace of dtype-next. This namespace offers some handy functions that act on whole columns.
Let us keep this dataset in a var.
(def preprocessed-trips
(-> trips
(tc/select-columns [:rideable-type :started-at :ended-at])
(print/print-range 5)
(tc/map-columns :duration
[:started-at :ended-at]
java-time/duration)
(tc/map-columns :duration-in-seconds
[:duration]
#(java-time/as % :seconds))
(tc/add-column :duration-in-minutes
(fn [ds]
(-> ds
:duration-in-seconds
(tcc/* 1/60))))
(tc/add-column :hour
(fn [ds]
(datetime/long-temporal-field
:hours
(:started-at ds))))))preprocessed-trips202304-divvy-tripdata.zip [426590 7]:
| :rideable-type | :started-at | :ended-at | :duration | :duration-in-seconds | :duration-in-minutes | :hour |
|---|---|---|---|---|---|---|
| electric_bike | 2023-04-02T08:37:28 | 2023-04-02T08:41:37 | PT4M9S | 249 | 4.150 | 8 |
| electric_bike | 2023-04-19T11:29:02 | 2023-04-19T11:52:12 | PT23M10S | 1390 | 23.17 | 11 |
| … | … | … | … | … | … | … |
| classic_bike | 2023-04-18T07:53:51 | 2023-04-18T07:59:16 | PT5M25S | 325 | 5.417 | 7 |
| classic_bike | 2023-04-29T07:33:55 | 2023-04-29T07:38:57 | PT5M2S | 302 | 5.033 | 7 |
| electric_bike | 2023-04-18T08:00:32 | 2023-04-18T08:02:35 | PT2M3S | 123 | 2.050 | 8 |
5.16 Grouping and summarizing
For how long is a bike used typically? We can answer this question by using one of the built-in summarizing operations. These are equivalent to the column operations that output a scalar value, but require the name of the column that we use as input.
(-> preprocessed-trips
(tc/median :duration-in-minutes))_unnamed [1 1]:
| summary |
|---|
| 8.91666667 |
This gives us a summary measure over the entire dataset. Alternatively, we can group by the values in one of the columns.
(-> preprocessed-trips
(tc/group-by [:rideable-type])
(print/print-range 5))_unnamed [3 3]:
| :name | :group-id | :data |
|---|---|---|
| {:rideable-type electric_bike} | 0 | Group: {:rideable-type “electric_bike”} [247965 7]: |
| {:rideable-type classic_bike} | 1 | Group: {:rideable-type “classic_bike”} [169738 7]: |
| {:rideable-type docked_bike} | 2 | Group: {:rideable-type “docked_bike”} [8887 7]: |
The resulting dataset is a dataset of a special kind, a grouped dataset. Its :data column contains whole datasets, which are the groups. In our case, these are the groups of bike trips starting in a given hour, for every hour throughout the day.
We can then apply an operation conditional on the groups.
(-> preprocessed-trips
(tc/group-by [:rideable-type])
(tc/mean :duration-in-minutes))_unnamed [3 2]:
| :rideable-type | summary |
|---|---|
| electric_bike | 11.76538409 |
| classic_bike | 18.54251022 |
| docked_bike | 143.70773039 |
More generally, we can use tc/aggregate to apply an arbitrary summary function.
How does bike usage change throughout the day?
Let us see how the number of trips and their median length change by the hour.
Let us group the trips by the hour:
(-> preprocessed-trips
(tc/group-by [:hour])
(print/print-range 5))_unnamed [24 3]:
| :name | :group-id | :data |
|---|---|---|
| {:hour 8} | 0 | Group: {:hour 8} [22170 7]: |
| {:hour 11} | 1 | Group: {:hour 11} [20593 7]: |
| … | … | … |
| {:hour 23} | 21 | Group: {:hour 23} [7327 7]: |
| {:hour 6} | 22 | Group: {:hour 6} [8863 7]: |
| {:hour 5} | 23 | Group: {:hour 5} [2802 7]: |
Now, we can aggregate over the groups to receive a summary. The resulting summary dataset will no longer be a grouped dataset. We will order the summary by the hour.
(-> preprocessed-trips
(tc/group-by [:hour])
(tc/aggregate {:n-trips tc/row-count
:median-duration (fn [ds]
(tcc/median (:duration-in-seconds ds)))})
(tc/order-by [:hour])
(print/print-range :all))_unnamed [24 3]:
| :hour | :n-trips | :median-duration |
|---|---|---|
| 0 | 5111 | 487.0 |
| 1 | 3235 | 478.0 |
| 2 | 1874 | 461.0 |
| 3 | 1165 | 491.0 |
| 4 | 950 | 452.0 |
| 5 | 2802 | 394.0 |
| 6 | 8863 | 434.0 |
| 7 | 17065 | 473.0 |
| 8 | 22170 | 472.0 |
| 9 | 17316 | 466.0 |
| 10 | 17489 | 510.0 |
| 11 | 20593 | 519.0 |
| 12 | 24050 | 536.0 |
| 13 | 24837 | 560.0 |
| 14 | 25900 | 585.0 |
| 15 | 30730 | 574.0 |
| 16 | 40374 | 590.0 |
| 17 | 46889 | 597.0 |
| 18 | 38281 | 578.0 |
| 19 | 27107 | 537.0 |
| 20 | 17252 | 505.0 |
| 21 | 14388 | 502.0 |
| 22 | 10822 | 508.0 |
| 23 | 7327 | 485.0 |
We can see a peak of usage between 17:00 to 18:00 and a possibly slight tendency for longer trips (in time) around the afternoon hours.
For further examples of summarizing, see the Tablecloth aggregate documentation.