5 Table processing with Tablecloth
author: Daniel Slutsky
last change: 2024-12-08
Tablecloth is a table processing library inspired by the dataframe ergonomics typicall to the R ecosystem, specifically the Tidyverse, but offers certain advantages on top of that.
It is built on top of the data structures and functions of tech.ml.dataset, a high-performance table processing library (often called TMD), which is built on top of dtype-next, an array-programming library.
In this tutorial, we will see a few of the core ideas of Tablecloth. You are encouraged to look into the main documentation for more information.
5.1 Recommended resources
Data Manipulation in Clojure Compared to R and Python by Kira Howe (McLean) (2024-07-18)
Dealing with out-of-memory faulty csvβs with Clojure, Duckdb, and Parquet by Georgy Toporkov (2024-01-22)
A beginner-friendly intro by Mey Beisaron (Func Prog Sweden, 2023-03-22):
- An in-depth walkthrough by Ethan Miller (data-recur group, 2022-11-05):
5.2 About this tutorial
In this tutorial, we will demonstrate the ergonomics of so-called dataset data strucures provided by Tablecloth. Datasets are table-like data structures, often called data-frames in other data science platforms.
We will assume basic familiarity with Clojure.
A lot of what we demonstrate here can also be implemented with the usual Clojure data strucures such as vectors and maps. However, datasets offer not only performance advantages in space and time, but also certain usability features, which are arguably expressive and powerful.
We will often use treading macros, mostly ->
. This approach is compatible with data science cultures such as the Tidyverse in R.
5.3 Setup
We create a namespace and require the Tablecloth API namespaces: The main Dataset API tablecloth.api
and the Column API tablecloth.column.api
that weβll see below. We will also use tech.v3.dataset.print
to control printing, clojure.string
for some string processing, Kindly to control the way certain things are displayed, Clojure.Java-Time for some time calculations, and the datetime
namespace of dtype-next.
ns noj-book.tablecloth-table-processing
(:require [tablecloth.api :as tc]
(:as tcc]
[tablecloth.column.api print :as print]
[tech.v3.dataset.:as str]
[clojure.string :as kind]
[scicloj.kindly.v4.kind :as java-time]
[java-time.api :as datetime])) [tech.v3.datatype.datetime
5.4 Creating a dataset
Assume we have a vector of vectors representing bike trips. Each trip has the type of the bike and the coordinates (latitude and longitude) of the tripβs start and end. We can turn this data structure into a dataset:
"classic_bike" 41.906866 -87.626217 41.92393131136619 -87.63582453131676]
(tc/dataset [["electric_bike" 41.869312286 -87.673897266 41.8895 -87.688257]
["classic_bike" 41.95600355078549 -87.68016144633293 41.886875 -87.62603]]
[:column-names [:rideable-type
{:start-lat :start-lng
:end-lat :end-lng]})
:_unnamed [3 5]:
:rideable-type | :start-lat | :start-lng | :end-lat | :end-lng |
---|---|---|---|---|
classic_bike | 41.90686600 | -87.62621700 | 41.92393131 | -87.63582453 |
electric_bike | 41.86931229 | -87.67389727 | 41.88950000 | -87.68825700 |
classic_bike | 41.95600355 | -87.68016145 | 41.88687500 | -87.62603000 |
Sometimes, data may arrive in different shapes. We can also create such a dataset from a vector of maps:
:rideable-type "classic_bike"
(tc/dataset [{:start-lat 41.906866
:start-lng -87.626217
:end-lat 41.92393131136619
:end-lng -87.63582453131676}
:rideable-type "electric_bike"
{:start-lat 41.869312286
:start-lng -87.673897266
:end-lat 41.8895
:end-lng -87.688257 }
:rideable-type "classic_bike"
{:start-lat 41.95600355078549
:start-lng -87.68016144633293
:end-lat 41.886875
:end-lng -87.62603}])
_unnamed [3 5]:
:rideable-type | :start-lat | :start-lng | :end-lat | :end-lng |
---|---|---|---|---|
classic_bike | 41.90686600 | -87.62621700 | 41.92393131 | -87.63582453 |
electric_bike | 41.86931229 | -87.67389727 | 41.88950000 | -87.68825700 |
classic_bike | 41.95600355 | -87.68016145 | 41.88687500 | -87.62603000 |
.. and also, from a map of vectors:
:rideable-type ["classic_bike" "electric_bike" "classic_bike"]
(tc/dataset {:start-lat [41.906866 41.869312286 41.95600355078549]
:start-lng [-87.626217 -87.673897266 -87.68016144633293]
:end-lat [41.92393131136619 41.8895 41.886875]
:end-lng [-87.63582453131676 -87.688257 -87.62603]})
_unnamed [3 5]:
:rideable-type | :start-lat | :start-lng | :end-lat | :end-lng |
---|---|---|---|---|
classic_bike | 41.90686600 | -87.62621700 | 41.92393131 | -87.63582453 |
electric_bike | 41.86931229 | -87.67389727 | 41.88950000 | -87.68825700 |
classic_bike | 41.95600355 | -87.68016145 | 41.88687500 | -87.62603000 |
Let us hold it in a var to explore it further:
def some-trips
(:rideable-type ["classic_bike" "electric_bike" "classic_bike"]
(tc/dataset {:start-lat [41.906866 41.869312286 41.95600355078549]
:start-lng [-87.626217 -87.673897266 -87.68016144633293]
:end-lat [41.92393131136619 41.8895 41.886875]
:end-lng [-87.63582453131676 -87.688257 -87.62603]}))
5.5 Displaying a dataset
In an environment compatible with the Kindly standard, the default way a dataset is displayed is by printing it.
some-trips
_unnamed [3 5]:
:rideable-type | :start-lat | :start-lng | :end-lat | :end-lng |
---|---|---|---|---|
classic_bike | 41.90686600 | -87.62621700 | 41.92393131 | -87.63582453 |
electric_bike | 41.86931229 | -87.67389727 | 41.88950000 | -87.68825700 |
classic_bike | 41.95600355 | -87.68016145 | 41.88687500 | -87.62603000 |
If necessary, we may customize the printing using the tech.v3.dataset.print
namespace of the tech.ml.dataset library.
For example:
-> {:x (range 99)
(:y (repeatedly 99 rand)}
tc/dataset;; show at most 9 values
9)) (print/print-range
_unnamed [99 2]:
:x | :y |
---|---|
0 | 0.35437556 |
1 | 0.50994490 |
2 | 0.39946618 |
3 | 0.01589818 |
β¦ | β¦ |
94 | 0.16528626 |
95 | 0.99148194 |
96 | 0.45386767 |
97 | 0.92543445 |
98 | 0.57312692 |
We may also explicitly turn it into an HTML table:
(kind/table some-trips)
rideable-type | start-lat | start-lng | end-lat | end-lng |
---|---|---|---|---|
classic_bike | 41.906866 | -87.626217 | 41.92393131136619 | -87.63582453131676 |
electric_bike | 41.869312286 | -87.673897266 | 41.8895 | -87.688257 |
classic_bike | 41.95600355078549 | -87.68016144633293 | 41.886875 | -87.62603 |
This does not matter much for now, but it can be handy when certain inner values should be visualized in a certain way.
It is possible to use datatables to reneder kind/table
and specify datatables options (see the full list).
(kind/table some-trips:use-datatables true
{:datatables {:scrollY 100}})
rideable-type | start-lat | start-lng | end-lat | end-lng |
---|---|---|---|---|
classic_bike | 41.906866 | -87.626217 | 41.92393131136619 | -87.63582453131676 |
electric_bike | 41.869312286 | -87.673897266 | 41.8895 | -87.688257 |
classic_bike | 41.95600355078549 | -87.68016144633293 | 41.886875 | -87.62603 |
We may also nest the usual printed table inside other visualization kinds, such as Hiccup.
(kind/hiccup:div {:style {:width "60%"
[:max-height "400px"
:overflow-x :auto
:overflow-y :auto
:background "floralwhite"}}
some-trips])
_unnamed [3 5]:
:rideable-type | :start-lat | :start-lng | :end-lat | :end-lng |
---|---|---|---|---|
classic_bike | 41.90686600 | -87.62621700 | 41.92393131 | -87.63582453 |
electric_bike | 41.86931229 | -87.67389727 | 41.88950000 | -87.68825700 |
classic_bike | 41.95600355 | -87.68016145 | 41.88687500 | -87.62603000 |
For use in this tutorial, let us define our own customized view:
defn compact-view [dataset]
(
(kind/hiccup:div {:style {:max-width "100%"
[:max-height "400px"
:overflow-x :auto
:overflow-y :auto}}
dataset]))
5.6 What is a dataset?
Let us explore this data structure, our little dataset of bike trips.
A dataset is a value of the Dataset
datatype defined in the tech.ml.dataset library:
type some-trips) (
tech.v3.dataset.impl.dataset.Dataset
One thing worth knowing about this datatype is that it is extended by quite a few interfaces and protocols.
For example, it behaves as a map.
map? some-trips) (
true
The keys are the column names:
keys some-trips) (
:rideable-type :start-lat :start-lng :end-lat :end-lng) (
(tc/column-names some-trips)
:rideable-type :start-lat :start-lng :end-lat :end-lng) (
.. and the values are the columns:
:start-lat some-trips) (
3]
#tech.v3.dataset.column<float64>[:start-lat
41.91, 41.87, 41.96] [
Now we need to discuss what columns are.
5.7 What is a column?
:start-lat some-trips) (
3]
#tech.v3.dataset.column<float64>[:start-lat
41.91, 41.87, 41.96] [
A column is a value of Column
datatype defined in the tech.ml.dataset library:
-> some-trips
(:start-lat
type)
tech.v3.dataset.impl.column.Column
This datatype is also extended by quite a few interfaces and protocols.
For example, it is sequential.
-> some-trips
(:start-lat
sequential?)
true
So, we can use the sequence abstraction with it:
->> (:start-lat some-trips)
(take 2)) (
41.906866 41.869312286) (
->> (:rideable-type some-trips)
(filter #{"classic_bike"})) (
"classic_bike" "classic_bike") (
It is also assiciative:
-> some-trips
(:rideable-type
assoc 2 "my strange and unique bike")) (
"classic_bike" "electric_bike" "my strange and unique bike"] [
5.8 Working with Columns
We may use Tableclothβs Column API to create and process columns. For example:
"classic_bike" "electrical_bike" "classic_bike"]) (tcc/column [
3]
#tech.v3.dataset.column<string>[
null [classic_bike, electrical_bike, classic_bike]
What is the average latitude where trips tend to start?
-> some-trips
(:start-lat
tcc/mean)
41.9107272789285
What is the type of elements in this Column?
-> some-trips
(:start-lat
tcc/typeof)
:float64
Let us look into our latitudes in radians rather than degrees:
-> some-trips
(:start-lat
/ Math/PI 180))) (tcc/* (
3]
#tech.v3.dataset.column<float64>[
null0.7314, 0.7308, 0.7323] [
You see, columns are typed, and this has implications for both for time and space performance as well as ergonomics.
5.9 The data in columns
You will probably not need this detail most times, but it is worth knowing that the data actually held by the Column can be accessed as the .data
field, and it can be of varying data structures.
For example:
-> some-trips
(:start-lat
.datatype)
tech.v3.datatype.array_buffer.ArrayBuffer
-> some-trips
(:rideable-type
.datatype)
tech.v3.dataset.string_table.StringTable
-> (range 9)
(
tcc/column
.datatype)
clojure.lang.LongRange
Behind the scenes, tech.ml.dataset makes sure to use efficient data structures for columns, so that, e.g., random access by index will be efficient:
-> some-trips
(:start-lat
nth 2)) (
41.95600355078549
The following is quick too!
-> (range 1000000)
(1000)
(tcc/* nth 10000)) (
10000000
Here we rely on the βlazy and noncachingβ semantics of the undelying dtype-next library, which is a topic worth its own tutorial.
5.10 Summarizing datasets
We can use the tc/info
function to summarize a dataset:
-> some-trips
(
tc/info compact-view)
_unnamed: descriptive-stats [5 12]:
:col-name | :datatype | :n-valid | :n-missing | :min | :mean | :mode | :max | :standard-deviation | :skew | :first | :last |
---|---|---|---|---|---|---|---|---|---|---|---|
:rideable-type | :string | 3 | 0 | classic_bike | classic_bike | classic_bike | |||||
:start-lat | :float64 | 3 | 0 | 41.86931229 | 41.91072728 | 41.95600355 | 0.04347443 | 0.39652474 | 41.91 | 41.96 | |
:start-lng | :float64 | 3 | 0 | -87.68016145 | -87.66009190 | -87.62621700 | 0.02950325 | 1.64462246 | -87.63 | -87.68 | |
:end-lat | :float64 | 3 | 0 | 41.88687500 | 41.90010210 | 41.92393131 | 0.02067839 | 1.70070295 | 41.92 | 41.89 | |
:end-lng | :float64 | 3 | 0 | -87.68825700 | -87.65003718 | -87.62603000 | 0.03345967 | -1.56657916 | -87.64 | -87.63 |
5.11 Reading datasets
Datasets are often read from files.
Let us read a file from the Chicago bike trips dataset. In this case, it is a CSV file compressed by gzip, but other formats are supported as well.
First, let us read just a few rows:
-> "data/chicago-bikes/202304_divvy_tripdata.csv.gz"
(:num-rows 3})
(tc/dataset { compact-view)
data/chicago-bikes/202304_divvy_tripdata.csv.gz [3 13]:
ride_id | rideable_type | started_at | ended_at | start_station_name | start_station_id | end_station_name | end_station_id | start_lat | start_lng | end_lat | end_lng | member_casual |
---|---|---|---|---|---|---|---|---|---|---|---|---|
8FE8F7D9C10E88C7 | electric_bike | 2023-04-02 08:37:28 | 2023-04-02 08:41:37 | 41.80 | -87.60 | 41.79 | -87.60 | member | ||||
34E4ED3ADF1D821B | electric_bike | 2023-04-19 11:29:02 | 2023-04-19 11:52:12 | 41.87 | -87.65 | 41.93 | -87.68 | member | ||||
5296BF07A2F77CB5 | electric_bike | 2023-04-19 08:41:22 | 2023-04-19 08:43:22 | 41.93 | -87.66 | 41.93 | -87.66 | member |
So reading a dataset is easy, but sometimes we may wish to pass a few options to handle it a bit better.
For example, you see that by default, the column names are strings:
-> "data/chicago-bikes/202304_divvy_tripdata.csv.gz"
(:num-rows 9})
(tc/dataset { tc/column-names)
"ride_id"
("rideable_type"
"started_at"
"ended_at"
"start_station_name"
"start_station_id"
"end_station_name"
"end_station_id"
"start_lat"
"start_lng"
"end_lat"
"end_lng"
"member_casual")
We can apply the keyword
function to all of them, to conveniently have keywords instead.
-> "data/chicago-bikes/202304_divvy_tripdata.csv.gz"
(:num-rows 9
(tc/dataset {:key-fn keyword})
tc/column-names)
:ride_id
(:rideable_type
:started_at
:ended_at
:start_station_name
:start_station_id
:end_station_name
:end_station_id
:start_lat
:start_lng
:end_lat
:end_lng
:member_casual)
Even better, we may process the names to replace underscores with dashes.
-> "data/chicago-bikes/202304_divvy_tripdata.csv.gz"
(:num-rows 9
(tc/dataset {:key-fn (fn [s]
-> s
(#"_" "-")
(str/replace keyword))})
tc/column-names)
:ride-id
(:rideable-type
:started-at
:ended-at
:start-station-name
:start-station-id
:end-station-name
:end-station-id
:start-lat
:start-lng
:end-lat
:end-lng
:member-casual)
Also, the date-time columns are parsed as strings.
-> "data/chicago-bikes/202304_divvy_tripdata.csv.gz"
(:num-rows 9
(tc/dataset {:key-fn (fn [s]
-> s
(#"_" "-")
(str/replace keyword))})
:started-at
tcc/typeof)
:string
Let us specify our own parsing for these columns.
def datetime-parser [:local-date-time
("yyyy-MM-dd HH:mm:ss"])
-> "data/chicago-bikes/202304_divvy_tripdata.csv.gz"
(:num-rows 9
(tc/dataset {:key-fn (fn [s]
-> s
(#"_" "-")
(str/replace keyword))
;; Note we use the original column names
;; when defining the parser:
:parser-fn {"started_at" datetime-parser
"ended_at" datetime-parser}})
:started-at
tcc/typeof)
:local-date-time
Let us now read the whole dataset and hold it in a var for further exploration. We use defonce
so that next time we evaluate this expression, nothing will happen. This practice is handy when reading big files.
defonce trips
(-> "data/chicago-bikes/202304_divvy_tripdata.csv.gz"
(:key-fn (fn [s]
(tc/dataset {-> s
(#"_" "-")
(str/replace keyword))
:parser-fn {"started_at" datetime-parser
"ended_at" datetime-parser}})))
(compact-view trips)
data/chicago-bikes/202304_divvy_tripdata.csv.gz [426590 13]:
:ride-id | :rideable-type | :started-at | :ended-at | :start-station-name | :start-station-id | :end-station-name | :end-station-id | :start-lat | :start-lng | :end-lat | :end-lng | :member-casual |
---|---|---|---|---|---|---|---|---|---|---|---|---|
8FE8F7D9C10E88C7 | electric_bike | 2023-04-02T08:37:28 | 2023-04-02T08:41:37 | 41.80000000 | -87.60000000 | 41.79000000 | -87.60000000 | member | ||||
34E4ED3ADF1D821B | electric_bike | 2023-04-19T11:29:02 | 2023-04-19T11:52:12 | 41.87000000 | -87.65000000 | 41.93000000 | -87.68000000 | member | ||||
5296BF07A2F77CB5 | electric_bike | 2023-04-19T08:41:22 | 2023-04-19T08:43:22 | 41.93000000 | -87.66000000 | 41.93000000 | -87.66000000 | member | ||||
40759916B76D5D52 | electric_bike | 2023-04-19T13:31:30 | 2023-04-19T13:35:09 | 41.92000000 | -87.65000000 | 41.91000000 | -87.65000000 | member | ||||
77A96F460101AC63 | electric_bike | 2023-04-19T12:05:36 | 2023-04-19T12:10:26 | 41.91000000 | -87.65000000 | 41.91000000 | -87.63000000 | member | ||||
8D6A2328E19DC168 | electric_bike | 2023-04-19T12:17:34 | 2023-04-19T12:21:38 | 41.91000000 | -87.63000000 | 41.92000000 | -87.65000000 | member | ||||
C97BBA66E07889F9 | electric_bike | 2023-04-19T09:35:48 | 2023-04-19T09:45 | 41.93000000 | -87.66000000 | 41.91000000 | -87.65000000 | member | ||||
6687AD4C575FF734 | electric_bike | 2023-04-11T16:13:43 | 2023-04-11T16:18:41 | 42.00000000 | -87.66000000 | 41.99000000 | -87.66000000 | member | ||||
A8FA4F73B22BC11F | electric_bike | 2023-04-11T16:29:24 | 2023-04-11T16:40:23 | 41.99000000 | -87.66000000 | 42.00000000 | -87.66000000 | member | ||||
81E158FE63D99994 | electric_bike | 2023-04-19T17:35:40 | 2023-04-19T17:36:11 | 41.88000000 | -87.65000000 | 41.88000000 | -87.65000000 | member | ||||
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
BAE25DBA0CD0DF40 | electric_bike | 2023-04-14T07:12:34 | 2023-04-14T07:42:07 | Richmond St & Diversey Ave | 15645 | Clark St & Ida B Wells Dr | TA1305000009 | 41.93186915 | -87.70120764 | 41.87593267 | -87.63058454 | member |
5A98F86A573AAB2C | electric_bike | 2023-04-24T07:27:02 | 2023-04-24T07:58:22 | Richmond St & Diversey Ave | 15645 | Clark St & Ida B Wells Dr | TA1305000009 | 41.93198085 | -87.70128024 | 41.87593267 | -87.63058454 | member |
92B02F2FBDC1FB9F | classic_bike | 2023-04-12T08:16:48 | 2023-04-12T08:40:08 | Michigan Ave & Lake St | TA1305000011 | May St & Taylor St | 13160 | 41.88602200 | -87.62439800 | 41.86948210 | -87.65548640 | member |
119A6C53607EAA4A | electric_bike | 2023-04-28T07:24:54 | 2023-04-28T07:53:44 | Richmond St & Diversey Ave | 15645 | Clark St & Ida B Wells Dr | TA1305000009 | 41.93190396 | -87.70125461 | 41.87593267 | -87.63058454 | member |
9BCF1E8BA027EAFA | electric_bike | 2023-04-21T07:15:06 | 2023-04-21T07:41:45 | Richmond St & Diversey Ave | 15645 | Clark St & Ida B Wells Dr | TA1305000009 | 41.93193305 | -87.70126295 | 41.87593267 | -87.63058454 | member |
A17D800CE963661A | classic_bike | 2023-04-11T15:46:42 | 2023-04-11T15:50:03 | Michigan Ave & Lake St | TA1305000011 | Clark St & Randolph St | TA1305000030 | 41.88602200 | -87.62439800 | 41.88457623 | -87.63188991 | member |
8B441A6C436E9900 | classic_bike | 2023-04-29T21:20:21 | 2023-04-29T21:30:19 | Halsted St & 18th St | 13099 | Blue Island Ave & 18th St | 13135 | 41.85750568 | -87.64599144 | 41.85755600 | -87.66153500 | casual |
3980D64BE11540F1 | classic_bike | 2023-04-24T09:16:05 | 2023-04-24T09:22:27 | Halsted St & 18th St | 13099 | Blue Island Ave & 18th St | 13135 | 41.85750568 | -87.64599144 | 41.85755600 | -87.66153500 | casual |
3EF4B49FF7DAA02C | classic_bike | 2023-04-18T07:53:51 | 2023-04-18T07:59:16 | Franklin St & Jackson Blvd | TA1305000025 | Clark St & Randolph St | TA1305000030 | 41.87770796 | -87.63532114 | 41.88457623 | -87.63188991 | casual |
210B2ED6583DC231 | classic_bike | 2023-04-29T07:33:55 | 2023-04-29T07:38:57 | Michigan Ave & Lake St | TA1305000011 | Clark St & Randolph St | TA1305000030 | 41.88602200 | -87.62439800 | 41.88457623 | -87.63188991 | casual |
D29CB39B9E3FC46A | electric_bike | 2023-04-18T08:00:32 | 2023-04-18T08:02:35 | Franklin St & Jackson Blvd | TA1305000025 | Clark St & Ida B Wells Dr | TA1305000009 | 41.87813417 | -87.63525450 | 41.87593267 | -87.63058454 | casual |
-> trips
(
tc/info compact-view)
data/chicago-bikes/202304_divvy_tripdata.csv.gz: descriptive-stats [13 12]:
:col-name | :datatype | :n-valid | :n-missing | :min | :mean | :mode | :max | :standard-deviation | :skew | :first | :last |
---|---|---|---|---|---|---|---|---|---|---|---|
:ride-id | :string | 426590 | 0 | 89CC272248838ACC | 8FE8F7D9C10E88C7 | D29CB39B9E3FC46A | |||||
:rideable-type | :string | 426590 | 0 | electric_bike | electric_bike | electric_bike | |||||
:started-at | :local-date-time | 426590 | 0 | 2023-04-01T00:00:02 | 2023-04-16T05:15:35.589 | 2023-04-30T23:59:05 | 6.79308822E+08 | 0.08185427 | 2023-04-02T08:37:28 | 2023-04-18T08:00:32 | |
:ended-at | :local-date-time | 426590 | 0 | 2023-04-01T00:03:10 | 2023-04-16T05:32:48.230 | 2023-05-03T10:37:12 | 6.79294547E+08 | 0.08147835 | 2023-04-02T08:41:37 | 2023-04-18T08:02:35 | |
:start-station-name | :string | 362776 | 63814 | Franklin St & Jackson Blvd | |||||||
:start-station-id | :string | 362776 | 63814 | TA1305000025 | |||||||
:end-station-name | :string | 357960 | 68630 | Clark St & Ida B Wells Dr | |||||||
:end-station-id | :string | 357960 | 68630 | TA1305000009 | |||||||
:start-lat | :float64 | 426590 | 0 | 41.65 | 41.90 | 42.07 | 4.70980524E-02 | -0.35851400 | 41.80 | 41.88 | |
:start-lng | :float64 | 426590 | 0 | -87.83 | -87.65 | -87.52 | 2.76893978E-02 | -0.82254345 | -87.60 | -87.64 | |
:end-lat | :float64 | 426155 | 435 | 41.65 | 41.90 | 42.08 | 4.72412894E-02 | -0.37120814 | 41.79 | 41.88 | |
:end-lng | :float64 | 426155 | 435 | -88.11 | -87.65 | -87.53 | 2.78076951E-02 | -0.82990154 | -87.60 | -87.63 | |
:member-casual | :string | 426590 | 0 | member | member | casual |
It is a whole month of bike trips!
5.12 Getting the rows of a dataset
Datasets are organized by columns for efficiency, making use of the knowledge of homogenous types in columns.
Sometimes, however, it is useful to work with rows as well.
The tc/rows
function provides the rows of a dataset, either as vectors or as maps. Note, however, that it does not copy the data. Rather, it provides a rowwise view of the columnwise dataset.
take 2 (tc/rows trips)) (
"8FE8F7D9C10E88C7"
(["electric_bike"
time.LocalDateTime 0x15e0dac2 "2023-04-02T08:37:28"]
#object[java.time.LocalDateTime 0x427051b2 "2023-04-02T08:41:37"]
#object[java.nil
nil
nil
nil
41.8
87.6
-41.79
87.6
-"member"]
"34E4ED3ADF1D821B"
["electric_bike"
time.LocalDateTime 0x6e5bcef0 "2023-04-19T11:29:02"]
#object[java.time.LocalDateTime 0x1327563a "2023-04-19T11:52:12"]
#object[java.nil
nil
nil
nil
41.87
87.65
-41.93
87.68
-"member"])
take 2 (tc/rows trips :as-maps)) (
:ride-id "8FE8F7D9C10E88C7",
({:rideable-type "electric_bike",
:started-at
time.LocalDateTime 0x15e0dac2 "2023-04-02T08:37:28"],
#object[java.:ended-at
time.LocalDateTime 0x427051b2 "2023-04-02T08:41:37"],
#object[java.:start-station-name nil,
:start-station-id nil,
:end-station-name nil,
:end-station-id nil,
:start-lat 41.8,
:start-lng -87.6,
:end-lat 41.79,
:end-lng -87.6,
:member-casual "member"}
:ride-id "34E4ED3ADF1D821B",
{:rideable-type "electric_bike",
:started-at
time.LocalDateTime 0x6e5bcef0 "2023-04-19T11:29:02"],
#object[java.:ended-at
time.LocalDateTime 0x1327563a "2023-04-19T11:52:12"],
#object[java.:start-station-name nil,
:start-station-id nil,
:end-station-name nil,
:end-station-id nil,
:start-lat 41.87,
:start-lng -87.65,
:end-lat 41.93,
:end-lng -87.68,
:member-casual "member"})
As you may know, Clojure shines in processing plain data, structured or unstructured, such as vectors and maps of any content. We do not lose any of that when using datasets, as we can still view them as rows which are just maps or vectors.
5.13 Querying datasets
Tablecloth offers various ways to view a subset of a dataset. Typically, they do not copy the data but provide views of the same space in memory.
The first few trips:
-> trips
(
tc/head compact-view)
data/chicago-bikes/202304_divvy_tripdata.csv.gz [5 13]:
:ride-id | :rideable-type | :started-at | :ended-at | :start-station-name | :start-station-id | :end-station-name | :end-station-id | :start-lat | :start-lng | :end-lat | :end-lng | :member-casual |
---|---|---|---|---|---|---|---|---|---|---|---|---|
8FE8F7D9C10E88C7 | electric_bike | 2023-04-02T08:37:28 | 2023-04-02T08:41:37 | 41.80 | -87.60 | 41.79 | -87.60 | member | ||||
34E4ED3ADF1D821B | electric_bike | 2023-04-19T11:29:02 | 2023-04-19T11:52:12 | 41.87 | -87.65 | 41.93 | -87.68 | member | ||||
5296BF07A2F77CB5 | electric_bike | 2023-04-19T08:41:22 | 2023-04-19T08:43:22 | 41.93 | -87.66 | 41.93 | -87.66 | member | ||||
40759916B76D5D52 | electric_bike | 2023-04-19T13:31:30 | 2023-04-19T13:35:09 | 41.92 | -87.65 | 41.91 | -87.65 | member | ||||
77A96F460101AC63 | electric_bike | 2023-04-19T12:05:36 | 2023-04-19T12:10:26 | 41.91 | -87.65 | 41.91 | -87.63 | member |
Just a few columns:
-> trips
(:rideable-type :started-at :ended-at])
(tc/select-columns [5)) (print/print-range
data/chicago-bikes/202304_divvy_tripdata.csv.gz [426590 3]:
:rideable-type | :started-at | :ended-at |
---|---|---|
electric_bike | 2023-04-02T08:37:28 | 2023-04-02T08:41:37 |
electric_bike | 2023-04-19T11:29:02 | 2023-04-19T11:52:12 |
β¦ | β¦ | β¦ |
classic_bike | 2023-04-18T07:53:51 | 2023-04-18T07:59:16 |
classic_bike | 2023-04-29T07:33:55 | 2023-04-29T07:38:57 |
electric_bike | 2023-04-18T08:00:32 | 2023-04-18T08:02:35 |
Only rows about classical bikes, and just a few columns:
-> trips
(fn [row]
(tc/select-rows (-> row :rideable-type (= "classic_bike"))))
(
tc/head:rideable-type :started-at :ended-at])
(tc/select-columns [5)) (print/print-range
data/chicago-bikes/202304_divvy_tripdata.csv.gz [5 3]:
:rideable-type | :started-at | :ended-at |
---|---|---|
classic_bike | 2023-04-10T17:34:35 | 2023-04-10T18:02:36 |
classic_bike | 2023-04-12T12:29:46 | 2023-04-12T12:54 |
classic_bike | 2023-04-29T20:57:10 | 2023-04-29T20:57:13 |
classic_bike | 2023-04-20T17:03:11 | 2023-04-20T17:24:58 |
classic_bike | 2023-04-07T21:14:14 | 2023-04-07T21:15:14 |
5.14 Adding columns
Here we will demonstrate some of the ways to extend a dataset with new columns.
Let us compute how the bike trips are. For clarity, let us focus on a dataset with just a few of the columns:
-> trips
(:rideable-type :started-at :ended-at])
(tc/select-columns [5)) (print/print-range
data/chicago-bikes/202304_divvy_tripdata.csv.gz [426590 3]:
:rideable-type | :started-at | :ended-at |
---|---|---|
electric_bike | 2023-04-02T08:37:28 | 2023-04-02T08:41:37 |
electric_bike | 2023-04-19T11:29:02 | 2023-04-19T11:52:12 |
β¦ | β¦ | β¦ |
classic_bike | 2023-04-18T07:53:51 | 2023-04-18T07:59:16 |
classic_bike | 2023-04-29T07:33:55 | 2023-04-29T07:38:57 |
electric_bike | 2023-04-18T08:00:32 | 2023-04-18T08:02:35 |
The tc/map-columns
function is useful when one needs to apply a function to the values in one or more of the existings columns, for every row.
-> trips
(:rideable-type :started-at :ended-at])
(tc/select-columns [5)
(print/print-range :duration
(tc/map-columns :started-at :ended-at]
[ java-time/duration))
data/chicago-bikes/202304_divvy_tripdata.csv.gz [426590 4]:
:rideable-type | :started-at | :ended-at | :duration |
---|---|---|---|
electric_bike | 2023-04-02T08:37:28 | 2023-04-02T08:41:37 | PT4M9S |
electric_bike | 2023-04-19T11:29:02 | 2023-04-19T11:52:12 | PT23M10S |
β¦ | β¦ | β¦ | β¦ |
classic_bike | 2023-04-18T07:53:51 | 2023-04-18T07:59:16 | PT5M25S |
classic_bike | 2023-04-29T07:33:55 | 2023-04-29T07:38:57 | PT5M2S |
electric_bike | 2023-04-18T08:00:32 | 2023-04-18T08:02:35 | PT2M3S |
-> trips
(:rideable-type :started-at :ended-at])
(tc/select-columns [5)
(print/print-range :duration
(tc/map-columns :started-at :ended-at]
[
java-time/duration):duration-in-seconds
(tc/map-columns :duration]
[% :seconds))) #(java-time/as
data/chicago-bikes/202304_divvy_tripdata.csv.gz [426590 5]:
:rideable-type | :started-at | :ended-at | :duration | :duration-in-seconds |
---|---|---|---|---|
electric_bike | 2023-04-02T08:37:28 | 2023-04-02T08:41:37 | PT4M9S | 249 |
electric_bike | 2023-04-19T11:29:02 | 2023-04-19T11:52:12 | PT23M10S | 1390 |
β¦ | β¦ | β¦ | β¦ | β¦ |
classic_bike | 2023-04-18T07:53:51 | 2023-04-18T07:59:16 | PT5M25S | 325 |
classic_bike | 2023-04-29T07:33:55 | 2023-04-29T07:38:57 | PT5M2S | 302 |
electric_bike | 2023-04-18T08:00:32 | 2023-04-18T08:02:35 | PT2M3S | 123 |
The tc/add-columns
function is useful when one needs to apply a function to a whole dataset and return a whole column at once. This combines nicely with the column API (tcc
).
-> trips
(:rideable-type :started-at :ended-at])
(tc/select-columns [5)
(print/print-range :duration
(tc/map-columns :started-at :ended-at]
[
java-time/duration):duration-in-seconds
(tc/map-columns :duration]
[% :seconds))
#(java-time/as :duration-in-minutes
(tc/add-column fn [ds]
(-> ds
(:duration-in-seconds
1/60))))) (tcc/*
data/chicago-bikes/202304_divvy_tripdata.csv.gz [426590 6]:
:rideable-type | :started-at | :ended-at | :duration | :duration-in-seconds | :duration-in-minutes |
---|---|---|---|---|---|
electric_bike | 2023-04-02T08:37:28 | 2023-04-02T08:41:37 | PT4M9S | 249 | 4.150 |
electric_bike | 2023-04-19T11:29:02 | 2023-04-19T11:52:12 | PT23M10S | 1390 | 23.17 |
β¦ | β¦ | β¦ | β¦ | β¦ | β¦ |
classic_bike | 2023-04-18T07:53:51 | 2023-04-18T07:59:16 | PT5M25S | 325 | 5.417 |
classic_bike | 2023-04-29T07:33:55 | 2023-04-29T07:38:57 | PT5M2S | 302 | 5.033 |
electric_bike | 2023-04-18T08:00:32 | 2023-04-18T08:02:35 | PT2M3S | 123 | 2.050 |
Let us also add a column of the hour where each trip started β an integer between 0 to 23.
To do that, we will first add the hour as a column. Earlier, we did some time processing using the java-time
API. Now, we will demonstrate a different way, using the datetime
namespace of dtype-next. This namespace offers some handy functions that act on whole columns.
Let us keep this dataset in a var.
def preprocessed-trips
(-> trips
(:rideable-type :started-at :ended-at])
(tc/select-columns [5)
(print/print-range :duration
(tc/map-columns :started-at :ended-at]
[
java-time/duration):duration-in-seconds
(tc/map-columns :duration]
[% :seconds))
#(java-time/as :duration-in-minutes
(tc/add-column fn [ds]
(-> ds
(:duration-in-seconds
1/60))))
(tcc/* :hour
(tc/add-column fn [ds]
(
(datetime/long-temporal-field:hours
:started-at ds)))))) (
preprocessed-trips
data/chicago-bikes/202304_divvy_tripdata.csv.gz [426590 7]:
:rideable-type | :started-at | :ended-at | :duration | :duration-in-seconds | :duration-in-minutes | :hour |
---|---|---|---|---|---|---|
electric_bike | 2023-04-02T08:37:28 | 2023-04-02T08:41:37 | PT4M9S | 249 | 4.150 | 8 |
electric_bike | 2023-04-19T11:29:02 | 2023-04-19T11:52:12 | PT23M10S | 1390 | 23.17 | 11 |
β¦ | β¦ | β¦ | β¦ | β¦ | β¦ | β¦ |
classic_bike | 2023-04-18T07:53:51 | 2023-04-18T07:59:16 | PT5M25S | 325 | 5.417 | 7 |
classic_bike | 2023-04-29T07:33:55 | 2023-04-29T07:38:57 | PT5M2S | 302 | 5.033 | 7 |
electric_bike | 2023-04-18T08:00:32 | 2023-04-18T08:02:35 | PT2M3S | 123 | 2.050 | 8 |
5.15 Grouping and summarizing
How does bike usage change througout the day?
Let us see how the number of trips and their median length change by the hour.
Let us group the trips by the hour:
-> preprocessed-trips
(:hour])
(tc/group-by [5)) (print/print-range
_unnamed [24 3]:
:name | :group-id | :data |
---|---|---|
{:hour 8} | 0 | Group: {:hour 8} [22170 7]: |
{:hour 11} | 1 | Group: {:hour 11} [20593 7]: |
β¦ | β¦ | β¦ |
{:hour 23} | 21 | Group: {:hour 23} [7327 7]: |
{:hour 6} | 22 | Group: {:hour 6} [8863 7]: |
{:hour 5} | 23 | Group: {:hour 5} [2802 7]: |
The resulting dataset is a dataset of a special kind, a grouped dataset. Its :data
column contains whole datasets, which are the groups. In our case, these are the groups of bike trips starting in a given hour, for every hour throughout the day.
Now, we can aggregate over the groups to recieve a summary. The resulting summary dataset will no longer be a grouped dataset. We will order the summary by the hour.
-> preprocessed-trips
(:hour])
(tc/group-by [:n-trips tc/row-count
(tc/aggregate {:median-duration (fn [ds]
:duration-in-seconds ds)))})
(tcc/median (:hour])
(tc/order-by [:all)) (print/print-range
_unnamed [24 3]:
:hour | :n-trips | :median-duration |
---|---|---|
0 | 5111 | 487.0 |
1 | 3235 | 478.0 |
2 | 1874 | 461.0 |
3 | 1165 | 491.0 |
4 | 950 | 452.0 |
5 | 2802 | 394.0 |
6 | 8863 | 434.0 |
7 | 17065 | 473.0 |
8 | 22170 | 472.0 |
9 | 17316 | 466.0 |
10 | 17489 | 510.0 |
11 | 20593 | 519.0 |
12 | 24050 | 536.0 |
13 | 24837 | 560.0 |
14 | 25900 | 585.0 |
15 | 30730 | 574.0 |
16 | 40374 | 590.0 |
17 | 46889 | 597.0 |
18 | 38281 | 578.0 |
19 | 27107 | 537.0 |
20 | 17252 | 505.0 |
21 | 14388 | 502.0 |
22 | 10822 | 508.0 |
23 | 7327 | 485.0 |
We can see a peak of usage between 17:00 to 18:00 and a possibly slight tendendcy for longer trips (in time) around the afternoon hours.