5  Table processing with Tablecloth

author: Cvetomir Dimov and Daniel Slutsky

last change: 2025-02-02

Tablecloth is a table processing library inspired by the dataframe ergonomics typicall to the R ecosystem, specifically the Tidyverse, but offers certain advantages on top of that.

It is built on top of the data structures and functions of tech.ml.dataset, a high-performance table processing library (often called TMD), which is built on top of dtype-next, an array-programming library.

In this tutorial, we will see a few of the core ideas of Tablecloth. You are encouraged to look into the main documentation for more information.

5.2 About this tutorial

In this tutorial, we will demonstrate the ergonomics of so-called dataset data strucures provided by Tablecloth. Datasets are table-like data structures, often called data-frames in other data science platforms.

We will assume basic familiarity with Clojure.

A lot of what we demonstrate here can also be implemented with the usual Clojure data strucures such as vectors and maps. However, datasets offer not only performance advantages in space and time, but also certain usability features, which are arguably expressive and powerful.

We will often use treading macros, mostly ->. This approach is compatible with data science cultures such as the Tidyverse in R.

5.3 Setup

We create a namespace and require the following namespaces:

(ns noj-book.tablecloth-table-processing
  (:require [tablecloth.api :as tc]
            [tablecloth.column.api :as tcc]
            [tech.v3.dataset.print :as print]
            [clojure.string :as str]
            [clojure.java.io :as io]
            [scicloj.kindly.v4.kind :as kind]
            [java-time.api :as java-time]
            [tech.v3.datatype.datetime :as datetime]))

5.4 Creating a dataset

Assume we have a vector of vectors representing bike trips. Each trip has the type of the bike and the coordinates (latitude and longitude) of the trip’s start and end. We can turn this data structure into a dataset:

(tc/dataset [["classic_bike" 41.906866 -87.626217 41.92393131136619 -87.63582453131676]
             ["electric_bike" 41.869312286 -87.673897266 41.8895 -87.688257]
             ["classic_bike" 41.95600355078549 -87.68016144633293 41.886875 -87.62603]]
            {:column-names [:rideable-type
                            :start-lat :start-lng
                            :end-lat :end-lng]})

:_unnamed [3 5]:

:rideable-type :start-lat :start-lng :end-lat :end-lng
classic_bike 41.90686600 -87.62621700 41.92393131 -87.63582453
electric_bike 41.86931229 -87.67389727 41.88950000 -87.68825700
classic_bike 41.95600355 -87.68016145 41.88687500 -87.62603000

Sometimes, data may arrive in different shapes. We can also create such a dataset from a vector of maps:

(tc/dataset [{:rideable-type "classic_bike"
              :start-lat     41.906866
              :start-lng     -87.626217
              :end-lat       41.92393131136619
              :end-lng       -87.63582453131676}
             {:rideable-type "electric_bike"
              :start-lat     41.869312286
              :start-lng     -87.673897266
              :end-lat       41.8895
              :end-lng       -87.688257 }
             {:rideable-type "classic_bike"
              :start-lat     41.95600355078549
              :start-lng     -87.68016144633293
              :end-lat       41.886875
              :end-lng       -87.62603}])

_unnamed [3 5]:

:rideable-type :start-lat :start-lng :end-lat :end-lng
classic_bike 41.90686600 -87.62621700 41.92393131 -87.63582453
electric_bike 41.86931229 -87.67389727 41.88950000 -87.68825700
classic_bike 41.95600355 -87.68016145 41.88687500 -87.62603000

.. and also, from a map of vectors:

(tc/dataset {:rideable-type ["classic_bike" "electric_bike" "classic_bike"]
             :start-lat     [41.906866 41.869312286 41.95600355078549]
             :start-lng     [-87.626217 -87.673897266 -87.68016144633293]
             :end-lat       [41.92393131136619 41.8895 41.886875]
             :end-lng       [-87.63582453131676 -87.688257 -87.62603]})

_unnamed [3 5]:

:rideable-type :start-lat :start-lng :end-lat :end-lng
classic_bike 41.90686600 -87.62621700 41.92393131 -87.63582453
electric_bike 41.86931229 -87.67389727 41.88950000 -87.68825700
classic_bike 41.95600355 -87.68016145 41.88687500 -87.62603000

Let us hold it in a var to explore it further:

(def some-trips
  (tc/dataset {:rideable-type ["classic_bike" "electric_bike" "classic_bike"]
               :start-lat     [41.906866 41.869312286 41.95600355078549]
               :start-lng     [-87.626217 -87.673897266 -87.68016144633293]
               :end-lat       [41.92393131136619 41.8895 41.886875]
               :end-lng       [-87.63582453131676 -87.688257 -87.62603]}))

5.5 Displaying a dataset

In an environment compatible with the Kindly standard, the default way a dataset is displayed is by printing it.

some-trips

_unnamed [3 5]:

:rideable-type :start-lat :start-lng :end-lat :end-lng
classic_bike 41.90686600 -87.62621700 41.92393131 -87.63582453
electric_bike 41.86931229 -87.67389727 41.88950000 -87.68825700
classic_bike 41.95600355 -87.68016145 41.88687500 -87.62603000

If necessary, we may customize the printing using the tech.v3.dataset.print namespace of the tech.ml.dataset library.

For example:

(-> {:x (range 99)
     :y (repeatedly 99 rand)}
    tc/dataset
    ;; show at most 9 values
    (print/print-range 9))

_unnamed [99 2]:

:x :y
0 0.10963269
1 0.46255653
2 0.84352554
3 0.20056546
94 0.90482520
95 0.07842168
96 0.47287917
97 0.56064950
98 0.58327654

We may also explicitly turn it into an HTML table:

(kind/table some-trips)
rideable-type start-lat start-lng end-lat end-lng
classic_bike 41.906866 -87.626217 41.92393131136619 -87.63582453131676
electric_bike 41.869312286 -87.673897266 41.8895 -87.688257
classic_bike 41.95600355078549 -87.68016144633293 41.886875 -87.62603

This does not matter much for now, but it can be handy when certain inner values should be visualized in a certain way.

It is possible to use datatables to reneder kind/table and specify datatables options (see the full list).

(kind/table some-trips
            {:use-datatables true
             :datatables     {:scrollY 100}})
rideable-typestart-latstart-lngend-latend-lng
classic_bike41.906866-87.62621741.92393131136619-87.63582453131676
electric_bike41.869312286-87.67389726641.8895-87.688257
classic_bike41.95600355078549-87.6801614463329341.886875-87.62603

We may also nest the usual printed table inside other visualization kinds, such as Hiccup.

(kind/hiccup
 [:div {:style {:width "60%"
                :max-height "400px"
                :overflow-x :auto
                :overflow-y :auto
                :background "floralwhite"}}
  some-trips])

_unnamed [3 5]:

:rideable-type :start-lat :start-lng :end-lat :end-lng
classic_bike 41.90686600 -87.62621700 41.92393131 -87.63582453
electric_bike 41.86931229 -87.67389727 41.88950000 -87.68825700
classic_bike 41.95600355 -87.68016145 41.88687500 -87.62603000

5.6 What is a dataset?

Let us explore this data structure, our little dataset of bike trips.

A dataset is a value of the Dataset datatype defined in the tech.ml.dataset library:

(type some-trips)
tech.v3.dataset.impl.dataset.Dataset

One thing worth knowing about this datatype is that it is extended by quite a few interfaces and protocols.

For example, it behaves as a map.

(map? some-trips)
true

The keys are the column names:

(keys some-trips)
(:rideable-type :start-lat :start-lng :end-lat :end-lng)
(tc/column-names some-trips)
(:rideable-type :start-lat :start-lng :end-lat :end-lng)

.. and the values are the columns:

(:start-lat some-trips)
#tech.v3.dataset.column<float64>[3]
:start-lat
[41.91, 41.87, 41.96]

Now we need to discuss what columns are.

5.7 What is a column?

(:start-lat some-trips)
#tech.v3.dataset.column<float64>[3]
:start-lat
[41.91, 41.87, 41.96]

A column is a value of Column datatype defined in the tech.ml.dataset library:

(-> some-trips
    :start-lat
    type)
tech.v3.dataset.impl.column.Column

This datatype is also extended by quite a few interfaces and protocols.

For example, it is sequential.

(-> some-trips
    :start-lat
    sequential?)
true

So, we can use the sequence abstraction with it:

(->> (:start-lat some-trips)
     (take 2))
(41.906866 41.869312286)
(->> (:rideable-type some-trips)
     (filter #{"classic_bike"}))
("classic_bike" "classic_bike")

It is also assiciative:

(-> some-trips
    :rideable-type
    (assoc 2 "my strange and unique bike"))
["classic_bike" "electric_bike" "my strange and unique bike"]

5.8 Working with Columns

We may use Tablecloth’s Column API to create and process columns. For example:

(tcc/column ["classic_bike" "electrical_bike" "classic_bike"])
#tech.v3.dataset.column<string>[3]
null
[classic_bike, electrical_bike, classic_bike]

What is the average latitude where trips tend to start?

(-> some-trips
    :start-lat
    tcc/mean)
41.9107272789285

What is the type of elements in this Column?

(-> some-trips
    :start-lat
    tcc/typeof)
:float64

Let us look into our latitudes in radians rather than degrees:

(-> some-trips
    :start-lat
    (tcc/* (/ Math/PI 180)))
#tech.v3.dataset.column<float64>[3]
null
[0.7314, 0.7308, 0.7323]

You see, columns are typed, and this has implications for both for time and space performance as well as ergonomics.

5.9 The data in columns

You will probably not need this detail most times, but it is worth knowing that the data actually held by the Column can be accessed as the .data field, and it can be of varying data structures (see list of all datatypes).

For example:

(-> some-trips
    :start-lat
    .data
    type)
tech.v3.datatype.array_buffer.ArrayBuffer
(-> some-trips
    :rideable-type
    .data
    type)
tech.v3.dataset.string_table.StringTable
(-> (range 9)
    tcc/column
    .data
    type)
clojure.lang.LongRange

Behind the scenes, tech.ml.dataset makes sure to use efficient data structures for columns, so that, e.g., random access by index will be efficient:

(-> some-trips
    :start-lat
    (nth 2))
41.95600355078549

The following is quick too!

(-> (range 1000000)
    (tcc/* 1000)
    (nth 10000))
10000000

Here we rely on the “lazy and noncaching” semantics of the undelying dtype-next library, which is a topic worth its own tutorial.

5.10 Operations on columns

In the examples with columns above, we used tcc/* to multiply the values of a column by a scalar. The result was another column. This operation can be applied to multiple two (or more) columns as well, or a mixture of columns and scalars.

(tcc/* (:start-lat some-trips)
       (:end-lng some-trips)
       3
       4)
#tech.v3.dataset.column<float64>[3]
null
[-4.407E+04, -4.406E+04, -4.412E+04]

A long list of other operations have been implemented, among which arithmetic operations, trigonometric functions, and various predicates. Note that the tcc/min and tcc/max functions output a column as well.

(tcc/min (:start-lat some-trips)
         (:end-lat some-trips))
#tech.v3.dataset.column<float64>[3]
null
[41.91, 41.87, 41.89]

They also work with scalar inputs.

(tcc/min (:start-lat some-trips)
         100)
#tech.v3.dataset.column<float64>[3]
null
[41.91, 41.87, 41.96]
(tcc/min (:start-lat some-trips)
         -100)
#tech.v3.dataset.column<float64>[3]
null
[-100.0, -100.0, -100.0]

Other operations produce a scalar as an output. These include various measures of central tendency.

(-> some-trips
    :start-lat
    tcc/mean)
41.9107272789285
(-> some-trips
    :end-lat
    tcc/median)
41.8895

Important operations in this category are functions that reduce over the columns, such as reduce-*, reduce-+, reduce-min, and reduce-max. For example, the following can be used to find the maximum value in a column:

(-> some-trips
    :start-lng
    tcc/reduce-max)
-87.626217

5.11 Summarizing datasets

We can use the tc/info function to summarize a dataset:

(tc/info some-trips)

_unnamed: descriptive-stats [5 12]:

:col-name :datatype :n-valid :n-missing :min :mean :mode :max :standard-deviation :skew :first :last
:rideable-type :string 3 0 classic_bike classic_bike classic_bike
:start-lat :float64 3 0 41.86931229 41.91072728 41.95600355 0.04347443 0.39652474 41.91 41.96
:start-lng :float64 3 0 -87.68016145 -87.66009190 -87.62621700 0.02950325 1.64462246 -87.63 -87.68
:end-lat :float64 3 0 41.88687500 41.90010210 41.92393131 0.02067839 1.70070295 41.92 41.89
:end-lng :float64 3 0 -87.68825700 -87.65003718 -87.62603000 0.03345967 -1.56657916 -87.64 -87.63

5.12 Reading datasets

Datasets are often read from files.

Let us read a file from the Divvy Data, where the history of bike sharing trips in Chicago is shared publicly. Chicago bike trips dataset.

Let us first download the data of one month.

(def bike-trips-filename
  "202304-divvy-tripdata.zip")
(if (.exists (io/as-file bike-trips-filename))
  [:already-downloaded bike-trips-filename]
  (with-open [in (io/input-stream "https://divvy-tripdata.s3.amazonaws.com/202304-divvy-tripdata.zip")
              out (io/output-stream bike-trips-filename)]
    (io/copy in out)
    [:just-downloaded bike-trips-filename]))
[:just-downloaded "202304-divvy-tripdata.zip"]

In this case, it is a CSV file held inside a compressed ZIP file, but other formats are supported as well.

First, let us read just a few rows:

(-> bike-trips-filename 
    tc/dataset
    time)

202304-divvy-tripdata.zip [426590 13]:

ride_id rideable_type started_at ended_at start_station_name start_station_id end_station_name end_station_id start_lat start_lng end_lat end_lng member_casual
8FE8F7D9C10E88C7 electric_bike 2023-04-02 08:37:28 2023-04-02 08:41:37 41.80000000 -87.60000000 41.79000000 -87.60000000 member
34E4ED3ADF1D821B electric_bike 2023-04-19 11:29:02 2023-04-19 11:52:12 41.87000000 -87.65000000 41.93000000 -87.68000000 member
5296BF07A2F77CB5 electric_bike 2023-04-19 08:41:22 2023-04-19 08:43:22 41.93000000 -87.66000000 41.93000000 -87.66000000 member
40759916B76D5D52 electric_bike 2023-04-19 13:31:30 2023-04-19 13:35:09 41.92000000 -87.65000000 41.91000000 -87.65000000 member
77A96F460101AC63 electric_bike 2023-04-19 12:05:36 2023-04-19 12:10:26 41.91000000 -87.65000000 41.91000000 -87.63000000 member
8D6A2328E19DC168 electric_bike 2023-04-19 12:17:34 2023-04-19 12:21:38 41.91000000 -87.63000000 41.92000000 -87.65000000 member
C97BBA66E07889F9 electric_bike 2023-04-19 09:35:48 2023-04-19 09:45:00 41.93000000 -87.66000000 41.91000000 -87.65000000 member
6687AD4C575FF734 electric_bike 2023-04-11 16:13:43 2023-04-11 16:18:41 42.00000000 -87.66000000 41.99000000 -87.66000000 member
A8FA4F73B22BC11F electric_bike 2023-04-11 16:29:24 2023-04-11 16:40:23 41.99000000 -87.66000000 42.00000000 -87.66000000 member
81E158FE63D99994 electric_bike 2023-04-19 17:35:40 2023-04-19 17:36:11 41.88000000 -87.65000000 41.88000000 -87.65000000 member
BAE25DBA0CD0DF40 electric_bike 2023-04-14 07:12:34 2023-04-14 07:42:07 Richmond St & Diversey Ave 15645 Clark St & Ida B Wells Dr TA1305000009 41.93186915 -87.70120764 41.87593267 -87.63058454 member
5A98F86A573AAB2C electric_bike 2023-04-24 07:27:02 2023-04-24 07:58:22 Richmond St & Diversey Ave 15645 Clark St & Ida B Wells Dr TA1305000009 41.93198085 -87.70128024 41.87593267 -87.63058454 member
92B02F2FBDC1FB9F classic_bike 2023-04-12 08:16:48 2023-04-12 08:40:08 Michigan Ave & Lake St TA1305000011 May St & Taylor St 13160 41.88602200 -87.62439800 41.86948210 -87.65548640 member
119A6C53607EAA4A electric_bike 2023-04-28 07:24:54 2023-04-28 07:53:44 Richmond St & Diversey Ave 15645 Clark St & Ida B Wells Dr TA1305000009 41.93190396 -87.70125461 41.87593267 -87.63058454 member
9BCF1E8BA027EAFA electric_bike 2023-04-21 07:15:06 2023-04-21 07:41:45 Richmond St & Diversey Ave 15645 Clark St & Ida B Wells Dr TA1305000009 41.93193305 -87.70126295 41.87593267 -87.63058454 member
A17D800CE963661A classic_bike 2023-04-11 15:46:42 2023-04-11 15:50:03 Michigan Ave & Lake St TA1305000011 Clark St & Randolph St TA1305000030 41.88602200 -87.62439800 41.88457623 -87.63188991 member
8B441A6C436E9900 classic_bike 2023-04-29 21:20:21 2023-04-29 21:30:19 Halsted St & 18th St 13099 Blue Island Ave & 18th St 13135 41.85750568 -87.64599144 41.85755600 -87.66153500 casual
3980D64BE11540F1 classic_bike 2023-04-24 09:16:05 2023-04-24 09:22:27 Halsted St & 18th St 13099 Blue Island Ave & 18th St 13135 41.85750568 -87.64599144 41.85755600 -87.66153500 casual
3EF4B49FF7DAA02C classic_bike 2023-04-18 07:53:51 2023-04-18 07:59:16 Franklin St & Jackson Blvd TA1305000025 Clark St & Randolph St TA1305000030 41.87770796 -87.63532114 41.88457623 -87.63188991 casual
210B2ED6583DC231 classic_bike 2023-04-29 07:33:55 2023-04-29 07:38:57 Michigan Ave & Lake St TA1305000011 Clark St & Randolph St TA1305000030 41.88602200 -87.62439800 41.88457623 -87.63188991 casual
D29CB39B9E3FC46A electric_bike 2023-04-18 08:00:32 2023-04-18 08:02:35 Franklin St & Jackson Blvd TA1305000025 Clark St & Ida B Wells Dr TA1305000009 41.87813417 -87.63525450 41.87593267 -87.63058454 casual
(->  bike-trips-filename
     (tc/dataset {:num-rows 3}))

202304-divvy-tripdata.zip [3 13]:

ride_id rideable_type started_at ended_at start_station_name start_station_id end_station_name end_station_id start_lat start_lng end_lat end_lng member_casual
8FE8F7D9C10E88C7 electric_bike 2023-04-02 08:37:28 2023-04-02 08:41:37 41.80 -87.60 41.79 -87.60 member
34E4ED3ADF1D821B electric_bike 2023-04-19 11:29:02 2023-04-19 11:52:12 41.87 -87.65 41.93 -87.68 member
5296BF07A2F77CB5 electric_bike 2023-04-19 08:41:22 2023-04-19 08:43:22 41.93 -87.66 41.93 -87.66 member

So reading a dataset is easy, but sometimes we may wish to pass a few options to handle it a bit better.

For example, you see that by default, the column names are strings:

(->  bike-trips-filename
     (tc/dataset {:num-rows 9})
     tc/column-names)
("ride_id"
 "rideable_type"
 "started_at"
 "ended_at"
 "start_station_name"
 "start_station_id"
 "end_station_name"
 "end_station_id"
 "start_lat"
 "start_lng"
 "end_lat"
 "end_lng"
 "member_casual")

We can apply the keyword function to all of them, to conveniently have keywords instead.

(->  bike-trips-filename
     (tc/dataset {:num-rows 9
                  :key-fn keyword})
     tc/column-names)
(:ride_id
 :rideable_type
 :started_at
 :ended_at
 :start_station_name
 :start_station_id
 :end_station_name
 :end_station_id
 :start_lat
 :start_lng
 :end_lat
 :end_lng
 :member_casual)

Even better, we may process the names to replace underscores with dashes.

(->  bike-trips-filename
     (tc/dataset {:num-rows 9
                  :key-fn (fn [s]
                            (-> s
                                (str/replace #"_" "-")
                                keyword))})
     tc/column-names)
(:ride-id
 :rideable-type
 :started-at
 :ended-at
 :start-station-name
 :start-station-id
 :end-station-name
 :end-station-id
 :start-lat
 :start-lng
 :end-lat
 :end-lng
 :member-casual)

Also, the date-time columns are parsed as strings.

(->  bike-trips-filename
     (tc/dataset {:num-rows 9
                  :key-fn (fn [s]
                            (-> s
                                (str/replace #"_" "-")
                                keyword))})
     :started-at
     tcc/typeof)
:string

Let us specify our own parsing for these columns.

(def datetime-parser [:local-date-time
                      "yyyy-MM-dd HH:mm:ss"])
(->  bike-trips-filename
     (tc/dataset {:num-rows 9
                  :key-fn (fn [s]
                            (-> s
                                (str/replace #"_" "-")
                                keyword))
                  ;; Note we use the original column names
                  ;; when defining the parser:
                  :parser-fn {"started_at" datetime-parser
                              "ended_at" datetime-parser}})
     :started-at
     tcc/typeof)
:local-date-time

Let us now read the whole dataset and hold it in a var for further exploration. We use defonce so that next time we evaluate this expression, nothing will happen. This practice is handy when reading big files.

(defonce trips
  (->  bike-trips-filename
       (tc/dataset {:key-fn    (fn [s]
                                 (-> s
                                     (str/replace #"_" "-")
                                     keyword))
                    :parser-fn {"started_at" datetime-parser
                                "ended_at"   datetime-parser}})))
trips

202304-divvy-tripdata.zip [426590 13]:

:ride-id :rideable-type :started-at :ended-at :start-station-name :start-station-id :end-station-name :end-station-id :start-lat :start-lng :end-lat :end-lng :member-casual
8FE8F7D9C10E88C7 electric_bike 2023-04-02T08:37:28 2023-04-02T08:41:37 41.80000000 -87.60000000 41.79000000 -87.60000000 member
34E4ED3ADF1D821B electric_bike 2023-04-19T11:29:02 2023-04-19T11:52:12 41.87000000 -87.65000000 41.93000000 -87.68000000 member
5296BF07A2F77CB5 electric_bike 2023-04-19T08:41:22 2023-04-19T08:43:22 41.93000000 -87.66000000 41.93000000 -87.66000000 member
40759916B76D5D52 electric_bike 2023-04-19T13:31:30 2023-04-19T13:35:09 41.92000000 -87.65000000 41.91000000 -87.65000000 member
77A96F460101AC63 electric_bike 2023-04-19T12:05:36 2023-04-19T12:10:26 41.91000000 -87.65000000 41.91000000 -87.63000000 member
8D6A2328E19DC168 electric_bike 2023-04-19T12:17:34 2023-04-19T12:21:38 41.91000000 -87.63000000 41.92000000 -87.65000000 member
C97BBA66E07889F9 electric_bike 2023-04-19T09:35:48 2023-04-19T09:45 41.93000000 -87.66000000 41.91000000 -87.65000000 member
6687AD4C575FF734 electric_bike 2023-04-11T16:13:43 2023-04-11T16:18:41 42.00000000 -87.66000000 41.99000000 -87.66000000 member
A8FA4F73B22BC11F electric_bike 2023-04-11T16:29:24 2023-04-11T16:40:23 41.99000000 -87.66000000 42.00000000 -87.66000000 member
81E158FE63D99994 electric_bike 2023-04-19T17:35:40 2023-04-19T17:36:11 41.88000000 -87.65000000 41.88000000 -87.65000000 member
BAE25DBA0CD0DF40 electric_bike 2023-04-14T07:12:34 2023-04-14T07:42:07 Richmond St & Diversey Ave 15645 Clark St & Ida B Wells Dr TA1305000009 41.93186915 -87.70120764 41.87593267 -87.63058454 member
5A98F86A573AAB2C electric_bike 2023-04-24T07:27:02 2023-04-24T07:58:22 Richmond St & Diversey Ave 15645 Clark St & Ida B Wells Dr TA1305000009 41.93198085 -87.70128024 41.87593267 -87.63058454 member
92B02F2FBDC1FB9F classic_bike 2023-04-12T08:16:48 2023-04-12T08:40:08 Michigan Ave & Lake St TA1305000011 May St & Taylor St 13160 41.88602200 -87.62439800 41.86948210 -87.65548640 member
119A6C53607EAA4A electric_bike 2023-04-28T07:24:54 2023-04-28T07:53:44 Richmond St & Diversey Ave 15645 Clark St & Ida B Wells Dr TA1305000009 41.93190396 -87.70125461 41.87593267 -87.63058454 member
9BCF1E8BA027EAFA electric_bike 2023-04-21T07:15:06 2023-04-21T07:41:45 Richmond St & Diversey Ave 15645 Clark St & Ida B Wells Dr TA1305000009 41.93193305 -87.70126295 41.87593267 -87.63058454 member
A17D800CE963661A classic_bike 2023-04-11T15:46:42 2023-04-11T15:50:03 Michigan Ave & Lake St TA1305000011 Clark St & Randolph St TA1305000030 41.88602200 -87.62439800 41.88457623 -87.63188991 member
8B441A6C436E9900 classic_bike 2023-04-29T21:20:21 2023-04-29T21:30:19 Halsted St & 18th St 13099 Blue Island Ave & 18th St 13135 41.85750568 -87.64599144 41.85755600 -87.66153500 casual
3980D64BE11540F1 classic_bike 2023-04-24T09:16:05 2023-04-24T09:22:27 Halsted St & 18th St 13099 Blue Island Ave & 18th St 13135 41.85750568 -87.64599144 41.85755600 -87.66153500 casual
3EF4B49FF7DAA02C classic_bike 2023-04-18T07:53:51 2023-04-18T07:59:16 Franklin St & Jackson Blvd TA1305000025 Clark St & Randolph St TA1305000030 41.87770796 -87.63532114 41.88457623 -87.63188991 casual
210B2ED6583DC231 classic_bike 2023-04-29T07:33:55 2023-04-29T07:38:57 Michigan Ave & Lake St TA1305000011 Clark St & Randolph St TA1305000030 41.88602200 -87.62439800 41.88457623 -87.63188991 casual
D29CB39B9E3FC46A electric_bike 2023-04-18T08:00:32 2023-04-18T08:02:35 Franklin St & Jackson Blvd TA1305000025 Clark St & Ida B Wells Dr TA1305000009 41.87813417 -87.63525450 41.87593267 -87.63058454 casual
(tc/info trips)

202304-divvy-tripdata.zip: descriptive-stats [13 12]:

:col-name :datatype :n-valid :n-missing :min :mean :mode :max :standard-deviation :skew :first :last
:ride-id :string 426590 0 89CC272248838ACC 8FE8F7D9C10E88C7 D29CB39B9E3FC46A
:rideable-type :string 426590 0 electric_bike electric_bike electric_bike
:started-at :local-date-time 426590 0 2023-04-01T00:00:02 2023-04-16T05:15:35.589 2023-04-30T23:59:05 6.79308822E+08 0.08185427 2023-04-02T08:37:28 2023-04-18T08:00:32
:ended-at :local-date-time 426590 0 2023-04-01T00:03:10 2023-04-16T05:32:48.230 2023-05-03T10:37:12 6.79294547E+08 0.08147835 2023-04-02T08:41:37 2023-04-18T08:02:35
:start-station-name :string 362776 63814 Franklin St & Jackson Blvd
:start-station-id :string 362776 63814 TA1305000025
:end-station-name :string 357960 68630 Clark St & Ida B Wells Dr
:end-station-id :string 357960 68630 TA1305000009
:start-lat :float64 426590 0 41.65 41.90 42.07 4.70980524E-02 -0.35851400 41.80 41.88
:start-lng :float64 426590 0 -87.83 -87.65 -87.52 2.76893978E-02 -0.82254345 -87.60 -87.64
:end-lat :float64 426155 435 41.65 41.90 42.08 4.72412894E-02 -0.37120814 41.79 41.88
:end-lng :float64 426155 435 -88.11 -87.65 -87.53 2.78076951E-02 -0.82990154 -87.60 -87.63
:member-casual :string 426590 0 member member casual

It is a whole month of bike trips!

5.13 Getting the rows of a dataset

Datasets are organized by columns for efficiency, making use of the knowledge of homogenous types in columns.

Sometimes, however, it is useful to work with rows as well.

The tc/rows function provides the rows of a dataset, either as vectors or as maps. Note, however, that it does not copy the data. Rather, it provides a rowwise view of the columnwise dataset.

(take 2 (tc/rows trips))
(["8FE8F7D9C10E88C7"
  "electric_bike"
  #object[java.time.LocalDateTime 0x165dae02 "2023-04-02T08:37:28"]
  #object[java.time.LocalDateTime 0x1d41d126 "2023-04-02T08:41:37"]
  nil
  nil
  nil
  nil
  41.8
  -87.6
  41.79
  -87.6
  "member"]
 ["34E4ED3ADF1D821B"
  "electric_bike"
  #object[java.time.LocalDateTime 0xd7638ac "2023-04-19T11:29:02"]
  #object[java.time.LocalDateTime 0xa812fb2 "2023-04-19T11:52:12"]
  nil
  nil
  nil
  nil
  41.87
  -87.65
  41.93
  -87.68
  "member"])
(take 2 (tc/rows trips :as-maps))
({:ride-id "8FE8F7D9C10E88C7",
  :rideable-type "electric_bike",
  :started-at
  #object[java.time.LocalDateTime 0x165dae02 "2023-04-02T08:37:28"],
  :ended-at
  #object[java.time.LocalDateTime 0x1d41d126 "2023-04-02T08:41:37"],
  :start-station-name nil,
  :start-station-id nil,
  :end-station-name nil,
  :end-station-id nil,
  :start-lat 41.8,
  :start-lng -87.6,
  :end-lat 41.79,
  :end-lng -87.6,
  :member-casual "member"}
 {:ride-id "34E4ED3ADF1D821B",
  :rideable-type "electric_bike",
  :started-at
  #object[java.time.LocalDateTime 0xd7638ac "2023-04-19T11:29:02"],
  :ended-at
  #object[java.time.LocalDateTime 0xa812fb2 "2023-04-19T11:52:12"],
  :start-station-name nil,
  :start-station-id nil,
  :end-station-name nil,
  :end-station-id nil,
  :start-lat 41.87,
  :start-lng -87.65,
  :end-lat 41.93,
  :end-lng -87.68,
  :member-casual "member"})

As you may know, Clojure shines in processing plain data, structured or unstructured, such as vectors and maps of any content. We do not lose any of that when using datasets, as we can still view them as rows which are just maps or vectors.

5.14 Querying datasets

Tablecloth offers various ways to view a subset of a dataset. Typically, they do not copy the data but provide views of the same space in memory.

The first few trips:

(tc/head trips)

202304-divvy-tripdata.zip [5 13]:

:ride-id :rideable-type :started-at :ended-at :start-station-name :start-station-id :end-station-name :end-station-id :start-lat :start-lng :end-lat :end-lng :member-casual
8FE8F7D9C10E88C7 electric_bike 2023-04-02T08:37:28 2023-04-02T08:41:37 41.80 -87.60 41.79 -87.60 member
34E4ED3ADF1D821B electric_bike 2023-04-19T11:29:02 2023-04-19T11:52:12 41.87 -87.65 41.93 -87.68 member
5296BF07A2F77CB5 electric_bike 2023-04-19T08:41:22 2023-04-19T08:43:22 41.93 -87.66 41.93 -87.66 member
40759916B76D5D52 electric_bike 2023-04-19T13:31:30 2023-04-19T13:35:09 41.92 -87.65 41.91 -87.65 member
77A96F460101AC63 electric_bike 2023-04-19T12:05:36 2023-04-19T12:10:26 41.91 -87.65 41.91 -87.63 member

Just a few columns:

(-> trips
    (tc/select-columns [:rideable-type :started-at :ended-at])
    (print/print-range 5))

202304-divvy-tripdata.zip [426590 3]:

:rideable-type :started-at :ended-at
electric_bike 2023-04-02T08:37:28 2023-04-02T08:41:37
electric_bike 2023-04-19T11:29:02 2023-04-19T11:52:12
classic_bike 2023-04-18T07:53:51 2023-04-18T07:59:16
classic_bike 2023-04-29T07:33:55 2023-04-29T07:38:57
electric_bike 2023-04-18T08:00:32 2023-04-18T08:02:35

Only rows about classical bikes, and just a few columns:

(-> trips
    (tc/select-rows (fn [row]
                      (-> row :rideable-type (= "classic_bike"))))
    tc/head
    (tc/select-columns [:rideable-type :started-at :ended-at])
    (print/print-range 5))

202304-divvy-tripdata.zip [5 3]:

:rideable-type :started-at :ended-at
classic_bike 2023-04-10T17:34:35 2023-04-10T18:02:36
classic_bike 2023-04-12T12:29:46 2023-04-12T12:54
classic_bike 2023-04-29T20:57:10 2023-04-29T20:57:13
classic_bike 2023-04-20T17:03:11 2023-04-20T17:24:58
classic_bike 2023-04-07T21:14:14 2023-04-07T21:15:14

5.15 Adding columns

Here we will demonstrate some of the ways to extend a dataset with new columns. For clarity, let us focus on a dataset with just a few of the columns:

(-> trips
    (tc/select-columns [:start-lat :end-lat])
    (print/print-range 5))

202304-divvy-tripdata.zip [426590 2]:

:start-lat :end-lat
41.80000000 41.79000000
41.87000000 41.93000000
41.87770796 41.88457623
41.88602200 41.88457623
41.87813417 41.87593267

One can create new columns by applying the dataset equivalents to the column operations discussed above. These operations have the same names, but now take a dataset and the name of the new column as additional inputs. For example, the min function, when taken from the tablecloth.api, can be used as follows:

(-> trips
    (tc/select-columns [:start-lat :end-lat])
    (tc/min :min-lat [:start-lat :end-lat])
    (print/print-range 5))

202304-divvy-tripdata.zip [426590 3]:

:start-lat :end-lat :min-lat
41.80000000 41.79000000 41.79000000
41.87000000 41.93000000 41.87000000
41.87770796 41.88457623 41.87770796
41.88602200 41.88457623 41.88457623
41.87813417 41.87593267 41.87593267

When no built-in operations exist, we can use the functions tc/map-columns and tc/add-columns.

The tc/map-columns function is useful when one needs to apply a function to the values in one or more of the existings columns, for every row.

(-> trips
    (tc/select-columns [:rideable-type :started-at :ended-at])
    (print/print-range 5)
    (tc/map-columns :duration
                    [:started-at :ended-at]
                    java-time/duration))

202304-divvy-tripdata.zip [426590 4]:

:rideable-type :started-at :ended-at :duration
electric_bike 2023-04-02T08:37:28 2023-04-02T08:41:37 PT4M9S
electric_bike 2023-04-19T11:29:02 2023-04-19T11:52:12 PT23M10S
classic_bike 2023-04-18T07:53:51 2023-04-18T07:59:16 PT5M25S
classic_bike 2023-04-29T07:33:55 2023-04-29T07:38:57 PT5M2S
electric_bike 2023-04-18T08:00:32 2023-04-18T08:02:35 PT2M3S
(-> trips
    (tc/select-columns [:rideable-type :started-at :ended-at])
    (print/print-range 5)
    (tc/map-columns :duration
                    [:started-at :ended-at]
                    java-time/duration)
    (tc/map-columns :duration-in-seconds
                    [:duration]
                    #(java-time/as % :seconds)))

202304-divvy-tripdata.zip [426590 5]:

:rideable-type :started-at :ended-at :duration :duration-in-seconds
electric_bike 2023-04-02T08:37:28 2023-04-02T08:41:37 PT4M9S 249
electric_bike 2023-04-19T11:29:02 2023-04-19T11:52:12 PT23M10S 1390
classic_bike 2023-04-18T07:53:51 2023-04-18T07:59:16 PT5M25S 325
classic_bike 2023-04-29T07:33:55 2023-04-29T07:38:57 PT5M2S 302
electric_bike 2023-04-18T08:00:32 2023-04-18T08:02:35 PT2M3S 123

The tc/add-columns function is useful when one needs to apply a function to a whole dataset and return a whole column at once. This combines nicely with the column API (tcc).

(-> trips
    (tc/select-columns [:rideable-type :started-at :ended-at])
    (print/print-range 5)
    (tc/map-columns :duration
                    [:started-at :ended-at]
                    java-time/duration)
    (tc/map-columns :duration-in-seconds
                    [:duration]
                    #(java-time/as % :seconds))
    (tc/add-column :duration-in-minutes
                   (fn [ds]
                     (-> ds
                         :duration-in-seconds
                         (tcc/* 1/60)))))

202304-divvy-tripdata.zip [426590 6]:

:rideable-type :started-at :ended-at :duration :duration-in-seconds :duration-in-minutes
electric_bike 2023-04-02T08:37:28 2023-04-02T08:41:37 PT4M9S 249 4.150
electric_bike 2023-04-19T11:29:02 2023-04-19T11:52:12 PT23M10S 1390 23.17
classic_bike 2023-04-18T07:53:51 2023-04-18T07:59:16 PT5M25S 325 5.417
classic_bike 2023-04-29T07:33:55 2023-04-29T07:38:57 PT5M2S 302 5.033
electric_bike 2023-04-18T08:00:32 2023-04-18T08:02:35 PT2M3S 123 2.050

Let us also add a column of the hour where each trip started – an integer between 0 to 23.

To do that, we will first add the hour as a column. Earlier, we did some time processing using the java-time API. Now, we will demonstrate a different way, using the datetime namespace of dtype-next. This namespace offers some handy functions that act on whole columns.

Let us keep this dataset in a var.

(def preprocessed-trips
  (-> trips
      (tc/select-columns [:rideable-type :started-at :ended-at])
      (print/print-range 5)
      (tc/map-columns :duration
                      [:started-at :ended-at]
                      java-time/duration)
      (tc/map-columns :duration-in-seconds
                      [:duration]
                      #(java-time/as % :seconds))
      (tc/add-column :duration-in-minutes
                     (fn [ds]
                       (-> ds
                           :duration-in-seconds
                           (tcc/* 1/60))))
      (tc/add-column :hour
                     (fn [ds]
                       (datetime/long-temporal-field
                        :hours
                        (:started-at ds))))))
preprocessed-trips

202304-divvy-tripdata.zip [426590 7]:

:rideable-type :started-at :ended-at :duration :duration-in-seconds :duration-in-minutes :hour
electric_bike 2023-04-02T08:37:28 2023-04-02T08:41:37 PT4M9S 249 4.150 8
electric_bike 2023-04-19T11:29:02 2023-04-19T11:52:12 PT23M10S 1390 23.17 11
classic_bike 2023-04-18T07:53:51 2023-04-18T07:59:16 PT5M25S 325 5.417 7
classic_bike 2023-04-29T07:33:55 2023-04-29T07:38:57 PT5M2S 302 5.033 7
electric_bike 2023-04-18T08:00:32 2023-04-18T08:02:35 PT2M3S 123 2.050 8

5.16 Grouping and summarizing

For how long is a bike used typically? We can answer this question by using one of the built-in summarizing operations. These are equivalent to the column operations that output a scalar value, but require the name of the column that we use as input.

(-> preprocessed-trips
    (tc/median :duration-in-minutes))

_unnamed [1 1]:

summary
8.91666667

This gives us a summary measure over the entire dataset. Alternatively, we can group by the values in one of the columns.

(-> preprocessed-trips
    (tc/group-by [:rideable-type])
    (print/print-range 5))

_unnamed [3 3]:

:name :group-id :data
{:rideable-type electric_bike} 0 Group: {:rideable-type “electric_bike”} [247965 7]:
{:rideable-type classic_bike} 1 Group: {:rideable-type “classic_bike”} [169738 7]:
{:rideable-type docked_bike} 2 Group: {:rideable-type “docked_bike”} [8887 7]:

The resulting dataset is a dataset of a special kind, a grouped dataset. Its :data column contains whole datasets, which are the groups. In our case, these are the groups of bike trips starting in a given hour, for every hour throughout the day.

We can then apply an operation conditional on the groups.

(-> preprocessed-trips
    (tc/group-by [:rideable-type])
    (tc/mean :duration-in-minutes))

_unnamed [3 2]:

:rideable-type summary
electric_bike 11.76538409
classic_bike 18.54251022
docked_bike 143.70773039

More generally, we can use tc/aggregate to apply an arbitrary summary function.

How does bike usage change througout the day?

Let us see how the number of trips and their median length change by the hour.

Let us group the trips by the hour:

(-> preprocessed-trips
    (tc/group-by [:hour])
    (print/print-range 5))

_unnamed [24 3]:

:name :group-id :data
{:hour 8} 0 Group: {:hour 8} [22170 7]:
{:hour 11} 1 Group: {:hour 11} [20593 7]:
{:hour 23} 21 Group: {:hour 23} [7327 7]:
{:hour 6} 22 Group: {:hour 6} [8863 7]:
{:hour 5} 23 Group: {:hour 5} [2802 7]:

Now, we can aggregate over the groups to recieve a summary. The resulting summary dataset will no longer be a grouped dataset. We will order the summary by the hour.

(-> preprocessed-trips
    (tc/group-by [:hour])
    (tc/aggregate {:n-trips tc/row-count
                   :median-duration (fn [ds]
                                      (tcc/median (:duration-in-seconds ds)))})
    (tc/order-by [:hour])
    (print/print-range :all))

_unnamed [24 3]:

:hour :n-trips :median-duration
0 5111 487.0
1 3235 478.0
2 1874 461.0
3 1165 491.0
4 950 452.0
5 2802 394.0
6 8863 434.0
7 17065 473.0
8 22170 472.0
9 17316 466.0
10 17489 510.0
11 20593 519.0
12 24050 536.0
13 24837 560.0
14 25900 585.0
15 30730 574.0
16 40374 590.0
17 46889 597.0
18 38281 578.0
19 27107 537.0
20 17252 505.0
21 14388 502.0
22 10822 508.0
23 7327 485.0

We can see a peak of usage between 17:00 to 18:00 and a possibly slight tendendcy for longer trips (in time) around the afternoon hours.

For further examples of summarizing, see the Tablecloth aggregate documentation.

source: notebooks/noj_book/tablecloth_table_processing.clj