5  Table processing with Tablecloth

author: Daniel Slutsky

last change: 2024-12-08

Tablecloth is a table processing library inspired by the dataframe ergonomics typicall to the R ecosystem, specifically the Tidyverse, but offers certain advantages on top of that.

It is built on top of the data structures and functions of tech.ml.dataset, a high-performance table processing library (often called TMD), which is built on top of dtype-next, an array-programming library.

In this tutorial, we will see a few of the core ideas of Tablecloth. You are encouraged to look into the main documentation for more information.

5.2 About this tutorial

In this tutorial, we will demonstrate the ergonomics of so-called dataset data strucures provided by Tablecloth. Datasets are table-like data structures, often called data-frames in other data science platforms.

We will assume basic familiarity with Clojure.

A lot of what we demonstrate here can also be implemented with the usual Clojure data strucures such as vectors and maps. However, datasets offer not only performance advantages in space and time, but also certain usability features, which are arguably expressive and powerful.

We will often use treading macros, mostly ->. This approach is compatible with data science cultures such as the Tidyverse in R.

5.3 Setup

We create a namespace and require the Tablecloth API namespaces: The main Dataset API tablecloth.api and the Column API tablecloth.column.api that we’ll see below. We will also use tech.v3.dataset.print to control printing, clojure.string for some string processing, Kindly to control the way certain things are displayed, Clojure.Java-Time for some time calculations, and the datetime namespace of dtype-next.

(ns noj-book.tablecloth-table-processing
  (:require [tablecloth.api :as tc]
            [tablecloth.column.api :as tcc]
            [tech.v3.dataset.print :as print]
            [clojure.string :as str]
            [scicloj.kindly.v4.kind :as kind]
            [java-time.api :as java-time]
            [tech.v3.datatype.datetime :as datetime]))

5.4 Creating a dataset

Assume we have a vector of vectors representing bike trips. Each trip has the type of the bike and the coordinates (latitude and longitude) of the trip’s start and end. We can turn this data structure into a dataset:

(tc/dataset [["classic_bike" 41.906866 -87.626217 41.92393131136619 -87.63582453131676]
             ["electric_bike" 41.869312286 -87.673897266 41.8895 -87.688257]
             ["classic_bike" 41.95600355078549 -87.68016144633293 41.886875 -87.62603]]
            {:column-names [:rideable-type
                            :start-lat :start-lng
                            :end-lat :end-lng]})

:_unnamed [3 5]:

:rideable-type :start-lat :start-lng :end-lat :end-lng
classic_bike 41.90686600 -87.62621700 41.92393131 -87.63582453
electric_bike 41.86931229 -87.67389727 41.88950000 -87.68825700
classic_bike 41.95600355 -87.68016145 41.88687500 -87.62603000

Sometimes, data may arrive in different shapes. We can also create such a dataset from a vector of maps:

(tc/dataset [{:rideable-type "classic_bike"
              :start-lat     41.906866
              :start-lng     -87.626217
              :end-lat       41.92393131136619
              :end-lng       -87.63582453131676}
             {:rideable-type "electric_bike"
              :start-lat     41.869312286
              :start-lng     -87.673897266
              :end-lat       41.8895
              :end-lng       -87.688257 }
             {:rideable-type "classic_bike"
              :start-lat     41.95600355078549
              :start-lng     -87.68016144633293
              :end-lat       41.886875
              :end-lng       -87.62603}])

_unnamed [3 5]:

:rideable-type :start-lat :start-lng :end-lat :end-lng
classic_bike 41.90686600 -87.62621700 41.92393131 -87.63582453
electric_bike 41.86931229 -87.67389727 41.88950000 -87.68825700
classic_bike 41.95600355 -87.68016145 41.88687500 -87.62603000

.. and also, from a map of vectors:

(tc/dataset {:rideable-type ["classic_bike" "electric_bike" "classic_bike"]
             :start-lat     [41.906866 41.869312286 41.95600355078549]
             :start-lng     [-87.626217 -87.673897266 -87.68016144633293]
             :end-lat       [41.92393131136619 41.8895 41.886875]
             :end-lng       [-87.63582453131676 -87.688257 -87.62603]})

_unnamed [3 5]:

:rideable-type :start-lat :start-lng :end-lat :end-lng
classic_bike 41.90686600 -87.62621700 41.92393131 -87.63582453
electric_bike 41.86931229 -87.67389727 41.88950000 -87.68825700
classic_bike 41.95600355 -87.68016145 41.88687500 -87.62603000

Let us hold it in a var to explore it further:

(def some-trips
  (tc/dataset {:rideable-type ["classic_bike" "electric_bike" "classic_bike"]
               :start-lat     [41.906866 41.869312286 41.95600355078549]
               :start-lng     [-87.626217 -87.673897266 -87.68016144633293]
               :end-lat       [41.92393131136619 41.8895 41.886875]
               :end-lng       [-87.63582453131676 -87.688257 -87.62603]}))

5.5 Displaying a dataset

In an environment compatible with the Kindly standard, the default way a dataset is displayed is by printing it.

some-trips

_unnamed [3 5]:

:rideable-type :start-lat :start-lng :end-lat :end-lng
classic_bike 41.90686600 -87.62621700 41.92393131 -87.63582453
electric_bike 41.86931229 -87.67389727 41.88950000 -87.68825700
classic_bike 41.95600355 -87.68016145 41.88687500 -87.62603000

If necessary, we may customize the printing using the tech.v3.dataset.print namespace of the tech.ml.dataset library.

For example:

(-> {:x (range 99)
     :y (repeatedly 99 rand)}
    tc/dataset
    ;; show at most 9 values
    (print/print-range 9))

_unnamed [99 2]:

:x :y
0 0.35437556
1 0.50994490
2 0.39946618
3 0.01589818
… …
94 0.16528626
95 0.99148194
96 0.45386767
97 0.92543445
98 0.57312692

We may also explicitly turn it into an HTML table:

(kind/table some-trips)
rideable-type start-lat start-lng end-lat end-lng
classic_bike 41.906866 -87.626217 41.92393131136619 -87.63582453131676
electric_bike 41.869312286 -87.673897266 41.8895 -87.688257
classic_bike 41.95600355078549 -87.68016144633293 41.886875 -87.62603

This does not matter much for now, but it can be handy when certain inner values should be visualized in a certain way.

It is possible to use datatables to reneder kind/table and specify datatables options (see the full list).

(kind/table some-trips
            {:use-datatables true
             :datatables     {:scrollY 100}})
rideable-typestart-latstart-lngend-latend-lng
classic_bike41.906866-87.62621741.92393131136619-87.63582453131676
electric_bike41.869312286-87.67389726641.8895-87.688257
classic_bike41.95600355078549-87.6801614463329341.886875-87.62603

We may also nest the usual printed table inside other visualization kinds, such as Hiccup.

(kind/hiccup
 [:div {:style {:width "60%"
                :max-height "400px"
                :overflow-x :auto
                :overflow-y :auto
                :background "floralwhite"}}
  some-trips])

_unnamed [3 5]:

:rideable-type :start-lat :start-lng :end-lat :end-lng
classic_bike 41.90686600 -87.62621700 41.92393131 -87.63582453
electric_bike 41.86931229 -87.67389727 41.88950000 -87.68825700
classic_bike 41.95600355 -87.68016145 41.88687500 -87.62603000

For use in this tutorial, let us define our own customized view:

(defn compact-view [dataset]
  (kind/hiccup
   [:div {:style {:max-width "100%"
                  :max-height "400px"
                  :overflow-x :auto
                  :overflow-y :auto}}
    dataset]))

5.6 What is a dataset?

Let us explore this data structure, our little dataset of bike trips.

A dataset is a value of the Dataset datatype defined in the tech.ml.dataset library:

(type some-trips)
tech.v3.dataset.impl.dataset.Dataset

One thing worth knowing about this datatype is that it is extended by quite a few interfaces and protocols.

For example, it behaves as a map.

(map? some-trips)
true

The keys are the column names:

(keys some-trips)
(:rideable-type :start-lat :start-lng :end-lat :end-lng)
(tc/column-names some-trips)
(:rideable-type :start-lat :start-lng :end-lat :end-lng)

.. and the values are the columns:

(:start-lat some-trips)
#tech.v3.dataset.column<float64>[3]
:start-lat
[41.91, 41.87, 41.96]

Now we need to discuss what columns are.

5.7 What is a column?

(:start-lat some-trips)
#tech.v3.dataset.column<float64>[3]
:start-lat
[41.91, 41.87, 41.96]

A column is a value of Column datatype defined in the tech.ml.dataset library:

(-> some-trips
    :start-lat
    type)
tech.v3.dataset.impl.column.Column

This datatype is also extended by quite a few interfaces and protocols.

For example, it is sequential.

(-> some-trips
    :start-lat
    sequential?)
true

So, we can use the sequence abstraction with it:

(->> (:start-lat some-trips)
     (take 2))
(41.906866 41.869312286)
(->> (:rideable-type some-trips)
     (filter #{"classic_bike"}))
("classic_bike" "classic_bike")

It is also assiciative:

(-> some-trips
    :rideable-type
    (assoc 2 "my strange and unique bike"))
["classic_bike" "electric_bike" "my strange and unique bike"]

5.8 Working with Columns

We may use Tablecloth’s Column API to create and process columns. For example:

(tcc/column ["classic_bike" "electrical_bike" "classic_bike"])
#tech.v3.dataset.column<string>[3]
null
[classic_bike, electrical_bike, classic_bike]

What is the average latitude where trips tend to start?

(-> some-trips
    :start-lat
    tcc/mean)
41.9107272789285

What is the type of elements in this Column?

(-> some-trips
    :start-lat
    tcc/typeof)
:float64

Let us look into our latitudes in radians rather than degrees:

(-> some-trips
    :start-lat
    (tcc/* (/ Math/PI 180)))
#tech.v3.dataset.column<float64>[3]
null
[0.7314, 0.7308, 0.7323]

You see, columns are typed, and this has implications for both for time and space performance as well as ergonomics.

5.9 The data in columns

You will probably not need this detail most times, but it is worth knowing that the data actually held by the Column can be accessed as the .data field, and it can be of varying data structures.

For example:

(-> some-trips
    :start-lat
    .data
    type)
tech.v3.datatype.array_buffer.ArrayBuffer
(-> some-trips
    :rideable-type
    .data
    type)
tech.v3.dataset.string_table.StringTable
(-> (range 9)
    tcc/column
    .data
    type)
clojure.lang.LongRange

Behind the scenes, tech.ml.dataset makes sure to use efficient data structures for columns, so that, e.g., random access by index will be efficient:

(-> some-trips
    :start-lat
    (nth 2))
41.95600355078549

The following is quick too!

(-> (range 1000000)
    (tcc/* 1000)
    (nth 10000))
10000000

Here we rely on the β€œlazy and noncaching” semantics of the undelying dtype-next library, which is a topic worth its own tutorial.

5.10 Summarizing datasets

We can use the tc/info function to summarize a dataset:

(-> some-trips
    tc/info
    compact-view)

_unnamed: descriptive-stats [5 12]:

:col-name :datatype :n-valid :n-missing :min :mean :mode :max :standard-deviation :skew :first :last
:rideable-type :string 3 0 classic_bike classic_bike classic_bike
:start-lat :float64 3 0 41.86931229 41.91072728 41.95600355 0.04347443 0.39652474 41.91 41.96
:start-lng :float64 3 0 -87.68016145 -87.66009190 -87.62621700 0.02950325 1.64462246 -87.63 -87.68
:end-lat :float64 3 0 41.88687500 41.90010210 41.92393131 0.02067839 1.70070295 41.92 41.89
:end-lng :float64 3 0 -87.68825700 -87.65003718 -87.62603000 0.03345967 -1.56657916 -87.64 -87.63

5.11 Reading datasets

Datasets are often read from files.

Let us read a file from the Chicago bike trips dataset. In this case, it is a CSV file compressed by gzip, but other formats are supported as well.

First, let us read just a few rows:

(->  "data/chicago-bikes/202304_divvy_tripdata.csv.gz"
     (tc/dataset {:num-rows 3})
     compact-view)

data/chicago-bikes/202304_divvy_tripdata.csv.gz [3 13]:

ride_id rideable_type started_at ended_at start_station_name start_station_id end_station_name end_station_id start_lat start_lng end_lat end_lng member_casual
8FE8F7D9C10E88C7 electric_bike 2023-04-02 08:37:28 2023-04-02 08:41:37 41.80 -87.60 41.79 -87.60 member
34E4ED3ADF1D821B electric_bike 2023-04-19 11:29:02 2023-04-19 11:52:12 41.87 -87.65 41.93 -87.68 member
5296BF07A2F77CB5 electric_bike 2023-04-19 08:41:22 2023-04-19 08:43:22 41.93 -87.66 41.93 -87.66 member

So reading a dataset is easy, but sometimes we may wish to pass a few options to handle it a bit better.

For example, you see that by default, the column names are strings:

(->  "data/chicago-bikes/202304_divvy_tripdata.csv.gz"
     (tc/dataset {:num-rows 9})
     tc/column-names)
("ride_id"
 "rideable_type"
 "started_at"
 "ended_at"
 "start_station_name"
 "start_station_id"
 "end_station_name"
 "end_station_id"
 "start_lat"
 "start_lng"
 "end_lat"
 "end_lng"
 "member_casual")

We can apply the keyword function to all of them, to conveniently have keywords instead.

(->  "data/chicago-bikes/202304_divvy_tripdata.csv.gz"
     (tc/dataset {:num-rows 9
                  :key-fn keyword})
     tc/column-names)
(:ride_id
 :rideable_type
 :started_at
 :ended_at
 :start_station_name
 :start_station_id
 :end_station_name
 :end_station_id
 :start_lat
 :start_lng
 :end_lat
 :end_lng
 :member_casual)

Even better, we may process the names to replace underscores with dashes.

(->  "data/chicago-bikes/202304_divvy_tripdata.csv.gz"
     (tc/dataset {:num-rows 9
                  :key-fn (fn [s]
                            (-> s
                                (str/replace #"_" "-")
                                keyword))})
     tc/column-names)
(:ride-id
 :rideable-type
 :started-at
 :ended-at
 :start-station-name
 :start-station-id
 :end-station-name
 :end-station-id
 :start-lat
 :start-lng
 :end-lat
 :end-lng
 :member-casual)

Also, the date-time columns are parsed as strings.

(->  "data/chicago-bikes/202304_divvy_tripdata.csv.gz"
     (tc/dataset {:num-rows 9
                  :key-fn (fn [s]
                            (-> s
                                (str/replace #"_" "-")
                                keyword))})
     :started-at
     tcc/typeof)
:string

Let us specify our own parsing for these columns.

(def datetime-parser [:local-date-time
                      "yyyy-MM-dd HH:mm:ss"])
(->  "data/chicago-bikes/202304_divvy_tripdata.csv.gz"
     (tc/dataset {:num-rows 9
                  :key-fn (fn [s]
                            (-> s
                                (str/replace #"_" "-")
                                keyword))
                  ;; Note we use the original column names
                  ;; when defining the parser:
                  :parser-fn {"started_at" datetime-parser
                              "ended_at" datetime-parser}})
     :started-at
     tcc/typeof)
:local-date-time

Let us now read the whole dataset and hold it in a var for further exploration. We use defonce so that next time we evaluate this expression, nothing will happen. This practice is handy when reading big files.

(defonce trips
  (->  "data/chicago-bikes/202304_divvy_tripdata.csv.gz"
       (tc/dataset {:key-fn    (fn [s]
                                 (-> s
                                     (str/replace #"_" "-")
                                     keyword))
                    :parser-fn {"started_at" datetime-parser
                                "ended_at"   datetime-parser}})))
(compact-view
 trips)

data/chicago-bikes/202304_divvy_tripdata.csv.gz [426590 13]:

:ride-id :rideable-type :started-at :ended-at :start-station-name :start-station-id :end-station-name :end-station-id :start-lat :start-lng :end-lat :end-lng :member-casual
8FE8F7D9C10E88C7 electric_bike 2023-04-02T08:37:28 2023-04-02T08:41:37 41.80000000 -87.60000000 41.79000000 -87.60000000 member
34E4ED3ADF1D821B electric_bike 2023-04-19T11:29:02 2023-04-19T11:52:12 41.87000000 -87.65000000 41.93000000 -87.68000000 member
5296BF07A2F77CB5 electric_bike 2023-04-19T08:41:22 2023-04-19T08:43:22 41.93000000 -87.66000000 41.93000000 -87.66000000 member
40759916B76D5D52 electric_bike 2023-04-19T13:31:30 2023-04-19T13:35:09 41.92000000 -87.65000000 41.91000000 -87.65000000 member
77A96F460101AC63 electric_bike 2023-04-19T12:05:36 2023-04-19T12:10:26 41.91000000 -87.65000000 41.91000000 -87.63000000 member
8D6A2328E19DC168 electric_bike 2023-04-19T12:17:34 2023-04-19T12:21:38 41.91000000 -87.63000000 41.92000000 -87.65000000 member
C97BBA66E07889F9 electric_bike 2023-04-19T09:35:48 2023-04-19T09:45 41.93000000 -87.66000000 41.91000000 -87.65000000 member
6687AD4C575FF734 electric_bike 2023-04-11T16:13:43 2023-04-11T16:18:41 42.00000000 -87.66000000 41.99000000 -87.66000000 member
A8FA4F73B22BC11F electric_bike 2023-04-11T16:29:24 2023-04-11T16:40:23 41.99000000 -87.66000000 42.00000000 -87.66000000 member
81E158FE63D99994 electric_bike 2023-04-19T17:35:40 2023-04-19T17:36:11 41.88000000 -87.65000000 41.88000000 -87.65000000 member
... ... ... ... ... ... ... ... ... ... ... ... ...
BAE25DBA0CD0DF40 electric_bike 2023-04-14T07:12:34 2023-04-14T07:42:07 Richmond St & Diversey Ave 15645 Clark St & Ida B Wells Dr TA1305000009 41.93186915 -87.70120764 41.87593267 -87.63058454 member
5A98F86A573AAB2C electric_bike 2023-04-24T07:27:02 2023-04-24T07:58:22 Richmond St & Diversey Ave 15645 Clark St & Ida B Wells Dr TA1305000009 41.93198085 -87.70128024 41.87593267 -87.63058454 member
92B02F2FBDC1FB9F classic_bike 2023-04-12T08:16:48 2023-04-12T08:40:08 Michigan Ave & Lake St TA1305000011 May St & Taylor St 13160 41.88602200 -87.62439800 41.86948210 -87.65548640 member
119A6C53607EAA4A electric_bike 2023-04-28T07:24:54 2023-04-28T07:53:44 Richmond St & Diversey Ave 15645 Clark St & Ida B Wells Dr TA1305000009 41.93190396 -87.70125461 41.87593267 -87.63058454 member
9BCF1E8BA027EAFA electric_bike 2023-04-21T07:15:06 2023-04-21T07:41:45 Richmond St & Diversey Ave 15645 Clark St & Ida B Wells Dr TA1305000009 41.93193305 -87.70126295 41.87593267 -87.63058454 member
A17D800CE963661A classic_bike 2023-04-11T15:46:42 2023-04-11T15:50:03 Michigan Ave & Lake St TA1305000011 Clark St & Randolph St TA1305000030 41.88602200 -87.62439800 41.88457623 -87.63188991 member
8B441A6C436E9900 classic_bike 2023-04-29T21:20:21 2023-04-29T21:30:19 Halsted St & 18th St 13099 Blue Island Ave & 18th St 13135 41.85750568 -87.64599144 41.85755600 -87.66153500 casual
3980D64BE11540F1 classic_bike 2023-04-24T09:16:05 2023-04-24T09:22:27 Halsted St & 18th St 13099 Blue Island Ave & 18th St 13135 41.85750568 -87.64599144 41.85755600 -87.66153500 casual
3EF4B49FF7DAA02C classic_bike 2023-04-18T07:53:51 2023-04-18T07:59:16 Franklin St & Jackson Blvd TA1305000025 Clark St & Randolph St TA1305000030 41.87770796 -87.63532114 41.88457623 -87.63188991 casual
210B2ED6583DC231 classic_bike 2023-04-29T07:33:55 2023-04-29T07:38:57 Michigan Ave & Lake St TA1305000011 Clark St & Randolph St TA1305000030 41.88602200 -87.62439800 41.88457623 -87.63188991 casual
D29CB39B9E3FC46A electric_bike 2023-04-18T08:00:32 2023-04-18T08:02:35 Franklin St & Jackson Blvd TA1305000025 Clark St & Ida B Wells Dr TA1305000009 41.87813417 -87.63525450 41.87593267 -87.63058454 casual
(-> trips
    tc/info
    compact-view)

data/chicago-bikes/202304_divvy_tripdata.csv.gz: descriptive-stats [13 12]:

:col-name :datatype :n-valid :n-missing :min :mean :mode :max :standard-deviation :skew :first :last
:ride-id :string 426590 0 89CC272248838ACC 8FE8F7D9C10E88C7 D29CB39B9E3FC46A
:rideable-type :string 426590 0 electric_bike electric_bike electric_bike
:started-at :local-date-time 426590 0 2023-04-01T00:00:02 2023-04-16T05:15:35.589 2023-04-30T23:59:05 6.79308822E+08 0.08185427 2023-04-02T08:37:28 2023-04-18T08:00:32
:ended-at :local-date-time 426590 0 2023-04-01T00:03:10 2023-04-16T05:32:48.230 2023-05-03T10:37:12 6.79294547E+08 0.08147835 2023-04-02T08:41:37 2023-04-18T08:02:35
:start-station-name :string 362776 63814 Franklin St & Jackson Blvd
:start-station-id :string 362776 63814 TA1305000025
:end-station-name :string 357960 68630 Clark St & Ida B Wells Dr
:end-station-id :string 357960 68630 TA1305000009
:start-lat :float64 426590 0 41.65 41.90 42.07 4.70980524E-02 -0.35851400 41.80 41.88
:start-lng :float64 426590 0 -87.83 -87.65 -87.52 2.76893978E-02 -0.82254345 -87.60 -87.64
:end-lat :float64 426155 435 41.65 41.90 42.08 4.72412894E-02 -0.37120814 41.79 41.88
:end-lng :float64 426155 435 -88.11 -87.65 -87.53 2.78076951E-02 -0.82990154 -87.60 -87.63
:member-casual :string 426590 0 member member casual

It is a whole month of bike trips!

5.12 Getting the rows of a dataset

Datasets are organized by columns for efficiency, making use of the knowledge of homogenous types in columns.

Sometimes, however, it is useful to work with rows as well.

The tc/rows function provides the rows of a dataset, either as vectors or as maps. Note, however, that it does not copy the data. Rather, it provides a rowwise view of the columnwise dataset.

(take 2 (tc/rows trips))
(["8FE8F7D9C10E88C7"
  "electric_bike"
  #object[java.time.LocalDateTime 0x15e0dac2 "2023-04-02T08:37:28"]
  #object[java.time.LocalDateTime 0x427051b2 "2023-04-02T08:41:37"]
  nil
  nil
  nil
  nil
  41.8
  -87.6
  41.79
  -87.6
  "member"]
 ["34E4ED3ADF1D821B"
  "electric_bike"
  #object[java.time.LocalDateTime 0x6e5bcef0 "2023-04-19T11:29:02"]
  #object[java.time.LocalDateTime 0x1327563a "2023-04-19T11:52:12"]
  nil
  nil
  nil
  nil
  41.87
  -87.65
  41.93
  -87.68
  "member"])
(take 2 (tc/rows trips :as-maps))
({:ride-id "8FE8F7D9C10E88C7",
  :rideable-type "electric_bike",
  :started-at
  #object[java.time.LocalDateTime 0x15e0dac2 "2023-04-02T08:37:28"],
  :ended-at
  #object[java.time.LocalDateTime 0x427051b2 "2023-04-02T08:41:37"],
  :start-station-name nil,
  :start-station-id nil,
  :end-station-name nil,
  :end-station-id nil,
  :start-lat 41.8,
  :start-lng -87.6,
  :end-lat 41.79,
  :end-lng -87.6,
  :member-casual "member"}
 {:ride-id "34E4ED3ADF1D821B",
  :rideable-type "electric_bike",
  :started-at
  #object[java.time.LocalDateTime 0x6e5bcef0 "2023-04-19T11:29:02"],
  :ended-at
  #object[java.time.LocalDateTime 0x1327563a "2023-04-19T11:52:12"],
  :start-station-name nil,
  :start-station-id nil,
  :end-station-name nil,
  :end-station-id nil,
  :start-lat 41.87,
  :start-lng -87.65,
  :end-lat 41.93,
  :end-lng -87.68,
  :member-casual "member"})

As you may know, Clojure shines in processing plain data, structured or unstructured, such as vectors and maps of any content. We do not lose any of that when using datasets, as we can still view them as rows which are just maps or vectors.

5.13 Querying datasets

Tablecloth offers various ways to view a subset of a dataset. Typically, they do not copy the data but provide views of the same space in memory.

The first few trips:

(-> trips
    tc/head
    compact-view)

data/chicago-bikes/202304_divvy_tripdata.csv.gz [5 13]:

:ride-id :rideable-type :started-at :ended-at :start-station-name :start-station-id :end-station-name :end-station-id :start-lat :start-lng :end-lat :end-lng :member-casual
8FE8F7D9C10E88C7 electric_bike 2023-04-02T08:37:28 2023-04-02T08:41:37 41.80 -87.60 41.79 -87.60 member
34E4ED3ADF1D821B electric_bike 2023-04-19T11:29:02 2023-04-19T11:52:12 41.87 -87.65 41.93 -87.68 member
5296BF07A2F77CB5 electric_bike 2023-04-19T08:41:22 2023-04-19T08:43:22 41.93 -87.66 41.93 -87.66 member
40759916B76D5D52 electric_bike 2023-04-19T13:31:30 2023-04-19T13:35:09 41.92 -87.65 41.91 -87.65 member
77A96F460101AC63 electric_bike 2023-04-19T12:05:36 2023-04-19T12:10:26 41.91 -87.65 41.91 -87.63 member

Just a few columns:

(-> trips
    (tc/select-columns [:rideable-type :started-at :ended-at])
    (print/print-range 5))

data/chicago-bikes/202304_divvy_tripdata.csv.gz [426590 3]:

:rideable-type :started-at :ended-at
electric_bike 2023-04-02T08:37:28 2023-04-02T08:41:37
electric_bike 2023-04-19T11:29:02 2023-04-19T11:52:12
… … …
classic_bike 2023-04-18T07:53:51 2023-04-18T07:59:16
classic_bike 2023-04-29T07:33:55 2023-04-29T07:38:57
electric_bike 2023-04-18T08:00:32 2023-04-18T08:02:35

Only rows about classical bikes, and just a few columns:

(-> trips
    (tc/select-rows (fn [row]
                      (-> row :rideable-type (= "classic_bike"))))
    tc/head
    (tc/select-columns [:rideable-type :started-at :ended-at])
    (print/print-range 5))

data/chicago-bikes/202304_divvy_tripdata.csv.gz [5 3]:

:rideable-type :started-at :ended-at
classic_bike 2023-04-10T17:34:35 2023-04-10T18:02:36
classic_bike 2023-04-12T12:29:46 2023-04-12T12:54
classic_bike 2023-04-29T20:57:10 2023-04-29T20:57:13
classic_bike 2023-04-20T17:03:11 2023-04-20T17:24:58
classic_bike 2023-04-07T21:14:14 2023-04-07T21:15:14

5.14 Adding columns

Here we will demonstrate some of the ways to extend a dataset with new columns.

Let us compute how the bike trips are. For clarity, let us focus on a dataset with just a few of the columns:

(-> trips
    (tc/select-columns [:rideable-type :started-at :ended-at])
    (print/print-range 5))

data/chicago-bikes/202304_divvy_tripdata.csv.gz [426590 3]:

:rideable-type :started-at :ended-at
electric_bike 2023-04-02T08:37:28 2023-04-02T08:41:37
electric_bike 2023-04-19T11:29:02 2023-04-19T11:52:12
… … …
classic_bike 2023-04-18T07:53:51 2023-04-18T07:59:16
classic_bike 2023-04-29T07:33:55 2023-04-29T07:38:57
electric_bike 2023-04-18T08:00:32 2023-04-18T08:02:35

The tc/map-columns function is useful when one needs to apply a function to the values in one or more of the existings columns, for every row.

(-> trips
    (tc/select-columns [:rideable-type :started-at :ended-at])
    (print/print-range 5)
    (tc/map-columns :duration
                    [:started-at :ended-at]
                    java-time/duration))

data/chicago-bikes/202304_divvy_tripdata.csv.gz [426590 4]:

:rideable-type :started-at :ended-at :duration
electric_bike 2023-04-02T08:37:28 2023-04-02T08:41:37 PT4M9S
electric_bike 2023-04-19T11:29:02 2023-04-19T11:52:12 PT23M10S
… … … …
classic_bike 2023-04-18T07:53:51 2023-04-18T07:59:16 PT5M25S
classic_bike 2023-04-29T07:33:55 2023-04-29T07:38:57 PT5M2S
electric_bike 2023-04-18T08:00:32 2023-04-18T08:02:35 PT2M3S
(-> trips
    (tc/select-columns [:rideable-type :started-at :ended-at])
    (print/print-range 5)
    (tc/map-columns :duration
                    [:started-at :ended-at]
                    java-time/duration)
    (tc/map-columns :duration-in-seconds
                    [:duration]
                    #(java-time/as % :seconds)))

data/chicago-bikes/202304_divvy_tripdata.csv.gz [426590 5]:

:rideable-type :started-at :ended-at :duration :duration-in-seconds
electric_bike 2023-04-02T08:37:28 2023-04-02T08:41:37 PT4M9S 249
electric_bike 2023-04-19T11:29:02 2023-04-19T11:52:12 PT23M10S 1390
… … … … …
classic_bike 2023-04-18T07:53:51 2023-04-18T07:59:16 PT5M25S 325
classic_bike 2023-04-29T07:33:55 2023-04-29T07:38:57 PT5M2S 302
electric_bike 2023-04-18T08:00:32 2023-04-18T08:02:35 PT2M3S 123

The tc/add-columns function is useful when one needs to apply a function to a whole dataset and return a whole column at once. This combines nicely with the column API (tcc).

(-> trips
    (tc/select-columns [:rideable-type :started-at :ended-at])
    (print/print-range 5)
    (tc/map-columns :duration
                    [:started-at :ended-at]
                    java-time/duration)
    (tc/map-columns :duration-in-seconds
                    [:duration]
                    #(java-time/as % :seconds))
    (tc/add-column :duration-in-minutes
                   (fn [ds]
                     (-> ds
                         :duration-in-seconds
                         (tcc/* 1/60)))))

data/chicago-bikes/202304_divvy_tripdata.csv.gz [426590 6]:

:rideable-type :started-at :ended-at :duration :duration-in-seconds :duration-in-minutes
electric_bike 2023-04-02T08:37:28 2023-04-02T08:41:37 PT4M9S 249 4.150
electric_bike 2023-04-19T11:29:02 2023-04-19T11:52:12 PT23M10S 1390 23.17
… … … … … …
classic_bike 2023-04-18T07:53:51 2023-04-18T07:59:16 PT5M25S 325 5.417
classic_bike 2023-04-29T07:33:55 2023-04-29T07:38:57 PT5M2S 302 5.033
electric_bike 2023-04-18T08:00:32 2023-04-18T08:02:35 PT2M3S 123 2.050

Let us also add a column of the hour where each trip started – an integer between 0 to 23.

To do that, we will first add the hour as a column. Earlier, we did some time processing using the java-time API. Now, we will demonstrate a different way, using the datetime namespace of dtype-next. This namespace offers some handy functions that act on whole columns.

Let us keep this dataset in a var.

(def preprocessed-trips
  (-> trips
      (tc/select-columns [:rideable-type :started-at :ended-at])
      (print/print-range 5)
      (tc/map-columns :duration
                      [:started-at :ended-at]
                      java-time/duration)
      (tc/map-columns :duration-in-seconds
                      [:duration]
                      #(java-time/as % :seconds))
      (tc/add-column :duration-in-minutes
                     (fn [ds]
                       (-> ds
                           :duration-in-seconds
                           (tcc/* 1/60))))
      (tc/add-column :hour
                     (fn [ds]
                       (datetime/long-temporal-field
                        :hours
                        (:started-at ds))))))
preprocessed-trips

data/chicago-bikes/202304_divvy_tripdata.csv.gz [426590 7]:

:rideable-type :started-at :ended-at :duration :duration-in-seconds :duration-in-minutes :hour
electric_bike 2023-04-02T08:37:28 2023-04-02T08:41:37 PT4M9S 249 4.150 8
electric_bike 2023-04-19T11:29:02 2023-04-19T11:52:12 PT23M10S 1390 23.17 11
… … … … … … …
classic_bike 2023-04-18T07:53:51 2023-04-18T07:59:16 PT5M25S 325 5.417 7
classic_bike 2023-04-29T07:33:55 2023-04-29T07:38:57 PT5M2S 302 5.033 7
electric_bike 2023-04-18T08:00:32 2023-04-18T08:02:35 PT2M3S 123 2.050 8

5.15 Grouping and summarizing

How does bike usage change througout the day?

Let us see how the number of trips and their median length change by the hour.

Let us group the trips by the hour:

(-> preprocessed-trips
    (tc/group-by [:hour])
    (print/print-range 5))

_unnamed [24 3]:

:name :group-id :data
{:hour 8} 0 Group: {:hour 8} [22170 7]:
{:hour 11} 1 Group: {:hour 11} [20593 7]:
… … …
{:hour 23} 21 Group: {:hour 23} [7327 7]:
{:hour 6} 22 Group: {:hour 6} [8863 7]:
{:hour 5} 23 Group: {:hour 5} [2802 7]:

The resulting dataset is a dataset of a special kind, a grouped dataset. Its :data column contains whole datasets, which are the groups. In our case, these are the groups of bike trips starting in a given hour, for every hour throughout the day.

Now, we can aggregate over the groups to recieve a summary. The resulting summary dataset will no longer be a grouped dataset. We will order the summary by the hour.

(-> preprocessed-trips
    (tc/group-by [:hour])
    (tc/aggregate {:n-trips tc/row-count
                   :median-duration (fn [ds]
                                      (tcc/median (:duration-in-seconds ds)))})
    (tc/order-by [:hour])
    (print/print-range :all))

_unnamed [24 3]:

:hour :n-trips :median-duration
0 5111 487.0
1 3235 478.0
2 1874 461.0
3 1165 491.0
4 950 452.0
5 2802 394.0
6 8863 434.0
7 17065 473.0
8 22170 472.0
9 17316 466.0
10 17489 510.0
11 20593 519.0
12 24050 536.0
13 24837 560.0
14 25900 585.0
15 30730 574.0
16 40374 590.0
17 46889 597.0
18 38281 578.0
19 27107 537.0
20 17252 505.0
21 14388 502.0
22 10822 508.0
23 7327 485.0

We can see a peak of usage between 17:00 to 18:00 and a possibly slight tendendcy for longer trips (in time) around the afternoon hours.

source: notebooks/noj_book/tablecloth_table_processing.clj