5 Table processing with Tablecloth

author: Cvetomir Dimov and Daniel Slutsky

last change: 2025-02-02

Tablecloth is a table processing library inspired by the dataframe ergonomics typicall to the R ecosystem, specifically the Tidyverse, but offers certain advantages on top of that.

It is built on top of the data structures and functions of tech.ml.dataset, a high-performance table processing library (often called TMD), which is built on top of dtype-next, an array-programming library.

In this tutorial, we will see a few of the core ideas of Tablecloth. You are encouraged to look into the main documentation for more information.

5.1 Recommended resources

Data Manipulation in Clojure Compared to R and Python by Kira Howe (McLean) (2024-07-18)
Dealing with out-of-memory faulty csv’s with Clojure, Duckdb, and Parquet by Georgy Toporkov (2024-01-22)
A beginner-friendly intro by Mey Beisaron (Func Prog Sweden, 2023-03-22):

An in-depth walkthrough by Ethan Miller (data-recur group, 2022-11-05):

5.2 About this tutorial

In this tutorial, we will demonstrate the ergonomics of so-called dataset data strucures provided by Tablecloth. Datasets are table-like data structures, often called data-frames in other data science platforms.

We will assume basic familiarity with Clojure.

A lot of what we demonstrate here can also be implemented with the usual Clojure data strucures such as vectors and maps. However, datasets offer not only performance advantages in space and time, but also certain usability features, which are arguably expressive and powerful.

We will often use treading macros, mostly ->. This approach is compatible with data science cultures such as the Tidyverse in R.

5.3 Setup

We create a namespace and require the following namespaces:

tablecloth.api - the main Tablecloth Dataset API
tablecloth.column.api - the Tablecloth Column API
tech.v3.dataset.print to control printing (from tech.ml.dataset)
clojure.string for string processing
clojure.java.io for file input/output
scicloj.kindly.v4.kind of the Kindly standard to control the way certain values are displayed
java-time.api of Clojure.Java-Time for some time calculations
tech.v3.datatype.datetime of date and time operations (from dtype-next)

(ns noj-book.tablecloth-table-processing
  (:require [tablecloth.api :as tc]
            [tablecloth.column.api :as tcc]
            [tech.v3.dataset.print :as print]
            [clojure.string :as str]
            [clojure.java.io :as io]
            [scicloj.kindly.v4.kind :as kind]
            [java-time.api :as java-time]
            [tech.v3.datatype.datetime :as datetime]))

5.4 Creating a dataset

Assume we have a vector of vectors representing bike trips. Each trip has the type of the bike and the coordinates (latitude and longitude) of the trip’s start and end. We can turn this data structure into a dataset:

(tc/dataset [["classic_bike" 41.906866 -87.626217 41.92393131136619 -87.63582453131676]
             ["electric_bike" 41.869312286 -87.673897266 41.8895 -87.688257]
             ["classic_bike" 41.95600355078549 -87.68016144633293 41.886875 -87.62603]]
            {:column-names [:rideable-type
                            :start-lat :start-lng
                            :end-lat :end-lng]})

:_unnamed [3 5]:

:rideable-type	:start-lat	:start-lng	:end-lat	:end-lng
classic_bike	41.90686600	-87.62621700	41.92393131	-87.63582453
electric_bike	41.86931229	-87.67389727	41.88950000	-87.68825700
classic_bike	41.95600355	-87.68016145	41.88687500	-87.62603000

Sometimes, data may arrive in different shapes. We can also create such a dataset from a vector of maps:

(tc/dataset [{:rideable-type "classic_bike"
              :start-lat     41.906866
              :start-lng     -87.626217
              :end-lat       41.92393131136619
              :end-lng       -87.63582453131676}
             {:rideable-type "electric_bike"
              :start-lat     41.869312286
              :start-lng     -87.673897266
              :end-lat       41.8895
              :end-lng       -87.688257 }
             {:rideable-type "classic_bike"
              :start-lat     41.95600355078549
              :start-lng     -87.68016144633293
              :end-lat       41.886875
              :end-lng       -87.62603}])

_unnamed [3 5]:

:rideable-type	:start-lat	:start-lng	:end-lat	:end-lng
classic_bike	41.90686600	-87.62621700	41.92393131	-87.63582453
electric_bike	41.86931229	-87.67389727	41.88950000	-87.68825700
classic_bike	41.95600355	-87.68016145	41.88687500	-87.62603000

.. and also, from a map of vectors:

(tc/dataset {:rideable-type ["classic_bike" "electric_bike" "classic_bike"]
             :start-lat     [41.906866 41.869312286 41.95600355078549]
             :start-lng     [-87.626217 -87.673897266 -87.68016144633293]
             :end-lat       [41.92393131136619 41.8895 41.886875]
             :end-lng       [-87.63582453131676 -87.688257 -87.62603]})

_unnamed [3 5]:

:rideable-type	:start-lat	:start-lng	:end-lat	:end-lng
classic_bike	41.90686600	-87.62621700	41.92393131	-87.63582453
electric_bike	41.86931229	-87.67389727	41.88950000	-87.68825700
classic_bike	41.95600355	-87.68016145	41.88687500	-87.62603000

Let us hold it in a var to explore it further:

(def some-trips
  (tc/dataset {:rideable-type ["classic_bike" "electric_bike" "classic_bike"]
               :start-lat     [41.906866 41.869312286 41.95600355078549]
               :start-lng     [-87.626217 -87.673897266 -87.68016144633293]
               :end-lat       [41.92393131136619 41.8895 41.886875]
               :end-lng       [-87.63582453131676 -87.688257 -87.62603]}))

5.5 Displaying a dataset

In an environment compatible with the Kindly standard, the default way a dataset is displayed is by printing it.

some-trips

_unnamed [3 5]:

:rideable-type	:start-lat	:start-lng	:end-lat	:end-lng
classic_bike	41.90686600	-87.62621700	41.92393131	-87.63582453
electric_bike	41.86931229	-87.67389727	41.88950000	-87.68825700
classic_bike	41.95600355	-87.68016145	41.88687500	-87.62603000

If necessary, we may customize the printing using the tech.v3.dataset.print namespace of the tech.ml.dataset library.

For example:

(-> {:x (range 99)
     :y (repeatedly 99 rand)}
    tc/dataset
    ;; show at most 9 values
    (print/print-range 9))

_unnamed [99 2]:

:x	:y
0	0.14069322
1	0.91375369
2	0.38493058
3	0.29880738
…	…
94	0.34595240
95	0.46465434
96	0.79280554
97	0.36360836
98	0.69083800

We may also explicitly turn it into an HTML table:

(kind/table some-trips)

rideable-type	start-lat	start-lng	end-lat	end-lng
classic_bike	41.906866	-87.626217	41.92393131136619	-87.63582453131676
electric_bike	41.869312286	-87.673897266	41.8895	-87.688257
classic_bike	41.95600355078549	-87.68016144633293	41.886875	-87.62603

This does not matter much for now, but it can be handy when certain inner values should be visualized in a certain way.

It is possible to use datatables to reneder kind/table and specify datatables options (see the full list).

(kind/table some-trips
            {:use-datatables true
             :datatables     {:scrollY 100}})

rideable-type	start-lat	start-lng	end-lat	end-lng
classic_bike	41.906866	-87.626217	41.92393131136619	-87.63582453131676
electric_bike	41.869312286	-87.673897266	41.8895	-87.688257
classic_bike	41.95600355078549	-87.68016144633293	41.886875	-87.62603

We may also nest the usual printed table inside other visualization kinds, such as Hiccup.

(kind/hiccup
 [:div {:style {:width "60%"
                :max-height "400px"
                :overflow-x :auto
                :overflow-y :auto
                :background "floralwhite"}}
  some-trips])

_unnamed [3 5]:

:rideable-type	:start-lat	:start-lng	:end-lat	:end-lng
classic_bike	41.90686600	-87.62621700	41.92393131	-87.63582453
electric_bike	41.86931229	-87.67389727	41.88950000	-87.68825700
classic_bike	41.95600355	-87.68016145	41.88687500	-87.62603000

5.6 What is a dataset?

Let us explore this data structure, our little dataset of bike trips.

A dataset is a value of the Dataset datatype defined in the tech.ml.dataset library:

(type some-trips)

tech.v3.dataset.impl.dataset.Dataset

One thing worth knowing about this datatype is that it is extended by quite a few interfaces and protocols.

For example, it behaves as a map.

(map? some-trips)

true

The keys are the column names:

(keys some-trips)

(:rideable-type :start-lat :start-lng :end-lat :end-lng)

(tc/column-names some-trips)

(:rideable-type :start-lat :start-lng :end-lat :end-lng)

.. and the values are the columns:

(:start-lat some-trips)

#tech.v3.dataset.column<float64>[3]
:start-lat
[41.91, 41.87, 41.96]

Now we need to discuss what columns are.

5.7 What is a column?

(:start-lat some-trips)

#tech.v3.dataset.column<float64>[3]
:start-lat
[41.91, 41.87, 41.96]

A column is a value of Column datatype defined in the tech.ml.dataset library:

(-> some-trips
    :start-lat
    type)

tech.v3.dataset.impl.column.Column

This datatype is also extended by quite a few interfaces and protocols.

For example, it is sequential.

(-> some-trips
    :start-lat
    sequential?)

true

So, we can use the sequence abstraction with it:

(->> (:start-lat some-trips)
     (take 2))

(41.906866 41.869312286)

(->> (:rideable-type some-trips)
     (filter #{"classic_bike"}))

("classic_bike" "classic_bike")

It is also assiciative:

(-> some-trips
    :rideable-type
    (assoc 2 "my strange and unique bike"))

["classic_bike" "electric_bike" "my strange and unique bike"]

5.8 Working with Columns

We may use Tablecloth’s Column API to create and process columns. For example:

(tcc/column ["classic_bike" "electrical_bike" "classic_bike"])

#tech.v3.dataset.column<string>[3]
null
[classic_bike, electrical_bike, classic_bike]

What is the average latitude where trips tend to start?

(-> some-trips
    :start-lat
    tcc/mean)

41.9107272789285

What is the type of elements in this Column?

(-> some-trips
    :start-lat
    tcc/typeof)

:float64

Let us look into our latitudes in radians rather than degrees:

(-> some-trips
    :start-lat
    (tcc/* (/ Math/PI 180)))

#tech.v3.dataset.column<float64>[3]
null
[0.7314, 0.7308, 0.7323]

You see, columns are typed, and this has implications for both for time and space performance as well as ergonomics.

5.9 The data in columns

You will probably not need this detail most times, but it is worth knowing that the data actually held by the Column can be accessed as the .data field, and it can be of varying data structures (see list of all datatypes).

For example:

(-> some-trips
    :start-lat
    .data
    type)

tech.v3.datatype.array_buffer.ArrayBuffer

(-> some-trips
    :rideable-type
    .data
    type)

tech.v3.dataset.string_table.StringTable

(-> (range 9)
    tcc/column
    .data
    type)

clojure.lang.LongRange

Behind the scenes, tech.ml.dataset makes sure to use efficient data structures for columns, so that, e.g., random access by index will be efficient:

(-> some-trips
    :start-lat
    (nth 2))

41.95600355078549

The following is quick too!

(-> (range 1000000)
    (tcc/* 1000)
    (nth 10000))

10000000

Here we rely on the “lazy and noncaching” semantics of the undelying dtype-next library, which is a topic worth its own tutorial.

5.10 Operations on columns

In the examples with columns above, we used tcc/* to multiply the values of a column by a scalar. The result was another column. This operation can be applied to multiple two (or more) columns as well, or a mixture of columns and scalars.

(tcc/* (:start-lat some-trips)
       (:end-lng some-trips)
       3
       4)

#tech.v3.dataset.column<float64>[3]
null
[-4.407E+04, -4.406E+04, -4.412E+04]

A long list of other operations have been implemented, among which arithmetic operations, trigonometric functions, and various predicates. Note that the tcc/min and tcc/max functions output a column as well.

(tcc/min (:start-lat some-trips)
         (:end-lat some-trips))

#tech.v3.dataset.column<float64>[3]
null
[41.91, 41.87, 41.89]

They also work with scalar inputs.

(tcc/min (:start-lat some-trips)
         100)

#tech.v3.dataset.column<float64>[3]
null
[41.91, 41.87, 41.96]

(tcc/min (:start-lat some-trips)
         -100)

#tech.v3.dataset.column<float64>[3]
null
[-100.0, -100.0, -100.0]

Other operations produce a scalar as an output. These include various measures of central tendency.

(-> some-trips
    :start-lat
    tcc/mean)

41.9107272789285

(-> some-trips
    :end-lat
    tcc/median)

41.8895

Important operations in this category are functions that reduce over the columns, such as reduce-*, reduce-+, reduce-min, and reduce-max. For example, the following can be used to find the maximum value in a column:

(-> some-trips
    :start-lng
    tcc/reduce-max)

-87.626217

5.11 Summarizing datasets

We can use the tc/info function to summarize a dataset:

(tc/info some-trips)

_unnamed: descriptive-stats [5 12]:

:col-name	:datatype	:n-valid	:min	:mean	:mode	:max	:standard-deviation	:skew	:first	:last
:rideable-type	:string	3			classic_bike				classic_bike	classic_bike
:start-lat	:float64	3	41.86931229	41.91072728		41.95600355	0.04347443	0.39652474	41.91	41.96
:start-lng	:float64	3	-87.68016145	-87.66009190		-87.62621700	0.02950325	1.64462246	-87.63	-87.68
:end-lat	:float64	3	41.88687500	41.90010210		41.92393131	0.02067839	1.70070295	41.92	41.89
:end-lng	:float64	3	-87.68825700	-87.65003718		-87.62603000	0.03345967	-1.56657916	-87.64	-87.63

5.12 Reading datasets

Datasets are often read from files.

Let us read a file from the Divvy Data, where the history of bike sharing trips in Chicago is shared publicly. Chicago bike trips dataset.

Let us first download the data of one month.

(def bike-trips-filename
  "202304-divvy-tripdata.zip")

(if (.exists (io/as-file bike-trips-filename))
  [:already-downloaded bike-trips-filename]
  (with-open [in (io/input-stream "https://divvy-tripdata.s3.amazonaws.com/202304-divvy-tripdata.zip")
              out (io/output-stream bike-trips-filename)]
    (io/copy in out)
    [:just-downloaded bike-trips-filename]))

[:just-downloaded "202304-divvy-tripdata.zip"]

In this case, it is a CSV file held inside a compressed ZIP file, but other formats are supported as well.

First, let us read just a few rows:

(-> bike-trips-filename 
    tc/dataset
    time)

202304-divvy-tripdata.zip [426590 13]:

ride_id	rideable_type	started_at	ended_at	start_station_name	start_station_id	end_station_name	end_station_id	start_lat	start_lng	end_lat	end_lng	member_casual
8FE8F7D9C10E88C7	electric_bike	2023-04-02 08:37:28	2023-04-02 08:41:37					41.80000000	-87.60000000	41.79000000	-87.60000000	member
34E4ED3ADF1D821B	electric_bike	2023-04-19 11:29:02	2023-04-19 11:52:12					41.87000000	-87.65000000	41.93000000	-87.68000000	member
5296BF07A2F77CB5	electric_bike	2023-04-19 08:41:22	2023-04-19 08:43:22					41.93000000	-87.66000000	41.93000000	-87.66000000	member
40759916B76D5D52	electric_bike	2023-04-19 13:31:30	2023-04-19 13:35:09					41.92000000	-87.65000000	41.91000000	-87.65000000	member
77A96F460101AC63	electric_bike	2023-04-19 12:05:36	2023-04-19 12:10:26					41.91000000	-87.65000000	41.91000000	-87.63000000	member
8D6A2328E19DC168	electric_bike	2023-04-19 12:17:34	2023-04-19 12:21:38					41.91000000	-87.63000000	41.92000000	-87.65000000	member
C97BBA66E07889F9	electric_bike	2023-04-19 09:35:48	2023-04-19 09:45:00					41.93000000	-87.66000000	41.91000000	-87.65000000	member
6687AD4C575FF734	electric_bike	2023-04-11 16:13:43	2023-04-11 16:18:41					42.00000000	-87.66000000	41.99000000	-87.66000000	member
A8FA4F73B22BC11F	electric_bike	2023-04-11 16:29:24	2023-04-11 16:40:23					41.99000000	-87.66000000	42.00000000	-87.66000000	member
81E158FE63D99994	electric_bike	2023-04-19 17:35:40	2023-04-19 17:36:11					41.88000000	-87.65000000	41.88000000	-87.65000000	member
…	…	…	…	…	…	…	…	…	…	…	…	…
BAE25DBA0CD0DF40	electric_bike	2023-04-14 07:12:34	2023-04-14 07:42:07	Richmond St & Diversey Ave	15645	Clark St & Ida B Wells Dr	TA1305000009	41.93186915	-87.70120764	41.87593267	-87.63058454	member
5A98F86A573AAB2C	electric_bike	2023-04-24 07:27:02	2023-04-24 07:58:22	Richmond St & Diversey Ave	15645	Clark St & Ida B Wells Dr	TA1305000009	41.93198085	-87.70128024	41.87593267	-87.63058454	member
92B02F2FBDC1FB9F	classic_bike	2023-04-12 08:16:48	2023-04-12 08:40:08	Michigan Ave & Lake St	TA1305000011	May St & Taylor St	13160	41.88602200	-87.62439800	41.86948210	-87.65548640	member
119A6C53607EAA4A	electric_bike	2023-04-28 07:24:54	2023-04-28 07:53:44	Richmond St & Diversey Ave	15645	Clark St & Ida B Wells Dr	TA1305000009	41.93190396	-87.70125461	41.87593267	-87.63058454	member
9BCF1E8BA027EAFA	electric_bike	2023-04-21 07:15:06	2023-04-21 07:41:45	Richmond St & Diversey Ave	15645	Clark St & Ida B Wells Dr	TA1305000009	41.93193305	-87.70126295	41.87593267	-87.63058454	member
A17D800CE963661A	classic_bike	2023-04-11 15:46:42	2023-04-11 15:50:03	Michigan Ave & Lake St	TA1305000011	Clark St & Randolph St	TA1305000030	41.88602200	-87.62439800	41.88457623	-87.63188991	member
8B441A6C436E9900	classic_bike	2023-04-29 21:20:21	2023-04-29 21:30:19	Halsted St & 18th St	13099	Blue Island Ave & 18th St	13135	41.85750568	-87.64599144	41.85755600	-87.66153500	casual
3980D64BE11540F1	classic_bike	2023-04-24 09:16:05	2023-04-24 09:22:27	Halsted St & 18th St	13099	Blue Island Ave & 18th St	13135	41.85750568	-87.64599144	41.85755600	-87.66153500	casual
3EF4B49FF7DAA02C	classic_bike	2023-04-18 07:53:51	2023-04-18 07:59:16	Franklin St & Jackson Blvd	TA1305000025	Clark St & Randolph St	TA1305000030	41.87770796	-87.63532114	41.88457623	-87.63188991	casual
210B2ED6583DC231	classic_bike	2023-04-29 07:33:55	2023-04-29 07:38:57	Michigan Ave & Lake St	TA1305000011	Clark St & Randolph St	TA1305000030	41.88602200	-87.62439800	41.88457623	-87.63188991	casual
D29CB39B9E3FC46A	electric_bike	2023-04-18 08:00:32	2023-04-18 08:02:35	Franklin St & Jackson Blvd	TA1305000025	Clark St & Ida B Wells Dr	TA1305000009	41.87813417	-87.63525450	41.87593267	-87.63058454	casual

(->  bike-trips-filename
     (tc/dataset {:num-rows 3}))

202304-divvy-tripdata.zip [3 13]:

ride_id	rideable_type	started_at	ended_at	start_lat	start_lng	end_lat	end_lng	member_casual
8FE8F7D9C10E88C7	electric_bike	2023-04-02 08:37:28	2023-04-02 08:41:37	41.80	-87.60	41.79	-87.60	member
34E4ED3ADF1D821B	electric_bike	2023-04-19 11:29:02	2023-04-19 11:52:12	41.87	-87.65	41.93	-87.68	member
5296BF07A2F77CB5	electric_bike	2023-04-19 08:41:22	2023-04-19 08:43:22	41.93	-87.66	41.93	-87.66	member

So reading a dataset is easy, but sometimes we may wish to pass a few options to handle it a bit better.

For example, you see that by default, the column names are strings:

(->  bike-trips-filename
     (tc/dataset {:num-rows 9})
     tc/column-names)

("ride_id"
 "rideable_type"
 "started_at"
 "ended_at"
 "start_station_name"
 "start_station_id"
 "end_station_name"
 "end_station_id"
 "start_lat"
 "start_lng"
 "end_lat"
 "end_lng"
 "member_casual")

We can apply the keyword function to all of them, to conveniently have keywords instead.

(->  bike-trips-filename
     (tc/dataset {:num-rows 9
                  :key-fn keyword})
     tc/column-names)

(:ride_id
 :rideable_type
 :started_at
 :ended_at
 :start_station_name
 :start_station_id
 :end_station_name
 :end_station_id
 :start_lat
 :start_lng
 :end_lat
 :end_lng
 :member_casual)

Even better, we may process the names to replace underscores with dashes.

(->  bike-trips-filename
     (tc/dataset {:num-rows 9
                  :key-fn (fn [s]
                            (-> s
                                (str/replace #"_" "-")
                                keyword))})
     tc/column-names)

(:ride-id
 :rideable-type
 :started-at
 :ended-at
 :start-station-name
 :start-station-id
 :end-station-name
 :end-station-id
 :start-lat
 :start-lng
 :end-lat
 :end-lng
 :member-casual)

Also, the date-time columns are parsed as strings.

(->  bike-trips-filename
     (tc/dataset {:num-rows 9
                  :key-fn (fn [s]
                            (-> s
                                (str/replace #"_" "-")
                                keyword))})
     :started-at
     tcc/typeof)

:string

Let us specify our own parsing for these columns.

(def datetime-parser [:local-date-time
                      "yyyy-MM-dd HH:mm:ss"])

(->  bike-trips-filename
     (tc/dataset {:num-rows 9
                  :key-fn (fn [s]
                            (-> s
                                (str/replace #"_" "-")
                                keyword))
                  ;; Note we use the original column names
                  ;; when defining the parser:
                  :parser-fn {"started_at" datetime-parser
                              "ended_at" datetime-parser}})
     :started-at
     tcc/typeof)

:local-date-time

Let us now read the whole dataset and hold it in a var for further exploration. We use defonce so that next time we evaluate this expression, nothing will happen. This practice is handy when reading big files.

(defonce trips
  (->  bike-trips-filename
       (tc/dataset {:key-fn    (fn [s]
                                 (-> s
                                     (str/replace #"_" "-")
                                     keyword))
                    :parser-fn {"started_at" datetime-parser
                                "ended_at"   datetime-parser}})))

trips

202304-divvy-tripdata.zip [426590 13]:

:ride-id	:rideable-type	:started-at	:ended-at	:start-station-name	:start-station-id	:end-station-name	:end-station-id	:start-lat	:start-lng	:end-lat	:end-lng	:member-casual
8FE8F7D9C10E88C7	electric_bike	2023-04-02T08:37:28	2023-04-02T08:41:37					41.80000000	-87.60000000	41.79000000	-87.60000000	member
34E4ED3ADF1D821B	electric_bike	2023-04-19T11:29:02	2023-04-19T11:52:12					41.87000000	-87.65000000	41.93000000	-87.68000000	member
5296BF07A2F77CB5	electric_bike	2023-04-19T08:41:22	2023-04-19T08:43:22					41.93000000	-87.66000000	41.93000000	-87.66000000	member
40759916B76D5D52	electric_bike	2023-04-19T13:31:30	2023-04-19T13:35:09					41.92000000	-87.65000000	41.91000000	-87.65000000	member
77A96F460101AC63	electric_bike	2023-04-19T12:05:36	2023-04-19T12:10:26					41.91000000	-87.65000000	41.91000000	-87.63000000	member
8D6A2328E19DC168	electric_bike	2023-04-19T12:17:34	2023-04-19T12:21:38					41.91000000	-87.63000000	41.92000000	-87.65000000	member
C97BBA66E07889F9	electric_bike	2023-04-19T09:35:48	2023-04-19T09:45					41.93000000	-87.66000000	41.91000000	-87.65000000	member
6687AD4C575FF734	electric_bike	2023-04-11T16:13:43	2023-04-11T16:18:41					42.00000000	-87.66000000	41.99000000	-87.66000000	member
A8FA4F73B22BC11F	electric_bike	2023-04-11T16:29:24	2023-04-11T16:40:23					41.99000000	-87.66000000	42.00000000	-87.66000000	member
81E158FE63D99994	electric_bike	2023-04-19T17:35:40	2023-04-19T17:36:11					41.88000000	-87.65000000	41.88000000	-87.65000000	member
…	…	…	…	…	…	…	…	…	…	…	…	…
BAE25DBA0CD0DF40	electric_bike	2023-04-14T07:12:34	2023-04-14T07:42:07	Richmond St & Diversey Ave	15645	Clark St & Ida B Wells Dr	TA1305000009	41.93186915	-87.70120764	41.87593267	-87.63058454	member
5A98F86A573AAB2C	electric_bike	2023-04-24T07:27:02	2023-04-24T07:58:22	Richmond St & Diversey Ave	15645	Clark St & Ida B Wells Dr	TA1305000009	41.93198085	-87.70128024	41.87593267	-87.63058454	member
92B02F2FBDC1FB9F	classic_bike	2023-04-12T08:16:48	2023-04-12T08:40:08	Michigan Ave & Lake St	TA1305000011	May St & Taylor St	13160	41.88602200	-87.62439800	41.86948210	-87.65548640	member
119A6C53607EAA4A	electric_bike	2023-04-28T07:24:54	2023-04-28T07:53:44	Richmond St & Diversey Ave	15645	Clark St & Ida B Wells Dr	TA1305000009	41.93190396	-87.70125461	41.87593267	-87.63058454	member
9BCF1E8BA027EAFA	electric_bike	2023-04-21T07:15:06	2023-04-21T07:41:45	Richmond St & Diversey Ave	15645	Clark St & Ida B Wells Dr	TA1305000009	41.93193305	-87.70126295	41.87593267	-87.63058454	member
A17D800CE963661A	classic_bike	2023-04-11T15:46:42	2023-04-11T15:50:03	Michigan Ave & Lake St	TA1305000011	Clark St & Randolph St	TA1305000030	41.88602200	-87.62439800	41.88457623	-87.63188991	member
8B441A6C436E9900	classic_bike	2023-04-29T21:20:21	2023-04-29T21:30:19	Halsted St & 18th St	13099	Blue Island Ave & 18th St	13135	41.85750568	-87.64599144	41.85755600	-87.66153500	casual
3980D64BE11540F1	classic_bike	2023-04-24T09:16:05	2023-04-24T09:22:27	Halsted St & 18th St	13099	Blue Island Ave & 18th St	13135	41.85750568	-87.64599144	41.85755600	-87.66153500	casual
3EF4B49FF7DAA02C	classic_bike	2023-04-18T07:53:51	2023-04-18T07:59:16	Franklin St & Jackson Blvd	TA1305000025	Clark St & Randolph St	TA1305000030	41.87770796	-87.63532114	41.88457623	-87.63188991	casual
210B2ED6583DC231	classic_bike	2023-04-29T07:33:55	2023-04-29T07:38:57	Michigan Ave & Lake St	TA1305000011	Clark St & Randolph St	TA1305000030	41.88602200	-87.62439800	41.88457623	-87.63188991	casual
D29CB39B9E3FC46A	electric_bike	2023-04-18T08:00:32	2023-04-18T08:02:35	Franklin St & Jackson Blvd	TA1305000025	Clark St & Ida B Wells Dr	TA1305000009	41.87813417	-87.63525450	41.87593267	-87.63058454	casual

(tc/info trips)

202304-divvy-tripdata.zip: descriptive-stats [13 12]:

:col-name	:datatype	:n-valid	:n-missing	:min	:mean	:mode	:max	:standard-deviation	:skew	:first	:last
:ride-id	:string	426590	0			89CC272248838ACC				8FE8F7D9C10E88C7	D29CB39B9E3FC46A
:rideable-type	:string	426590	0			electric_bike				electric_bike	electric_bike
:started-at	:local-date-time	426590	0	2023-04-01T00:00:02	2023-04-16T05:15:35.589		2023-04-30T23:59:05	6.79308822E+08	0.08185427	2023-04-02T08:37:28	2023-04-18T08:00:32
:ended-at	:local-date-time	426590	0	2023-04-01T00:03:10	2023-04-16T05:32:48.230		2023-05-03T10:37:12	6.79294547E+08	0.08147835	2023-04-02T08:41:37	2023-04-18T08:02:35
:start-station-name	:string	362776	63814								Franklin St & Jackson Blvd
:start-station-id	:string	362776	63814								TA1305000025
:end-station-name	:string	357960	68630								Clark St & Ida B Wells Dr
:end-station-id	:string	357960	68630								TA1305000009
:start-lat	:float64	426590	0	41.65	41.90		42.07	4.70980524E-02	-0.35851400	41.80	41.88
:start-lng	:float64	426590	0	-87.83	-87.65		-87.52	2.76893978E-02	-0.82254345	-87.60	-87.64
:end-lat	:float64	426155	435	41.65	41.90		42.08	4.72412894E-02	-0.37120814	41.79	41.88
:end-lng	:float64	426155	435	-88.11	-87.65		-87.53	2.78076951E-02	-0.82990154	-87.60	-87.63
:member-casual	:string	426590	0			member				member	casual

It is a whole month of bike trips!

5.13 Getting the rows of a dataset

Datasets are organized by columns for efficiency, making use of the knowledge of homogenous types in columns.

Sometimes, however, it is useful to work with rows as well.

The tc/rows function provides the rows of a dataset, either as vectors or as maps. Note, however, that it does not copy the data. Rather, it provides a rowwise view of the columnwise dataset.

(take 2 (tc/rows trips))

(["8FE8F7D9C10E88C7"
  "electric_bike"
  #object[java.time.LocalDateTime 0x6f2c0754 "2023-04-02T08:37:28"]
  #object[java.time.LocalDateTime 0x28f0792e "2023-04-02T08:41:37"]
  nil
  nil
  nil
  nil
  41.8
  -87.6
  41.79
  -87.6
  "member"]
 ["34E4ED3ADF1D821B"
  "electric_bike"
  #object[java.time.LocalDateTime 0x333291d4 "2023-04-19T11:29:02"]
  #object[java.time.LocalDateTime 0x26d15c3 "2023-04-19T11:52:12"]
  nil
  nil
  nil
  nil
  41.87
  -87.65
  41.93
  -87.68
  "member"])

(take 2 (tc/rows trips :as-maps))

({:ride-id "8FE8F7D9C10E88C7",
  :rideable-type "electric_bike",
  :started-at
  #object[java.time.LocalDateTime 0x6f2c0754 "2023-04-02T08:37:28"],
  :ended-at
  #object[java.time.LocalDateTime 0x28f0792e "2023-04-02T08:41:37"],
  :start-station-name nil,
  :start-station-id nil,
  :end-station-name nil,
  :end-station-id nil,
  :start-lat 41.8,
  :start-lng -87.6,
  :end-lat 41.79,
  :end-lng -87.6,
  :member-casual "member"}
 {:ride-id "34E4ED3ADF1D821B",
  :rideable-type "electric_bike",
  :started-at
  #object[java.time.LocalDateTime 0x333291d4 "2023-04-19T11:29:02"],
  :ended-at
  #object[java.time.LocalDateTime 0x26d15c3 "2023-04-19T11:52:12"],
  :start-station-name nil,
  :start-station-id nil,
  :end-station-name nil,
  :end-station-id nil,
  :start-lat 41.87,
  :start-lng -87.65,
  :end-lat 41.93,
  :end-lng -87.68,
  :member-casual "member"})

As you may know, Clojure shines in processing plain data, structured or unstructured, such as vectors and maps of any content. We do not lose any of that when using datasets, as we can still view them as rows which are just maps or vectors.

5.14 Querying datasets

Tablecloth offers various ways to view a subset of a dataset. Typically, they do not copy the data but provide views of the same space in memory.

The first few trips:

(tc/head trips)

202304-divvy-tripdata.zip [5 13]:

:ride-id	:rideable-type	:started-at	:ended-at	:start-lat	:start-lng	:end-lat	:end-lng	:member-casual
8FE8F7D9C10E88C7	electric_bike	2023-04-02T08:37:28	2023-04-02T08:41:37	41.80	-87.60	41.79	-87.60	member
34E4ED3ADF1D821B	electric_bike	2023-04-19T11:29:02	2023-04-19T11:52:12	41.87	-87.65	41.93	-87.68	member
5296BF07A2F77CB5	electric_bike	2023-04-19T08:41:22	2023-04-19T08:43:22	41.93	-87.66	41.93	-87.66	member
40759916B76D5D52	electric_bike	2023-04-19T13:31:30	2023-04-19T13:35:09	41.92	-87.65	41.91	-87.65	member
77A96F460101AC63	electric_bike	2023-04-19T12:05:36	2023-04-19T12:10:26	41.91	-87.65	41.91	-87.63	member

Just a few columns:

(-> trips
    (tc/select-columns [:rideable-type :started-at :ended-at])
    (print/print-range 5))

202304-divvy-tripdata.zip [426590 3]:

:rideable-type	:started-at	:ended-at
electric_bike	2023-04-02T08:37:28	2023-04-02T08:41:37
electric_bike	2023-04-19T11:29:02	2023-04-19T11:52:12
…	…	…
classic_bike	2023-04-18T07:53:51	2023-04-18T07:59:16
classic_bike	2023-04-29T07:33:55	2023-04-29T07:38:57
electric_bike	2023-04-18T08:00:32	2023-04-18T08:02:35

Only rows about classical bikes, and just a few columns:

(-> trips
    (tc/select-rows (fn [row]
                      (-> row :rideable-type (= "classic_bike"))))
    tc/head
    (tc/select-columns [:rideable-type :started-at :ended-at])
    (print/print-range 5))

202304-divvy-tripdata.zip [5 3]:

:rideable-type	:started-at	:ended-at
classic_bike	2023-04-10T17:34:35	2023-04-10T18:02:36
classic_bike	2023-04-12T12:29:46	2023-04-12T12:54
classic_bike	2023-04-29T20:57:10	2023-04-29T20:57:13
classic_bike	2023-04-20T17:03:11	2023-04-20T17:24:58
classic_bike	2023-04-07T21:14:14	2023-04-07T21:15:14

5.15 Adding columns

Here we will demonstrate some of the ways to extend a dataset with new columns. For clarity, let us focus on a dataset with just a few of the columns:

(-> trips
    (tc/select-columns [:start-lat :end-lat])
    (print/print-range 5))

202304-divvy-tripdata.zip [426590 2]:

:start-lat	:end-lat
41.80000000	41.79000000
41.87000000	41.93000000
…	…
41.87770796	41.88457623
41.88602200	41.88457623
41.87813417	41.87593267

One can create new columns by applying the dataset equivalents to the column operations discussed above. These operations have the same names, but now take a dataset and the name of the new column as additional inputs. For example, the min function, when taken from the tablecloth.api, can be used as follows:

(-> trips
    (tc/select-columns [:start-lat :end-lat])
    (tc/min :min-lat [:start-lat :end-lat])
    (print/print-range 5))

202304-divvy-tripdata.zip [426590 3]:

:start-lat	:end-lat	:min-lat
41.80000000	41.79000000	41.79000000
41.87000000	41.93000000	41.87000000
…	…	…
41.87770796	41.88457623	41.87770796
41.88602200	41.88457623	41.88457623
41.87813417	41.87593267	41.87593267

When no built-in operations exist, we can use the functions tc/map-columns and tc/add-columns.

The tc/map-columns function is useful when one needs to apply a function to the values in one or more of the existings columns, for every row.

(-> trips
    (tc/select-columns [:rideable-type :started-at :ended-at])
    (print/print-range 5)
    (tc/map-columns :duration
                    [:started-at :ended-at]
                    java-time/duration))

202304-divvy-tripdata.zip [426590 4]:

:rideable-type	:started-at	:ended-at	:duration
electric_bike	2023-04-02T08:37:28	2023-04-02T08:41:37	PT4M9S
electric_bike	2023-04-19T11:29:02	2023-04-19T11:52:12	PT23M10S
…	…	…	…
classic_bike	2023-04-18T07:53:51	2023-04-18T07:59:16	PT5M25S
classic_bike	2023-04-29T07:33:55	2023-04-29T07:38:57	PT5M2S
electric_bike	2023-04-18T08:00:32	2023-04-18T08:02:35	PT2M3S

(-> trips
    (tc/select-columns [:rideable-type :started-at :ended-at])
    (print/print-range 5)
    (tc/map-columns :duration
                    [:started-at :ended-at]
                    java-time/duration)
    (tc/map-columns :duration-in-seconds
                    [:duration]
                    #(java-time/as % :seconds)))

202304-divvy-tripdata.zip [426590 5]:

:rideable-type	:started-at	:ended-at	:duration	:duration-in-seconds
electric_bike	2023-04-02T08:37:28	2023-04-02T08:41:37	PT4M9S	249
electric_bike	2023-04-19T11:29:02	2023-04-19T11:52:12	PT23M10S	1390
…	…	…	…	…
classic_bike	2023-04-18T07:53:51	2023-04-18T07:59:16	PT5M25S	325
classic_bike	2023-04-29T07:33:55	2023-04-29T07:38:57	PT5M2S	302
electric_bike	2023-04-18T08:00:32	2023-04-18T08:02:35	PT2M3S	123

The tc/add-columns function is useful when one needs to apply a function to a whole dataset and return a whole column at once. This combines nicely with the column API (tcc).

(-> trips
    (tc/select-columns [:rideable-type :started-at :ended-at])
    (print/print-range 5)
    (tc/map-columns :duration
                    [:started-at :ended-at]
                    java-time/duration)
    (tc/map-columns :duration-in-seconds
                    [:duration]
                    #(java-time/as % :seconds))
    (tc/add-column :duration-in-minutes
                   (fn [ds]
                     (-> ds
                         :duration-in-seconds
                         (tcc/* 1/60)))))

202304-divvy-tripdata.zip [426590 6]:

:rideable-type	:started-at	:ended-at	:duration	:duration-in-seconds	:duration-in-minutes
electric_bike	2023-04-02T08:37:28	2023-04-02T08:41:37	PT4M9S	249	4.150
electric_bike	2023-04-19T11:29:02	2023-04-19T11:52:12	PT23M10S	1390	23.17
…	…	…	…	…	…
classic_bike	2023-04-18T07:53:51	2023-04-18T07:59:16	PT5M25S	325	5.417
classic_bike	2023-04-29T07:33:55	2023-04-29T07:38:57	PT5M2S	302	5.033
electric_bike	2023-04-18T08:00:32	2023-04-18T08:02:35	PT2M3S	123	2.050

Let us also add a column of the hour where each trip started – an integer between 0 to 23.

To do that, we will first add the hour as a column. Earlier, we did some time processing using the java-time API. Now, we will demonstrate a different way, using the datetime namespace of dtype-next. This namespace offers some handy functions that act on whole columns.

Let us keep this dataset in a var.

(def preprocessed-trips
  (-> trips
      (tc/select-columns [:rideable-type :started-at :ended-at])
      (print/print-range 5)
      (tc/map-columns :duration
                      [:started-at :ended-at]
                      java-time/duration)
      (tc/map-columns :duration-in-seconds
                      [:duration]
                      #(java-time/as % :seconds))
      (tc/add-column :duration-in-minutes
                     (fn [ds]
                       (-> ds
                           :duration-in-seconds
                           (tcc/* 1/60))))
      (tc/add-column :hour
                     (fn [ds]
                       (datetime/long-temporal-field
                        :hours
                        (:started-at ds))))))

preprocessed-trips

202304-divvy-tripdata.zip [426590 7]:

:rideable-type	:started-at	:ended-at	:duration	:duration-in-seconds	:duration-in-minutes	:hour
electric_bike	2023-04-02T08:37:28	2023-04-02T08:41:37	PT4M9S	249	4.150	8
electric_bike	2023-04-19T11:29:02	2023-04-19T11:52:12	PT23M10S	1390	23.17	11
…	…	…	…	…	…	…
classic_bike	2023-04-18T07:53:51	2023-04-18T07:59:16	PT5M25S	325	5.417	7
classic_bike	2023-04-29T07:33:55	2023-04-29T07:38:57	PT5M2S	302	5.033	7
electric_bike	2023-04-18T08:00:32	2023-04-18T08:02:35	PT2M3S	123	2.050	8

5.16 Grouping and summarizing

For how long is a bike used typically? We can answer this question by using one of the built-in summarizing operations. These are equivalent to the column operations that output a scalar value, but require the name of the column that we use as input.

(-> preprocessed-trips
    (tc/median :duration-in-minutes))

_unnamed [1 1]:

summary
8.91666667

This gives us a summary measure over the entire dataset. Alternatively, we can group by the values in one of the columns.

(-> preprocessed-trips
    (tc/group-by [:rideable-type])
    (print/print-range 5))

_unnamed [3 3]:

:name	:group-id	:data
{:rideable-type electric_bike}	0	Group: {:rideable-type “electric_bike”} [247965 7]:
{:rideable-type classic_bike}	1	Group: {:rideable-type “classic_bike”} [169738 7]:
{:rideable-type docked_bike}	2	Group: {:rideable-type “docked_bike”} [8887 7]:

The resulting dataset is a dataset of a special kind, a grouped dataset. Its :data column contains whole datasets, which are the groups. In our case, these are the groups of bike trips starting in a given hour, for every hour throughout the day.

We can then apply an operation conditional on the groups.

(-> preprocessed-trips
    (tc/group-by [:rideable-type])
    (tc/mean :duration-in-minutes))

_unnamed [3 2]:

:rideable-type	summary
electric_bike	11.76538409
classic_bike	18.54251022
docked_bike	143.70773039

More generally, we can use tc/aggregate to apply an arbitrary summary function.

How does bike usage change througout the day?

Let us see how the number of trips and their median length change by the hour.

Let us group the trips by the hour:

(-> preprocessed-trips
    (tc/group-by [:hour])
    (print/print-range 5))

_unnamed [24 3]:

:name	:group-id	:data
{:hour 8}	0	Group: {:hour 8} [22170 7]:
{:hour 11}	1	Group: {:hour 11} [20593 7]:
…	…	…
{:hour 23}	21	Group: {:hour 23} [7327 7]:
{:hour 6}	22	Group: {:hour 6} [8863 7]:
{:hour 5}	23	Group: {:hour 5} [2802 7]:

Now, we can aggregate over the groups to recieve a summary. The resulting summary dataset will no longer be a grouped dataset. We will order the summary by the hour.

(-> preprocessed-trips
    (tc/group-by [:hour])
    (tc/aggregate {:n-trips tc/row-count
                   :median-duration (fn [ds]
                                      (tcc/median (:duration-in-seconds ds)))})
    (tc/order-by [:hour])
    (print/print-range :all))

_unnamed [24 3]:

:hour	:n-trips	:median-duration
0	5111	487.0
1	3235	478.0
2	1874	461.0
3	1165	491.0
4	950	452.0
5	2802	394.0
6	8863	434.0
7	17065	473.0
8	22170	472.0
9	17316	466.0
10	17489	510.0
11	20593	519.0
12	24050	536.0
13	24837	560.0
14	25900	585.0
15	30730	574.0
16	40374	590.0
17	46889	597.0
18	38281	578.0
19	27107	537.0
20	17252	505.0
21	14388	502.0
22	10822	508.0
23	7327	485.0

We can see a peak of usage between 17:00 to 18:00 and a possibly slight tendendcy for longer trips (in time) around the afternoon hours.

For further examples of summarizing, see the Tablecloth aggregate documentation.

source: notebooks/noj_book/tablecloth_table_processing.clj