7 Advanced Table Processing with Tablecloth - draft 🛠

authors: Cvetomir Dimov and Daniel Slutsky

last change: 2025-01-30

7.1 Setup

(ns noj-book.tablecloth-advanced-table-processing
  (:require [tablecloth.api :as tc]
            [tablecloth.column.api :as tcc]
            [tech.v3.dataset.print :as print]
            [clojure.string :as str]
            [scicloj.kindly.v4.kind :as kind]
            [java-time.api :as java-time]
            [tech.v3.datatype.datetime :as datetime]))

7.2 Pivoting

The following is inpired by R4DS Tidy data / Pivoting.

Pivoting datasets means changing their shape - either splitting a culumn into several or putting the data from two or more columns into one. The two functions that perform these transformations are pivot->longer and pivot->wider. Here we will demonstrate how they work.

We will use the dataset cases-data for this demontration, which corresponds to table4a in the aforementioned chapter of R4DS.

(def cases-data
  (tc/dataset
   [["Aghanistan" 745 2666]
    ["Brazil" 37737 80488]
    ["China" 212258 213766]]
   {:column-names [:country 1999 2000]}))

It stores the number of cases in the population of a country in two different years. The data for each year are stored in a separate column.

cases-data

:_unnamed [3 3]:

:country	1999	2000
Aghanistan	745	2666
Brazil	37737	80488
China	212258	213766

pivot->longer places the data from two or more columns into one column, and their column names into another. In this case, we would like to transform cases-data by placing the data from columns 1999 and 2000 into a single column, called :cases, and also placing the column names into a new column called :year. This can be done by providing a list of all columns that will be pivoted.

(-> cases-data
    (tc/pivot->longer [1999 2000]
                      {:target-columns :year
                       :value-column-name :cases}))

:_unnamed [6 3]:

:country	:year	:cases
Aghanistan	1999	745
Brazil	1999	37737
China	1999	212258
Aghanistan	2000	2666
Brazil	2000	80488
China	2000	213766

Alternatively one can provide a predicate that selects the columns whose names agree with the predicate.

(-> cases-data
    (tc/pivot->longer (fn [column-name]
                        (not= column-name :country))
                      {:target-columns :year
                       :value-column-name :cases}))

:_unnamed [6 3]:

:country	:year	:cases
Aghanistan	1999	745
Brazil	1999	37737
China	1999	212258
Aghanistan	2000	2666
Brazil	2000	80488
China	2000	213766

A more clojure-y way of writing the same predicate would use a set of column names. We can also order the resulting dataset and save the result for later.

(def cases-longer
  (-> cases-data
      (tc/pivot->longer (complement #{:country})
                        {:target-columns :year
                         :value-column-name :cases})
      (tc/order-by [:country :year])))

For more examples of pivot->longer, see tablecloth’s pivot longer documentation.

pivot->wider is the reverse operation of pivot->longer: it spreads data into more columns. It needs as input a column (or columns) that holds the column names of the new dataset and another one that holds the value. We can return to our initial dataset like this:

(-> cases-longer
    (tc/pivot->wider [:year]
                     [:cases]))

:_unnamed [3 3]:

:country	1999	2000
Aghanistan	745	2666
Brazil	37737	80488
China	212258	213766

For additional examples with pivot-wider, see tablecloth’s pivot wider documentation.

7.3 Joining datasets

The following is inpired by R4DS Relational data / Mutating joins.

Joining datasets means putting them together, side by side. When we join, we specify one or more columns that they have in common and we would like them to share. The result of the join is a new dataset that has these shared columns and all other columns, aligned by the shared columns.

To demonstrate joins, we will use the dataset population-data in addition to cases-data. The former corresponds to table4b in chapter 12 of R4DS. It stores the population of a country for different years in different columns.

(def population-data
  (tc/dataset
   [["Aghanistan" 19987071 20595360]
    ["Brazil" 172006362 174504898]
    ["China" 1272915272 1280428583]]
   {:column-names [:country 1999 2000]}))

For our demonstration, we will pivot->longer that dataset.

(def population-longer
  (-> population-data
      (tc/pivot->longer (complement #{:country})
                        {:target-columns :year
                         :value-column-name :population})
      (tc/order-by [:country :year])))

population-longer

:_unnamed [6 3]:

:country	:year	:population
Aghanistan	1999	19987071
Aghanistan	2000	20595360
Brazil	1999	172006362
Brazil	2000	174504898
China	1999	1272915272
China	2000	1280428583

(tc/left-join
 cases-longer
 population-longer
 [:country :year])

left-outer-join [6 6]:

:country	:year	:cases	::_unnamed-right.country	::_unnamed-right.year	:population
Aghanistan	1999	745	Aghanistan	1999	19987071
Aghanistan	2000	2666	Aghanistan	2000	20595360
Brazil	1999	37737	Brazil	1999	172006362
Brazil	2000	80488	Brazil	2000	174504898
China	1999	212258	China	1999	1272915272
China	2000	213766	China	2000	1280428583

The new dataset holds all the columns from the old datasets. In its current implemenation, the left-join operation also copies the columns that we join over for the right dataset. This is done so that this information is not lost. In our example, the two datasets have the exact same values for the column that we join over, so we can drop them. Note that we need to order the dataset after this transformation.

(def population-and-cases
  (-> (tc/left-join
       cases-longer
       population-longer
       [:country :year])
      (tc/select-columns [:country :year :cases :population])
      (tc/order-by [:country :year])))

population-and-cases

left-outer-join [6 4]:

:country	:year	:cases	:population
Aghanistan	1999	745	19987071
Aghanistan	2000	2666	20595360
Brazil	1999	37737	172006362
Brazil	2000	80488	174504898
China	1999	212258	1272915272
China	2000	213766	1280428583

In a left-join like this one, the first data set remains unchanged and the second dataset is adapted accordingly. For example, if the second dataset misses some of the column values for the columns that we join over, we will have empty values in the corresponding rows. If it has additional column values, those will be dropped. The converse is true for a right join (function right-join) - the second dataset well remain unchanged. In addition, one can perform an inner-join, in which all rows not shared among the datasets are dropped, and a full-join, in which no row is dropped.

For additional examples, see tablecloth’s join/concat datasets documentation.

7.4 Concatenating datasets

The dataset other-cases-data is similar to cases-data.

(def other-cases-data
  (tc/dataset
   [["Aghanistan" 745 2666]
    ["Brunei" 32 123]
    ["Chile" 11234 10993]]
   {:column-names [:country 1999 2000]}))

other-cases-data

:_unnamed [3 3]:

:country	1999	2000
Aghanistan	745	2666
Brunei	32	123
Chile	11234	10993

We can concatenate the two datasets together:

(tc/concat cases-data other-cases-data)

:_unnamed [6 3]:

:country	1999	2000
Aghanistan	745	2666
Brazil	37737	80488
China	212258	213766
Aghanistan	745	2666
Brunei	32	123
Chile	11234	10993

This has the unfortunate effect of repeating the data for Afghanistan. To return only unique rows, use union:

(tc/union cases-data other-cases-data)

union [5 3]:

:country	1999	2000
Aghanistan	745	2666
Brazil	37737	80488
China	212258	213766
Brunei	32	123
Chile	11234	10993

We can also identify the common rows between the two datasets:

(tc/intersect cases-data other-cases-data)

intersection [1 3]:

:country	1999	2000
Aghanistan	745	2666

Finally, we can subtract the common rows from the first dataset:

(tc/difference cases-data other-cases-data)

difference [2 3]:

:country	1999	2000
Brazil	37737	80488
China	212258	213766

For additional examples, see tablecloth’s join/concat datasets documentation.

7.5 Joining and separating columns

The following is inpired by R4DS Tidy data / Separating and uniting.

Two other operations that are commonly performed are joining the values of two columns into a new column and separating one column into two new columns. We can join several columns into one with join-columns.

(tc/join-columns population-and-cases
                 :ratio [:cases :population])

left-outer-join [6 3]:

:country	:year	:ratio
Aghanistan	1999	745-19987071
Aghanistan	2000	2666-20595360
Brazil	1999	37737-172006362
Brazil	2000	80488-174504898
China	1999	212258-1272915272
China	2000	213766-1280428583

We can use any separator that we want. In this case, we will use a divisor “/”, because we compute a ratio. We will save this dataset for later.

(def ratio-data
  (tc/join-columns population-and-cases
                   :ratio [:cases :population]
                   {:separator "/"}))

ratio-data

left-outer-join [6 3]:

:country	:year	:ratio
Aghanistan	1999	745/19987071
Aghanistan	2000	2666/20595360
Brazil	1999	37737/172006362
Brazil	2000	80488/174504898
China	1999	212258/1272915272
China	2000	213766/1280428583

The reverse operation is separate-column. We can split the new dataset back into the old one.

(tc/separate-column ratio-data :ratio [:cases :population] "/")

left-outer-join [6 4]:

:country	:year	:cases	:population
Aghanistan	1999	745	19987071
Aghanistan	2000	2666	20595360
Brazil	1999	37737	172006362
Brazil	2000	80488	174504898
China	1999	212258	1272915272
China	2000	213766	1280428583

For additional examples, see tablecloth’s join/separate columns documentation

source: notebooks/noj_book/tablecloth_advanced_data_processing.clj