8  Data manipulation

(ns chapter-3-data-manipulation.3-data-manipulation
  ;; {:nextjournal.clerk/visibility {:code :hide}
  ;;  :nextjournal.clerk/toc true}
  (:require [tablecloth.api :as tc]
            [tech.v3.datatype.functional :as fun]
            [tech.v3.dataset.column :as tdsc]
            [tech.v3.datatype.rolling :as rolling]
            [clojure.string :as str]
            [fastmath.stats :as stats]
            [scicloj.kind-clerk.api :as kind-clerk]))
(kind-clerk/setup!)
:ok

This is a work in progress of the code examples that will make up chapter 3 of the Clojure data cookbook

Once data is loaded and ready to work with, here’s how to do some of the most common data manipulation tasks.

8.1 Sorting

(def dataset (tc/dataset [{:country "Canada"
                           :size 10000000}
                          {:country "USA"
                           :size 9000000}
                          {:country "Germany"
                           :size 80000}]))

8.1.1 Sorting columns

Give the column headers in the order you want

(-> dataset
    (tc/reorder-columns [:country :size]))

_unnamed [3 2]:

:country :size
Canada 10000000
USA 9000000
Germany 80000

8.1.2 Sorting rows

(-> dataset
    (tc/order-by [:size] [:desc]))

_unnamed [3 2]:

:country :size
Canada 10000000
USA 9000000
Germany 80000

8.1.3 Custom sorting functions

e.g. length of the country name

(-> dataset
    (tc/order-by (fn [row] (-> row :country count))
                 :desc))

_unnamed [3 2]:

:country :size
Germany 80000
Canada 10000000
USA 9000000

8.2 Selecting one column or multiple columns

(-> dataset
    (tc/select-columns [:country]))

_unnamed [3 1]:

:country
Canada
USA
Germany

8.3 Randomizing order

(-> dataset
    tc/shuffle)

_unnamed [3 2]:

:country :size
USA 9000000
Canada 10000000
Germany 80000

8.4 Repeatable randomisation

(-> dataset
    (tc/shuffle {:seed 100}))

_unnamed [3 2]:

:country :size
Canada 10000000
Germany 80000
USA 9000000

Finding unique rows

(def dupes (tc/dataset [{:country "Canada"
                         :size 10000000}
                        {:country "Canada"
                         :size 10000303}
                        {:country "United states"
                         :size 9000000}
                        {:country "United States"
                         :size 9000000}
                        {:country "Germany"
                         :size 80000}]))

(def “USA” #{“USA” “United States” “United states of America”}) https://scicloj.github.io/tablecloth/index.html#Unique

(-> dupes
    tc/unique-by)

_unnamed [5 2]:

:country :size
Canada 10000000
Canada 10000303
United states 9000000
United States 9000000
Germany 80000
(-> dupes
    (tc/unique-by :size))

_unnamed [4 2]:

:country :size
Canada 10000000
Canada 10000303
United states 9000000
Germany 80000
(-> dupes
    (tc/unique-by :country))

_unnamed [4 2]:

:country :size
Canada 10000000
United states 9000000
United States 9000000
Germany 80000
(-> dupes
    (tc/unique-by #(-> % :country str/lower-case)))

_unnamed [3 2]:

:country :size
Canada 10000000
United states 9000000
Germany 80000
(-> dupes
    (tc/unique-by #(-> % :country str/lower-case)
                  {:strategy (fn [vals]
                               (case (tdsc/column-name vals)
                                 :size (apply max vals)
                                 :country (last vals)))}))

_unnamed [3 2]:

:country :size
Canada 10000303
United States 9000000
Germany 80000

could use this to rename vals to a canonical one (e.g. convert everything that matches set of USA to “USA”) Adding computed columns to data “lengthening” or “widening” data, making it “tidy” e.g. converting a column with numbers to a category (>5 “yes”, <5 “no”), summing multiple columns into a new one

(-> dataset
    (tc/add-column :area [9000000 8000000 1000000]))

_unnamed [3 3]:

:country :size :area
Canada 10000000 9000000
USA 9000000 8000000
Germany 80000 1000000
(-> dataset
    (tc/add-column :population [40000000 100000000 80000000])
    (tc/rename-columns {:size :area})
    (tc/convert-types :population :double)
    (tc/add-column :density (fn [d]
                              (fun// (:population d) (:area d)))))

_unnamed [3 4]:

:country :area :population :density
Canada 10000000 4.0e07 4.00000000
USA 9000000 1.0e08 11.11111111
Germany 80000 8.0e07 1000.00000000

vs, probably preferable

(-> dataset
    (tc/add-column :population [40000000 100000000 80000000])
    (tc/rename-columns {:size :area})
    (tc/add-column :density (fn [ds]
                              (fun// (fun/* 1.0 (:population ds)) (:area ds)))))

_unnamed [3 4]:

:country :area :population :density
Canada 10000000 40000000 4.00000000
USA 9000000 100000000 11.11111111
Germany 80000 80000000 1000.00000000
  • Removing columns
(-> dataset
    (tc/drop-columns :size))

_unnamed [3 1]:

:country
Canada
USA
Germany
  • Transforming values
  • Working with nested data structures, really nice libraries in Clojure for doing this (specter, meander)
  • All values in a column
  • Conditional transformation (e.g. “truncate only 11 digit phone numbers to 10 digits”)
  • Rearranging order of columns
  • Renaming columns
  • Filtering rows
  • Single filter, multiple filters
(-> dataset
    (tc/select-rows (fn [row]
                      (< 1000000 (:size row)))))

_unnamed [2 2]:

:country :size
Canada 10000000
USA 9000000
  • Aggregating rows (counts, groups)
(def co2-over-time (tc/dataset "data/co2_over_time.csv"))
(-> co2-over-time
    (tc/aggregate {:average-co2 (fn [ds]
                                  (/ (reduce + (get ds "CO2"))
                                     (count (get ds "CO2"))))}))

_unnamed [1 1]:

:average-co2
355.31093117

Add a column for year

(-> co2-over-time
    (tc/map-columns "Year" "Date" (memfn getYear)))

data/co2_over_time.csv [741 4]:

Date CO2 adjusted CO2 Year
1958-03-01 315.70 314.44 1958
1958-04-01 317.46 315.16 1958
1958-05-01 317.51 314.71 1958
1958-07-01 315.86 315.19 1958
1958-08-01 314.93 316.19 1958
1958-09-01 313.21 316.08 1958
1958-11-01 313.33 315.20 1958
1958-12-01 314.67 315.43 1958
1959-01-01 315.58 315.54 1959
1959-02-01 316.49 315.86 1959
2019-06-01 413.96 411.38 2019
2019-07-01 411.85 411.03 2019
2019-08-01 410.08 411.62 2019
2019-09-01 408.55 412.06 2019
2019-10-01 408.43 412.06 2019
2019-11-01 410.29 412.56 2019
2019-12-01 411.85 412.78 2019
2020-01-01 413.37 413.32 2020
2020-02-01 414.09 413.33 2020
2020-03-01 414.51 412.94 2020
2020-04-01 416.18 413.35 2020

Group by year

(-> co2-over-time
    (tc/group-by (fn [row]
                   (.getYear (get row "Date")))))

_unnamed [63 3]:

:name :group-id :data
1958 0 Group: 1958 [8 3]:
1959 1 Group: 1959 [12 3]:
1960 2 Group: 1960 [12 3]:
1961 3 Group: 1961 [12 3]:
1962 4 Group: 1962 [12 3]:
1963 5 Group: 1963 [12 3]:
1964 6 Group: 1964 [9 3]:
1965 7 Group: 1965 [12 3]:
1966 8 Group: 1966 [12 3]:
1967 9 Group: 1967 [12 3]:
2010 52 Group: 2010 [12 3]:
2011 53 Group: 2011 [12 3]:
2012 54 Group: 2012 [12 3]:
2013 55 Group: 2013 [12 3]:
2014 56 Group: 2014 [12 3]:
2015 57 Group: 2015 [12 3]:
2016 58 Group: 2016 [12 3]:
2017 59 Group: 2017 [12 3]:
2018 60 Group: 2018 [12 3]:
2019 61 Group: 2019 [12 3]:
2020 62 Group: 2020 [4 3]:

Get average temp per year tablecloth applies the aggregate fn to every groups dataset

(defn round2
  "Round a double to the given precision (number of significant digits)"
  [precision d]
  (let [factor (Math/pow 10 precision)]
    (/ (Math/round (* d factor)) factor)))
(-> co2-over-time
    (tc/group-by (fn [row]
                   (.getYear (get row "Date"))))
    (tc/aggregate {:average-co2 (fn [ds]
                                  (round2 2
                                          (/ (reduce + (get ds "CO2"))
                                             (count (get ds "CO2")))))}))

_unnamed [63 2]:

:$group-name :average-co2
1958 315.33
1959 315.98
1960 316.91
1961 317.65
1962 318.45
1963 318.99
1964 319.20
1965 320.04
1966 321.37
1967 322.18
2010 389.90
2011 391.65
2012 393.87
2013 396.57
2014 398.61
2015 400.89
2016 404.28
2017 406.58
2018 408.59
2019 411.50
2020 414.54

Can rename the column to be more descriptive

(-> co2-over-time
    (tc/group-by (fn [row]
                   (.getYear (get row "Date"))))
    (tc/aggregate {:average-co2 (fn [ds]
                                  (/ (reduce + (get ds "CO2"))
                                     (count (get ds "CO2"))))})
    (tc/rename-columns {:$group-name :year}))

_unnamed [63 2]:

:year :average-co2
1958 315.33375000
1959 315.98166667
1960 316.90916667
1961 317.64500000
1962 318.45416667
1963 318.99250000
1964 319.20111111
1965 320.03583333
1966 321.36916667
1967 322.18083333
2010 389.90083333
2011 391.64833333
2012 393.87000000
2013 396.56666667
2014 398.61416667
2015 400.88500000
2016 404.27750000
2017 406.58416667
2018 408.58750000
2019 411.49500000
2020 414.53750000

Concatenating datasets

(def ds1 (tc/dataset [{:id "id1" :b "val1"}
                      {:id "id2" :b "val2"}
                      {:id "id3" :b "val3"}]))
(def ds2 (tc/dataset [{:id "id1" :b "val4"}
                      {:id "id5" :b "val5"}
                      {:id "id6" :b "val6"}]))

Naively concats rows

(tc/concat ds1 ds2 (tc/dataset [{:id "id3" :b "other value"}]))

_unnamed [7 2]:

:id :b
id1 val1
id2 val2
id3 val3
id1 val4
id5 val5
id6 val6
id3 other value
(tc/concat ds1 (tc/dataset [{:b "val4" :c "text"}
                            {:b "val5" :c "hi"}
                            {:b "val6" :c "test"}]))

_unnamed [6 3]:

:id :b :c
id1 val1
id2 val2
id3 val3
val4 text
val5 hi
val6 test

De-duping

(tc/union ds1 ds2)

union [6 2]:

:id :b
id1 val1
id2 val2
id3 val3
id1 val4
id5 val5
id6 val6
  • Merging datasets
  • When column headers are the same or different, on multiple columns TODO explain set logic and SQL joins
(def ds3 (tc/dataset {:id [1 2 3 4]
                      :b ["val1" "val2" "val3" "val4"]}))
(def ds4 (tc/dataset {:id [1 2 3 4]
                      :c ["val1" "val2" "val3" "val4"]}))

Keep all columns

(tc/full-join ds3 ds4 :id)

full-join [4 4]:

:id :b :right.id :c
1 val1 1 val1
2 val2 2 val2
3 val3 3 val3
4 val4 4 val4

“Merge” datasets on a given column where rows have a value

(tc/inner-join ds3 ds4 :id)

inner-join [4 3]:

:id :b :c
1 val1 val1
2 val2 val2
3 val3 val3
4 val4 val4

Drop rows missing a value

(tc/inner-join (tc/dataset {:id [1 2 3 4]
                      :b ["val1" "val2" "val3"]})
               (tc/dataset {:id [1 2 3 4]
                      :c ["val1" "val2" "val3" "val4"]})
               :id)

inner-join [4 3]:

:id :b :c
1 val1 val1
2 val2 val2
3 val3 val3
4 val4
(tc/right-join (tc/dataset {:id [1 2 3 ]
                            :b ["val1" "val2" "val3"]})
               (tc/dataset {:id [1 2 3 4]
                            :c ["val1" "val2" "val3" "val4"]})
               :id)

right-outer-join [4 4]:

:id :b :right.id :c
1 val1 1 val1
2 val2 2 val2
3 val3 3 val3
4 val4

scratch

(tc/left-join (tc/dataset {:email ["asdf"]
                            :name ["asdfads"]
                            :entry-id [1 2 3]})
               (tc/dataset {:entry-id [1 2 3]
                            :upload-count [2 3 4]
                            :catgory ["art" "science"]})
               :entry-id)

left-outer-join [3 6]:

:entry-id :email :name :right.entry-id :upload-count :catgory
1 asdf asdfads 1 2 art
2 2 3 science
3 3 4
(tc/dataset {:email ["asdf"]
             :name ["asdfads"]
             :entry-id [1 2 3]})

_unnamed [3 3]:

:email :name :entry-id
asdf asdfads 1
2
3
(tc/dataset {:entry-id [1 2 3]
             :upload-count [2 3 4]
             :catgory ["art" "science"]})

_unnamed [3 3]:

:entry-id :upload-count :catgory
1 2 art
2 3 science
3 4

see tablecloth join stuff Inner join, only keeps rows with the specified column value in common

(tc/inner-join ds1 ds2 :id)

inner-join [1 3]:

:id :b :right.b
id1 val1 val4
  • Converting between wide and long formats? Signal processing/time series analysis
  • Compute rolling average to be able to plot a trend line
(def exp-moving-avg
  (let [data (get co2-over-time "adjusted CO2")
        moving-avg
        (->> data
             (reduce (fn [acc next]
                       (conj acc (+ (* 0.9 (last acc)) (* 0.1 next))))
                     [(first data)])
             rest)]
    (tc/dataset [["Exponential moving average" moving-avg]])))
  • widen dataset to include new row that’s already in order
(tc/append co2-over-time exp-moving-avg)

data/co2_over_time.csv [741 4]:

Date CO2 adjusted CO2 Exponential moving average
1958-03-01 315.70 314.44 314.44000000
1958-04-01 317.46 315.16 314.51200000
1958-05-01 317.51 314.71 314.53180000
1958-07-01 315.86 315.19 314.59762000
1958-08-01 314.93 316.19 314.75685800
1958-09-01 313.21 316.08 314.88917220
1958-11-01 313.33 315.20 314.92025498
1958-12-01 314.67 315.43 314.97122948
1959-01-01 315.58 315.54 315.02810653
1959-02-01 316.49 315.86 315.11129588
2019-06-01 413.96 411.38 409.42307506
2019-07-01 411.85 411.03 409.58376755
2019-08-01 410.08 411.62 409.78739079
2019-09-01 408.55 412.06 410.01465172
2019-10-01 408.43 412.06 410.21918654
2019-11-01 410.29 412.56 410.45326789
2019-12-01 411.85 412.78 410.68594110
2020-01-01 413.37 413.32 410.94934699
2020-02-01 414.09 413.33 411.18741229
2020-03-01 414.51 412.94 411.36267106
2020-04-01 416.18 413.35 411.56140396
  • Rolling average over a 12 point range
(def rolling-average
  (tc/dataset [["Rolling average"
                (-> co2-over-time
                    (get "adjusted CO2")
                    (rolling/fixed-rolling-window 12
                                                  fun/mean
                                                  {:relative-window-position :left}))]]))
(tc/append co2-over-time rolling-average)

data/co2_over_time.csv [741 4]:

Date CO2 adjusted CO2 Rolling average
1958-03-01 315.70 314.44 314.44000000
1958-04-01 317.46 315.16 314.50000000
1958-05-01 317.51 314.71 314.52250000
1958-07-01 315.86 315.19 314.58500000
1958-08-01 314.93 316.19 314.73083333
1958-09-01 313.21 316.08 314.86750000
1958-11-01 313.33 315.20 314.93083333
1958-12-01 314.67 315.43 315.01333333
1959-01-01 315.58 315.54 315.10500000
1959-02-01 316.49 315.86 315.22333333
2019-06-01 413.96 411.38 410.14000000
2019-07-01 411.85 411.03 410.38583333
2019-08-01 410.08 411.62 410.63500000
2019-09-01 408.55 412.06 410.88333333
2019-10-01 408.43 412.06 411.08750000
2019-11-01 410.29 412.56 411.26916667
2019-12-01 411.85 412.78 411.48833333
2020-01-01 413.37 413.32 411.69250000
2020-02-01 414.09 413.33 411.89500000
2020-03-01 414.51 412.94 412.10166667
2020-04-01 416.18 413.35 412.32083333
  • Train a model to predict the next 10 years
(-> co2-over-time
    )

data/co2_over_time.csv [741 3]:

Date CO2 adjusted CO2
1958-03-01 315.70 314.44
1958-04-01 317.46 315.16
1958-05-01 317.51 314.71
1958-07-01 315.86 315.19
1958-08-01 314.93 316.19
1958-09-01 313.21 316.08
1958-11-01 313.33 315.20
1958-12-01 314.67 315.43
1959-01-01 315.58 315.54
1959-02-01 316.49 315.86
2019-06-01 413.96 411.38
2019-07-01 411.85 411.03
2019-08-01 410.08 411.62
2019-09-01 408.55 412.06
2019-10-01 408.43 412.06
2019-11-01 410.29 412.56
2019-12-01 411.85 412.78
2020-01-01 413.37 413.32
2020-02-01 414.09 413.33
2020-03-01 414.51 412.94
2020-04-01 416.18 413.35
  • Summarizing data (mean, standard deviation, confidence intervals etc.)
  • Standard deviation using fastmath
(def avg-co2-by-year
  (-> co2-over-time
      (tc/group-by (fn [row]
                     (.getYear (get row "Date"))))
      (tc/aggregate {:average-co2 (fn [ds]
                                    (stats/mean (get ds "adjusted CO2"))
                                    ;; (/ (reduce + (get ds "CO2"))
                                    ;;    (count (get ds "CO2")))
                                    )
                     :standard-deviation (fn [ds]
                                           (stats/stddev (get ds "adjusted CO2")))})
      ;; (tc/rename-columns {:$group-name :year})
      ))
  • Overall average
(stats/mean (:average-co2 avg-co2-by-year))
355.56414902998233
  • Long term average 1991-2020
(-> avg-co2-by-year
    ;; (tc/select-rows (fn [row] (< 1990 (:year row))))
    ;; :average-co2
    ;; mean
    )

_unnamed [63 3]:

:$group-name :average-co2 :standard-deviation
1958 315.30000000 0.60318204
1959 315.97750000 0.47259679
1960 316.90750000 0.42004599
1961 317.63833333 0.45170049
1962 318.44833333 0.37201743
1963 318.98750000 0.28813270
1964 319.67888889 0.20127372
1965 320.03083333 0.50883929
1966 321.36250000 0.37363388
1967 322.17500000 0.32326460
2010 389.89333333 0.67686891
2011 391.64500000 0.71908401
2012 393.86500000 0.87383689
2013 396.55833333 0.72002315
2014 398.60500000 0.68076828
2015 400.87833333 1.02130784
2016 404.27416667 0.95601881
2017 406.57750000 0.64441834
2018 408.58166667 0.99862481
2019 411.48833333 0.74410206
2020 413.23500000 0.19706175
  • Working with sequential data
  • Smoothing out data
  • Calculating a moving average
  • Averaging a sequence in blocks
  • Run length encoding?
  • Filling nil s with last non-nil value?
(def sparse-dataset
  (tc/dataset {:a [nil 2 3 4 nil nil 7 8]
               :b [10 11 12 nil nil nil 16 nil]}))
(-> sparse-dataset
    (tc/replace-missing :up))

_unnamed [8 2]:

:a :b
2 10
2 11
3 12
4 16
7 16
7 16
7 16
8
(-> sparse-dataset
    (tc/replace-missing :updown))

_unnamed [8 2]:

:a :b
2 10
2 11
3 12
4 16
7 16
7 16
7 16
8 16
(-> sparse-dataset
    (tc/replace-missing :down))

_unnamed [8 2]:

:a :b
10
2 11
3 12
4 12
4 12
4 12
7 16
8 16
(-> sparse-dataset
    (tc/replace-missing :downup))

_unnamed [8 2]:

:a :b
2 10
2 11
3 12
4 12
4 12
4 12
7 16
8 16
(-> sparse-dataset
    (tc/replace-missing :lerp))

_unnamed [8 2]:

:a :b
2.0 10.0
2.0 11.0
3.0 12.0
4.0 13.0
5.0 14.0
6.0 15.0
7.0 16.0
8.0 16.0
(-> sparse-dataset
    (tc/replace-missing :all :value 100))

_unnamed [8 2]:

:a :b
100 10
2 11
3 12
4 100
100 100
100 100
7 16
8 100
(-> sparse-dataset
    (tc/replace-missing :a :value 100))

_unnamed [8 2]:

:a :b
100 10
2 11
3 12
4
100
100
7 16
8
source: book/chapter_3_data_manipulation/3_data_manipulation.clj