8 Data manipulation

(ns chapter-3-data-manipulation.3-data-manipulation
  ;; {:nextjournal.clerk/visibility {:code :hide}
  ;;  :nextjournal.clerk/toc true}
  (:require [tablecloth.api :as tc]
            [tech.v3.datatype.functional :as fun]
            [tech.v3.dataset.column :as tdsc]
            [tech.v3.datatype.rolling :as rolling]
            [clojure.string :as str]
            [fastmath.stats :as stats]
            [scicloj.kind-clerk.api :as kind-clerk]))

(kind-clerk/setup!)

:ok

This is a work in progress of the code examples that will make up chapter 3 of the Clojure data cookbook

Once data is loaded and ready to work with, here’s how to do some of the most common data manipulation tasks.

8.1 Sorting

(def dataset (tc/dataset [{:country "Canada"
                           :size 10000000}
                          {:country "USA"
                           :size 9000000}
                          {:country "Germany"
                           :size 80000}]))

8.1.1 Sorting columns

Give the column headers in the order you want

(-> dataset
    (tc/reorder-columns [:country :size]))

_unnamed [3 2]:

:country	:size
Canada	10000000
USA	9000000
Germany	80000

8.1.2 Sorting rows

(-> dataset
    (tc/order-by [:size] [:desc]))

_unnamed [3 2]:

:country	:size
Canada	10000000
USA	9000000
Germany	80000

8.1.3 Custom sorting functions

e.g. length of the country name

(-> dataset
    (tc/order-by (fn [row] (-> row :country count))
                 :desc))

_unnamed [3 2]:

:country	:size
Germany	80000
Canada	10000000
USA	9000000

8.2 Selecting one column or multiple columns

(-> dataset
    (tc/select-columns [:country]))

_unnamed [3 1]:

:country
Canada
USA
Germany

8.3 Randomizing order

(-> dataset
    tc/shuffle)

_unnamed [3 2]:

:country	:size
USA	9000000
Canada	10000000
Germany	80000

8.4 Repeatable randomisation

(-> dataset
    (tc/shuffle {:seed 100}))

_unnamed [3 2]:

:country	:size
Canada	10000000
Germany	80000
USA	9000000

Finding unique rows

(def dupes (tc/dataset [{:country "Canada"
                         :size 10000000}
                        {:country "Canada"
                         :size 10000303}
                        {:country "United states"
                         :size 9000000}
                        {:country "United States"
                         :size 9000000}
                        {:country "Germany"
                         :size 80000}]))

(def “USA” #{“USA” “United States” “United states of America”}) https://scicloj.github.io/tablecloth/index.html#Unique

(-> dupes
    tc/unique-by)

_unnamed [5 2]:

:country	:size
Canada	10000000
Canada	10000303
United states	9000000
United States	9000000
Germany	80000

(-> dupes
    (tc/unique-by :size))

_unnamed [4 2]:

:country	:size
Canada	10000000
Canada	10000303
United states	9000000
Germany	80000

(-> dupes
    (tc/unique-by :country))

_unnamed [4 2]:

:country	:size
Canada	10000000
United states	9000000
United States	9000000
Germany	80000

(-> dupes
    (tc/unique-by #(-> % :country str/lower-case)))

_unnamed [3 2]:

:country	:size
Canada	10000000
United states	9000000
Germany	80000

(-> dupes
    (tc/unique-by #(-> % :country str/lower-case)
                  {:strategy (fn [vals]
                               (case (tdsc/column-name vals)
                                 :size (apply max vals)
                                 :country (last vals)))}))

_unnamed [3 2]:

:country	:size
Canada	10000303
United States	9000000
Germany	80000

could use this to rename vals to a canonical one (e.g. convert everything that matches set of USA to “USA”) Adding computed columns to data “lengthening” or “widening” data, making it “tidy” e.g. converting a column with numbers to a category (>5 “yes”, <5 “no”), summing multiple columns into a new one

(-> dataset
    (tc/add-column :area [9000000 8000000 1000000]))

_unnamed [3 3]:

:country	:size	:area
Canada	10000000	9000000
USA	9000000	8000000
Germany	80000	1000000

(-> dataset
    (tc/add-column :population [40000000 100000000 80000000])
    (tc/rename-columns {:size :area})
    (tc/convert-types :population :double)
    (tc/add-column :density (fn [d]
                              (fun// (:population d) (:area d)))))

_unnamed [3 4]:

:country	:area	:population	:density
Canada	10000000	4.0e07	4.00000000
USA	9000000	1.0e08	11.11111111
Germany	80000	8.0e07	1000.00000000

vs, probably preferable

(-> dataset
    (tc/add-column :population [40000000 100000000 80000000])
    (tc/rename-columns {:size :area})
    (tc/add-column :density (fn [ds]
                              (fun// (fun/* 1.0 (:population ds)) (:area ds)))))

_unnamed [3 4]:

:country	:area	:population	:density
Canada	10000000	40000000	4.00000000
USA	9000000	100000000	11.11111111
Germany	80000	80000000	1000.00000000

Removing columns

(-> dataset
    (tc/drop-columns :size))

_unnamed [3 1]:

:country
Canada
USA
Germany

Transforming values
Working with nested data structures, really nice libraries in Clojure for doing this (specter, meander)
All values in a column
Conditional transformation (e.g. “truncate only 11 digit phone numbers to 10 digits”)
Rearranging order of columns
Renaming columns
Filtering rows
Single filter, multiple filters

(-> dataset
    (tc/select-rows (fn [row]
                      (< 1000000 (:size row)))))

_unnamed [2 2]:

:country	:size
Canada	10000000
USA	9000000

Aggregating rows (counts, groups)

(def co2-over-time (tc/dataset "data/co2_over_time.csv"))

(-> co2-over-time
    (tc/aggregate {:average-co2 (fn [ds]
                                  (/ (reduce + (get ds "CO2"))
                                     (count (get ds "CO2"))))}))

_unnamed [1 1]:

:average-co2
355.31093117

Add a column for year

(-> co2-over-time
    (tc/map-columns "Year" "Date" (memfn getYear)))

data/co2_over_time.csv [741 4]:

Date	CO2	adjusted CO2	Year
1958-03-01	315.70	314.44	1958
1958-04-01	317.46	315.16	1958
1958-05-01	317.51	314.71	1958
1958-07-01	315.86	315.19	1958
1958-08-01	314.93	316.19	1958
1958-09-01	313.21	316.08	1958
1958-11-01	313.33	315.20	1958
1958-12-01	314.67	315.43	1958
1959-01-01	315.58	315.54	1959
1959-02-01	316.49	315.86	1959
…	…	…	…
2019-06-01	413.96	411.38	2019
2019-07-01	411.85	411.03	2019
2019-08-01	410.08	411.62	2019
2019-09-01	408.55	412.06	2019
2019-10-01	408.43	412.06	2019
2019-11-01	410.29	412.56	2019
2019-12-01	411.85	412.78	2019
2020-01-01	413.37	413.32	2020
2020-02-01	414.09	413.33	2020
2020-03-01	414.51	412.94	2020
2020-04-01	416.18	413.35	2020

Group by year

(-> co2-over-time
    (tc/group-by (fn [row]
                   (.getYear (get row "Date")))))

_unnamed [63 3]:

:name	:group-id	:data
1958	0	Group: 1958 [8 3]:
1959	1	Group: 1959 [12 3]:
1960	2	Group: 1960 [12 3]:
1961	3	Group: 1961 [12 3]:
1962	4	Group: 1962 [12 3]:
1963	5	Group: 1963 [12 3]:
1964	6	Group: 1964 [9 3]:
1965	7	Group: 1965 [12 3]:
1966	8	Group: 1966 [12 3]:
1967	9	Group: 1967 [12 3]:
…	…	…
2010	52	Group: 2010 [12 3]:
2011	53	Group: 2011 [12 3]:
2012	54	Group: 2012 [12 3]:
2013	55	Group: 2013 [12 3]:
2014	56	Group: 2014 [12 3]:
2015	57	Group: 2015 [12 3]:
2016	58	Group: 2016 [12 3]:
2017	59	Group: 2017 [12 3]:
2018	60	Group: 2018 [12 3]:
2019	61	Group: 2019 [12 3]:
2020	62	Group: 2020 [4 3]:

Get average temp per year tablecloth applies the aggregate fn to every groups dataset

(defn round2
  "Round a double to the given precision (number of significant digits)"
  [precision d]
  (let [factor (Math/pow 10 precision)]
    (/ (Math/round (* d factor)) factor)))

(-> co2-over-time
    (tc/group-by (fn [row]
                   (.getYear (get row "Date"))))
    (tc/aggregate {:average-co2 (fn [ds]
                                  (round2 2
                                          (/ (reduce + (get ds "CO2"))
                                             (count (get ds "CO2")))))}))

_unnamed [63 2]:

:$group-name	:average-co2
1958	315.33
1959	315.98
1960	316.91
1961	317.65
1962	318.45
1963	318.99
1964	319.20
1965	320.04
1966	321.37
1967	322.18
…	…
2010	389.90
2011	391.65
2012	393.87
2013	396.57
2014	398.61
2015	400.89
2016	404.28
2017	406.58
2018	408.59
2019	411.50
2020	414.54

Can rename the column to be more descriptive

(-> co2-over-time
    (tc/group-by (fn [row]
                   (.getYear (get row "Date"))))
    (tc/aggregate {:average-co2 (fn [ds]
                                  (/ (reduce + (get ds "CO2"))
                                     (count (get ds "CO2"))))})
    (tc/rename-columns {:$group-name :year}))

_unnamed [63 2]:

:year	:average-co2
1958	315.33375000
1959	315.98166667
1960	316.90916667
1961	317.64500000
1962	318.45416667
1963	318.99250000
1964	319.20111111
1965	320.03583333
1966	321.36916667
1967	322.18083333
…	…
2010	389.90083333
2011	391.64833333
2012	393.87000000
2013	396.56666667
2014	398.61416667
2015	400.88500000
2016	404.27750000
2017	406.58416667
2018	408.58750000
2019	411.49500000
2020	414.53750000

Concatenating datasets

(def ds1 (tc/dataset [{:id "id1" :b "val1"}
                      {:id "id2" :b "val2"}
                      {:id "id3" :b "val3"}]))

(def ds2 (tc/dataset [{:id "id1" :b "val4"}
                      {:id "id5" :b "val5"}
                      {:id "id6" :b "val6"}]))

Naively concats rows

(tc/concat ds1 ds2 (tc/dataset [{:id "id3" :b "other value"}]))

_unnamed [7 2]:

:id	:b
id1	val1
id2	val2
id3	val3
id1	val4
id5	val5
id6	val6
id3	other value

(tc/concat ds1 (tc/dataset [{:b "val4" :c "text"}
                            {:b "val5" :c "hi"}
                            {:b "val6" :c "test"}]))

_unnamed [6 3]:

:id	:b	:c
id1	val1
id2	val2
id3	val3
	val4	text
	val5	hi
	val6	test

De-duping

(tc/union ds1 ds2)

union [6 2]:

:id	:b
id1	val1
id2	val2
id3	val3
id1	val4
id5	val5
id6	val6

Merging datasets
When column headers are the same or different, on multiple columns TODO explain set logic and SQL joins

(def ds3 (tc/dataset {:id [1 2 3 4]
                      :b ["val1" "val2" "val3" "val4"]}))

(def ds4 (tc/dataset {:id [1 2 3 4]
                      :c ["val1" "val2" "val3" "val4"]}))

Keep all columns

(tc/full-join ds3 ds4 :id)

full-join [4 4]:

:id	:b	:right.id	:c
1	val1	1	val1
2	val2	2	val2
3	val3	3	val3
4	val4	4	val4

“Merge” datasets on a given column where rows have a value

(tc/inner-join ds3 ds4 :id)

inner-join [4 3]:

:id	:b	:c
1	val1	val1
2	val2	val2
3	val3	val3
4	val4	val4

Drop rows missing a value

(tc/inner-join (tc/dataset {:id [1 2 3 4]
                      :b ["val1" "val2" "val3"]})
               (tc/dataset {:id [1 2 3 4]
                      :c ["val1" "val2" "val3" "val4"]})
               :id)

inner-join [4 3]:

:id	:b	:c
1	val1	val1
2	val2	val2
3	val3	val3
4		val4

(tc/right-join (tc/dataset {:id [1 2 3 ]
                            :b ["val1" "val2" "val3"]})
               (tc/dataset {:id [1 2 3 4]
                            :c ["val1" "val2" "val3" "val4"]})
               :id)

right-outer-join [4 4]:

:id	:b	:right.id	:c
1	val1	1	val1
2	val2	2	val2
3	val3	3	val3
		4	val4

scratch

(tc/left-join (tc/dataset {:email ["asdf"]
                            :name ["asdfads"]
                            :entry-id [1 2 3]})
               (tc/dataset {:entry-id [1 2 3]
                            :upload-count [2 3 4]
                            :catgory ["art" "science"]})
               :entry-id)

left-outer-join [3 6]:

:entry-id	:email	:name	:right.entry-id	:upload-count	:catgory
1	asdf	asdfads	1	2	art
2			2	3	science
3			3	4

(tc/dataset {:email ["asdf"]
             :name ["asdfads"]
             :entry-id [1 2 3]})

_unnamed [3 3]:

:email	:name	:entry-id
asdf	asdfads	1
		2
		3

(tc/dataset {:entry-id [1 2 3]
             :upload-count [2 3 4]
             :catgory ["art" "science"]})

_unnamed [3 3]:

:entry-id	:upload-count	:catgory
1	2	art
2	3	science
3	4

see tablecloth join stuff Inner join, only keeps rows with the specified column value in common

(tc/inner-join ds1 ds2 :id)

inner-join [1 3]:

:id	:b	:right.b
id1	val1	val4

Converting between wide and long formats? Signal processing/time series analysis
Compute rolling average to be able to plot a trend line

(def exp-moving-avg
  (let [data (get co2-over-time "adjusted CO2")
        moving-avg
        (->> data
             (reduce (fn [acc next]
                       (conj acc (+ (* 0.9 (last acc)) (* 0.1 next))))
                     [(first data)])
             rest)]
    (tc/dataset [["Exponential moving average" moving-avg]])))

widen dataset to include new row that’s already in order

(tc/append co2-over-time exp-moving-avg)

data/co2_over_time.csv [741 4]:

Date	CO2	adjusted CO2	Exponential moving average
1958-03-01	315.70	314.44	314.44000000
1958-04-01	317.46	315.16	314.51200000
1958-05-01	317.51	314.71	314.53180000
1958-07-01	315.86	315.19	314.59762000
1958-08-01	314.93	316.19	314.75685800
1958-09-01	313.21	316.08	314.88917220
1958-11-01	313.33	315.20	314.92025498
1958-12-01	314.67	315.43	314.97122948
1959-01-01	315.58	315.54	315.02810653
1959-02-01	316.49	315.86	315.11129588
…	…	…	…
2019-06-01	413.96	411.38	409.42307506
2019-07-01	411.85	411.03	409.58376755
2019-08-01	410.08	411.62	409.78739079
2019-09-01	408.55	412.06	410.01465172
2019-10-01	408.43	412.06	410.21918654
2019-11-01	410.29	412.56	410.45326789
2019-12-01	411.85	412.78	410.68594110
2020-01-01	413.37	413.32	410.94934699
2020-02-01	414.09	413.33	411.18741229
2020-03-01	414.51	412.94	411.36267106
2020-04-01	416.18	413.35	411.56140396

Rolling average over a 12 point range

(def rolling-average
  (tc/dataset [["Rolling average"
                (-> co2-over-time
                    (get "adjusted CO2")
                    (rolling/fixed-rolling-window 12
                                                  fun/mean
                                                  {:relative-window-position :left}))]]))

(tc/append co2-over-time rolling-average)

data/co2_over_time.csv [741 4]:

Date	CO2	adjusted CO2	Rolling average
1958-03-01	315.70	314.44	314.44000000
1958-04-01	317.46	315.16	314.50000000
1958-05-01	317.51	314.71	314.52250000
1958-07-01	315.86	315.19	314.58500000
1958-08-01	314.93	316.19	314.73083333
1958-09-01	313.21	316.08	314.86750000
1958-11-01	313.33	315.20	314.93083333
1958-12-01	314.67	315.43	315.01333333
1959-01-01	315.58	315.54	315.10500000
1959-02-01	316.49	315.86	315.22333333
…	…	…	…
2019-06-01	413.96	411.38	410.14000000
2019-07-01	411.85	411.03	410.38583333
2019-08-01	410.08	411.62	410.63500000
2019-09-01	408.55	412.06	410.88333333
2019-10-01	408.43	412.06	411.08750000
2019-11-01	410.29	412.56	411.26916667
2019-12-01	411.85	412.78	411.48833333
2020-01-01	413.37	413.32	411.69250000
2020-02-01	414.09	413.33	411.89500000
2020-03-01	414.51	412.94	412.10166667
2020-04-01	416.18	413.35	412.32083333

Train a model to predict the next 10 years

(-> co2-over-time
    )

data/co2_over_time.csv [741 3]:

Date	CO2	adjusted CO2
1958-03-01	315.70	314.44
1958-04-01	317.46	315.16
1958-05-01	317.51	314.71
1958-07-01	315.86	315.19
1958-08-01	314.93	316.19
1958-09-01	313.21	316.08
1958-11-01	313.33	315.20
1958-12-01	314.67	315.43
1959-01-01	315.58	315.54
1959-02-01	316.49	315.86
…	…	…
2019-06-01	413.96	411.38
2019-07-01	411.85	411.03
2019-08-01	410.08	411.62
2019-09-01	408.55	412.06
2019-10-01	408.43	412.06
2019-11-01	410.29	412.56
2019-12-01	411.85	412.78
2020-01-01	413.37	413.32
2020-02-01	414.09	413.33
2020-03-01	414.51	412.94
2020-04-01	416.18	413.35

Summarizing data (mean, standard deviation, confidence intervals etc.)
Standard deviation using fastmath

(def avg-co2-by-year
  (-> co2-over-time
      (tc/group-by (fn [row]
                     (.getYear (get row "Date"))))
      (tc/aggregate {:average-co2 (fn [ds]
                                    (stats/mean (get ds "adjusted CO2"))
                                    ;; (/ (reduce + (get ds "CO2"))
                                    ;;    (count (get ds "CO2")))
                                    )
                     :standard-deviation (fn [ds]
                                           (stats/stddev (get ds "adjusted CO2")))})
      ;; (tc/rename-columns {:$group-name :year})
      ))

Overall average

(stats/mean (:average-co2 avg-co2-by-year))

355.56414902998233

Long term average 1991-2020

(-> avg-co2-by-year
    ;; (tc/select-rows (fn [row] (< 1990 (:year row))))
    ;; :average-co2
    ;; mean
    )

_unnamed [63 3]:

:$group-name	:average-co2	:standard-deviation
1958	315.30000000	0.60318204
1959	315.97750000	0.47259679
1960	316.90750000	0.42004599
1961	317.63833333	0.45170049
1962	318.44833333	0.37201743
1963	318.98750000	0.28813270
1964	319.67888889	0.20127372
1965	320.03083333	0.50883929
1966	321.36250000	0.37363388
1967	322.17500000	0.32326460
…	…	…
2010	389.89333333	0.67686891
2011	391.64500000	0.71908401
2012	393.86500000	0.87383689
2013	396.55833333	0.72002315
2014	398.60500000	0.68076828
2015	400.87833333	1.02130784
2016	404.27416667	0.95601881
2017	406.57750000	0.64441834
2018	408.58166667	0.99862481
2019	411.48833333	0.74410206
2020	413.23500000	0.19706175

Working with sequential data
Smoothing out data
Calculating a moving average
Averaging a sequence in blocks
Run length encoding?
Filling nil s with last non-nil value?

(def sparse-dataset
  (tc/dataset {:a [nil 2 3 4 nil nil 7 8]
               :b [10 11 12 nil nil nil 16 nil]}))

(-> sparse-dataset
    (tc/replace-missing :up))

_unnamed [8 2]:

:a	:b
2	10
2	11
3	12
4	16
7	16
7	16
7	16
8

(-> sparse-dataset
    (tc/replace-missing :updown))

_unnamed [8 2]:

:a	:b
2	10
2	11
3	12
4	16
7	16
7	16
7	16
8	16

(-> sparse-dataset
    (tc/replace-missing :down))

_unnamed [8 2]:

:a	:b
	10
2	11
3	12
4	12
4	12
4	12
7	16
8	16

(-> sparse-dataset
    (tc/replace-missing :downup))

_unnamed [8 2]:

:a	:b
2	10
2	11
3	12
4	12
4	12
4	12
7	16
8	16

(-> sparse-dataset
    (tc/replace-missing :lerp))

_unnamed [8 2]:

:a	:b
2.0	10.0
2.0	11.0
3.0	12.0
4.0	13.0
5.0	14.0
6.0	15.0
7.0	16.0
8.0	16.0

(-> sparse-dataset
    (tc/replace-missing :all :value 100))

_unnamed [8 2]:

:a	:b
100	10
2	11
3	12
4	100
100	100
100	100
7	16
8	100

(-> sparse-dataset
    (tc/replace-missing :a :value 100))

_unnamed [8 2]:

:a	:b
100	10
2	11
3	12
4
100
100
7	16
8

source: book/chapter_3_data_manipulation/3_data_manipulation.clj