tech-ml-version

“7.007”

tablecloth-version

“7.007”

Introduction

tech.ml.dataset is a great and fast library which brings columnar dataset to the Clojure. Chris Nuernberger has been working on this library for last year as a part of bigger tech.ml stack.

I’ve started to test the library and help to fix uncovered bugs. My main goal was to compare functionalities with the other standards from other platforms. I focused on R solutions: dplyr, tidyr and data.table.

During conversions of the examples I’ve come up how to reorganized existing tech.ml.dataset functions into simple to use API. The main goals were:

  • Focus on dataset manipulation functionality, leaving other parts of tech.ml like pipelines, datatypes, readers, ML, etc.
  • Single entry point for common operations - one function dispatching on given arguments.
  • group-by results with special kind of dataset - a dataset containing subsets created after grouping as a column.
  • Most operations recognize regular dataset and grouped dataset and process data accordingly.
  • One function form to enable thread-first on dataset.

If you want to know more about tech.ml.dataset and dtype-next please refer their documentation:

SOURCE CODE

Join the discussion on Zulip

Let’s require main namespace and define dataset used in most examples:

(require '[tablecloth.api :as tc]
         '[tech.v3.datatype.functional :as dfn])
(def DS (tc/dataset {:V1 (take 9 (cycle [1 2]))
                      :V2 (range 1 10)
                      :V3 (take 9 (cycle [0.5 1.0 1.5]))
                      :V4 (take 9 (cycle ["A" "B" "C"]))}))
DS

_unnamed [9 4]:

:V1 :V2 :V3 :V4
1 1 0.5 A
2 2 1.0 B
1 3 1.5 C
2 4 0.5 A
1 5 1.0 B
2 6 1.5 C
1 7 0.5 A
2 8 1.0 B
1 9 1.5 C

Functionality

Dataset

Dataset is a special type which can be considered as a map of columns implemented around tech.ml.dataset library. Each column can be considered as named sequence of typed data. Supported types include integers, floats, string, boolean, date/time, objects etc.

Dataset creation

Dataset can be created from various of types of Clojure structures and files:

  • single values
  • sequence of maps
  • map of sequences or values
  • sequence of columns (taken from other dataset or created manually)
  • sequence of pairs: [string column-data] or [keyword column-data]
  • array of any arrays
  • file types: raw/gzipped csv/tsv, json, xls(x) taken from local file system or URL
  • input stream

tc/dataset accepts:

  • data
  • options (see documentation of tech.ml.dataset/->dataset function for full list):
    • :dataset-name - name of the dataset
    • :num-rows - number of rows to read from file
    • :header-row? - indication if first row in file is a header
    • :key-fn - function applied to column names (eg. keyword, to convert column names to keywords)
    • :separator - column separator
    • :single-value-column-name - name of the column when single value is provided
    • :column-names - in case you want to name columns - only works for sequential input (arrays) or empty dataset
    • :layout - for numerical, native array of arrays - treat entries :as-rows or :as-columns (default)

tc/let-dataset accepts bindings symbol-column-data to simulate R’s tibble function. Each binding is converted into a column. You can refer previous columns to in further bindings (as in let).


Empty dataset.

(tc/dataset)
_unnamed [0 0]

Empty dataset with column names

(tc/dataset nil {:column-names [:a :b]})
_unnamed [0 2]:

| :a | :b |
|----|----|

Sequence of pairs (first = column name, second = value(s)).

(tc/dataset [[:A 33] [:B 5] [:C :a]])

_unnamed [1 3]:

:A :B :C
33 5 :a

Not sequential values are repeated row-count number of times.

(tc/dataset [[:A [1 2 3 4 5 6]] [:B "X"] [:C :a]])

_unnamed [6 3]:

:A :B :C
1 X :a
2 X :a
3 X :a
4 X :a
5 X :a
6 X :a

Dataset created from map (keys = column names, vals = value(s)). Works the same as sequence of pairs.

(tc/dataset {:A 33})
(tc/dataset {:A [1 2 3]})
(tc/dataset {:A [3 4 5] :B "X"})

_unnamed [1 1]:

:A
33

_unnamed [3 1]:

:A
1
2
3

_unnamed [3 2]:

:A :B
3 X
4 X
5 X

You can put any value inside a column

(tc/dataset {:A [[3 4 5] [:a :b]] :B "X"})

_unnamed [2 2]:

:A :B
[3 4 5] X
[:a :b] X

Sequence of maps

(tc/dataset [{:a 1 :b 3} {:b 2 :a 99}])
(tc/dataset [{:a 1 :b [1 2 3]} {:a 2 :b [3 4]}])

_unnamed [2 2]:

:a :b
1 3
99 2

_unnamed [2 2]:

:a :b
1 [1 2 3]
2 [3 4]

Missing values are marked by nil

(tc/dataset [{:a nil :b 1} {:a 3 :b 4} {:a 11}])

_unnamed [3 2]:

:a :b
1
3 4
11

Reading from arrays, by default :as-rows

(-> (map int-array [[1 2] [3 4] [5 6]])
    (into-array)
    (tc/dataset))

:_unnamed [3 2]:

0 1
1 2
3 4
5 6

:as-columns

(-> (map int-array [[1 2] [3 4] [5 6]])
    (into-array)
    (tc/dataset {:layout :as-columns}))

:_unnamed [2 3]:

0 1 2
1 3 5
2 4 6

:as-rows with names

(-> (map int-array [[1 2] [3 4] [5 6]])
    (into-array)
    (tc/dataset {:layout :as-rows
                 :column-names [:a :b]}))

:_unnamed [3 2]:

:a :b
1 2
3 4
5 6

Any objects

(-> (map to-array [[:a :z] ["ee" "ww"] [9 10]])
    (into-array)
    (tc/dataset {:column-names [:a :b :c]
                 :layout :as-columns}))

:_unnamed [2 3]:

:a :b :c
:a ee 9
:z ww 10

Create dataset using macro let-dataset to simulate R tibble function. Each binding is converted into a column.

(tc/let-dataset [x (range 1 6)
                  y 1
                  z (dfn/+ x y)])

_unnamed [5 3]:

:x :y :z
1 1 2
2 1 3
3 1 4
4 1 5
5 1 6

Import CSV file

(tc/dataset "data/family.csv")

data/family.csv [5 5]:

family dob_child1 dob_child2 gender_child1 gender_child2
1 1998-11-26 2000-01-29 1 2
2 1996-06-22 2
3 2002-07-11 2004-04-05 2 2
4 2004-10-10 2009-08-27 1 1
5 2000-12-05 2005-02-28 2 1

Import from URL

(defonce ds (tc/dataset "https://vega.github.io/vega-lite/examples/data/seattle-weather.csv"))
ds

https://vega.github.io/vega-lite/examples/data/seattle-weather.csv [1461 6]:

date precipitation temp_max temp_min wind weather
2012-01-01 0.0 12.8 5.0 4.7 drizzle
2012-01-02 10.9 10.6 2.8 4.5 rain
2012-01-03 0.8 11.7 7.2 2.3 rain
2012-01-04 20.3 12.2 5.6 4.7 rain
2012-01-05 1.3 8.9 2.8 6.1 rain
2012-01-06 2.5 4.4 2.2 2.2 rain
2012-01-07 0.0 7.2 2.8 2.3 rain
2012-01-08 0.0 10.0 2.8 2.0 sun
2012-01-09 4.3 9.4 5.0 3.4 rain
2012-01-10 1.0 6.1 0.6 3.4 rain
2015-12-21 27.4 5.6 2.8 4.3 rain
2015-12-22 4.6 7.8 2.8 5.0 rain
2015-12-23 6.1 5.0 2.8 7.6 rain
2015-12-24 2.5 5.6 2.2 4.3 rain
2015-12-25 5.8 5.0 2.2 1.5 rain
2015-12-26 0.0 4.4 0.0 2.5 sun
2015-12-27 8.6 4.4 1.7 2.9 rain
2015-12-28 1.5 5.0 1.7 1.3 rain
2015-12-29 0.0 7.2 0.6 2.6 fog
2015-12-30 0.0 5.6 -1.0 3.4 sun
2015-12-31 0.0 5.6 -2.1 3.5 sun

When none of above works, singleton dataset is created. Along with the error message from the exception thrown by tech.ml.dataset

(tc/dataset 999)

_unnamed [1 2]:

:\(value | :\)error
999 Don’t know how to create ISeq from: java.lang.Long

To see the stack trace, turn it on by setting :stack-trace? to true.


Set column name for single value. Also set the dataset name and turn off creating error message column.

(tc/dataset 999 {:single-value-column-name "my-single-value"
                 :error-column? false})
(tc/dataset 999 {:single-value-column-name ""
                 :dataset-name "Single value"
                 :error-column? false})

_unnamed [1 1]:

my-single-value
999

Single value [1 1]:

0
999

Saving

Export dataset to a file or output stream can be done by calling tc/write!. Function accepts:

  • dataset
  • file name with one of the extensions: .csv, .tsv, .csv.gz and .tsv.gz or output stream
  • options:
  • :separator - string or separator char.
(tc/write! ds "output.tsv.gz")
(.exists (clojure.java.io/file "output.tsv.gz"))
1462
true
Nippy
(tc/write! DS "output.nippy.gz")
nil
(tc/dataset "output.nippy.gz")

output.nippy.gz [9 4]:

:V1 :V2 :V3 :V4
1 1 0.5 A
2 2 1.0 B
1 3 1.5 C
2 4 0.5 A
1 5 1.0 B
2 6 1.5 C
1 7 0.5 A
2 8 1.0 B
1 9 1.5 C

Columns and rows

Get columns and rows as sequences. column, columns and rows treat grouped dataset as regular one. See Groups to read more about grouped datasets.

Possible result types:

  • :as-seq or :as-seqs - sequence of seqences (default)
  • :as-maps - sequence of maps (rows)
  • :as-map - map of sequences (columns)
  • :as-double-arrays - array of double arrays
  • :as-vecs - sequence of vectors (rows)

For rows setting :nil-missing? option to false will elide keys for nil values.


Select column.

(ds "wind")
(tc/column ds "date")
#tech.v3.dataset.column<float64>[1461]
wind
[4.700, 4.500, 2.300, 4.700, 6.100, 2.200, 2.300, 2.000, 3.400, 3.400, 5.100, 1.900, 1.300, 5.300, 3.200, 5.000, 5.600, 5.000, 1.600, 2.300...]
#tech.v3.dataset.column<packed-local-date>[1461]
date
[2012-01-01, 2012-01-02, 2012-01-03, 2012-01-04, 2012-01-05, 2012-01-06, 2012-01-07, 2012-01-08, 2012-01-09, 2012-01-10, 2012-01-11, 2012-01-12, 2012-01-13, 2012-01-14, 2012-01-15, 2012-01-16, 2012-01-17, 2012-01-18, 2012-01-19, 2012-01-20...]

Columns as sequence

(take 2 (tc/columns ds))
(#tech.v3.dataset.column<packed-local-date>[1461]
date
[2012-01-01, 2012-01-02, 2012-01-03, 2012-01-04, 2012-01-05, 2012-01-06, 2012-01-07, 2012-01-08, 2012-01-09, 2012-01-10, 2012-01-11, 2012-01-12, 2012-01-13, 2012-01-14, 2012-01-15, 2012-01-16, 2012-01-17, 2012-01-18, 2012-01-19, 2012-01-20...] #tech.v3.dataset.column<float64>[1461]
precipitation
[0.000, 10.90, 0.8000, 20.30, 1.300, 2.500, 0.000, 0.000, 4.300, 1.000, 0.000, 0.000, 0.000, 4.100, 5.300, 2.500, 8.100, 19.80, 15.20, 13.50...])

Columns as map

(keys (tc/columns ds :as-map))
("date" "precipitation" "temp_max" "temp_min" "wind" "weather")

Rows as sequence of sequences

(take 2 (tc/rows ds))
([#object[java.time.LocalDate 0x3ad18613 "2012-01-01"] 0.0 12.8 5.0 4.7 "drizzle"] [#object[java.time.LocalDate 0x63f0fc93 "2012-01-02"] 10.9 10.6 2.8 4.5 "rain"])

Select rows/columns as double-double-array

(-> ds
    (tc/select-columns :type/numerical)
    (tc/head)
    (tc/rows :as-double-arrays))
#object["[[D" 0x7913631e "[[D@7913631e"]
(-> ds
    (tc/select-columns :type/numerical)
    (tc/head)
    (tc/columns :as-double-arrays))
#object["[[D" 0x33ac292c "[[D@33ac292c"]

Rows as sequence of maps

(clojure.pprint/pprint (take 2 (tc/rows ds :as-maps)))
({"date" #object[java.time.LocalDate 0x46e26443 "2012-01-01"],
  "precipitation" 0.0,
  "temp_max" 12.8,
  "temp_min" 5.0,
  "wind" 4.7,
  "weather" "drizzle"}
 {"date" #object[java.time.LocalDate 0x1f2b8ae2 "2012-01-02"],
  "precipitation" 10.9,
  "temp_max" 10.6,
  "temp_min" 2.8,
  "wind" 4.5,
  "weather" "rain"})

Rows with missing values

(-> {:a [1 nil 2]
     :b [3 4 nil]}
    (tc/dataset)
    (tc/rows :as-maps))
[{:a 1, :b 3} {:a nil, :b 4} {:a 2, :b nil}]

Rows with elided missing values

(-> {:a [1 nil 2]
     :b [3 4 nil]}
    (tc/dataset)
    (tc/rows :as-maps {:nil-missing? false}))
[{:a 1, :b 3} {:b 4} {:a 2}]

Single entry

Get single value from the table using get-in from Clojure API or get-entry. First argument is column name, second is row number.

(get-in ds ["wind" 2])
2.3
(tc/get-entry ds "wind" 2)
2.3

Printing

Dataset is printed using dataset->str or print-dataset functions. Options are the same as in tech.ml.dataset/dataset-data->str. Most important is :print-line-policy which can be one of the: :single, :repl or :markdown.

(tc/print-dataset (tc/group-by DS :V1) {:print-line-policy :markdown})
_unnamed [2 3]:

| :name | :group-id |                                                                                                                                                                                                                                                             :data |
|------:|----------:|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|     1 |         0 | Group: 1 [5 4]:<br><br>\| :V1 \| :V2 \| :V3 \| :V4 \|<br>\|----:\|----:\|----:\|-----\|<br>\|   1 \|   1 \| 0.5 \|   A \|<br>\|   1 \|   3 \| 1.5 \|   C \|<br>\|   1 \|   5 \| 1.0 \|   B \|<br>\|   1 \|   7 \| 0.5 \|   A \|<br>\|   1 \|   9 \| 1.5 \|   C \| |
|     2 |         1 |                                   Group: 2 [4 4]:<br><br>\| :V1 \| :V2 \| :V3 \| :V4 \|<br>\|----:\|----:\|----:\|-----\|<br>\|   2 \|   2 \| 1.0 \|   B \|<br>\|   2 \|   4 \| 0.5 \|   A \|<br>\|   2 \|   6 \| 1.5 \|   C \|<br>\|   2 \|   8 \| 1.0 \|   B \| |
(tc/print-dataset (tc/group-by DS :V1) {:print-line-policy :repl})
_unnamed [2 3]:

| :name | :group-id |                          :data |
|------:|----------:|--------------------------------|
|     1 |         0 | Group: 1 [5 4]:                |
|       |           |                                |
|       |           | \| :V1 \| :V2 \| :V3 \| :V4 \| |
|       |           | \|----:\|----:\|----:\|-----\| |
|       |           | \|   1 \|   1 \| 0.5 \|   A \| |
|       |           | \|   1 \|   3 \| 1.5 \|   C \| |
|       |           | \|   1 \|   5 \| 1.0 \|   B \| |
|       |           | \|   1 \|   7 \| 0.5 \|   A \| |
|       |           | \|   1 \|   9 \| 1.5 \|   C \| |
|     2 |         1 | Group: 2 [4 4]:                |
|       |           |                                |
|       |           | \| :V1 \| :V2 \| :V3 \| :V4 \| |
|       |           | \|----:\|----:\|----:\|-----\| |
|       |           | \|   2 \|   2 \| 1.0 \|   B \| |
|       |           | \|   2 \|   4 \| 0.5 \|   A \| |
|       |           | \|   2 \|   6 \| 1.5 \|   C \| |
|       |           | \|   2 \|   8 \| 1.0 \|   B \| |
(tc/print-dataset (tc/group-by DS :V1) {:print-line-policy :single})
_unnamed [2 3]:

| :name | :group-id |           :data |
|------:|----------:|-----------------|
|     1 |         0 | Group: 1 [5 4]: |
|     2 |         1 | Group: 2 [4 4]: |

Group-by

Grouping by is an operation which splits dataset into subdatasets and pack it into new special type of… dataset. I distinguish two types of dataset: regular dataset and grouped dataset. The latter is the result of grouping.

Grouped dataset is annotated in by :grouped? meta tag and consist following columns:

  • :name - group name or structure
  • :group-id - integer assigned to the group
  • :data - groups as datasets

Almost all functions recognize type of the dataset (grouped or not) and operate accordingly.

You can’t apply reshaping or join/concat functions on grouped datasets.

Grouping

Grouping is done by calling group-by function with arguments:

  • ds - dataset
  • grouping-selector - what to use for grouping
  • options:
  • :result-type - what to return:
  • :as-dataset (default) - return grouped dataset
  • :as-indexes - return rows ids (row number from original dataset)
  • :as-map - return map with group names as keys and subdataset as values
  • :as-seq - return sequens of subdatasets
  • :select-keys - list of the columns passed to a grouping selector function

All subdatasets (groups) have set name as the group name, additionally group-id is in meta.

Grouping can be done by:

  • single column name
  • seq of column names
  • map of keys (group names) and row indexes
  • value returned by function taking row as map (limited to :select-keys)

Note: currently dataset inside dataset is printed recursively so it renders poorly from markdown. So I will use :as-seq result type to show just group names and groups.


List of columns in grouped dataset

(-> DS
    (tc/group-by :V1)
    (tc/column-names))
(:V1 :V2 :V3 :V4)

List of columns in grouped dataset treated as regular dataset

(-> DS
    (tc/group-by :V1)
    (tc/as-regular-dataset)
    (tc/column-names))
(:name :group-id :data)

Content of the grouped dataset

(tc/columns (tc/group-by DS :V1) :as-map)
{:name #tech.v3.dataset.column<int64>[2]
:name
[1, 2], :group-id #tech.v3.dataset.column<int64>[2]
:group-id
[0, 1], :data #tech.v3.dataset.column<dataset>[2]
:data
[Group: 1 [5 4]:

| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
|   1 |   1 | 0.5 |   A |
|   1 |   3 | 1.5 |   C |
|   1 |   5 | 1.0 |   B |
|   1 |   7 | 0.5 |   A |
|   1 |   9 | 1.5 |   C |
, Group: 2 [4 4]:

| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
|   2 |   2 | 1.0 |   B |
|   2 |   4 | 0.5 |   A |
|   2 |   6 | 1.5 |   C |
|   2 |   8 | 1.0 |   B |
]}

Grouped dataset as map

(keys (tc/group-by DS :V1 {:result-type :as-map}))
(1 2)
(vals (tc/group-by DS :V1 {:result-type :as-map}))

(Group: 1 [5 4]:

:V1 :V2 :V3 :V4
1 1 0.5 A
1 3 1.5 C
1 5 1.0 B
1 7 0.5 A
1 9 1.5 C

Group: 2 [4 4]:

:V1 :V2 :V3 :V4
2 2 1.0 B
2 4 0.5 A
2 6 1.5 C
2 8 1.0 B

)


Group dataset as map of indexes (row ids)

(tc/group-by DS :V1 {:result-type :as-indexes})
{1 [0 2 4 6 8], 2 [1 3 5 7]}

Grouped datasets are printed as follows by default.

(tc/group-by DS :V1)

_unnamed [2 3]:

:name :group-id :data
1 0 Group: 1 [5 4]:
2 1 Group: 2 [4 4]:

To get groups as sequence or a map can be done from grouped dataset using groups->seq and groups->map functions.

Groups as seq can be obtained by just accessing :data column.

I will use temporary dataset here.

(let [ds (-> {"a" [1 1 2 2]
              "b" ["a" "b" "c" "d"]}
             (tc/dataset)
             (tc/group-by "a"))]
  (seq (ds :data))) ;; seq is not necessary but Markdown treats `:data` as command here

(Group: 1 [2 2]:

a b
1 a
1 b

Group: 2 [2 2]:

a b
2 c
2 d

)

(-> {"a" [1 1 2 2]
     "b" ["a" "b" "c" "d"]}
    (tc/dataset)
    (tc/group-by "a")
    (tc/groups->seq))

(Group: 1 [2 2]:

a b
1 a
1 b

Group: 2 [2 2]:

a b
2 c
2 d

)


Groups as map

(-> {"a" [1 1 2 2]
     "b" ["a" "b" "c" "d"]}
    (tc/dataset)
    (tc/group-by "a")
    (tc/groups->map))

{1 Group: 1 [2 2]:

a b
1 a
1 b

, 2 Group: 2 [2 2]:

a b
2 c
2 d

}


Grouping by more than one column. You can see that group names are maps. When ungrouping is done these maps are used to restore column names.

(tc/group-by DS [:V1 :V3] {:result-type :as-seq})

(Group: {:V1 1, :V3 0.5} [2 4]:

:V1 :V2 :V3 :V4
1 1 0.5 A
1 7 0.5 A

Group: {:V1 2, :V3 1.0} [2 4]:

:V1 :V2 :V3 :V4
2 2 1.0 B
2 8 1.0 B

Group: {:V1 1, :V3 1.5} [2 4]:

:V1 :V2 :V3 :V4
1 3 1.5 C
1 9 1.5 C

Group: {:V1 2, :V3 0.5} [1 4]:

:V1 :V2 :V3 :V4
2 4 0.5 A

Group: {:V1 1, :V3 1.0} [1 4]:

:V1 :V2 :V3 :V4
1 5 1.0 B

Group: {:V1 2, :V3 1.5} [1 4]:

:V1 :V2 :V3 :V4
2 6 1.5 C

)


Grouping can be done by providing just row indexes. This way you can assign the same row to more than one group.

(tc/group-by DS {"group-a" [1 2 1 2]
                  "group-b" [5 5 5 1]} {:result-type :as-seq})

(Group: group-a [4 4]:

:V1 :V2 :V3 :V4
2 2 1.0 B
1 3 1.5 C
2 2 1.0 B
1 3 1.5 C

Group: group-b [4 4]:

:V1 :V2 :V3 :V4
2 6 1.5 C
2 6 1.5 C
2 6 1.5 C
2 2 1.0 B

)


You can group by a result of grouping function which gets row as map and should return group name. When map is used as a group name, ungrouping restore original column names.

(tc/group-by DS (fn [row] (* (:V1 row)
                             (:V3 row))) {:result-type :as-seq})

(Group: 0.5 [2 4]:

:V1 :V2 :V3 :V4
1 1 0.5 A
1 7 0.5 A

Group: 2.0 [2 4]:

:V1 :V2 :V3 :V4
2 2 1.0 B
2 8 1.0 B

Group: 1.5 [2 4]:

:V1 :V2 :V3 :V4
1 3 1.5 C
1 9 1.5 C

Group: 1.0 [2 4]:

:V1 :V2 :V3 :V4
2 4 0.5 A
1 5 1.0 B

Group: 3.0 [1 4]:

:V1 :V2 :V3 :V4
2 6 1.5 C

)


You can use any predicate on column to split dataset into two groups.

(tc/group-by DS (comp #(< % 1.0) :V3) {:result-type :as-seq})

(Group: true [3 4]:

:V1 :V2 :V3 :V4
1 1 0.5 A
2 4 0.5 A
1 7 0.5 A

Group: false [6 4]:

:V1 :V2 :V3 :V4
2 2 1.0 B
1 3 1.5 C
1 5 1.0 B
2 6 1.5 C
2 8 1.0 B
1 9 1.5 C

)


juxt is also helpful

(tc/group-by DS (juxt :V1 :V3) {:result-type :as-seq})

(Group: [1 0.5] [2 4]:

:V1 :V2 :V3 :V4
1 1 0.5 A
1 7 0.5 A

Group: [2 1.0] [2 4]:

:V1 :V2 :V3 :V4
2 2 1.0 B
2 8 1.0 B

Group: [1 1.5] [2 4]:

:V1 :V2 :V3 :V4
1 3 1.5 C
1 9 1.5 C

Group: [2 0.5] [1 4]:

:V1 :V2 :V3 :V4
2 4 0.5 A

Group: [1 1.0] [1 4]:

:V1 :V2 :V3 :V4
1 5 1.0 B

Group: [2 1.5] [1 4]:

:V1 :V2 :V3 :V4
2 6 1.5 C

)


tech.ml.dataset provides an option to limit columns which are passed to grouping functions. It’s done for performance purposes.

(tc/group-by DS identity {:result-type :as-seq
                           :select-keys [:V1]})

(Group: {:V1 1} [5 4]:

:V1 :V2 :V3 :V4
1 1 0.5 A
1 3 1.5 C
1 5 1.0 B
1 7 0.5 A
1 9 1.5 C

Group: {:V1 2} [4 4]:

:V1 :V2 :V3 :V4
2 2 1.0 B
2 4 0.5 A
2 6 1.5 C
2 8 1.0 B

)

Ungrouping

Ungrouping simply concats all the groups into the dataset. Following options are possible

  • :order? - order groups according to the group name ascending order. Default: false
  • :add-group-as-column - should group name become a column? If yes column is created with provided name (or :$group-name if argument is true). Default: nil.
  • :add-group-id-as-column - should group id become a column? If yes column is created with provided name (or :$group-id if argument is true). Default: nil.
  • :dataset-name - to name resulting dataset. Default: nil (_unnamed)

If group name is a map, it will be splitted into separate columns. Be sure that groups (subdatasets) doesn’t contain the same columns already.

If group name is a vector, it will be splitted into separate columns. If you want to name them, set vector of target column names as :add-group-as-column argument.

After ungrouping, order of the rows is kept within the groups but groups are ordered according to the internal storage.


Grouping and ungrouping.

(-> DS
    (tc/group-by :V3)
    (tc/ungroup))

_unnamed [9 4]:

:V1 :V2 :V3 :V4
1 1 0.5 A
2 4 0.5 A
1 7 0.5 A
2 2 1.0 B
1 5 1.0 B
2 8 1.0 B
1 3 1.5 C
2 6 1.5 C
1 9 1.5 C

Groups sorted by group name and named.

(-> DS
    (tc/group-by :V3)
    (tc/ungroup {:order? true
                  :dataset-name "Ordered by V3"}))

Ordered by V3 [9 4]:

:V1 :V2 :V3 :V4
1 1 0.5 A
2 4 0.5 A
1 7 0.5 A
2 2 1.0 B
1 5 1.0 B
2 8 1.0 B
1 3 1.5 C
2 6 1.5 C
1 9 1.5 C

Groups sorted descending by group name and named.

(-> DS
    (tc/group-by :V3)
    (tc/ungroup {:order? :desc
                  :dataset-name "Ordered by V3 descending"}))

Ordered by V3 descending [9 4]:

:V1 :V2 :V3 :V4
1 3 1.5 C
2 6 1.5 C
1 9 1.5 C
2 2 1.0 B
1 5 1.0 B
2 8 1.0 B
1 1 0.5 A
2 4 0.5 A
1 7 0.5 A

Let’s add group name and id as additional columns

(-> DS
    (tc/group-by (comp #(< % 4) :V2))
    (tc/ungroup {:add-group-as-column true
                  :add-group-id-as-column true}))

_unnamed [9 6]:

:\(group-name | :\)group-id :V1 :V2 :V3 :V4
true 0 1 1 0.5 A
true 0 2 2 1.0 B
true 0 1 3 1.5 C
false 1 2 4 0.5 A
false 1 1 5 1.0 B
false 1 2 6 1.5 C
false 1 1 7 0.5 A
false 1 2 8 1.0 B
false 1 1 9 1.5 C

Let’s assign different column names

(-> DS
    (tc/group-by (comp #(< % 4) :V2))
    (tc/ungroup {:add-group-as-column "Is V2 less than 4?"
                  :add-group-id-as-column "group id"}))

_unnamed [9 6]:

Is V2 less than 4? group id :V1 :V2 :V3 :V4
true 0 1 1 0.5 A
true 0 2 2 1.0 B
true 0 1 3 1.5 C
false 1 2 4 0.5 A
false 1 1 5 1.0 B
false 1 2 6 1.5 C
false 1 1 7 0.5 A
false 1 2 8 1.0 B
false 1 1 9 1.5 C

If we group by map, we can automatically create new columns out of group names.

(-> DS
    (tc/group-by (fn [row] {"V1 and V3 multiplied" (* (:V1 row)
                                                      (:V3 row))
                            "V4 as lowercase" (clojure.string/lower-case (:V4 row))}))
    (tc/ungroup {:add-group-as-column true}))

_unnamed [9 6]:

V1 and V3 multiplied V4 as lowercase :V1 :V2 :V3 :V4
0.5 a 1 1 0.5 A
0.5 a 1 7 0.5 A
2.0 b 2 2 1.0 B
2.0 b 2 8 1.0 B
1.5 c 1 3 1.5 C
1.5 c 1 9 1.5 C
1.0 a 2 4 0.5 A
1.0 b 1 5 1.0 B
3.0 c 2 6 1.5 C

We can add group names without separation

(-> DS
    (tc/group-by (fn [row] {"V1 and V3 multiplied" (* (:V1 row)
                                                      (:V3 row))
                            "V4 as lowercase" (clojure.string/lower-case (:V4 row))}))
    (tc/ungroup {:add-group-as-column "just map"
                  :separate? false}))

_unnamed [9 5]:

just map :V1 :V2 :V3 :V4
{“V1 and V3 multiplied” 0.5, “V4 as lowercase” “a”} 1 1 0.5 A
{“V1 and V3 multiplied” 0.5, “V4 as lowercase” “a”} 1 7 0.5 A
{“V1 and V3 multiplied” 2.0, “V4 as lowercase” “b”} 2 2 1.0 B
{“V1 and V3 multiplied” 2.0, “V4 as lowercase” “b”} 2 8 1.0 B
{“V1 and V3 multiplied” 1.5, “V4 as lowercase” “c”} 1 3 1.5 C
{“V1 and V3 multiplied” 1.5, “V4 as lowercase” “c”} 1 9 1.5 C
{“V1 and V3 multiplied” 1.0, “V4 as lowercase” “a”} 2 4 0.5 A
{“V1 and V3 multiplied” 1.0, “V4 as lowercase” “b”} 1 5 1.0 B
{“V1 and V3 multiplied” 3.0, “V4 as lowercase” “c”} 2 6 1.5 C

The same applies to group names as sequences

(-> DS
    (tc/group-by (juxt :V1 :V3))
    (tc/ungroup {:add-group-as-column "abc"}))

_unnamed [9 6]:

:abc-0 :abc-1 :V1 :V2 :V3 :V4
1 0.5 1 1 0.5 A
1 0.5 1 7 0.5 A
2 1.0 2 2 1.0 B
2 1.0 2 8 1.0 B
1 1.5 1 3 1.5 C
1 1.5 1 9 1.5 C
2 0.5 2 4 0.5 A
1 1.0 1 5 1.0 B
2 1.5 2 6 1.5 C

Let’s provide column names

(-> DS
    (tc/group-by (juxt :V1 :V3))
    (tc/ungroup {:add-group-as-column ["v1" "v3"]}))

_unnamed [9 6]:

v1 v3 :V1 :V2 :V3 :V4
1 0.5 1 1 0.5 A
1 0.5 1 7 0.5 A
2 1.0 2 2 1.0 B
2 1.0 2 8 1.0 B
1 1.5 1 3 1.5 C
1 1.5 1 9 1.5 C
2 0.5 2 4 0.5 A
1 1.0 1 5 1.0 B
2 1.5 2 6 1.5 C

Also we can supress separation

(-> DS
    (tc/group-by (juxt :V1 :V3))
    (tc/ungroup {:separate? false
                  :add-group-as-column true}))
;; => _unnamed [9 5]:

_unnamed [9 5]:

:$group-name :V1 :V2 :V3 :V4
[1 0.5] 1 1 0.5 A
[1 0.5] 1 7 0.5 A
[2 1.0] 2 2 1.0 B
[2 1.0] 2 8 1.0 B
[1 1.5] 1 3 1.5 C
[1 1.5] 1 9 1.5 C
[2 0.5] 2 4 0.5 A
[1 1.0] 1 5 1.0 B
[2 1.5] 2 6 1.5 C

Other functions

To check if dataset is grouped or not just use grouped? function.

(tc/grouped? DS)
nil
(tc/grouped? (tc/group-by DS :V1))
true

If you want to remove grouping annotation (to make all the functions work as with regular dataset) you can use unmark-group or as-regular-dataset (alias) functions.

It can be important when you want to remove some groups (rows) from grouped dataset using drop-rows or something like that.

(-> DS
    (tc/group-by :V1)
    (tc/as-regular-dataset)
    (tc/grouped?))
nil

You can also operate on grouped dataset as a regular one in case you want to access its columns using without-grouping-> threading macro.

(-> DS
    (tc/group-by [:V4 :V1])
    (tc/without-grouping->
     (tc/order-by (comp (juxt :V4 :V1) :name))))

_unnamed [6 3]:

:name :group-id :data
{:V4 “A”, :V1 1} 0 Group: {:V4 “A”, :V1 1} [2 4]:
{:V4 “A”, :V1 2} 3 Group: {:V4 “A”, :V1 2} [1 4]:
{:V4 “B”, :V1 1} 4 Group: {:V4 “B”, :V1 1} [1 4]:
{:V4 “B”, :V1 2} 1 Group: {:V4 “B”, :V1 2} [2 4]:
{:V4 “C”, :V1 1} 2 Group: {:V4 “C”, :V1 1} [2 4]:
{:V4 “C”, :V1 2} 5 Group: {:V4 “C”, :V1 2} [1 4]:

This is considered internal.

If you want to implement your own mapping function on grouped dataset you can call process-group-data and pass function operating on datasets. Result should be a dataset to have ungrouping working.

(-> DS
    (tc/group-by :V1)
    (tc/process-group-data #(str "Shape: " (vector (tc/row-count %) (tc/column-count %))))
    (tc/as-regular-dataset))

_unnamed [2 3]:

:name :group-id :data
1 0 Shape: [5 4]
2 1 Shape: [4 4]

Columns

Column is a special tech.ml.dataset structure. For our purposes we cat treat columns as typed and named sequence bound to particular dataset.

Type of the data is inferred from a sequence during column creation.

Names

To select dataset columns or column names columns-selector is used. columns-selector can be one of the following:

  • :all keyword - selects all columns
  • column name - for single column
  • sequence of column names - for collection of columns
  • regex - to apply pattern on column names or datatype
  • filter predicate - to filter column names or datatype
  • type namespaced keyword for specific datatype or group of datatypes

Column name can be anything.

column-names function returns names according to columns-selector and optional meta-field. meta-field is one of the following:

  • :name (default) - to operate on column names
  • :datatype - to operated on column types
  • :all - if you want to process all metadata

Datatype groups are:

  • :type/numerical - any numerical type
  • :type/float - floating point number (:float32 and :float64)
  • :type/integer - any integer
  • :type/datetime - any datetime type

If qualified keyword starts with :!type, complement set is used.


To select all column names you can use column-names function.

(tc/column-names DS)
(:V1 :V2 :V3 :V4)

or

(tc/column-names DS :all)
(:V1 :V2 :V3 :V4)

In case you want to select column which has name :all (or is sequence or map), put it into a vector. Below code returns empty sequence since there is no such column in the dataset.

(tc/column-names DS [:all])
()

Obviously selecting single name returns it’s name if available

(tc/column-names DS :V1)
(tc/column-names DS "no such column")
(:V1)
()

Select sequence of column names.

(tc/column-names DS [:V1 "V2" :V3 :V4 :V5])
(:V1 :V3 :V4)

Select names based on regex, columns ends with 1 or 4

(tc/column-names DS #".*[14]")
(:V1 :V4)

Select names based on regex operating on type of the column (to check what are the column types, call (tc/info DS :columns). Here we want to get integer columns only.

(tc/column-names DS #"^:int.*" :datatype)
(:V1 :V2)

or

(tc/column-names DS :type/integer)
(:V1 :V2)

And finally we can use predicate to select names. Let’s select double precision columns.

(tc/column-names DS #{:float64} :datatype)
(:V3)

or

(tc/column-names DS :type/float64)
(:V3)

If you want to select all columns but given, use complement function. Works only on a predicate.

(tc/column-names DS (complement #{:V1}))
(tc/column-names DS (complement #{:float64}) :datatype)
(tc/column-names DS :!type/float64)
(:V2 :V3 :V4)
(:V1 :V2 :V4)
(:V1 :V2 :V4)

You can select column names based on all column metadata at once by using :all metadata selector. Below we want to select column names ending with 1 which have long datatype.

(tc/column-names DS (fn [meta]
                       (and (= :int64 (:datatype meta))
                            (clojure.string/ends-with? (:name meta) "1"))) :all)
(:V1)

Select

select-columns creates dataset with columns selected by columns-selector as described above. Function works on regular and grouped dataset.


Select only float64 columns

(tc/select-columns DS #(= :float64 %) :datatype)

_unnamed [9 1]:

:V3
0.5
1.0
1.5
0.5
1.0
1.5
0.5
1.0
1.5

or

(tc/select-columns DS :type/float64)

_unnamed [9 1]:

:V3
0.5
1.0
1.5
0.5
1.0
1.5
0.5
1.0
1.5

Select all but :V1 columns

(tc/select-columns DS (complement #{:V1}))

_unnamed [9 3]:

:V2 :V3 :V4
1 0.5 A
2 1.0 B
3 1.5 C
4 0.5 A
5 1.0 B
6 1.5 C
7 0.5 A
8 1.0 B
9 1.5 C

If we have grouped data set, column selection is applied to every group separately.

(-> DS
    (tc/group-by :V1)
    (tc/select-columns [:V2 :V3])
    (tc/groups->map))

{1 Group: 1 [5 2]:

:V2 :V3
1 0.5
3 1.5
5 1.0
7 0.5
9 1.5

, 2 Group: 2 [4 2]:

:V2 :V3
2 1.0
4 0.5
6 1.5
8 1.0

}

Drop

drop-columns creates dataset with removed columns.


Drop float64 columns

(tc/drop-columns DS #(= :float64 %) :datatype)

_unnamed [9 3]:

:V1 :V2 :V4
1 1 A
2 2 B
1 3 C
2 4 A
1 5 B
2 6 C
1 7 A
2 8 B
1 9 C

or

(tc/drop-columns DS :type/float64)

_unnamed [9 3]:

:V1 :V2 :V4
1 1 A
2 2 B
1 3 C
2 4 A
1 5 B
2 6 C
1 7 A
2 8 B
1 9 C

Drop all columns but :V1 and :V2

(tc/drop-columns DS (complement #{:V1 :V2}))

_unnamed [9 2]:

:V1 :V2
1 1
2 2
1 3
2 4
1 5
2 6
1 7
2 8
1 9

If we have grouped data set, column selection is applied to every group separately. Selected columns are dropped.

(-> DS
    (tc/group-by :V1)
    (tc/drop-columns [:V2 :V3])
    (tc/groups->map))

{1 Group: 1 [5 2]:

:V1 :V4
1 A
1 C
1 B
1 A
1 C

, 2 Group: 2 [4 2]:

:V1 :V4
2 B
2 A
2 C
2 B

}

Rename

If you want to rename colums use rename-columns and pass map where keys are old names, values new ones.

You can also pass mapping function with optional columns-selector

(tc/rename-columns DS {:V1 "v1"
                        :V2 "v2"
                        :V3 [1 2 3]
                        :V4 (Object.)})

_unnamed [9 4]:

v1 v2 [1 2 3]
1 1 0.5 A
2 2 1.0 B
1 3 1.5 C
2 4 0.5 A
1 5 1.0 B
2 6 1.5 C
1 7 0.5 A
2 8 1.0 B
1 9 1.5 C

Map all names with function

(tc/rename-columns DS (comp str second name))

_unnamed [9 4]:

1 2 3 4
1 1 0.5 A
2 2 1.0 B
1 3 1.5 C
2 4 0.5 A
1 5 1.0 B
2 6 1.5 C
1 7 0.5 A
2 8 1.0 B
1 9 1.5 C

Map selected names with function

(tc/rename-columns DS [:V1 :V3] (comp str second name))

_unnamed [9 4]:

1 :V2 3 :V4
1 1 0.5 A
2 2 1.0 B
1 3 1.5 C
2 4 0.5 A
1 5 1.0 B
2 6 1.5 C
1 7 0.5 A
2 8 1.0 B
1 9 1.5 C

Function works on grouped dataset

(-> DS
    (tc/group-by :V1)
    (tc/rename-columns {:V1 "v1"
                         :V2 "v2"
                         :V3 [1 2 3]
                         :V4 (Object.)})
    (tc/groups->map))

{1 Group: 1 [5 4]:

v1 v2 [1 2 3]
1 1 0.5 A
1 3 1.5 C
1 5 1.0 B
1 7 0.5 A
1 9 1.5 C

, 2 Group: 2 [4 4]:

v1 v2 [1 2 3]
2 2 1.0 B
2 4 0.5 A
2 6 1.5 C
2 8 1.0 B

}

Add or update

To add (or replace existing) column call add-column function. Function accepts:

  • ds - a dataset
  • column-name - if it’s existing column name, column will be replaced
  • column - can be column (from other dataset), sequence, single value or function. Too big columns are always trimmed. Too small are cycled or extended with missing values (according to size-strategy argument)
  • size-strategy (optional) - when new column is shorter than dataset row count, following strategies are applied:
    • :cycle - repeat data
    • :na - append missing values
    • :strict - (default) throws an exception when sizes mismatch

Function works on grouped dataset.


Add single value as column

(tc/add-column DS :V5 "X")

_unnamed [9 5]:

:V1 :V2 :V3 :V4 :V5
1 1 0.5 A X
2 2 1.0 B X
1 3 1.5 C X
2 4 0.5 A X
1 5 1.0 B X
2 6 1.5 C X
1 7 0.5 A X
2 8 1.0 B X
1 9 1.5 C X

Replace one column (column is trimmed)

(tc/add-column DS :V1 (repeatedly rand))

_unnamed [9 4]:

:V1 :V2 :V3 :V4
0.41257790 1 0.5 A
0.55281322 2 1.0 B
0.15073657 3 1.5 C
0.18311522 4 0.5 A
0.76664862 5 1.0 B
0.95296580 6 1.5 C
0.24539929 7 0.5 A
0.24504991 8 1.0 B
0.10504470 9 1.5 C

Copy column

(tc/add-column DS :V5 (DS :V1))

_unnamed [9 5]:

:V1 :V2 :V3 :V4 :V5
1 1 0.5 A 1
2 2 1.0 B 2
1 3 1.5 C 1
2 4 0.5 A 2
1 5 1.0 B 1
2 6 1.5 C 2
1 7 0.5 A 1
2 8 1.0 B 2
1 9 1.5 C 1

When function is used, argument is whole dataset and the result should be column, sequence or single value

(tc/add-column DS :row-count tc/row-count)

_unnamed [9 5]:

:V1 :V2 :V3 :V4 :row-count
1 1 0.5 A 9
2 2 1.0 B 9
1 3 1.5 C 9
2 4 0.5 A 9
1 5 1.0 B 9
2 6 1.5 C 9
1 7 0.5 A 9
2 8 1.0 B 9
1 9 1.5 C 9

Above example run on grouped dataset, applies function on each group separately.

(-> DS
    (tc/group-by :V1)
    (tc/add-column :row-count tc/row-count)
    (tc/ungroup))

_unnamed [9 5]:

:V1 :V2 :V3 :V4 :row-count
1 1 0.5 A 5
1 3 1.5 C 5
1 5 1.0 B 5
1 7 0.5 A 5
1 9 1.5 C 5
2 2 1.0 B 4
2 4 0.5 A 4
2 6 1.5 C 4
2 8 1.0 B 4

When column which is added is longer than row count in dataset, column is trimmed. When column is shorter, it’s cycled or missing values are appended.

(tc/add-column DS :V5 [:r :b] :cycle)

_unnamed [9 5]:

:V1 :V2 :V3 :V4 :V5
1 1 0.5 A :r
2 2 1.0 B :b
1 3 1.5 C :r
2 4 0.5 A :b
1 5 1.0 B :r
2 6 1.5 C :b
1 7 0.5 A :r
2 8 1.0 B :b
1 9 1.5 C :r
(tc/add-column DS :V5 [:r :b] :na)

_unnamed [9 5]:

:V1 :V2 :V3 :V4 :V5
1 1 0.5 A :r
2 2 1.0 B :b
1 3 1.5 C
2 4 0.5 A
1 5 1.0 B
2 6 1.5 C
1 7 0.5 A
2 8 1.0 B
1 9 1.5 C

Exception is thrown when :strict (default) strategy is used and column size is not equal row count

(try
  (tc/add-column DS :V5 [:r :b])
  (catch Exception e (str "Exception caught: "(ex-message e))))
"Exception caught: Column size (2) should be exactly the same as dataset row count (9). Consider `:cycle` or `:na` strategy."

Tha same applies for grouped dataset

(-> DS
    (tc/group-by :V3)
    (tc/add-column :V5 [:r :b] :na)
    (tc/ungroup))

_unnamed [9 5]:

:V1 :V2 :V3 :V4 :V5
1 1 0.5 A :r
2 4 0.5 A :b
1 7 0.5 A
2 2 1.0 B :r
1 5 1.0 B :b
2 8 1.0 B
1 3 1.5 C :r
2 6 1.5 C :b
1 9 1.5 C

Let’s use other column to fill groups

(-> DS
    (tc/group-by :V3)
    (tc/add-column :V5 (DS :V2) :cycle)
    (tc/ungroup))

_unnamed [9 5]:

:V1 :V2 :V3 :V4 :V5
1 1 0.5 A 1
2 4 0.5 A 2
1 7 0.5 A 3
2 2 1.0 B 1
1 5 1.0 B 2
2 8 1.0 B 3
1 3 1.5 C 1
2 6 1.5 C 2
1 9 1.5 C 3

In case you want to add or update several columns you can call add-columns and provide map where keys are column names, vals are columns.

(tc/add-columns DS {:V1 #(map inc (% :V1))
                               :V5 #(map (comp keyword str) (% :V4))
                               :V6 11})

_unnamed [9 6]:

:V1 :V2 :V3 :V4 :V5 :V6
2 1 0.5 A :A 11