tech-ml-version
“7.007”
tablecloth-version
“7.007”
tech.ml.dataset is a great and fast library which brings columnar dataset to the Clojure. Chris Nuernberger has been working on this library for last year as a part of bigger tech.ml
stack.
I’ve started to test the library and help to fix uncovered bugs. My main goal was to compare functionalities with the other standards from other platforms. I focused on R solutions: dplyr, tidyr and data.table.
During conversions of the examples I’ve come up how to reorganized existing tech.ml.dataset
functions into simple to use API. The main goals were:
tech.ml
like pipelines, datatypes, readers, ML, etc.group-by
results with special kind of dataset - a dataset containing subsets created after grouping as a column.If you want to know more about tech.ml.dataset
and dtype-next
please refer their documentation:
Join the discussion on Zulip
Let’s require main namespace and define dataset used in most examples:
require '[tablecloth.api :as tc]
(:as dfn])
'[tech.v3.datatype.functional def DS (tc/dataset {:V1 (take 9 (cycle [1 2]))
(:V2 (range 1 10)
:V3 (take 9 (cycle [0.5 1.0 1.5]))
:V4 (take 9 (cycle ["A" "B" "C"]))}))
DS
_unnamed [9 4]:
:V1 | :V2 | :V3 | :V4 |
---|---|---|---|
1 | 1 | 0.5 | A |
2 | 2 | 1.0 | B |
1 | 3 | 1.5 | C |
2 | 4 | 0.5 | A |
1 | 5 | 1.0 | B |
2 | 6 | 1.5 | C |
1 | 7 | 0.5 | A |
2 | 8 | 1.0 | B |
1 | 9 | 1.5 | C |
Dataset is a special type which can be considered as a map of columns implemented around tech.ml.dataset
library. Each column can be considered as named sequence of typed data. Supported types include integers, floats, string, boolean, date/time, objects etc.
Dataset can be created from various of types of Clojure structures and files:
[string column-data]
or [keyword column-data]
tc/dataset
accepts:
tech.ml.dataset/->dataset
function for full list):
:dataset-name
- name of the dataset:num-rows
- number of rows to read from file:header-row?
- indication if first row in file is a header:key-fn
- function applied to column names (eg. keyword
, to convert column names to keywords):separator
- column separator:single-value-column-name
- name of the column when single value is provided:column-names
- in case you want to name columns - only works for sequential input (arrays) or empty dataset:layout
- for numerical, native array of arrays - treat entries :as-rows
or :as-columns
(default)tc/let-dataset
accepts bindings symbol
-column-data
to simulate R’s tibble
function. Each binding is converted into a column. You can refer previous columns to in further bindings (as in let
).
Empty dataset.
(tc/dataset)
_unnamed [0 0]
Empty dataset with column names
nil {:column-names [:a :b]}) (tc/dataset
_unnamed [0 2]:
| :a | :b |
|----|----|
Sequence of pairs (first = column name, second = value(s)).
:A 33] [:B 5] [:C :a]]) (tc/dataset [[
_unnamed [1 3]:
:A | :B | :C |
---|---|---|
33 | 5 | :a |
Not sequential values are repeated row-count number of times.
:A [1 2 3 4 5 6]] [:B "X"] [:C :a]]) (tc/dataset [[
_unnamed [6 3]:
:A | :B | :C |
---|---|---|
1 | X | :a |
2 | X | :a |
3 | X | :a |
4 | X | :a |
5 | X | :a |
6 | X | :a |
Dataset created from map (keys = column names, vals = value(s)). Works the same as sequence of pairs.
:A 33})
(tc/dataset {:A [1 2 3]})
(tc/dataset {:A [3 4 5] :B "X"}) (tc/dataset {
_unnamed [1 1]:
:A |
---|
33 |
_unnamed [3 1]:
:A |
---|
1 |
2 |
3 |
_unnamed [3 2]:
:A | :B |
---|---|
3 | X |
4 | X |
5 | X |
You can put any value inside a column
:A [[3 4 5] [:a :b]] :B "X"}) (tc/dataset {
_unnamed [2 2]:
:A | :B |
---|---|
[3 4 5] | X |
[:a :b] | X |
Sequence of maps
:a 1 :b 3} {:b 2 :a 99}])
(tc/dataset [{:a 1 :b [1 2 3]} {:a 2 :b [3 4]}]) (tc/dataset [{
_unnamed [2 2]:
:a | :b |
---|---|
1 | 3 |
99 | 2 |
_unnamed [2 2]:
:a | :b |
---|---|
1 | [1 2 3] |
2 | [3 4] |
Missing values are marked by nil
:a nil :b 1} {:a 3 :b 4} {:a 11}]) (tc/dataset [{
_unnamed [3 2]:
:a | :b |
---|---|
1 | |
3 | 4 |
11 |
Reading from arrays, by default :as-rows
-> (map int-array [[1 2] [3 4] [5 6]])
(into-array)
( (tc/dataset))
:_unnamed [3 2]:
0 | 1 |
---|---|
1 | 2 |
3 | 4 |
5 | 6 |
:as-columns
-> (map int-array [[1 2] [3 4] [5 6]])
(into-array)
(:layout :as-columns})) (tc/dataset {
:_unnamed [2 3]:
0 | 1 | 2 |
---|---|---|
1 | 3 | 5 |
2 | 4 | 6 |
:as-rows
with names
-> (map int-array [[1 2] [3 4] [5 6]])
(into-array)
(:layout :as-rows
(tc/dataset {:column-names [:a :b]}))
:_unnamed [3 2]:
:a | :b |
---|---|
1 | 2 |
3 | 4 |
5 | 6 |
Any objects
-> (map to-array [[:a :z] ["ee" "ww"] [9 10]])
(into-array)
(:column-names [:a :b :c]
(tc/dataset {:layout :as-columns}))
:_unnamed [2 3]:
:a | :b | :c |
---|---|---|
:a | ee | 9 |
:z | ww | 10 |
Create dataset using macro let-dataset
to simulate R tibble
function. Each binding is converted into a column.
range 1 6)
(tc/let-dataset [x (1
y z (dfn/+ x y)])
_unnamed [5 3]:
:x | :y | :z |
---|---|---|
1 | 1 | 2 |
2 | 1 | 3 |
3 | 1 | 4 |
4 | 1 | 5 |
5 | 1 | 6 |
Import CSV file
"data/family.csv") (tc/dataset
data/family.csv [5 5]:
family | dob_child1 | dob_child2 | gender_child1 | gender_child2 |
---|---|---|---|---|
1 | 1998-11-26 | 2000-01-29 | 1 | 2 |
2 | 1996-06-22 | 2 | ||
3 | 2002-07-11 | 2004-04-05 | 2 | 2 |
4 | 2004-10-10 | 2009-08-27 | 1 | 1 |
5 | 2000-12-05 | 2005-02-28 | 2 | 1 |
Import from URL
defonce ds (tc/dataset "https://vega.github.io/vega-lite/examples/data/seattle-weather.csv")) (
ds
https://vega.github.io/vega-lite/examples/data/seattle-weather.csv [1461 6]:
date | precipitation | temp_max | temp_min | wind | weather |
---|---|---|---|---|---|
2012-01-01 | 0.0 | 12.8 | 5.0 | 4.7 | drizzle |
2012-01-02 | 10.9 | 10.6 | 2.8 | 4.5 | rain |
2012-01-03 | 0.8 | 11.7 | 7.2 | 2.3 | rain |
2012-01-04 | 20.3 | 12.2 | 5.6 | 4.7 | rain |
2012-01-05 | 1.3 | 8.9 | 2.8 | 6.1 | rain |
2012-01-06 | 2.5 | 4.4 | 2.2 | 2.2 | rain |
2012-01-07 | 0.0 | 7.2 | 2.8 | 2.3 | rain |
2012-01-08 | 0.0 | 10.0 | 2.8 | 2.0 | sun |
2012-01-09 | 4.3 | 9.4 | 5.0 | 3.4 | rain |
2012-01-10 | 1.0 | 6.1 | 0.6 | 3.4 | rain |
… | … | … | … | … | … |
2015-12-21 | 27.4 | 5.6 | 2.8 | 4.3 | rain |
2015-12-22 | 4.6 | 7.8 | 2.8 | 5.0 | rain |
2015-12-23 | 6.1 | 5.0 | 2.8 | 7.6 | rain |
2015-12-24 | 2.5 | 5.6 | 2.2 | 4.3 | rain |
2015-12-25 | 5.8 | 5.0 | 2.2 | 1.5 | rain |
2015-12-26 | 0.0 | 4.4 | 0.0 | 2.5 | sun |
2015-12-27 | 8.6 | 4.4 | 1.7 | 2.9 | rain |
2015-12-28 | 1.5 | 5.0 | 1.7 | 1.3 | rain |
2015-12-29 | 0.0 | 7.2 | 0.6 | 2.6 | fog |
2015-12-30 | 0.0 | 5.6 | -1.0 | 3.4 | sun |
2015-12-31 | 0.0 | 5.6 | -2.1 | 3.5 | sun |
When none of above works, singleton dataset is created. Along with the error message from the exception thrown by tech.ml.dataset
999) (tc/dataset
_unnamed [1 2]:
:\(value | :\)error | |
---|---|
999 | Don’t know how to create ISeq from: java.lang.Long |
To see the stack trace, turn it on by setting :stack-trace?
to true
.
Set column name for single value. Also set the dataset name and turn off creating error message column.
999 {:single-value-column-name "my-single-value"
(tc/dataset :error-column? false})
999 {:single-value-column-name ""
(tc/dataset :dataset-name "Single value"
:error-column? false})
_unnamed [1 1]:
my-single-value |
---|
999 |
Single value [1 1]:
0 |
---|
999 |
Export dataset to a file or output stream can be done by calling tc/write!
. Function accepts:
.csv
, .tsv
, .csv.gz
and .tsv.gz
or output stream:separator
- string or separator char."output.tsv.gz")
(tc/write! ds "output.tsv.gz")) (.exists (clojure.java.io/file
1462
true
"output.nippy.gz") (tc/write! DS
nil
"output.nippy.gz") (tc/dataset
output.nippy.gz [9 4]:
:V1 | :V2 | :V3 | :V4 |
---|---|---|---|
1 | 1 | 0.5 | A |
2 | 2 | 1.0 | B |
1 | 3 | 1.5 | C |
2 | 4 | 0.5 | A |
1 | 5 | 1.0 | B |
2 | 6 | 1.5 | C |
1 | 7 | 0.5 | A |
2 | 8 | 1.0 | B |
1 | 9 | 1.5 | C |
Get columns and rows as sequences. column
, columns
and rows
treat grouped dataset as regular one. See Groups
to read more about grouped datasets.
Possible result types:
:as-seq
or :as-seqs
- sequence of seqences (default):as-maps
- sequence of maps (rows):as-map
- map of sequences (columns):as-double-arrays
- array of double arrays:as-vecs
- sequence of vectors (rows)For rows
setting :nil-missing?
option to false
will elide keys for nil values.
Select column.
"wind")
(ds "date") (tc/column ds
#tech.v3.dataset.column<float64>[1461]
wind
[4.700, 4.500, 2.300, 4.700, 6.100, 2.200, 2.300, 2.000, 3.400, 3.400, 5.100, 1.900, 1.300, 5.300, 3.200, 5.000, 5.600, 5.000, 1.600, 2.300...]
#tech.v3.dataset.column<packed-local-date>[1461]
date
[2012-01-01, 2012-01-02, 2012-01-03, 2012-01-04, 2012-01-05, 2012-01-06, 2012-01-07, 2012-01-08, 2012-01-09, 2012-01-10, 2012-01-11, 2012-01-12, 2012-01-13, 2012-01-14, 2012-01-15, 2012-01-16, 2012-01-17, 2012-01-18, 2012-01-19, 2012-01-20...]
Columns as sequence
take 2 (tc/columns ds)) (
(#tech.v3.dataset.column<packed-local-date>[1461]
date
[2012-01-01, 2012-01-02, 2012-01-03, 2012-01-04, 2012-01-05, 2012-01-06, 2012-01-07, 2012-01-08, 2012-01-09, 2012-01-10, 2012-01-11, 2012-01-12, 2012-01-13, 2012-01-14, 2012-01-15, 2012-01-16, 2012-01-17, 2012-01-18, 2012-01-19, 2012-01-20...] #tech.v3.dataset.column<float64>[1461]
precipitation
[0.000, 10.90, 0.8000, 20.30, 1.300, 2.500, 0.000, 0.000, 4.300, 1.000, 0.000, 0.000, 0.000, 4.100, 5.300, 2.500, 8.100, 19.80, 15.20, 13.50...])
Columns as map
keys (tc/columns ds :as-map)) (
("date" "precipitation" "temp_max" "temp_min" "wind" "weather")
Rows as sequence of sequences
take 2 (tc/rows ds)) (
([#object[java.time.LocalDate 0x3ad18613 "2012-01-01"] 0.0 12.8 5.0 4.7 "drizzle"] [#object[java.time.LocalDate 0x63f0fc93 "2012-01-02"] 10.9 10.6 2.8 4.5 "rain"])
Select rows/columns as double-double-array
-> ds
(:type/numerical)
(tc/select-columns
(tc/head):as-double-arrays)) (tc/rows
#object["[[D" 0x7913631e "[[D@7913631e"]
-> ds
(:type/numerical)
(tc/select-columns
(tc/head):as-double-arrays)) (tc/columns
#object["[[D" 0x33ac292c "[[D@33ac292c"]
Rows as sequence of maps
take 2 (tc/rows ds :as-maps))) (clojure.pprint/pprint (
({"date" #object[java.time.LocalDate 0x46e26443 "2012-01-01"],
"precipitation" 0.0,
"temp_max" 12.8,
"temp_min" 5.0,
"wind" 4.7,
"weather" "drizzle"}
{"date" #object[java.time.LocalDate 0x1f2b8ae2 "2012-01-02"],
"precipitation" 10.9,
"temp_max" 10.6,
"temp_min" 2.8,
"wind" 4.5,
"weather" "rain"})
Rows with missing values
-> {:a [1 nil 2]
(:b [3 4 nil]}
(tc/dataset):as-maps)) (tc/rows
[{:a 1, :b 3} {:a nil, :b 4} {:a 2, :b nil}]
Rows with elided missing values
-> {:a [1 nil 2]
(:b [3 4 nil]}
(tc/dataset):as-maps {:nil-missing? false})) (tc/rows
[{:a 1, :b 3} {:b 4} {:a 2}]
Get single value from the table using get-in
from Clojure API or get-entry
. First argument is column name, second is row number.
get-in ds ["wind" 2]) (
2.3
"wind" 2) (tc/get-entry ds
2.3
Dataset is printed using dataset->str
or print-dataset
functions. Options are the same as in tech.ml.dataset/dataset-data->str
. Most important is :print-line-policy
which can be one of the: :single
, :repl
or :markdown
.
:V1) {:print-line-policy :markdown}) (tc/print-dataset (tc/group-by DS
_unnamed [2 3]:
| :name | :group-id | :data |
|------:|----------:|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 1 | 0 | Group: 1 [5 4]:<br><br>\| :V1 \| :V2 \| :V3 \| :V4 \|<br>\|----:\|----:\|----:\|-----\|<br>\| 1 \| 1 \| 0.5 \| A \|<br>\| 1 \| 3 \| 1.5 \| C \|<br>\| 1 \| 5 \| 1.0 \| B \|<br>\| 1 \| 7 \| 0.5 \| A \|<br>\| 1 \| 9 \| 1.5 \| C \| |
| 2 | 1 | Group: 2 [4 4]:<br><br>\| :V1 \| :V2 \| :V3 \| :V4 \|<br>\|----:\|----:\|----:\|-----\|<br>\| 2 \| 2 \| 1.0 \| B \|<br>\| 2 \| 4 \| 0.5 \| A \|<br>\| 2 \| 6 \| 1.5 \| C \|<br>\| 2 \| 8 \| 1.0 \| B \| |
:V1) {:print-line-policy :repl}) (tc/print-dataset (tc/group-by DS
_unnamed [2 3]:
| :name | :group-id | :data |
|------:|----------:|--------------------------------|
| 1 | 0 | Group: 1 [5 4]: |
| | | |
| | | \| :V1 \| :V2 \| :V3 \| :V4 \| |
| | | \|----:\|----:\|----:\|-----\| |
| | | \| 1 \| 1 \| 0.5 \| A \| |
| | | \| 1 \| 3 \| 1.5 \| C \| |
| | | \| 1 \| 5 \| 1.0 \| B \| |
| | | \| 1 \| 7 \| 0.5 \| A \| |
| | | \| 1 \| 9 \| 1.5 \| C \| |
| 2 | 1 | Group: 2 [4 4]: |
| | | |
| | | \| :V1 \| :V2 \| :V3 \| :V4 \| |
| | | \|----:\|----:\|----:\|-----\| |
| | | \| 2 \| 2 \| 1.0 \| B \| |
| | | \| 2 \| 4 \| 0.5 \| A \| |
| | | \| 2 \| 6 \| 1.5 \| C \| |
| | | \| 2 \| 8 \| 1.0 \| B \| |
:V1) {:print-line-policy :single}) (tc/print-dataset (tc/group-by DS
_unnamed [2 3]:
| :name | :group-id | :data |
|------:|----------:|-----------------|
| 1 | 0 | Group: 1 [5 4]: |
| 2 | 1 | Group: 2 [4 4]: |
Grouping by is an operation which splits dataset into subdatasets and pack it into new special type of… dataset. I distinguish two types of dataset: regular dataset and grouped dataset. The latter is the result of grouping.
Grouped dataset is annotated in by :grouped?
meta tag and consist following columns:
:name
- group name or structure:group-id
- integer assigned to the group:data
- groups as datasetsAlmost all functions recognize type of the dataset (grouped or not) and operate accordingly.
You can’t apply reshaping or join/concat functions on grouped datasets.
Grouping is done by calling group-by
function with arguments:
ds
- datasetgrouping-selector
- what to use for grouping:result-type
- what to return::as-dataset
(default) - return grouped dataset:as-indexes
- return rows ids (row number from original dataset):as-map
- return map with group names as keys and subdataset as values:as-seq
- return sequens of subdatasets:select-keys
- list of the columns passed to a grouping selector functionAll subdatasets (groups) have set name as the group name, additionally group-id
is in meta.
Grouping can be done by:
:select-keys
)Note: currently dataset inside dataset is printed recursively so it renders poorly from markdown. So I will use :as-seq
result type to show just group names and groups.
List of columns in grouped dataset
-> DS
(:V1)
(tc/group-by (tc/column-names))
(:V1 :V2 :V3 :V4)
List of columns in grouped dataset treated as regular dataset
-> DS
(:V1)
(tc/group-by
(tc/as-regular-dataset) (tc/column-names))
(:name :group-id :data)
Content of the grouped dataset
:V1) :as-map) (tc/columns (tc/group-by DS
{:name #tech.v3.dataset.column<int64>[2]
:name
[1, 2], :group-id #tech.v3.dataset.column<int64>[2]
:group-id
[0, 1], :data #tech.v3.dataset.column<dataset>[2]
:data
[Group: 1 [5 4]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 1 | 1 | 0.5 | A |
| 1 | 3 | 1.5 | C |
| 1 | 5 | 1.0 | B |
| 1 | 7 | 0.5 | A |
| 1 | 9 | 1.5 | C |
, Group: 2 [4 4]:
| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
| 2 | 2 | 1.0 | B |
| 2 | 4 | 0.5 | A |
| 2 | 6 | 1.5 | C |
| 2 | 8 | 1.0 | B |
]}
Grouped dataset as map
keys (tc/group-by DS :V1 {:result-type :as-map})) (
(1 2)
vals (tc/group-by DS :V1 {:result-type :as-map})) (
(Group: 1 [5 4]:
:V1 | :V2 | :V3 | :V4 |
---|---|---|---|
1 | 1 | 0.5 | A |
1 | 3 | 1.5 | C |
1 | 5 | 1.0 | B |
1 | 7 | 0.5 | A |
1 | 9 | 1.5 | C |
Group: 2 [4 4]:
:V1 | :V2 | :V3 | :V4 |
---|---|---|---|
2 | 2 | 1.0 | B |
2 | 4 | 0.5 | A |
2 | 6 | 1.5 | C |
2 | 8 | 1.0 | B |
)
Group dataset as map of indexes (row ids)
:V1 {:result-type :as-indexes}) (tc/group-by DS
{1 [0 2 4 6 8], 2 [1 3 5 7]}
Grouped datasets are printed as follows by default.
:V1) (tc/group-by DS
_unnamed [2 3]:
:name | :group-id | :data |
---|---|---|
1 | 0 | Group: 1 [5 4]: |
2 | 1 | Group: 2 [4 4]: |
To get groups as sequence or a map can be done from grouped dataset using groups->seq
and groups->map
functions.
Groups as seq can be obtained by just accessing :data
column.
I will use temporary dataset here.
let [ds (-> {"a" [1 1 2 2]
("b" ["a" "b" "c" "d"]}
(tc/dataset)"a"))]
(tc/group-by seq (ds :data))) ;; seq is not necessary but Markdown treats `:data` as command here (
(Group: 1 [2 2]:
a | b |
---|---|
1 | a |
1 | b |
Group: 2 [2 2]:
a | b |
---|---|
2 | c |
2 | d |
)
-> {"a" [1 1 2 2]
("b" ["a" "b" "c" "d"]}
(tc/dataset)"a")
(tc/group-by (tc/groups->seq))
(Group: 1 [2 2]:
a | b |
---|---|
1 | a |
1 | b |
Group: 2 [2 2]:
a | b |
---|---|
2 | c |
2 | d |
)
Groups as map
-> {"a" [1 1 2 2]
("b" ["a" "b" "c" "d"]}
(tc/dataset)"a")
(tc/group-by (tc/groups->map))
{1 Group: 1 [2 2]:
a | b |
---|---|
1 | a |
1 | b |
, 2 Group: 2 [2 2]:
a | b |
---|---|
2 | c |
2 | d |
}
Grouping by more than one column. You can see that group names are maps. When ungrouping is done these maps are used to restore column names.
:V1 :V3] {:result-type :as-seq}) (tc/group-by DS [
(Group: {:V1 1, :V3 0.5} [2 4]:
:V1 | :V2 | :V3 | :V4 |
---|---|---|---|
1 | 1 | 0.5 | A |
1 | 7 | 0.5 | A |
Group: {:V1 2, :V3 1.0} [2 4]:
:V1 | :V2 | :V3 | :V4 |
---|---|---|---|
2 | 2 | 1.0 | B |
2 | 8 | 1.0 | B |
Group: {:V1 1, :V3 1.5} [2 4]:
:V1 | :V2 | :V3 | :V4 |
---|---|---|---|
1 | 3 | 1.5 | C |
1 | 9 | 1.5 | C |
Group: {:V1 2, :V3 0.5} [1 4]:
:V1 | :V2 | :V3 | :V4 |
---|---|---|---|
2 | 4 | 0.5 | A |
Group: {:V1 1, :V3 1.0} [1 4]:
:V1 | :V2 | :V3 | :V4 |
---|---|---|---|
1 | 5 | 1.0 | B |
Group: {:V1 2, :V3 1.5} [1 4]:
:V1 | :V2 | :V3 | :V4 |
---|---|---|---|
2 | 6 | 1.5 | C |
)
Grouping can be done by providing just row indexes. This way you can assign the same row to more than one group.
"group-a" [1 2 1 2]
(tc/group-by DS {"group-b" [5 5 5 1]} {:result-type :as-seq})
(Group: group-a [4 4]:
:V1 | :V2 | :V3 | :V4 |
---|---|---|---|
2 | 2 | 1.0 | B |
1 | 3 | 1.5 | C |
2 | 2 | 1.0 | B |
1 | 3 | 1.5 | C |
Group: group-b [4 4]:
:V1 | :V2 | :V3 | :V4 |
---|---|---|---|
2 | 6 | 1.5 | C |
2 | 6 | 1.5 | C |
2 | 6 | 1.5 | C |
2 | 2 | 1.0 | B |
)
You can group by a result of grouping function which gets row as map and should return group name. When map is used as a group name, ungrouping restore original column names.
fn [row] (* (:V1 row)
(tc/group-by DS (:V3 row))) {:result-type :as-seq}) (
(Group: 0.5 [2 4]:
:V1 | :V2 | :V3 | :V4 |
---|---|---|---|
1 | 1 | 0.5 | A |
1 | 7 | 0.5 | A |
Group: 2.0 [2 4]:
:V1 | :V2 | :V3 | :V4 |
---|---|---|---|
2 | 2 | 1.0 | B |
2 | 8 | 1.0 | B |
Group: 1.5 [2 4]:
:V1 | :V2 | :V3 | :V4 |
---|---|---|---|
1 | 3 | 1.5 | C |
1 | 9 | 1.5 | C |
Group: 1.0 [2 4]:
:V1 | :V2 | :V3 | :V4 |
---|---|---|---|
2 | 4 | 0.5 | A |
1 | 5 | 1.0 | B |
Group: 3.0 [1 4]:
:V1 | :V2 | :V3 | :V4 |
---|---|---|---|
2 | 6 | 1.5 | C |
)
You can use any predicate on column to split dataset into two groups.
comp #(< % 1.0) :V3) {:result-type :as-seq}) (tc/group-by DS (
(Group: true [3 4]:
:V1 | :V2 | :V3 | :V4 |
---|---|---|---|
1 | 1 | 0.5 | A |
2 | 4 | 0.5 | A |
1 | 7 | 0.5 | A |
Group: false [6 4]:
:V1 | :V2 | :V3 | :V4 |
---|---|---|---|
2 | 2 | 1.0 | B |
1 | 3 | 1.5 | C |
1 | 5 | 1.0 | B |
2 | 6 | 1.5 | C |
2 | 8 | 1.0 | B |
1 | 9 | 1.5 | C |
)
juxt
is also helpful
juxt :V1 :V3) {:result-type :as-seq}) (tc/group-by DS (
(Group: [1 0.5] [2 4]:
:V1 | :V2 | :V3 | :V4 |
---|---|---|---|
1 | 1 | 0.5 | A |
1 | 7 | 0.5 | A |
Group: [2 1.0] [2 4]:
:V1 | :V2 | :V3 | :V4 |
---|---|---|---|
2 | 2 | 1.0 | B |
2 | 8 | 1.0 | B |
Group: [1 1.5] [2 4]:
:V1 | :V2 | :V3 | :V4 |
---|---|---|---|
1 | 3 | 1.5 | C |
1 | 9 | 1.5 | C |
Group: [2 0.5] [1 4]:
:V1 | :V2 | :V3 | :V4 |
---|---|---|---|
2 | 4 | 0.5 | A |
Group: [1 1.0] [1 4]:
:V1 | :V2 | :V3 | :V4 |
---|---|---|---|
1 | 5 | 1.0 | B |
Group: [2 1.5] [1 4]:
:V1 | :V2 | :V3 | :V4 |
---|---|---|---|
2 | 6 | 1.5 | C |
)
tech.ml.dataset
provides an option to limit columns which are passed to grouping functions. It’s done for performance purposes.
identity {:result-type :as-seq
(tc/group-by DS :select-keys [:V1]})
(Group: {:V1 1} [5 4]:
:V1 | :V2 | :V3 | :V4 |
---|---|---|---|
1 | 1 | 0.5 | A |
1 | 3 | 1.5 | C |
1 | 5 | 1.0 | B |
1 | 7 | 0.5 | A |
1 | 9 | 1.5 | C |
Group: {:V1 2} [4 4]:
:V1 | :V2 | :V3 | :V4 |
---|---|---|---|
2 | 2 | 1.0 | B |
2 | 4 | 0.5 | A |
2 | 6 | 1.5 | C |
2 | 8 | 1.0 | B |
)
Ungrouping simply concats all the groups into the dataset. Following options are possible
:order?
- order groups according to the group name ascending order. Default: false
:add-group-as-column
- should group name become a column? If yes column is created with provided name (or :$group-name
if argument is true
). Default: nil
.:add-group-id-as-column
- should group id become a column? If yes column is created with provided name (or :$group-id
if argument is true
). Default: nil
.:dataset-name
- to name resulting dataset. Default: nil
(_unnamed)If group name is a map, it will be splitted into separate columns. Be sure that groups (subdatasets) doesn’t contain the same columns already.
If group name is a vector, it will be splitted into separate columns. If you want to name them, set vector of target column names as :add-group-as-column
argument.
After ungrouping, order of the rows is kept within the groups but groups are ordered according to the internal storage.
Grouping and ungrouping.
-> DS
(:V3)
(tc/group-by (tc/ungroup))
_unnamed [9 4]:
:V1 | :V2 | :V3 | :V4 |
---|---|---|---|
1 | 1 | 0.5 | A |
2 | 4 | 0.5 | A |
1 | 7 | 0.5 | A |
2 | 2 | 1.0 | B |
1 | 5 | 1.0 | B |
2 | 8 | 1.0 | B |
1 | 3 | 1.5 | C |
2 | 6 | 1.5 | C |
1 | 9 | 1.5 | C |
Groups sorted by group name and named.
-> DS
(:V3)
(tc/group-by :order? true
(tc/ungroup {:dataset-name "Ordered by V3"}))
Ordered by V3 [9 4]:
:V1 | :V2 | :V3 | :V4 |
---|---|---|---|
1 | 1 | 0.5 | A |
2 | 4 | 0.5 | A |
1 | 7 | 0.5 | A |
2 | 2 | 1.0 | B |
1 | 5 | 1.0 | B |
2 | 8 | 1.0 | B |
1 | 3 | 1.5 | C |
2 | 6 | 1.5 | C |
1 | 9 | 1.5 | C |
Groups sorted descending by group name and named.
-> DS
(:V3)
(tc/group-by :order? :desc
(tc/ungroup {:dataset-name "Ordered by V3 descending"}))
Ordered by V3 descending [9 4]:
:V1 | :V2 | :V3 | :V4 |
---|---|---|---|
1 | 3 | 1.5 | C |
2 | 6 | 1.5 | C |
1 | 9 | 1.5 | C |
2 | 2 | 1.0 | B |
1 | 5 | 1.0 | B |
2 | 8 | 1.0 | B |
1 | 1 | 0.5 | A |
2 | 4 | 0.5 | A |
1 | 7 | 0.5 | A |
Let’s add group name and id as additional columns
-> DS
(comp #(< % 4) :V2))
(tc/group-by (:add-group-as-column true
(tc/ungroup {:add-group-id-as-column true}))
_unnamed [9 6]:
:\(group-name | :\)group-id | :V1 | :V2 | :V3 | :V4 | |
---|---|---|---|---|---|
true | 0 | 1 | 1 | 0.5 | A |
true | 0 | 2 | 2 | 1.0 | B |
true | 0 | 1 | 3 | 1.5 | C |
false | 1 | 2 | 4 | 0.5 | A |
false | 1 | 1 | 5 | 1.0 | B |
false | 1 | 2 | 6 | 1.5 | C |
false | 1 | 1 | 7 | 0.5 | A |
false | 1 | 2 | 8 | 1.0 | B |
false | 1 | 1 | 9 | 1.5 | C |
Let’s assign different column names
-> DS
(comp #(< % 4) :V2))
(tc/group-by (:add-group-as-column "Is V2 less than 4?"
(tc/ungroup {:add-group-id-as-column "group id"}))
_unnamed [9 6]:
Is V2 less than 4? | group id | :V1 | :V2 | :V3 | :V4 |
---|---|---|---|---|---|
true | 0 | 1 | 1 | 0.5 | A |
true | 0 | 2 | 2 | 1.0 | B |
true | 0 | 1 | 3 | 1.5 | C |
false | 1 | 2 | 4 | 0.5 | A |
false | 1 | 1 | 5 | 1.0 | B |
false | 1 | 2 | 6 | 1.5 | C |
false | 1 | 1 | 7 | 0.5 | A |
false | 1 | 2 | 8 | 1.0 | B |
false | 1 | 1 | 9 | 1.5 | C |
If we group by map, we can automatically create new columns out of group names.
-> DS
(fn [row] {"V1 and V3 multiplied" (* (:V1 row)
(tc/group-by (:V3 row))
("V4 as lowercase" (clojure.string/lower-case (:V4 row))}))
:add-group-as-column true})) (tc/ungroup {
_unnamed [9 6]:
V1 and V3 multiplied | V4 as lowercase | :V1 | :V2 | :V3 | :V4 |
---|---|---|---|---|---|
0.5 | a | 1 | 1 | 0.5 | A |
0.5 | a | 1 | 7 | 0.5 | A |
2.0 | b | 2 | 2 | 1.0 | B |
2.0 | b | 2 | 8 | 1.0 | B |
1.5 | c | 1 | 3 | 1.5 | C |
1.5 | c | 1 | 9 | 1.5 | C |
1.0 | a | 2 | 4 | 0.5 | A |
1.0 | b | 1 | 5 | 1.0 | B |
3.0 | c | 2 | 6 | 1.5 | C |
We can add group names without separation
-> DS
(fn [row] {"V1 and V3 multiplied" (* (:V1 row)
(tc/group-by (:V3 row))
("V4 as lowercase" (clojure.string/lower-case (:V4 row))}))
:add-group-as-column "just map"
(tc/ungroup {:separate? false}))
_unnamed [9 5]:
just map | :V1 | :V2 | :V3 | :V4 |
---|---|---|---|---|
{“V1 and V3 multiplied” 0.5, “V4 as lowercase” “a”} | 1 | 1 | 0.5 | A |
{“V1 and V3 multiplied” 0.5, “V4 as lowercase” “a”} | 1 | 7 | 0.5 | A |
{“V1 and V3 multiplied” 2.0, “V4 as lowercase” “b”} | 2 | 2 | 1.0 | B |
{“V1 and V3 multiplied” 2.0, “V4 as lowercase” “b”} | 2 | 8 | 1.0 | B |
{“V1 and V3 multiplied” 1.5, “V4 as lowercase” “c”} | 1 | 3 | 1.5 | C |
{“V1 and V3 multiplied” 1.5, “V4 as lowercase” “c”} | 1 | 9 | 1.5 | C |
{“V1 and V3 multiplied” 1.0, “V4 as lowercase” “a”} | 2 | 4 | 0.5 | A |
{“V1 and V3 multiplied” 1.0, “V4 as lowercase” “b”} | 1 | 5 | 1.0 | B |
{“V1 and V3 multiplied” 3.0, “V4 as lowercase” “c”} | 2 | 6 | 1.5 | C |
The same applies to group names as sequences
-> DS
(juxt :V1 :V3))
(tc/group-by (:add-group-as-column "abc"})) (tc/ungroup {
_unnamed [9 6]:
:abc-0 | :abc-1 | :V1 | :V2 | :V3 | :V4 |
---|---|---|---|---|---|
1 | 0.5 | 1 | 1 | 0.5 | A |
1 | 0.5 | 1 | 7 | 0.5 | A |
2 | 1.0 | 2 | 2 | 1.0 | B |
2 | 1.0 | 2 | 8 | 1.0 | B |
1 | 1.5 | 1 | 3 | 1.5 | C |
1 | 1.5 | 1 | 9 | 1.5 | C |
2 | 0.5 | 2 | 4 | 0.5 | A |
1 | 1.0 | 1 | 5 | 1.0 | B |
2 | 1.5 | 2 | 6 | 1.5 | C |
Let’s provide column names
-> DS
(juxt :V1 :V3))
(tc/group-by (:add-group-as-column ["v1" "v3"]})) (tc/ungroup {
_unnamed [9 6]:
v1 | v3 | :V1 | :V2 | :V3 | :V4 |
---|---|---|---|---|---|
1 | 0.5 | 1 | 1 | 0.5 | A |
1 | 0.5 | 1 | 7 | 0.5 | A |
2 | 1.0 | 2 | 2 | 1.0 | B |
2 | 1.0 | 2 | 8 | 1.0 | B |
1 | 1.5 | 1 | 3 | 1.5 | C |
1 | 1.5 | 1 | 9 | 1.5 | C |
2 | 0.5 | 2 | 4 | 0.5 | A |
1 | 1.0 | 1 | 5 | 1.0 | B |
2 | 1.5 | 2 | 6 | 1.5 | C |
Also we can supress separation
-> DS
(juxt :V1 :V3))
(tc/group-by (:separate? false
(tc/ungroup {:add-group-as-column true}))
;; => _unnamed [9 5]:
_unnamed [9 5]:
:$group-name | :V1 | :V2 | :V3 | :V4 |
---|---|---|---|---|
[1 0.5] | 1 | 1 | 0.5 | A |
[1 0.5] | 1 | 7 | 0.5 | A |
[2 1.0] | 2 | 2 | 1.0 | B |
[2 1.0] | 2 | 8 | 1.0 | B |
[1 1.5] | 1 | 3 | 1.5 | C |
[1 1.5] | 1 | 9 | 1.5 | C |
[2 0.5] | 2 | 4 | 0.5 | A |
[1 1.0] | 1 | 5 | 1.0 | B |
[2 1.5] | 2 | 6 | 1.5 | C |
To check if dataset is grouped or not just use grouped?
function.
(tc/grouped? DS)
nil
:V1)) (tc/grouped? (tc/group-by DS
true
If you want to remove grouping annotation (to make all the functions work as with regular dataset) you can use unmark-group
or as-regular-dataset
(alias) functions.
It can be important when you want to remove some groups (rows) from grouped dataset using drop-rows
or something like that.
-> DS
(:V1)
(tc/group-by
(tc/as-regular-dataset) (tc/grouped?))
nil
You can also operate on grouped dataset as a regular one in case you want to access its columns using without-grouping->
threading macro.
-> DS
(:V4 :V1])
(tc/group-by [
(tc/without-grouping->comp (juxt :V4 :V1) :name)))) (tc/order-by (
_unnamed [6 3]:
:name | :group-id | :data |
---|---|---|
{:V4 “A”, :V1 1} | 0 | Group: {:V4 “A”, :V1 1} [2 4]: |
{:V4 “A”, :V1 2} | 3 | Group: {:V4 “A”, :V1 2} [1 4]: |
{:V4 “B”, :V1 1} | 4 | Group: {:V4 “B”, :V1 1} [1 4]: |
{:V4 “B”, :V1 2} | 1 | Group: {:V4 “B”, :V1 2} [2 4]: |
{:V4 “C”, :V1 1} | 2 | Group: {:V4 “C”, :V1 1} [2 4]: |
{:V4 “C”, :V1 2} | 5 | Group: {:V4 “C”, :V1 2} [1 4]: |
This is considered internal.
If you want to implement your own mapping function on grouped dataset you can call process-group-data
and pass function operating on datasets. Result should be a dataset to have ungrouping working.
-> DS
(:V1)
(tc/group-by str "Shape: " (vector (tc/row-count %) (tc/column-count %))))
(tc/process-group-data #( (tc/as-regular-dataset))
_unnamed [2 3]:
:name | :group-id | :data |
---|---|---|
1 | 0 | Shape: [5 4] |
2 | 1 | Shape: [4 4] |
Column is a special tech.ml.dataset
structure. For our purposes we cat treat columns as typed and named sequence bound to particular dataset.
Type of the data is inferred from a sequence during column creation.
To select dataset columns or column names columns-selector
is used. columns-selector
can be one of the following:
:all
keyword - selects all columnstype
namespaced keyword for specific datatype or group of datatypesColumn name can be anything.
column-names
function returns names according to columns-selector
and optional meta-field
. meta-field
is one of the following:
:name
(default) - to operate on column names:datatype
- to operated on column types:all
- if you want to process all metadataDatatype groups are:
:type/numerical
- any numerical type:type/float
- floating point number (:float32
and :float64
):type/integer
- any integer:type/datetime
- any datetime typeIf qualified keyword starts with :!type
, complement set is used.
To select all column names you can use column-names
function.
(tc/column-names DS)
(:V1 :V2 :V3 :V4)
or
:all) (tc/column-names DS
(:V1 :V2 :V3 :V4)
In case you want to select column which has name :all
(or is sequence or map), put it into a vector. Below code returns empty sequence since there is no such column in the dataset.
:all]) (tc/column-names DS [
()
Obviously selecting single name returns it’s name if available
:V1)
(tc/column-names DS "no such column") (tc/column-names DS
(:V1)
()
Select sequence of column names.
:V1 "V2" :V3 :V4 :V5]) (tc/column-names DS [
(:V1 :V3 :V4)
Select names based on regex, columns ends with 1
or 4
#".*[14]") (tc/column-names DS
(:V1 :V4)
Select names based on regex operating on type of the column (to check what are the column types, call (tc/info DS :columns)
. Here we want to get integer columns only.
#"^:int.*" :datatype) (tc/column-names DS
(:V1 :V2)
or
:type/integer) (tc/column-names DS
(:V1 :V2)
And finally we can use predicate to select names. Let’s select double precision columns.
:float64} :datatype) (tc/column-names DS #{
(:V3)
or
:type/float64) (tc/column-names DS
(:V3)
If you want to select all columns but given, use complement
function. Works only on a predicate.
complement #{:V1}))
(tc/column-names DS (complement #{:float64}) :datatype)
(tc/column-names DS ( (tc/column-names DS :!type/float64)
(:V2 :V3 :V4)
(:V1 :V2 :V4)
(:V1 :V2 :V4)
You can select column names based on all column metadata at once by using :all
metadata selector. Below we want to select column names ending with 1
which have long
datatype.
fn [meta]
(tc/column-names DS (and (= :int64 (:datatype meta))
(:name meta) "1"))) :all) (clojure.string/ends-with? (
(:V1)
select-columns
creates dataset with columns selected by columns-selector
as described above. Function works on regular and grouped dataset.
Select only float64 columns
= :float64 %) :datatype) (tc/select-columns DS #(
_unnamed [9 1]:
:V3 |
---|
0.5 |
1.0 |
1.5 |
0.5 |
1.0 |
1.5 |
0.5 |
1.0 |
1.5 |
or
:type/float64) (tc/select-columns DS
_unnamed [9 1]:
:V3 |
---|
0.5 |
1.0 |
1.5 |
0.5 |
1.0 |
1.5 |
0.5 |
1.0 |
1.5 |
Select all but :V1
columns
complement #{:V1})) (tc/select-columns DS (
_unnamed [9 3]:
:V2 | :V3 | :V4 |
---|---|---|
1 | 0.5 | A |
2 | 1.0 | B |
3 | 1.5 | C |
4 | 0.5 | A |
5 | 1.0 | B |
6 | 1.5 | C |
7 | 0.5 | A |
8 | 1.0 | B |
9 | 1.5 | C |
If we have grouped data set, column selection is applied to every group separately.
-> DS
(:V1)
(tc/group-by :V2 :V3])
(tc/select-columns [ (tc/groups->map))
{1 Group: 1 [5 2]:
:V2 | :V3 |
---|---|
1 | 0.5 |
3 | 1.5 |
5 | 1.0 |
7 | 0.5 |
9 | 1.5 |
, 2 Group: 2 [4 2]:
:V2 | :V3 |
---|---|
2 | 1.0 |
4 | 0.5 |
6 | 1.5 |
8 | 1.0 |
}
drop-columns
creates dataset with removed columns.
Drop float64 columns
= :float64 %) :datatype) (tc/drop-columns DS #(
_unnamed [9 3]:
:V1 | :V2 | :V4 |
---|---|---|
1 | 1 | A |
2 | 2 | B |
1 | 3 | C |
2 | 4 | A |
1 | 5 | B |
2 | 6 | C |
1 | 7 | A |
2 | 8 | B |
1 | 9 | C |
or
:type/float64) (tc/drop-columns DS
_unnamed [9 3]:
:V1 | :V2 | :V4 |
---|---|---|
1 | 1 | A |
2 | 2 | B |
1 | 3 | C |
2 | 4 | A |
1 | 5 | B |
2 | 6 | C |
1 | 7 | A |
2 | 8 | B |
1 | 9 | C |
Drop all columns but :V1
and :V2
complement #{:V1 :V2})) (tc/drop-columns DS (
_unnamed [9 2]:
:V1 | :V2 |
---|---|
1 | 1 |
2 | 2 |
1 | 3 |
2 | 4 |
1 | 5 |
2 | 6 |
1 | 7 |
2 | 8 |
1 | 9 |
If we have grouped data set, column selection is applied to every group separately. Selected columns are dropped.
-> DS
(:V1)
(tc/group-by :V2 :V3])
(tc/drop-columns [ (tc/groups->map))
{1 Group: 1 [5 2]:
:V1 | :V4 |
---|---|
1 | A |
1 | C |
1 | B |
1 | A |
1 | C |
, 2 Group: 2 [4 2]:
:V1 | :V4 |
---|---|
2 | B |
2 | A |
2 | C |
2 | B |
}
If you want to rename colums use rename-columns
and pass map where keys are old names, values new ones.
You can also pass mapping function with optional columns-selector
:V1 "v1"
(tc/rename-columns DS {:V2 "v2"
:V3 [1 2 3]
:V4 (Object.)})
_unnamed [9 4]:
v1 | v2 | [1 2 3] | java.lang.Object@705ee1e0 |
---|---|---|---|
1 | 1 | 0.5 | A |
2 | 2 | 1.0 | B |
1 | 3 | 1.5 | C |
2 | 4 | 0.5 | A |
1 | 5 | 1.0 | B |
2 | 6 | 1.5 | C |
1 | 7 | 0.5 | A |
2 | 8 | 1.0 | B |
1 | 9 | 1.5 | C |
Map all names with function
comp str second name)) (tc/rename-columns DS (
_unnamed [9 4]:
1 | 2 | 3 | 4 |
---|---|---|---|
1 | 1 | 0.5 | A |
2 | 2 | 1.0 | B |
1 | 3 | 1.5 | C |
2 | 4 | 0.5 | A |
1 | 5 | 1.0 | B |
2 | 6 | 1.5 | C |
1 | 7 | 0.5 | A |
2 | 8 | 1.0 | B |
1 | 9 | 1.5 | C |
Map selected names with function
:V1 :V3] (comp str second name)) (tc/rename-columns DS [
_unnamed [9 4]:
1 | :V2 | 3 | :V4 |
---|---|---|---|
1 | 1 | 0.5 | A |
2 | 2 | 1.0 | B |
1 | 3 | 1.5 | C |
2 | 4 | 0.5 | A |
1 | 5 | 1.0 | B |
2 | 6 | 1.5 | C |
1 | 7 | 0.5 | A |
2 | 8 | 1.0 | B |
1 | 9 | 1.5 | C |
Function works on grouped dataset
-> DS
(:V1)
(tc/group-by :V1 "v1"
(tc/rename-columns {:V2 "v2"
:V3 [1 2 3]
:V4 (Object.)})
(tc/groups->map))
{1 Group: 1 [5 4]:
v1 | v2 | [1 2 3] | java.lang.Object@1eb7c845 |
---|---|---|---|
1 | 1 | 0.5 | A |
1 | 3 | 1.5 | C |
1 | 5 | 1.0 | B |
1 | 7 | 0.5 | A |
1 | 9 | 1.5 | C |
, 2 Group: 2 [4 4]:
v1 | v2 | [1 2 3] | java.lang.Object@1eb7c845 |
---|---|---|---|
2 | 2 | 1.0 | B |
2 | 4 | 0.5 | A |
2 | 6 | 1.5 | C |
2 | 8 | 1.0 | B |
}
To add (or replace existing) column call add-column
function. Function accepts:
ds
- a datasetcolumn-name
- if it’s existing column name, column will be replacedcolumn
- can be column (from other dataset), sequence, single value or function. Too big columns are always trimmed. Too small are cycled or extended with missing values (according to size-strategy
argument)size-strategy
(optional) - when new column is shorter than dataset row count, following strategies are applied:
:cycle
- repeat data:na
- append missing values:strict
- (default) throws an exception when sizes mismatchFunction works on grouped dataset.
Add single value as column
:V5 "X") (tc/add-column DS
_unnamed [9 5]:
:V1 | :V2 | :V3 | :V4 | :V5 |
---|---|---|---|---|
1 | 1 | 0.5 | A | X |
2 | 2 | 1.0 | B | X |
1 | 3 | 1.5 | C | X |
2 | 4 | 0.5 | A | X |
1 | 5 | 1.0 | B | X |
2 | 6 | 1.5 | C | X |
1 | 7 | 0.5 | A | X |
2 | 8 | 1.0 | B | X |
1 | 9 | 1.5 | C | X |
Replace one column (column is trimmed)
:V1 (repeatedly rand)) (tc/add-column DS
_unnamed [9 4]:
:V1 | :V2 | :V3 | :V4 |
---|---|---|---|
0.41257790 | 1 | 0.5 | A |
0.55281322 | 2 | 1.0 | B |
0.15073657 | 3 | 1.5 | C |
0.18311522 | 4 | 0.5 | A |
0.76664862 | 5 | 1.0 | B |
0.95296580 | 6 | 1.5 | C |
0.24539929 | 7 | 0.5 | A |
0.24504991 | 8 | 1.0 | B |
0.10504470 | 9 | 1.5 | C |
Copy column
:V5 (DS :V1)) (tc/add-column DS
_unnamed [9 5]:
:V1 | :V2 | :V3 | :V4 | :V5 |
---|---|---|---|---|
1 | 1 | 0.5 | A | 1 |
2 | 2 | 1.0 | B | 2 |
1 | 3 | 1.5 | C | 1 |
2 | 4 | 0.5 | A | 2 |
1 | 5 | 1.0 | B | 1 |
2 | 6 | 1.5 | C | 2 |
1 | 7 | 0.5 | A | 1 |
2 | 8 | 1.0 | B | 2 |
1 | 9 | 1.5 | C | 1 |
When function is used, argument is whole dataset and the result should be column, sequence or single value
:row-count tc/row-count) (tc/add-column DS
_unnamed [9 5]:
:V1 | :V2 | :V3 | :V4 | :row-count |
---|---|---|---|---|
1 | 1 | 0.5 | A | 9 |
2 | 2 | 1.0 | B | 9 |
1 | 3 | 1.5 | C | 9 |
2 | 4 | 0.5 | A | 9 |
1 | 5 | 1.0 | B | 9 |
2 | 6 | 1.5 | C | 9 |
1 | 7 | 0.5 | A | 9 |
2 | 8 | 1.0 | B | 9 |
1 | 9 | 1.5 | C | 9 |
Above example run on grouped dataset, applies function on each group separately.
-> DS
(:V1)
(tc/group-by :row-count tc/row-count)
(tc/add-column (tc/ungroup))
_unnamed [9 5]:
:V1 | :V2 | :V3 | :V4 | :row-count |
---|---|---|---|---|
1 | 1 | 0.5 | A | 5 |
1 | 3 | 1.5 | C | 5 |
1 | 5 | 1.0 | B | 5 |
1 | 7 | 0.5 | A | 5 |
1 | 9 | 1.5 | C | 5 |
2 | 2 | 1.0 | B | 4 |
2 | 4 | 0.5 | A | 4 |
2 | 6 | 1.5 | C | 4 |
2 | 8 | 1.0 | B | 4 |
When column which is added is longer than row count in dataset, column is trimmed. When column is shorter, it’s cycled or missing values are appended.
:V5 [:r :b] :cycle) (tc/add-column DS
_unnamed [9 5]:
:V1 | :V2 | :V3 | :V4 | :V5 |
---|---|---|---|---|
1 | 1 | 0.5 | A | :r |
2 | 2 | 1.0 | B | :b |
1 | 3 | 1.5 | C | :r |
2 | 4 | 0.5 | A | :b |
1 | 5 | 1.0 | B | :r |
2 | 6 | 1.5 | C | :b |
1 | 7 | 0.5 | A | :r |
2 | 8 | 1.0 | B | :b |
1 | 9 | 1.5 | C | :r |
:V5 [:r :b] :na) (tc/add-column DS
_unnamed [9 5]:
:V1 | :V2 | :V3 | :V4 | :V5 |
---|---|---|---|---|
1 | 1 | 0.5 | A | :r |
2 | 2 | 1.0 | B | :b |
1 | 3 | 1.5 | C | |
2 | 4 | 0.5 | A | |
1 | 5 | 1.0 | B | |
2 | 6 | 1.5 | C | |
1 | 7 | 0.5 | A | |
2 | 8 | 1.0 | B | |
1 | 9 | 1.5 | C |
Exception is thrown when :strict
(default) strategy is used and column size is not equal row count
try
(:V5 [:r :b])
(tc/add-column DS catch Exception e (str "Exception caught: "(ex-message e)))) (
"Exception caught: Column size (2) should be exactly the same as dataset row count (9). Consider `:cycle` or `:na` strategy."
Tha same applies for grouped dataset
-> DS
(:V3)
(tc/group-by :V5 [:r :b] :na)
(tc/add-column (tc/ungroup))
_unnamed [9 5]:
:V1 | :V2 | :V3 | :V4 | :V5 |
---|---|---|---|---|
1 | 1 | 0.5 | A | :r |
2 | 4 | 0.5 | A | :b |
1 | 7 | 0.5 | A | |
2 | 2 | 1.0 | B | :r |
1 | 5 | 1.0 | B | :b |
2 | 8 | 1.0 | B | |
1 | 3 | 1.5 | C | :r |
2 | 6 | 1.5 | C | :b |
1 | 9 | 1.5 | C |
Let’s use other column to fill groups
-> DS
(:V3)
(tc/group-by :V5 (DS :V2) :cycle)
(tc/add-column (tc/ungroup))
_unnamed [9 5]:
:V1 | :V2 | :V3 | :V4 | :V5 |
---|---|---|---|---|
1 | 1 | 0.5 | A | 1 |
2 | 4 | 0.5 | A | 2 |
1 | 7 | 0.5 | A | 3 |
2 | 2 | 1.0 | B | 1 |
1 | 5 | 1.0 | B | 2 |
2 | 8 | 1.0 | B | 3 |
1 | 3 | 1.5 | C | 1 |
2 | 6 | 1.5 | C | 2 |
1 | 9 | 1.5 | C | 3 |
In case you want to add or update several columns you can call add-columns
and provide map where keys are column names, vals are columns.
:V1 #(map inc (% :V1))
(tc/add-columns DS {:V5 #(map (comp keyword str) (% :V4))
:V6 11})
_unnamed [9 6]:
:V1 | :V2 | :V3 | :V4 | :V5 | :V6 |
---|---|---|---|---|---|
2 | 1 | 0.5 | A | :A | 11 |