tech.ml.dataset is a great and fast library which brings columnar dataset to the Clojure. Chris Nuernberger has been working on this library for last year as a part of bigger tech.ml stack.
I’ve started to test the library and help to fix uncovered bugs. My main goal was to compare functionalities with the other standards from other platforms. I focused on R solutions: dplyr, tidyr and data.table.
During conversions of the examples I’ve come up how to reorganized existing tech.ml.dataset functions into simple to use API. The main goals were:
Focus on dataset manipulation functionality, leaving other parts of tech.ml like pipelines, datatypes, readers, ML, etc.
Single entry point for common operations - one function dispatching on given arguments.
group-by results with special kind of dataset - a dataset containing subsets created after grouping as a column.
Most operations recognize regular dataset and grouped dataset and process data accordingly.
One function form to enable thread-first on dataset.
If you want to know more about tech.ml.dataset and dtype-next please refer their documentation:
Dataset is a special type which can be considered as a map of columns implemented around tech.ml.dataset library. Each column can be considered as named sequence of typed data. Supported types include integers, floats, string, boolean, date/time, objects etc.
Dataset creation
Dataset can be created from various of types of Clojure structures and files:
single values
sequence of maps
map of sequences or values
sequence of tech.v3.dataset.columns (taken from other dataset or created manually)
array of any arrays
file types: raw/gzipped csv/tsv, json, xls(x) taken from local file system or URL
input stream
tc/dataset accepts:
data
options (see documentation of tech.ml.dataset/->dataset function for full list):
:dataset-name - name of the dataset
:num-rows - number of rows to read from file
:header-row? - indication if first row in file is a header
:key-fn - function applied to column names (eg. keyword, to convert column names to keywords)
:separator - column separator
:single-value-column-name - name of the column when single value is provided
:column-names - in case you want to name columns - only works for sequential input (arrays) or empty dataset
:layout - for numerical, native array of arrays - treat entries :as-rows or :as-columns (default)
tc/let-dataset accepts bindings symbol-column-data to simulate R’s tibble function. Each binding is converted into a column. You can refer previous columns to in further bindings (as in let).
Empty dataset.
(tc/dataset)
_unnamed [0 0]
Empty dataset with column names
(tc/dataset nil {:column-names [:a:b]})
_unnamed [0 2]:
:a
:b
Dataset created from map (keys = column names, vals = value(s)).
(tc/dataset {:A33})
_unnamed [1 1]:
:A
33
(tc/dataset {:A [123]})
_unnamed [3 1]:
:A
1
2
3
(tc/dataset {:A [345] :B ["X""Y""Z"]})
_unnamed [3 2]:
:A
:B
3
X
4
Y
5
Z
Non-sequential values are repeated row-count number of times.
Get single value from the table using get-in from Clojure API or get-entry. First argument is column name, second is row number.
(get-in ds ["wind"2])
2.3
(tc/get-entry ds "wind"2)
2.3
Printing
Dataset is printed using dataset->str or print-dataset functions. Options are the same as in tech.ml.dataset/dataset-data->str. Most important is :print-line-policy which can be one of the: :single, :repl or :markdown.
Grouping by is an operation which splits dataset into subdatasets and pack it into new special type of… dataset. I distinguish two types of dataset: regular dataset and grouped dataset. The latter is the result of grouping.
Grouped dataset is annotated in by :grouped? meta tag and consist following columns:
:name - group name or structure
:group-id - integer assigned to the group
:data - groups as datasets
Almost all functions recognize type of the dataset (grouped or not) and operate accordingly.
You can’t apply reshaping or join/concat functions on grouped datasets.
Grouping
Grouping is done by calling group-by function with arguments:
ds - dataset
grouping-selector - what to use for grouping
options:
:result-type - what to return:
:as-dataset (default) - return grouped dataset
:as-indexes - return rows ids (row number from original dataset)
:as-map - return map with group names as keys and subdataset as values
:as-seq - return sequence of subdatasets
:select-keys - list of the columns passed to a grouping selector function
All subdatasets (groups) have set name as the group name, additionally group-id is in meta.
Grouping can be done by:
single column name
seq of column names
map of keys (group names) and row indexes
value returned by function taking row as map (limited to :select-keys)
Note: currently dataset inside dataset is printed recursively so it renders poorly from markdown. So I will use :as-seq result type to show just group names and groups.
List of columns in grouped dataset
(-> DS (tc/group-by :V1) (tc/column-names))
(:V1:V2:V3:V4)
List of columns in grouped dataset treated as regular dataset
You can group by a result of grouping function which gets row as map and should return group name. When map is used as a group name, ungrouping restore original column names.
Ungrouping simply concats all the groups into the dataset. Following options are possible
:order? - order groups according to the group name ascending order. Default: false
:add-group-as-column - should group name become a column? If yes column is created with provided name (or :$group-name if argument is true). Default: nil.
:add-group-id-as-column - should group id become a column? If yes column is created with provided name (or :$group-id if argument is true). Default: nil.
:dataset-name - to name resulting dataset. Default: nil (_unnamed)
If group name is a map, it will be splitted into separate columns. Be sure that groups (subdatasets) doesn’t contain the same columns already.
If group name is a vector, it will be splitted into separate columns. If you want to name them, set vector of target column names as :add-group-as-column argument.
After ungrouping, order of the rows is kept within the groups but groups are ordered according to the internal storage.
Grouping and ungrouping.
(-> DS (tc/group-by :V3) (tc/ungroup))
_unnamed [9 4]:
:V1
:V2
:V3
:V4
1
1
0.5
A
2
4
0.5
A
1
7
0.5
A
2
2
1.0
B
1
5
1.0
B
2
8
1.0
B
1
3
1.5
C
2
6
1.5
C
1
9
1.5
C
Groups sorted by group name and named.
(-> DS (tc/group-by :V3) (tc/ungroup {:order? true:dataset-name"Ordered by V3"}))
Ordered by V3 [9 4]:
:V1
:V2
:V3
:V4
1
1
0.5
A
2
4
0.5
A
1
7
0.5
A
2
2
1.0
B
1
5
1.0
B
2
8
1.0
B
1
3
1.5
C
2
6
1.5
C
1
9
1.5
C
Groups sorted descending by group name and named.
(-> DS (tc/group-by :V3) (tc/ungroup {:order? :desc:dataset-name"Ordered by V3 descending"}))
To check if dataset is grouped or not just use grouped? function.
(tc/grouped? DS)
nil
(tc/grouped? (tc/group-by DS :V1))
true
If you want to remove grouping annotation (to make all the functions work as with regular dataset) you can use unmark-group or as-regular-dataset (alias) functions.
It can be important when you want to remove some groups (rows) from grouped dataset using drop-rows or something like that.
If you want to implement your own mapping function on grouped dataset you can call process-group-data and pass function operating on datasets. Result should be a dataset to have ungrouping working.
Column is a special tech.ml.dataset structure. For our purposes we cat treat columns as typed and named sequence bound to particular dataset.
Type of the data is inferred from a sequence during column creation.
Names
To select dataset columns or column names columns-selector is used. columns-selector can be one of the following:
:all keyword - selects all columns
column name - for single column
sequence of column names - for collection of columns
regex - to apply pattern on column names or datatype
filter predicate - to filter column names or datatype
type namespaced keyword for specific datatype or group of datatypes
Column name can be anything.
column-names function returns names according to columns-selector and optional meta-field. meta-field is one of the following:
:name (default) - to operate on column names
:datatype - to operated on column types
:all - if you want to process all metadata
Datatype groups are:
:type/numerical - any numerical type
:type/float - floating point number (:float32 and :float64)
:type/integer - any integer
:type/datetime - any datetime type
If qualified keyword starts with :!type, complement set is used.
To select all column names you can use column-names function.
(tc/column-names DS)
(:V1:V2:V3:V4)
or
(tc/column-names DS :all)
(:V1:V2:V3:V4)
In case you want to select column which has name :all (or is sequence or map), put it into a vector. Below code returns empty sequence since there is no such column in the dataset.
(tc/column-names DS [:all])
()
Obviously selecting single name returns it’s name if available
(tc/column-names DS :V1)
(:V1)
(tc/column-names DS "no such column")
()
Select sequence of column names.
(tc/column-names DS [:V1"V2":V3:V4:V5])
(:V1:V3:V4)
Select names based on regex, columns ends with 1 or 4
(tc/column-names DS #".*[14]")
(:V1:V4)
Select names based on regex operating on type of the column (to check what are the column types, call (tc/info DS :columns). Here we want to get integer columns only.
(tc/column-names DS #"^:int.*":datatype)
(:V1:V2)
or
(tc/column-names DS :type/integer)
(:V1:V2)
And finally we can use predicate to select names. Let’s select double precision columns.
(tc/column-names DS #{:float64} :datatype)
(:V3)
or
(tc/column-names DS :type/float64)
(:V3)
If you want to select all columns but given, use complement function. Works only on a predicate.
You can select column names based on all column metadata at once by using :all metadata selector. Below we want to select column names ending with 1 which have long datatype.
To add (or replace existing) column call add-column function. Function accepts:
ds - a dataset
column-name - if it’s existing column name, column will be replaced
column - can be column (from other dataset), sequence, single value or function. Too big columns are always trimmed. Too small are cycled or extended with missing values (according to size-strategy argument)
size-strategy (optional) - when new column is shorter than dataset row count, following strategies are applied:
:cycle - repeat data
:na - append missing values
:strict - (default) throws an exception when sizes mismatch
The other way of creating or updating column is to map rows as regular map function. The arity of mapping function should be the same as number of selected columns.
To convert column into given datatype can be done using convert-types function. Not all the types can be converted automatically also some types require slow parsing (every conversion from string). In case where conversion is not possible you can pass conversion function.
Arguments:
ds - dataset
Two options:
coltype-map in case when you want to convert several columns, keys are column names, vals are new types
column-selector and new-types - column name and new datatype (or datatypes as sequence)
new-types can be:
a type like :int64 or :string or sequence of types
or sequence of pair of datetype and conversion function
After conversion additional infomation is given on problematic values.
The other conversion is casting column into java array (->array) of the type column or provided as argument. Grouped dataset returns sequence of arrays.
You can also cast the type to the other one (if casting is possible):
(tc/->array DS :V4:string)
["A", "B", "C", "A", "B", "C", "A", "B", "C"]
(tc/->array DS :V1:float32)
[1.0, 2.0, 1.0, 2.0, 1.0, 2.0, 1.0, 2.0, 1.0]
Column Operations
There are a large number of column operations that can be performed on the columns in your dataset. These operations are a similar set as the Column API operations, but instead of operating directly on columns, they take a Dataset and a columns-selector.
The behavior of the operations differ based on their return value:
If an operation would return a scalar value, then the function behaves like an aggregator and can be used in group-by expresesions. Such functions will have the general signature:
(dataset columns-selector) => dataset
If an operation would return another column, then the function will not behave like an aggregator. Instead, it asks you to specify a target column, which it will add to the dataset that it returns. The signature of these functions is:
(ds target-col columns-selector) => dataset
As there are a large number of operations, we will illustrate their usage. To begin with, here are some examples of operations whose result is a new column:
DS
_unnamed [9 4]:
:V1
:V2
:V3
:V4
1
1
0.5
A
2
2
1.0
B
1
3
1.5
C
2
4
0.5
A
1
5
1.0
B
2
6
1.5
C
1
7
0.5
A
2
8
1.0
B
1
9
1.5
C
(-> DS (tc/+ :SUM [:V1:V2]))
_unnamed [9 5]:
:V1
:V2
:V3
:V4
:SUM
1
1
0.5
A
2
2
2
1.0
B
4
1
3
1.5
C
4
2
4
0.5
A
6
1
5
1.0
B
6
2
6
1.5
C
8
1
7
0.5
A
8
2
8
1.0
B
10
1
9
1.5
C
10
(-> DS (tc/* :MULTIPY [:V1:V2]))
_unnamed [9 5]:
:V1
:V2
:V3
:V4
:MULTIPY
1
1
0.5
A
1
2
2
1.0
B
4
1
3
1.5
C
3
2
4
0.5
A
8
1
5
1.0
B
5
2
6
1.5
C
12
1
7
0.5
A
7
2
8
1.0
B
16
1
9
1.5
C
9
Now let’s look at operations that would return a scalar:
(-> DS (tc/mean [:V1]))
_unnamed [1 1]:
summary
1.44444444
Notice that we did not supply a target column to the mean function. Since mean does not return a column, we do not provide this argument. Instead, we simply provide the dataset and a columns-selector. We then get back a dataset with the result.
Now let’s use this function within a grouping expression:
(-> DS (tc/group-by [:V4]) (tc/mean [:V2]))
_unnamed [3 2]:
:V4
summary
A
4.0
B
5.0
C
6.0
In this example, we grouped DS and then passed the resulting grouped Dataset directly to the mean operation. Since mean returns a scalar, it operates as an aggregator, summarizing the results of each group.
Rows
Rows can be selected or dropped using various selectors:
row id(s) - row index as number or seqence of numbers (first row has index 0, second 1 and so on)
sequence of true/false values
filter by predicate (argument is row as a map)
When predicate is used you may want to limit columns passed to the function (select-keys option).
Additionally you may want to precalculate some values which will be visible for predicate as additional columns. It’s done internally by calling add-columns on a dataset. :pre is used as a column definitions.
Select
Select fifth row
(tc/select-rows DS 4)
_unnamed [1 4]:
:V1
:V2
:V3
:V4
1
5
1.0
B
Select 3 rows
(tc/select-rows DS [145])
_unnamed [3 4]:
:V1
:V2
:V3
:V4
2
2
1.0
B
1
5
1.0
B
2
6
1.5
C
Select rows using sequence of true/false values
(tc/select-rows DS [truenilniltrue])
_unnamed [2 4]:
:V1
:V2
:V3
:V4
1
1
0.5
A
2
4
0.5
A
Select rows using predicate
(tc/select-rows DS (comp #(<%1) :V3))
_unnamed [3 4]:
:V1
:V2
:V3
:V4
1
1
0.5
A
2
4
0.5
A
1
7
0.5
A
The same works on grouped dataset, let’s select first row from every group.
Aggregating is a function which produces single row out of dataset.
Aggregator is a function or sequence or map of functions which accept dataset as an argument and result single value, sequence of values or map.
Where map is given as an input or result, keys are treated as column names.
Grouped dataset is ungrouped after aggreation. This can be turned off by setting :ungroup to false. In case you want to pass additional ungrouping parameters add them to the options.
By default resulting column names are prefixed with summary prefix (set it with :default-column-name-prefix option).
Let’s calculate mean of some columns
(tc/aggregate DS #(reduce+ (%:V2)))
_unnamed [1 1]:
summary
45
Let’s give resulting column a name.
(tc/aggregate DS {:sum-of-V2 #(reduce+ (%:V2))})
_unnamed [1 1]:
:sum-of-V2
45
Sequential result is spread into separate columns
(tc/aggregate DS #(take5(%:V2)))
_unnamed [1 5]:
:summary-0
:summary-1
:summary-2
:summary-3
:summary-4
1
2
3
4
5
You can combine all variants and rename default prefix
You can perform columnar aggreagation also. aggregate-columns selects columns and apply aggregating function (or sequence of functions) for each column separately.
Cross tabulation built from two sets of columns. First rows and cols are used to construct grouped dataset, then aggregation function is applied for each pair. By default it counts rows from each group.
Options are:
:aggregator - function which aggregates values of grouped dataset, default it’s row-count
:marginal-rows and :marginal-cols - if true, sum of rows and cols are added as an additional columns and row. May be custom function which accepts pure row and col as a seq.
:replace-missing? - should missing values be replaced (default: true) with :missing-value (default: 0)
:pivot? - if false, flat aggregation result is returned (default: false)
Custom comparator also can be used in case objects are not comparable by default. Let’s define artificial one: if Euclidean distance is lower than 2, compare along z else along x and y. We use first three columns for that.
Remove rows which contains the same data. By default unique-by removes duplicates from whole dataset. You can also pass list of columns or functions (similar as in group-by) to remove duplicates limited by them. Default strategy is to keep the first row. More strategies below.
unique-by works on groups
Remove duplicates from whole dataset
(tc/unique-by DS)
_unnamed [9 4]:
:V1
:V2
:V3
:V4
1
1
0.5
A
2
2
1.0
B
1
3
1.5
C
2
4
0.5
A
1
5
1.0
B
2
6
1.5
C
1
7
0.5
A
2
8
1.0
B
1
9
1.5
C
Remove duplicates from each group selected by column.
(tc/unique-by DS :V1)
_unnamed [2 4]:
:V1
:V2
:V3
:V4
1
1
0.5
A
2
2
1.0
B
Pair of columns
(tc/unique-by DS [:V1:V3])
_unnamed [6 4]:
:V1
:V2
:V3
:V4
1
1
0.5
A
2
2
1.0
B
1
3
1.5
C
2
4
0.5
A
1
5
1.0
B
2
6
1.5
C
Also function can be used, split dataset by modulo 3 on columns :V2
Missing values can be replaced using several strategies. replace-missing accepts:
dataset
column selector, default: :all
strategy, default: :nearest
value (optional)
single value
sequence of values (cycled)
function, applied on column(s) with stripped missings
map with [index,value] pairs
Strategies are:
:value - replace with given value
:up - copy values up
:down - copy values down
:updown - copy values up and then down for missing values at the end
:downup - copy values down and then up for missing values at the beginning
:mid or :nearest - copy values around known values
:midpoint - use average value from previous and next non-missing
:lerp - trying to lineary approximate values, works for numbers and datetime, otherwise applies :nearest. For numbers always results in float datatype.
To pack or unpack the data into single value you can use fold-by and unroll functions.
fold-by groups dataset and packs columns data from each group separately into desired datastructure (like vector or sequence). unroll does the opposite.
unroll unfolds sequences stored in data, multiplying other ones when necessary. You can unroll more than one column at once (folded data should have the same size!).
Options:
:indexes? if true (or column name), information about index of unrolled sequence is added.
:datatypes list of datatypes which should be applied to restored columns, a map
When column names contain observation data, such column names can be splitted and data can be restored into separate columns.
(def who (tc/dataset "data/who.csv.gz"))
(->> who (tc/column-names) (take10) (tc/select-columns who))
data/who.csv.gz [7240 10]:
country
iso2
iso3
year
new_sp_m014
new_sp_m1524
new_sp_m2534
new_sp_m3544
new_sp_m4554
new_sp_m5564
Afghanistan
AF
AFG
1980
Afghanistan
AF
AFG
1981
Afghanistan
AF
AFG
1982
Afghanistan
AF
AFG
1983
Afghanistan
AF
AFG
1984
Afghanistan
AF
AFG
1985
Afghanistan
AF
AFG
1986
Afghanistan
AF
AFG
1987
Afghanistan
AF
AFG
1988
Afghanistan
AF
AFG
1989
…
…
…
…
…
…
…
…
…
…
Zimbabwe
ZW
ZWE
2003
133
874
3048
2228
981
367
Zimbabwe
ZW
ZWE
2004
187
833
2908
2298
1056
366
Zimbabwe
ZW
ZWE
2005
210
837
2264
1855
762
295
Zimbabwe
ZW
ZWE
2006
215
736
2391
1939
896
348
Zimbabwe
ZW
ZWE
2007
138
500
3693
0
716
292
Zimbabwe
ZW
ZWE
2008
127
614
0
3316
704
263
Zimbabwe
ZW
ZWE
2009
125
578
3471
681
293
Zimbabwe
ZW
ZWE
2010
150
710
2208
1682
761
350
Zimbabwe
ZW
ZWE
2011
152
784
2467
2071
780
377
Zimbabwe
ZW
ZWE
2012
120
783
2421
2086
796
360
Zimbabwe
ZW
ZWE
2013
(tc/pivot->longer who #(clojure.string/starts-with? %"new") {:target-columns [:diagnosis:gender:age]:splitter#"new_?(.*)_(.)(.*)":value-column-name:count})
data/who.csv.gz [76046 8]:
country
iso2
iso3
year
:diagnosis
:gender
:age
:count
Albania
AL
ALB
2013
rel
m
1524
60
Algeria
DZ
DZA
2013
rel
m
1524
1021
Andorra
AD
AND
2013
rel
m
1524
0
Angola
AO
AGO
2013
rel
m
1524
2992
Anguilla
AI
AIA
2013
rel
m
1524
0
Antigua and Barbuda
AG
ATG
2013
rel
m
1524
1
Argentina
AR
ARG
2013
rel
m
1524
1124
Armenia
AM
ARM
2013
rel
m
1524
116
Australia
AU
AUS
2013
rel
m
1524
105
Austria
AT
AUT
2013
rel
m
1524
44
…
…
…
…
…
…
…
…
United Arab Emirates
AE
ARE
2013
rel
m
2534
9
United Kingdom of Great Britain and Northern Ireland
GB
GBR
2013
rel
m
2534
1158
United States of America
US
USA
2013
rel
m
2534
829
Uruguay
UY
URY
2013
rel
m
2534
142
Uzbekistan
UZ
UZB
2013
rel
m
2534
2371
Vanuatu
VU
VUT
2013
rel
m
2534
9
Venezuela (Bolivarian Republic of)
VE
VEN
2013
rel
m
2534
739
Viet Nam
VN
VNM
2013
rel
m
2534
6302
Yemen
YE
YEM
2013
rel
m
2534
1113
Zambia
ZM
ZMB
2013
rel
m
2534
7808
Zimbabwe
ZW
ZWE
2013
rel
m
2534
5331
When data contains multiple observations per row, we can use splitter and pattern for target columns to create new columns and put values there. In following dataset we have two obseravations dob and gender for two childs. We want to put child infomation into the column and leave dob and gender for values.
(def family (tc/dataset "data/family.csv"))
family
data/family.csv [5 5]:
family
dob_child1
dob_child2
gender_child1
gender_child2
1
1998-11-26
2000-01-29
1
2
2
1996-06-22
2
3
2002-07-11
2004-04-05
2
2
4
2004-10-10
2009-08-27
1
1
5
2000-12-05
2005-02-28
2
1
(tc/pivot->longer family (complement #{"family"}) {:target-columns [nil:child]:splitter"_":datatypes {"gender":int16}})
data/family.csv [9 4]:
family
:child
dob
gender
1
child1
1998-11-26
1
2
child1
1996-06-22
2
3
child1
2002-07-11
2
4
child1
2004-10-10
1
5
child1
2000-12-05
2
1
child2
2000-01-29
2
3
child2
2004-04-05
2
4
child2
2009-08-27
1
5
child2
2005-02-28
1
Similar here, we have two observations: x and y in four groups.
columns-selector - values from selected columns are converted to new columns
value-columns - what are values
When multiple columns are used as columns selector, names are joined using :concat-columns-with option. :concat-columns-with can be a string or function (default: “_“). Function accepts sequence of names.
When columns-selector creates non unique set of values, they are folded using :fold-fn (default: vec) option.
When value-columns is a sequence, multiple observations as columns are created appending value column names into new columns. Column names are joined using :concat-value-with option. :concat-value-with can be a string or function (default: “-”). Function accepts current column name and value.
Use station as a name source for columns and seen for values
(def fish (tc/dataset "data/fish_encounters.csv"))
fish
data/fish_encounters.csv [114 3]:
fish
station
seen
4842
Release
1
4842
I80_1
1
4842
Lisbon
1
4842
Rstr
1
4842
Base_TD
1
4842
BCE
1
4842
BCW
1
4842
BCE2
1
4842
BCW2
1
4842
MAE
1
…
…
…
4862
BCE
1
4862
BCW
1
4862
BCE2
1
4862
BCW2
1
4863
Release
1
4863
I80_1
1
4864
Release
1
4864
I80_1
1
4865
Release
1
4865
I80_1
1
4865
Lisbon
1
(tc/pivot->wider fish "station""seen" {:drop-missing? false})
data/fish_encounters.csv [19 12]:
fish
Release
I80_1
Lisbon
Rstr
Base_TD
BCE
BCW
BCE2
BCW2
MAE
MAW
4842
1
1
1
1
1
1
1
1
1
1
1
4843
1
1
1
1
1
1
1
1
1
1
1
4844
1
1
1
1
1
1
1
1
1
1
1
4858
1
1
1
1
1
1
1
1
1
1
1
4861
1
1
1
1
1
1
1
1
1
1
1
4857
1
1
1
1
1
1
1
1
1
4862
1
1
1
1
1
1
1
1
1
4850
1
1
1
1
1
1
4845
1
1
1
1
1
4855
1
1
1
1
1
4859
1
1
1
1
1
4848
1
1
1
1
4847
1
1
1
4865
1
1
1
4849
1
1
4851
1
1
4854
1
1
4863
1
1
4864
1
1
If selected columns contain multiple values, such values should be folded.
Joins accept left-side and right-side datasets and columns selector. Options are the same as in tech.ml.dataset functions.
A column selector can be a map with :left and :right keys to specify column names separate for left and right dataset.
The difference between tech.ml.dataset join functions are: arguments order (first datasets) and possibility to join on multiple columns.
Multiple columns joins create temporary index column from column selection. The method for creating index is based on :hashing option and defaults to identity. Prior to 7.000-beta-50hash function was used, which caused hash collision for certain cases.
Additionally set operations are defined: intersect and difference.
To concat two datasets rowwise you can choose:
concat - concats rows for matching columns, the number of columns should be equal.
union - like concat but returns unique values
bind - concats rows add missing, empty columns
To add two datasets columnwise use bind. The number of rows should be equal.
When :hashing option is used, data from join columns are preprocessed by applying join-columns funtion with :result-type set to the value of :hashing. This helps to create custom joining behaviour. Function used for hashing will get vector of row values from join columns.
In the following example we will join columns on value modulo 5.
(tc/left-join ds1 ds2 :b {:hashing (fn [[v]] (mod v 5))})
left-outer-join [16 8]:
:a
:b
:c
:right.a
:right.b
:right.c
:d
:e
3
105
t
110
d
X
3
2
104
1
109
a
X
4
4
109
t
1
109
a
X
4
1
103
s
2
108
t
X
5
108
c
2
108
t
X
5
2
102
b
5
107
a
X
6
107
a
5
107
a
X
6
1
101
a
4
106
t
X
7
4
106
r
4
106
t
X
7
3
105
t
3
105
a
X
2
104
2
104
b
X
8
4
109
t
2
104
b
X
8
1
103
s
1
103
l
X
1
108
c
1
103
l
X
1
2
102
b
102
e
X
1
107
a
102
e
X
1
Cross
Cross product from selected columns
(tc/cross-join ds1 ds2 [:a:b])
cross-join [81 4]:
:a
:b
:right.a
:right.b
1
101
110
1
101
1
109
1
101
2
108
1
101
5
107
1
101
4
106
1
101
3
105
1
101
2
104
1
101
1
103
1
101
102
2
102
110
…
…
…
…
108
1
103
108
102
4
109
110
4
109
1
109
4
109
2
108
4
109
5
107
4
109
4
106
4
109
3
105
4
109
2
104
4
109
1
103
4
109
102
(tc/cross-join ds1 ds2 {:left [:a:b] :right:e})
cross-join [81 3]:
:a
:b
:e
1
101
3
1
101
4
1
101
5
1
101
6
1
101
7
1
101
1
101
8
1
101
1
1
101
1
2
102
3
…
…
…
108
1
108
1
4
109
3
4
109
4
4
109
5
4
109
6
4
109
7
4
109
4
109
8
4
109
1
4
109
1
Expand
Similar to cross product but works on a single dataset.
(tc/expand ds2 :a:c:d)
cross-join [36 3]:
:a
:c
:d
d
X
a
X
t
X
b
X
l
X
e
X
1
d
X
1
a
X
1
t
X
1
b
X
…
…
…
4
a
X
4
t
X
4
b
X
4
l
X
4
e
X
3
d
X
3
a
X
3
t
X
3
b
X
3
l
X
3
e
X
Columns can be also bundled (nested) in tuples which are treated as a single entity during cross product.
(tc/expand ds2 [:a:c] [:e:b])
cross-join [81 4]:
:a
:c
:e
:b
d
3
110
d
4
109
d
5
108
d
6
107
d
7
106
d
105
d
8
104
d
1
103
d
1
102
1
a
3
110
…
…
…
…
1
l
1
103
1
l
1
102
e
3
110
e
4
109
e
5
108
e
6
107
e
7
106
e
105
e
8
104
e
1
103
e
1
102
Complete
Same as expand with all other columns preserved (filled with missing values if necessary).
In ML world very often you need to test given model and prepare collection of train and test datasets. split creates new dataset with two additional columns:
:$split-name - with :train, :test, :split-2, … values
:$split-id - id of splitted group (for k-fold and repeating)
split-type can be one of the following:
:kfold (default) - k-fold strategy, :k defines number of folds (defaults to 5), produces k splits
:bootstrap - :ratio defines ratio of observations put into result (defaults to 1.0), produces 1 split
:holdout - split into two or more parts with given ratio(s) (defaults to 2/3), produces 1 split
:holdouts - splits into two parts for ascending ratio. Range of rations is given by steps option
:loo - leave one out, produces the same number of splits as number of observations
:holdout can accept also probabilites or ratios and can split to more than 2 subdatasets
Additionally you can provide:
:seed - for random number generator
:shuffle? - turn on/off shuffle of the rows (default: true)
:repeats - repeat procedure :repeats times
:partition-selector - same as in group-by for stratified splitting to reflect dataset structure in splits.
:split-names names of subdatasets different than default, ie. [:train :test :split-2 ...]
:split-col-name - a column where name of split is stored, either :train or :test values (default: :$split-name)
:split-id-col-name - a column where id of the train/test pair is stored (default: :$split-id)
In case of grouped dataset each group is processed separately.
A column in tablecloth is a named sequence of typed data. It is the building block of the dataset since datasets are at base just a map of columns. Like dataset, the column is defined within tech.ml.dataset. In this section, we will show how you can interact with the column by itself.
Let’s begin by requiring the Column API, which we suggest you alias as tcc:
(require '[tablecloth.column.api :as tcc])
Creation & Identity
Creating an empty column is pretty straightforward:
(tcc/column)
#tech.v3.dataset.column<boolean>[0]null[]
You can also create a Column from a vector or a sequence
When you create a column, the underlying system will try to autodetect its type. We can illustrate that behavior using the tcc/typeof function to check the type of a column:
(-> (tcc/column [12345]) (tcc/typeof))
:int64
(-> (tcc/column [:a:b:c:d:e]) (tcc/typeof))
:keyword
Columns containing heterogenous data will receive type :object:
(-> (tcc/column [1:b3:c5]) (tcc/typeof))
:object
You can also use the tcc/typeof? function to check the value of a function as an asssertion:
(-> (tcc/column [12346]) (tcc/typeof? :boolean))
false
(-> (tcc/column [12346]) (tcc/typeof? :int64))
true
Tablecloth has a concept of “concrete” and “general” types. A general type is the broad category of type and the concrete type is the actual type in memory. For example, a concrete type is a 64-bit integer :int64, which is also of the general type :integer.
The typeof? function supports checking both.
(-> (tcc/column [12346]) (tcc/typeof? :int64))
true
(-> (tcc/column [12346]) (tcc/typeof? :integer))
true
Access & Manipulation
Column Access
The method for accessing a particular index position in a column is the same as for Clojure vectors:
If you need to create a discontinuous subset of the column, you can use the select function. This method accepts an array of index positions or an array of booleans. When using boolean select, a true value will select the value at the index positions containing true values:
Select the values at index positions 1 and 9:
(-> (tcc/column (range10)) (tcc/select [19]))
#tech.v3.dataset.column<int64>[2]null[1, 9]
Select the values at index positions 0 and 2 using booelan select:
Here, we used the odd? operator to create a boolean array indicating all index positions in the Column that included an odd value. Then we pass that boolean Column to the select function using the ->> to send it to the second argument position.
(-> (tcc/column [{:position2:text"and then stopped"} {:position1:text"I ran fast"}]) (tcc/sort-column (fn [a b] (< (:position a) (:position b)))))
#tech.v3.dataset.column<persistent-map>[2]null[{:position1, :text"I ran fast"}, {:position2, :text"and then stopped"}]
Operations
The Column API contains a large number of operations, which have been lifted from the underlying tech.ml.dataset library and adapted to Tablecloth. These operations all take one or more columns as an argument, and they return either a scalar value or a new column, depending on the operation. In other words, their signature is this general form:
(column1 column2 ...) => column | scalar
Because these functions all take a column as the first argument, they are easy to use with the pipe -> macro, as with all functions in Tablecloth.
Given the incredibly large number of available functions, we will illustrate instead how these operations can be used.
You can replace them. This function includes different strategies, and the default is :nearest.
:down - Take the previous value, or use provided value.
:up - Take the next value, or use provided value.
:downup - Take the previous value, otherwise take the next value.
:updown - Take the next value, otherwise take the previous value.
:nearest - Use the nearest of next or previous values. (Strategy :mid is an alias for :nearest).
:midpoint - Use the midpoint of averaged values between previous and next (non-missing) values.
:abb - Impute missing value with approximate Bayesian bootstrap. See r’s ABB.
:lerp - Linearly interpolate values between previous and next nonmissing rows.
:value - Provide a value explicitly. Value may be a function in which case it will be called on the column with missing values elided and the return will be used to as the filler.
tablecloth.pipeline exports special versions of API which create functions operating only on dataset. This creates the possibility to chain operations and compose them easily.
There are two ways to create pipelines:
functional, as a composition of functions
declarative, separating task declarations and concrete parametrization.
Pipeline operations are prepared to work with metamorph library. That means that result of the pipeline is wrapped into a map and dataset is stored under :metamorph/data key.
Warning: Duplicated metamorph pipeline functions are removed from tablecloth.pipeline namespace.
Below you can find comparizon between functionality of data.table and Clojure dataset API. I leave it without comments, please refer original document explaining details:
How can we specify just the columns we would like to compute the mean() on?
R
kable(head(flights[carrier =="AA", ## Only on trips with carrier "AA"lapply(.SD, mean), ## compute the meanby = .(origin, dest, month), ## for every 'origin,dest,month'.SDcols =c("arr_delay", "dep_delay")])) ## for just those specified in .SDcols
### API tour
Below snippets are taken from [A data.table and dplyr tour](https://atrebas.github.io/post/2019-03-03-datatable-dplyr/) written by Atrebas (permission granted).
I keep structure and subtitles but I skip `data.table` and `dplyr` examples.
Example data