2 API reference
Setup
In this notebook, we will use Tablecloth and Tableplot for code examples, alongside Tablemath.
ns tablemath-book.reference
(:require [scicloj.tablemath.v1.api :as tm]
(:as tc]
[tablecloth.api :as tcc]
[tablecloth.column.api :as plotly]
[scicloj.tableplot.v1.plotly :as utils])) [tablemath-book.utils
Reference
polynomial
[column degree]
Given a column
and an integer degree
, return a vector of columns with all its powers up to that degree, named appropriately.
Examples
-> [1 2 3]
(:name :x})
(tcc/column {4)) (tm/polynomial
3]
[#tech.v3.dataset.column<int64>[:x
1, 2, 3] #tech.v3.dataset.column<int64>[3]
[:x2
1, 4, 9] #tech.v3.dataset.column<int64>[3]
[:x3
1, 8, 27] #tech.v3.dataset.column<int64>[3]
[:x4
1, 16, 81]] [
one-hot
[column]
[column {:keys [values include-first], :or {values (distinct column), include-first false}}]
Given a column
, create a vector of integer binary columns, each encoding the presence of absence of one of its values.
E.g., if the column
name is :x
, and one of the values is :A
, then a resulting binary column will have 1 in all the rows where column
has :A
.
The sequence of values to generate the binary columns is defined as follows: either the value provided for the :values
key if present, or the distinct values in column
in their order of appearance. If the value of the option key :include-first
is false
(which is the default), then the first value is ommitted. This is handy for avoiding multicollinearity in linear regression.
Supported options: - :values
- the values to encode as columns - default nil
- :include-first
- should the first value be included - default false
Examples
:B :A :A :B :B :C]
(tm/one-hot (tcc/column [:name :x})) {
6]
[#tech.v3.dataset.column<int64>[:x=:A
0, 1, 1, 0, 0, 0] #tech.v3.dataset.column<int64>[6]
[:x=:C
0, 0, 0, 0, 0, 1]] [
:B :A :A :B :B :C]
(tm/one-hot (tcc/column [:name :x})
{:values [:A :B :C]}) {
6]
[#tech.v3.dataset.column<int64>[:x=:B
1, 0, 0, 1, 1, 0] #tech.v3.dataset.column<int64>[6]
[:x=:C
0, 0, 0, 0, 0, 1]] [
:B :A :A :B :B :C]
(tm/one-hot (tcc/column [:name :x})
{:values [:A :B :C]
{:include-first true})
6]
[#tech.v3.dataset.column<int64>[:x=:A
0, 1, 1, 0, 0, 0] #tech.v3.dataset.column<int64>[6]
[:x=:B
1, 0, 0, 1, 1, 0] #tech.v3.dataset.column<int64>[6]
[:x=:C
0, 0, 0, 0, 0, 1]] [
with
[m expr]
Evaluate expression expr
in the context of destructuring all the keys of map m
.
Examples
:x 3 :y 9}
(tm/with {+ x y)) '(
12
:x (range 4)
(tm/with (tc/dataset {:y 9})
'(tcc/+ x y))
4]
#tech.v3.dataset.column<int64>[
null9, 10, 11, 12] [
columns-with
[dataset specs]
Compute a sequence of named columns by a given sequence of specs
in the context of a given dataset
.
Each spec is one of the following:
- a keyword or string - in that case, we just take the corresponding column of the original dataset.
- a vector of two elements
[nam expr]
, where the first is a string or a keyword. In that case,nam
is interpreted as a name or a name-prefix for the resulting columns, andexpr
is handled as an expression as in (3).
- a vector of two elements
- any other Clojure form - in that case, we treat it as an expression, and evaluate it while destructuring the column names of
dataset
as well as all the columns created by previous specs; the evaluation is expected to return one of the following:
- a column (or the data to create a column (e.g., a vector of numbers))
- a sequential of columns
- a map from column names to columns
- any other Clojure form - in that case, we treat it as an expression, and evaluate it while destructuring the column names of
In any case, the result of the spec is turned into a sequence of named columns, which is conctenated to the columns from the previous specs. Some default naming mechanisms are invoked if column names are missing.
Columns of strings and keywords that have at most 20 distinct values are one-hot
-encoded by default.
Eventually, the sequence of all resulting columns is returned.
Examples
Note the naming of the resulting columns, and note they can sequentially depend on each other.
"v" [4 5 6]
(tm/columns-with (tc/dataset {:w [:A :B :C]
:x (range 3)
:y (reverse (range 3))})
:v
[:w
:x
'(tcc/+ x y):z '(tcc/+ x y)]
[:z1000 '(tcc/* z 1000)]
[juxt tcc/+ tcc/*) x y)
'((:p '((juxt tcc/+ tcc/*) x y)]
[:a (tcc/+ x y)
'{:b (tcc/* x y)}
:p '{:a (tcc/+ x y)
[:b (tcc/* x y)}]
:name :c})
'[(tcc/column (tcc/+ x y) {:name :d})]
(tcc/column (tcc/* x y) {:p '[(tcc/column (tcc/+ x y) {:name :c})
[:name :d})]]]) (tcc/column (tcc/* x y) {
3]
(#tech.v3.dataset.column<int64>[:w=:B
0, 1, 0] #tech.v3.dataset.column<int64>[3]
[:w=:C
0, 0, 1] #tech.v3.dataset.column<int64>[3]
[:x
0, 1, 2] #tech.v3.dataset.column<int64>[3]
[
(tcc/+ x y)2, 2, 2] #tech.v3.dataset.column<int64>[3]
[:z
2, 2, 2] #tech.v3.dataset.column<int64>[3]
[:z1000
2000, 2000, 2000] #tech.v3.dataset.column<int64>[3]
[juxt tcc/+ tcc/*) x y)_0
((2, 2, 2] #tech.v3.dataset.column<int64>[3]
[juxt tcc/+ tcc/*) x y)_1
((0, 1, 0] #tech.v3.dataset.column<int64>[3]
[:p_0
2, 2, 2] #tech.v3.dataset.column<int64>[3]
[:p_1
0, 1, 0] #tech.v3.dataset.column<int64>[3]
[:a
2, 2, 2] #tech.v3.dataset.column<int64>[3]
[:b
0, 1, 0] #tech.v3.dataset.column<int64>[3]
[:pa
2, 2, 2] #tech.v3.dataset.column<int64>[3]
[:pb
0, 1, 0] #tech.v3.dataset.column<int64>[3]
[:c
2, 2, 2] #tech.v3.dataset.column<int64>[3]
[:d
0, 1, 0] #tech.v3.dataset.column<int64>[3]
[:pc
2, 2, 2] #tech.v3.dataset.column<int64>[3]
[:pd
0, 1, 0]) [
design
[dataset target-specs feature-specs]
Given a dataset
and sequences target-specs
, feature-specs
, generate a new dataset from the columns generated by columns-with
from these two sequences. The columns from target-specs
will be marked as targets for modelling (e.g., regression, classification).
(Inspired by metamorph.ml.design-matrix but adapted for columnwise computation.)
Examples
"v" [4 5 6]
(tm/design (tc/dataset {:w [:A :B :C]
:x (range 3)
:y (reverse (range 3))})
:y]
[:v
[:w
:x
'(tcc/+ x y):z '(tcc/+ x y)]
[:z1000 '(tcc/* z 1000)]
[juxt tcc/+ tcc/*) x y)
'((:p '((juxt tcc/+ tcc/*) x y)]
[:a (tcc/+ x y)
'{:b (tcc/* x y)}
:p '{:a (tcc/+ x y)
[:b (tcc/* x y)}]
:name :c})
'[(tcc/column (tcc/+ x y) {:name :d})]
(tcc/column (tcc/* x y) {:p '[(tcc/column (tcc/+ x y) {:name :c})
[:name :d})]]]) (tcc/column (tcc/* x y) {
_unnamed [3 19]:
:y | :w=:B | :w=:C | :x | (tcc/+ x y) | :z | :z1000 | ((juxt tcc/+ tcc/*) x y)_0 | ((juxt tcc/+ tcc/*) x y)_1 | :p_0 | :p_1 | :a | :b | :pa | :pb | :c | :d | :pc | :pd |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2 | 0 | 0 | 0 | 2 | 2 | 2000 | 2 | 0 | 2 | 0 | 2 | 0 | 2 | 0 | 2 | 0 | 2 | 0 |
1 | 1 | 0 | 1 | 2 | 2 | 2000 | 2 | 1 | 2 | 1 | 2 | 1 | 2 | 1 | 2 | 1 | 2 | 1 |
0 | 0 | 1 | 2 | 2 | 2 | 2000 | 2 | 0 | 2 | 0 | 2 | 0 | 2 | 0 | 2 | 0 | 2 | 0 |
lm
[dataset]
[dataset options]
Compute a linear regression model for dataset
. The first column marked as target is the target. All the columns unmarked as target are the features. The resulting model is of type fastmath.ml.regression.LMData
, a generated by Fastmath. It can be summarized by summary
.
See fastmath.ml.regression.lm for options
.
Examples
Linear relationship
def linear-toydata
(-> {:x (range 9)}
(
tc/dataset:y
(tc/map-columns :x]
[fn [x]
(+ (* 2 x)
(3
-* 3 (rand))))))) (
-> linear-toydata
( plotly/layer-point)
Note how the coefficients fit the way we generated the data:
-> linear-toydata
(:y]
(tm/design [:x])
[
tm/lm tm/summary)
Residuals:
:min | :q1 | :median | :q3 | :max |
|
|-----------+-----------+-----------+---------+----------|1.443772 | -0.350436 | -0.244968 | 0.76477 | 1.198845 |
| -
Coefficients:
:name | :estimate | :stderr | :t-value | :p-value | :confidence-interval |
|
|-----------+-----------+----------+-----------+----------+----------------------|1.009014 | 0.523742 | -1.926548 | 0.095406 | [-2.247467 0.229439] |
| Intercept | -:x | 1.846975 | 0.110008 | 16.789489 | 1.0E-6 | [1.586848 2.107102] |
|
281.8869508632346 on degrees of freedom: {:residual 7, :model 1, :intercept 1}
F-statistic: 6.506718930321398E-7
p-value:
0.975769068214805
R2: 0.9723075065312058
Adjusted R2: 0.8521167114773845 on 7 degrees of freedom
Residual standard error: 26.398491770973536
AIC:
Cubic relationship
def cubic-toydata
(-> {:x (range 9)}
(
tc/dataset:y
(tc/map-columns :x]
[fn [x]
(+ 50
(* 4 x)
(* -9 x x)
(* x x x)
(* 3 (rand))))))) (
-> cubic-toydata
( plotly/layer-point)
Note how the coefficients fit the way we generated the data:
-> cubic-toydata
(:y]
(tm/design [3)])
['(tm/polynomial x
tm/lm tm/summary)
Residuals:
:min | :q1 | :median | :q3 | :max |
|
|-----------+-----------+----------+----------+----------|1.200883 | -0.416095 | 0.089125 | 0.557187 | 0.843251 |
| -
Coefficients:
:name | :estimate | :stderr | :t-value | :p-value | :confidence-interval |
|
|-----------+-----------+----------+------------+----------+-----------------------|51.854441 | 0.761092 | 68.13167 | 0.0 | [49.897993 53.810889] |
| Intercept | :x | 4.161807 | 0.87841 | 4.737889 | 0.00516 | [1.903784 6.419831] |
| :x2 | -9.051775 | 0.265215 | -34.130006 | 0.0 | [-9.73353 -8.370019] |
| :x3 | 1.004353 | 0.021754 | 46.167917 | 0.0 | [0.948432 1.060275] |
|
4069.899285168728 on degrees of freedom: {:residual 5, :model 3, :intercept 1}
F-statistic: 6.905335636631094E-9
p-value:
0.99959065708713
R2: 0.999345051339408
Adjusted R2: 0.8213817632386143 on 5 degrees of freedom
Residual standard error: 26.709002578008814
AIC:
Categorical relationship
def days-of-week
(:Mon :Tue :Wed :Thu :Fri :Sat :Sun]) [
def categorical-toydata
(-> {:t (range 18)
(:day-of-week (->> days-of-week
repeat 3)
(apply concat))}
(
tc/dataset:traffic
(tc/map-columns :day-of-week]
[fn [dow]
(+ (case dow
(:Sat 50
:Sun 50
60)
* 5 (rand))))))) (
-> categorical-toydata
(:t
(plotly/layer-point {:=x :traffic
:=y :day-of-week
:=color 10})
:=mark-size :t
(plotly/layer-line {:=x :traffic})) :=y
A model with all days except for one, dropping one category to avoid multicolinearity (note we begin with Thursday due to the order of appearance):
-> categorical-toydata
(:traffic]
(tm/design [
['(tm/one-hot day-of-week)])
tm/lm tm/summary)
Residuals:
:min | :q1 | :median | :q3 | :max |
|
|-----------+-----------+----------+----------+----------|2.089684 | -0.725653 | 0.027399 | 0.743488 | 1.896184 |
| -
Coefficients:
:name | :estimate | :stderr | :t-value | :p-value | :confidence-interval |
|
|-------------------+-----------+----------+-----------+----------+------------------------|61.711979 | 0.785093 | 78.604691 | 0.0 | [59.984002 63.439957] |
| Intercept | :day-of-week=:Tue | 2.061152 | 1.110289 | 1.856411 | 0.090358 | [-0.382577 4.504882] |
| :day-of-week=:Wed | 0.460481 | 1.110289 | 0.41474 | 0.686306 | [-1.983249 2.90421] |
| :day-of-week=:Thu | 0.560154 | 1.110289 | 0.504512 | 0.623855 | [-1.883575 3.003884] |
| :day-of-week=:Fri | 2.637936 | 1.241341 | 2.12507 | 0.057066 | [-0.094236 5.370109] |
| :day-of-week=:Sat | -8.227844 | 1.241341 | -6.628191 | 3.7E-5 | [-10.960016 -5.495671] |
| :day-of-week=:Sun | -9.23519 | 1.241341 | -7.439689 | 1.3E-5 | [-11.967362 -6.503017] |
|
28.03879101493394 on degrees of freedom: {:residual 11, :model 6, :intercept 1}
F-statistic: 4.726292776258134E-6
p-value:
0.9386272863637274
R2: 0.9051512607439424
Adjusted R2: 1.359820669918769 on 11 degrees of freedom
Residual standard error: 69.28191236879977
AIC:
A model with all days except for one, dropping one category to avoid multicolinearity, and speciftying the order of encoded values:
-> categorical-toydata
(:traffic]
(tm/design [
['(tm/one-hot day-of-week:values days-of-week})])
{
tm/lm tm/summary)
Residuals:
:min | :q1 | :median | :q3 | :max |
|
|-----------+-----------+----------+----------+----------|2.089684 | -0.725653 | 0.027399 | 0.743488 | 1.896184 |
| -
Coefficients:
:name | :estimate | :stderr | :t-value | :p-value | :confidence-interval |
|
|-------------------+-----------+----------+-----------+----------+------------------------|61.711979 | 0.785093 | 78.604691 | 0.0 | [59.984002 63.439957] |
| Intercept | :day-of-week=:Tue | 2.061152 | 1.110289 | 1.856411 | 0.090358 | [-0.382577 4.504882] |
| :day-of-week=:Wed | 0.460481 | 1.110289 | 0.41474 | 0.686306 | [-1.983249 2.90421] |
| :day-of-week=:Thu | 0.560154 | 1.110289 | 0.504512 | 0.623855 | [-1.883575 3.003884] |
| :day-of-week=:Fri | 2.637936 | 1.241341 | 2.12507 | 0.057066 | [-0.094236 5.370109] |
| :day-of-week=:Sat | -8.227844 | 1.241341 | -6.628191 | 3.7E-5 | [-10.960016 -5.495671] |
| :day-of-week=:Sun | -9.23519 | 1.241341 | -7.439689 | 1.3E-5 | [-11.967362 -6.503017] |
|
28.03879101493394 on degrees of freedom: {:residual 11, :model 6, :intercept 1}
F-statistic: 4.726292776258134E-6
p-value:
0.9386272863637274
R2: 0.9051512607439424
Adjusted R2: 1.359820669918769 on 11 degrees of freedom
Residual standard error: 69.28191236879977
AIC:
A model with all days and no intercept, dropping the intercept to avoid multicolinearity and have an easier interpretation of the coefficients:
Note how the coefficients fit the way we generated the data:
-> categorical-toydata
(:traffic]
(tm/design [
['(tm/one-hot day-of-week:values days-of-week
{:include-first true})])
:intercept? false})
(tm/lm { tm/summary)
Residuals:
:min | :q1 | :median | :q3 | :max |
|
|-----------+-----------+----------+----------+----------|2.089684 | -0.725653 | 0.027399 | 0.743488 | 1.896184 |
| -
Coefficients:
:name | :estimate | :stderr | :t-value | :p-value | :confidence-interval |
|
|-------------------+-----------+----------+-----------+----------+-----------------------|:day-of-week=:Mon | 61.711979 | 0.785093 | 78.604691 | 0.0 | [59.984002 63.439957] |
| :day-of-week=:Tue | 63.773132 | 0.785093 | 81.230053 | 0.0 | [62.045154 65.501109] |
| :day-of-week=:Wed | 62.17246 | 0.785093 | 79.191222 | 0.0 | [60.444483 63.900438] |
| :day-of-week=:Thu | 62.272133 | 0.785093 | 79.318179 | 0.0 | [60.544156 64.000111] |
| :day-of-week=:Fri | 64.349916 | 0.961538 | 66.923915 | 0.0 | [62.233584 66.466247] |
| :day-of-week=:Sat | 53.484136 | 0.961538 | 55.623504 | 0.0 | [51.367804 55.600468] |
| :day-of-week=:Sun | 52.47679 | 0.961538 | 54.575864 | 0.0 | [50.360458 54.593121] |
|
5127.2787830847365 on degrees of freedom: {:residual 11, :model 7, :intercept 0}
F-statistic: 0.0
p-value:
0.9996936099697633
R2: 0.9994986344959763
Adjusted R2: 1.359820669918769 on 11 degrees of freedom
Residual standard error: 69.28191236879977
AIC: