2 API reference
Setup
In this notebook, we will use Tablecloth and Tableplot for code examples, alongside Tablemath.
ns tablemath-book.reference
(:require [scicloj.tablemath.v1.api :as tm]
(:as tc]
[tablecloth.api :as tcc]
[tablecloth.column.api :as plotly]
[scicloj.tableplot.v1.plotly :as utils])) [tablemath-book.utils
Reference
polynomial
[column degree]
Given a column
and an integer degree
, return a vector of columns with all its powers up to that degree, named appropriately.
Examples
-> [1 2 3]
(:name :x})
(tcc/column {4)) (tm/polynomial
3]
[#tech.v3.dataset.column<int64>[:x
1, 2, 3] #tech.v3.dataset.column<int64>[3]
[:x2
1, 4, 9] #tech.v3.dataset.column<int64>[3]
[:x3
1, 8, 27] #tech.v3.dataset.column<int64>[3]
[:x4
1, 16, 81]] [
one-hot
[column]
[column {:keys [values include-first], :or {values (distinct column), include-first false}}]
Given a column
, create a vector of integer binary columns, each encoding the presence of absence of one of its values.
E.g., if the column
name is :x
, and one of the values is :A
, then a resulting binary column will have 1 in all the rows where column
has :A
.
The sequence of values to generate the binary columns is defined as follows: either the value provided for the :values
key if present, or the distinct values in column
in their order of appearance. If the value of the option key :include-first
is false
(which is the default), then the first value is ommitted. This is handy for avoiding multicollinearity in linear regression.
Supported options: - :values
- the values to encode as columns - default nil
- :include-first
- should the first value be included - default false
Examples
:B :A :A :B :B :C]
(tm/one-hot (tcc/column [:name :x})) {
6]
[#tech.v3.dataset.column<int64>[:x=:A
0, 1, 1, 0, 0, 0] #tech.v3.dataset.column<int64>[6]
[:x=:C
0, 0, 0, 0, 0, 1]] [
:B :A :A :B :B :C]
(tm/one-hot (tcc/column [:name :x})
{:values [:A :B :C]}) {
6]
[#tech.v3.dataset.column<int64>[:x=:B
1, 0, 0, 1, 1, 0] #tech.v3.dataset.column<int64>[6]
[:x=:C
0, 0, 0, 0, 0, 1]] [
:B :A :A :B :B :C]
(tm/one-hot (tcc/column [:name :x})
{:values [:A :B :C]
{:include-first true})
6]
[#tech.v3.dataset.column<int64>[:x=:A
0, 1, 1, 0, 0, 0] #tech.v3.dataset.column<int64>[6]
[:x=:B
1, 0, 0, 1, 1, 0] #tech.v3.dataset.column<int64>[6]
[:x=:C
0, 0, 0, 0, 0, 1]] [
with
[m expr]
Evaluate expression expr
in the context of destructuring all the keys of map m
.
Examples
:x 3 :y 9}
(tm/with {+ x y)) '(
12
:x (range 4)
(tm/with (tc/dataset {:y 9})
'(tcc/+ x y))
4]
#tech.v3.dataset.column<int64>[
null9, 10, 11, 12] [
columns-with
[dataset specs]
Compute a sequence of named columns by a given sequence of specs
in the context of a given dataset
.
Each spec is one of the following:
- a keyword or string - in that case, we just take the corresponding column of the original dataset.
- a vector of two elements
[nam expr]
, where the first is a string or a keyword. In that case,nam
is interpreted as a name or a name-prefix for the resulting columns, andexpr
is handled as an expression as in (3).
- a vector of two elements
- any other Clojure form - in that case, we treat it as an expression, and evaluate it while destructuring the column names of
dataset
as well as all the columns created by previous specs; the evaluation is expected to return one of the following:
- a column (or the data to create a column (e.g., a vector of numbers))
- a sequential of columns
- a map from column names to columns
- any other Clojure form - in that case, we treat it as an expression, and evaluate it while destructuring the column names of
In any case, the result of the spec is turned into a sequence of named columns, which is conctenated to the columns from the previous specs. Some default naming mechanisms are invoked if column names are missing.
Columns of strings and keywords that have at most 20 distinct values are one-hot
-encoded by default.
Eventually, the sequence of all resulting columns is returned.
Examples
Note the naming of the resulting columns, and note they can sequentially depend on each other.
"v" [4 5 6]
(tm/columns-with (tc/dataset {:w [:A :B :C]
:x (range 3)
:y (reverse (range 3))})
:v
[:w
:x
'(tcc/+ x y):z '(tcc/+ x y)]
[:z1000 '(tcc/* z 1000)]
[juxt tcc/+ tcc/*) x y)
'((:p '((juxt tcc/+ tcc/*) x y)]
[:a (tcc/+ x y)
'{:b (tcc/* x y)}
:p '{:a (tcc/+ x y)
[:b (tcc/* x y)}]
:name :c})
'[(tcc/column (tcc/+ x y) {:name :d})]
(tcc/column (tcc/* x y) {:p '[(tcc/column (tcc/+ x y) {:name :c})
[:name :d})]]]) (tcc/column (tcc/* x y) {
3]
(#tech.v3.dataset.column<int64>[:w=:B
0, 1, 0] #tech.v3.dataset.column<int64>[3]
[:w=:C
0, 0, 1] #tech.v3.dataset.column<int64>[3]
[:x
0, 1, 2] #tech.v3.dataset.column<int64>[3]
[
(tcc/+ x y)2, 2, 2] #tech.v3.dataset.column<int64>[3]
[:z
2, 2, 2] #tech.v3.dataset.column<int64>[3]
[:z1000
2000, 2000, 2000] #tech.v3.dataset.column<int64>[3]
[juxt tcc/+ tcc/*) x y)_0
((2, 2, 2] #tech.v3.dataset.column<int64>[3]
[juxt tcc/+ tcc/*) x y)_1
((0, 1, 0] #tech.v3.dataset.column<int64>[3]
[:p_0
2, 2, 2] #tech.v3.dataset.column<int64>[3]
[:p_1
0, 1, 0] #tech.v3.dataset.column<int64>[3]
[:a
2, 2, 2] #tech.v3.dataset.column<int64>[3]
[:b
0, 1, 0] #tech.v3.dataset.column<int64>[3]
[:pa
2, 2, 2] #tech.v3.dataset.column<int64>[3]
[:pb
0, 1, 0] #tech.v3.dataset.column<int64>[3]
[:c
2, 2, 2] #tech.v3.dataset.column<int64>[3]
[:d
0, 1, 0] #tech.v3.dataset.column<int64>[3]
[:pc
2, 2, 2] #tech.v3.dataset.column<int64>[3]
[:pd
0, 1, 0]) [
design
[dataset target-specs feature-specs]
Given a dataset
and sequences target-specs
, feature-specs
, generate a new dataset from the columns generated by columns-with
from these two sequences. The columns from target-specs
will be marked as targets for modelling (e.g., regression, classification).
(Inspired by metamorph.ml.design-matrix but adapted for columnwise computation.)
Examples
"v" [4 5 6]
(tm/design (tc/dataset {:w [:A :B :C]
:x (range 3)
:y (reverse (range 3))})
:y]
[:v
[:w
:x
'(tcc/+ x y):z '(tcc/+ x y)]
[:z1000 '(tcc/* z 1000)]
[juxt tcc/+ tcc/*) x y)
'((:p '((juxt tcc/+ tcc/*) x y)]
[:a (tcc/+ x y)
'{:b (tcc/* x y)}
:p '{:a (tcc/+ x y)
[:b (tcc/* x y)}]
:name :c})
'[(tcc/column (tcc/+ x y) {:name :d})]
(tcc/column (tcc/* x y) {:p '[(tcc/column (tcc/+ x y) {:name :c})
[:name :d})]]]) (tcc/column (tcc/* x y) {
_unnamed [3 19]:
:y | :w=:B | :w=:C | :x | (tcc/+ x y) | :z | :z1000 | ((juxt tcc/+ tcc/*) x y)_0 | ((juxt tcc/+ tcc/*) x y)_1 | :p_0 | :p_1 | :a | :b | :pa | :pb | :c | :d | :pc | :pd |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2 | 0 | 0 | 0 | 2 | 2 | 2000 | 2 | 0 | 2 | 0 | 2 | 0 | 2 | 0 | 2 | 0 | 2 | 0 |
1 | 1 | 0 | 1 | 2 | 2 | 2000 | 2 | 1 | 2 | 1 | 2 | 1 | 2 | 1 | 2 | 1 | 2 | 1 |
0 | 0 | 1 | 2 | 2 | 2 | 2000 | 2 | 0 | 2 | 0 | 2 | 0 | 2 | 0 | 2 | 0 | 2 | 0 |
lm
[dataset]
[dataset options]
Compute a linear regression model for dataset
. The first column marked as target is the target. All the columns unmarked as target are the features. The resulting model is of type fastmath.ml.regression.LMData
, a generated by Fastmath. It can be summarized by summary
.
See fastmath.ml.regression.lm for options
.
Examples
Linear relationship
def linear-toydata
(-> {:x (range 9)}
(
tc/dataset:y
(tc/map-columns :x]
[fn [x]
(+ (* 2 x)
(3
-* 3 (rand))))))) (
-> linear-toydata
( plotly/layer-point)
Note how the coefficients fit the way we generated the data:
-> linear-toydata
(:y]
(tm/design [:x])
[
tm/lm tm/summary)
Residuals:
:min | :q1 | :median | :q3 | :max |
|
|-----------+-----------+----------+----------+----------|1.065875 | -0.558432 | 0.158398 | 0.528058 | 0.669637 |
| -
Coefficients:
:name | :estimate | :stderr | :t-value | :p-value | :confidence-interval |
|
|-----------+-----------+----------+-----------+----------+----------------------|0.546876 | 0.392459 | -1.39346 | 0.206125 | [-1.474893 0.381142] |
| Intercept | -:x | 1.846629 | 0.082433 | 22.401628 | 0.0 | [1.651706 2.041552] |
|
501.83291766778524 on degrees of freedom: {:residual 7, :model 1, :intercept 1}
F-statistic: 8.935090556327907E-8
p-value:
0.9862430283950885
R2: 0.984277746737244
Adjusted R2: 0.6385217752507847 on 7 degrees of freedom
Residual standard error: 21.204272737197513
AIC:
Cubic relationship
def cubic-toydata
(-> {:x (range 9)}
(
tc/dataset:y
(tc/map-columns :x]
[fn [x]
(+ 50
(* 4 x)
(* -9 x x)
(* x x x)
(* 3 (rand))))))) (
-> cubic-toydata
( plotly/layer-point)
Note how the coefficients fit the way we generated the data:
-> cubic-toydata
(:y]
(tm/design [3)])
['(tm/polynomial x
tm/lm tm/summary)
Residuals:
:min | :q1 | :median | :q3 | :max |
|
|-----------+-----------+-----------+----------+----------|1.416885 | -0.361248 | -0.159624 | 0.421575 | 1.340384 |
| -
Coefficients:
:name | :estimate | :stderr | :t-value | :p-value | :confidence-interval |
|
|-----------+-----------+----------+------------+----------+------------------------|50.681655 | 0.891266 | 56.864814 | 0.0 | [48.390584 52.972726] |
| Intercept | :x | 5.152666 | 1.028649 | 5.009157 | 0.004073 | [2.508439 7.796894] |
| :x2 | -9.293254 | 0.310576 | -29.922661 | 1.0E-6 | [-10.091615 -8.494894] |
| :x3 | 1.019036 | 0.025475 | 40.001202 | 0.0 | [0.95355 1.084522] |
|
2962.49805122806 on degrees of freedom: {:residual 5, :model 3, :intercept 1}
F-statistic: 1.5268940778412343E-8
p-value:
0.9994377280531662
R2: 0.9991003648850659
Adjusted R2: 0.9618675995935431 on 5 degrees of freedom
Residual standard error: 29.551001186892037
AIC:
Categorical relationship
def days-of-week
(:Mon :Tue :Wed :Thu :Fri :Sat :Sun]) [
def categorical-toydata
(-> {:t (range 18)
(:day-of-week (->> days-of-week
repeat 3)
(apply concat))}
(
tc/dataset:traffic
(tc/map-columns :day-of-week]
[fn [dow]
(+ (case dow
(:Sat 50
:Sun 50
60)
* 5 (rand))))))) (
-> categorical-toydata
(:t
(plotly/layer-point {:=x :traffic
:=y :day-of-week
:=color 10})
:=mark-size :t
(plotly/layer-line {:=x :traffic})) :=y
A model with all days except for one, dropping one category to avoid multicolinearity (note we begin with Thursday due to the order of appearance):
-> categorical-toydata
(:traffic]
(tm/design [
['(tm/one-hot day-of-week)])
tm/lm tm/summary)
Residuals:
:min | :q1 | :median | :q3 | :max |
|
|-----------+-----------+----------+----------+----------|2.384111 | -0.945412 | 0.201007 | 1.030651 | 2.271914 |
| -
Coefficients:
:name | :estimate | :stderr | :t-value | :p-value | :confidence-interval |
|
|-------------------+-----------+----------+-----------+----------+------------------------|61.515092 | 0.964584 | 63.773679 | 0.0 | [59.392056 63.638128] |
| Intercept | :day-of-week=:Tue | 1.643425 | 1.364128 | 1.204744 | 0.253582 | [-1.359001 4.645851] |
| :day-of-week=:Wed | 0.208831 | 1.364128 | 0.153088 | 0.881101 | [-2.793595 3.211257] |
| :day-of-week=:Thu | 0.943699 | 1.364128 | 0.691796 | 0.503406 | [-2.058728 3.946125] |
| :day-of-week=:Fri | 1.181772 | 1.525142 | 0.774861 | 0.454756 | [-2.175042 4.538587] |
| :day-of-week=:Sat | -7.688267 | 1.525142 | -5.041018 | 3.77E-4 | [-11.045081 -4.331452] |
| :day-of-week=:Sun | -9.537227 | 1.525142 | -6.253338 | 6.2E-5 | [-12.894041 -6.180412] |
|
16.87587496496225 on degrees of freedom: {:residual 11, :model 6, :intercept 1}
F-statistic: 5.799260079852875E-5
p-value:
0.9020090372557159
R2: 0.8485594212133791
Adjusted R2: 1.670709092712173 on 11 degrees of freedom
Residual standard error: 76.69414360164832
AIC:
A model with all days except for one, dropping one category to avoid multicolinearity, and speciftying the order of encoded values:
-> categorical-toydata
(:traffic]
(tm/design [
['(tm/one-hot day-of-week:values days-of-week})])
{
tm/lm tm/summary)
Residuals:
:min | :q1 | :median | :q3 | :max |
|
|-----------+-----------+----------+----------+----------|2.384111 | -0.945412 | 0.201007 | 1.030651 | 2.271914 |
| -
Coefficients:
:name | :estimate | :stderr | :t-value | :p-value | :confidence-interval |
|
|-------------------+-----------+----------+-----------+----------+------------------------|61.515092 | 0.964584 | 63.773679 | 0.0 | [59.392056 63.638128] |
| Intercept | :day-of-week=:Tue | 1.643425 | 1.364128 | 1.204744 | 0.253582 | [-1.359001 4.645851] |
| :day-of-week=:Wed | 0.208831 | 1.364128 | 0.153088 | 0.881101 | [-2.793595 3.211257] |
| :day-of-week=:Thu | 0.943699 | 1.364128 | 0.691796 | 0.503406 | [-2.058728 3.946125] |
| :day-of-week=:Fri | 1.181772 | 1.525142 | 0.774861 | 0.454756 | [-2.175042 4.538587] |
| :day-of-week=:Sat | -7.688267 | 1.525142 | -5.041018 | 3.77E-4 | [-11.045081 -4.331452] |
| :day-of-week=:Sun | -9.537227 | 1.525142 | -6.253338 | 6.2E-5 | [-12.894041 -6.180412] |
|
16.87587496496225 on degrees of freedom: {:residual 11, :model 6, :intercept 1}
F-statistic: 5.799260079852875E-5
p-value:
0.9020090372557159
R2: 0.8485594212133791
Adjusted R2: 1.670709092712173 on 11 degrees of freedom
Residual standard error: 76.69414360164832
AIC:
A model with all days and no intercept, dropping the intercept to avoid multicolinearity and have an easier interpretation of the coefficients:
Note how the coefficients fit the way we generated the data:
-> categorical-toydata
(:traffic]
(tm/design [
['(tm/one-hot day-of-week:values days-of-week
{:include-first true})])
:intercept? false})
(tm/lm { tm/summary)
Residuals:
:min | :q1 | :median | :q3 | :max |
|
|-----------+-----------+----------+----------+----------|2.384111 | -0.945412 | 0.201007 | 1.030651 | 2.271914 |
| -
Coefficients:
:name | :estimate | :stderr | :t-value | :p-value | :confidence-interval |
|
|-------------------+-----------+----------+-----------+----------+-----------------------|:day-of-week=:Mon | 61.515092 | 0.964584 | 63.773679 | 0.0 | [59.392056 63.638128] |
| :day-of-week=:Tue | 63.158517 | 0.964584 | 65.477444 | 0.0 | [61.035481 65.281553] |
| :day-of-week=:Wed | 61.723923 | 0.964584 | 63.990177 | 0.0 | [59.600887 63.846959] |
| :day-of-week=:Thu | 62.45879 | 0.964584 | 64.752026 | 0.0 | [60.335755 64.581826] |
| :day-of-week=:Fri | 62.696864 | 1.18137 | 53.071331 | 0.0 | [60.096687 65.297041] |
| :day-of-week=:Sat | 53.826825 | 1.18137 | 45.563064 | 0.0 | [51.226648 56.427002] |
| :day-of-week=:Sun | 51.977865 | 1.18137 | 43.997966 | 0.0 | [49.377688 54.578043] |
|
3352.9036264916203 on degrees of freedom: {:residual 11, :model 7, :intercept 0}
F-statistic: 0.0
p-value:
0.9995315426271968
R2: 0.9992334333899584
Adjusted R2: 1.6707090927121728 on 11 degrees of freedom
Residual standard error: 76.69414360164832
AIC: