2  API reference

Setup

In this notebook, we will use Tablecloth and Tableplot for code examples, alongside Tablemath.

(ns tablemath-book.reference
  (:require [scicloj.tablemath.v1.api :as tm]
            [tablecloth.api :as tc]
            [tablecloth.column.api :as tcc]
            [scicloj.tableplot.v1.plotly :as plotly]
            [tablemath-book.utils :as utils]))

Reference

polynomial

[column degree]

Given a column and an integer degree, return a vector of columns with all its powers up to that degree, named appropriately.

Examples

(-> [1 2 3]
    (tcc/column {:name :x})
    (tm/polynomial 4))
[#tech.v3.dataset.column<int64>[3]
:x
[1, 2, 3] #tech.v3.dataset.column<int64>[3]
:x2
[1, 4, 9] #tech.v3.dataset.column<int64>[3]
:x3
[1, 8, 27] #tech.v3.dataset.column<int64>[3]
:x4
[1, 16, 81]]

one-hot

[column]

[column {:keys [values include-first], :or {values (distinct column), include-first false}}]

Given a column, create a vector of integer binary columns, each encoding the presence of absence of one of its values.

E.g., if the column name is :x, and one of the values is :A, then a resulting binary column will have 1 in all the rows where column has :A.

The sequence of values to generate the binary columns is defined as follows: either the value provided for the :values key if present, or the distinct values in column in their order of appearance. If the value of the option key :include-first is false (which is the default), then the first value is ommitted. This is handy for avoiding multicollinearity in linear regression.

Supported options: - :values - the values to encode as columns - default nil - :include-first - should the first value be included - default false

Examples

(tm/one-hot (tcc/column [:B :A :A :B :B :C]
                        {:name :x}))
[#tech.v3.dataset.column<int64>[6]
:x=:A
[0, 1, 1, 0, 0, 0] #tech.v3.dataset.column<int64>[6]
:x=:C
[0, 0, 0, 0, 0, 1]]
(tm/one-hot (tcc/column [:B :A :A :B :B :C]
                        {:name :x})
            {:values [:A :B :C]})
[#tech.v3.dataset.column<int64>[6]
:x=:B
[1, 0, 0, 1, 1, 0] #tech.v3.dataset.column<int64>[6]
:x=:C
[0, 0, 0, 0, 0, 1]]
(tm/one-hot (tcc/column [:B :A :A :B :B :C]
                        {:name :x})
            {:values [:A :B :C]
             :include-first true})
[#tech.v3.dataset.column<int64>[6]
:x=:A
[0, 1, 1, 0, 0, 0] #tech.v3.dataset.column<int64>[6]
:x=:B
[1, 0, 0, 1, 1, 0] #tech.v3.dataset.column<int64>[6]
:x=:C
[0, 0, 0, 0, 0, 1]]

with

[m expr]

Evaluate expression expr in the context of destructuring all the keys of map m.

Examples

(tm/with {:x 3 :y 9}
         '(+ x y))
12
(tm/with (tc/dataset {:x (range 4)
                      :y 9})
         '(tcc/+ x y))
#tech.v3.dataset.column<int64>[4]
null
[9, 10, 11, 12]

columns-with

[dataset specs]

Compute a sequence of named columns by a given sequence of specs in the context of a given dataset.

Each spec is one of the following:

    1. a keyword or string - in that case, we just take the corresponding column of the original dataset.
    1. a vector of two elements [nam expr], where the first is a string or a keyword. In that case, nam is interpreted as a name or a name-prefix for the resulting columns, and expr is handled as an expression as in (3).
    1. any other Clojure form - in that case, we treat it as an expression, and evaluate it while destructuring the column names of dataset as well as all the columns created by previous specs; the evaluation is expected to return one of the following:
    • a column (or the data to create a column (e.g., a vector of numbers))
    • a sequential of columns
    • a map from column names to columns

In any case, the result of the spec is turned into a sequence of named columns, which is conctenated to the columns from the previous specs. Some default naming mechanisms are invoked if column names are missing.

Columns of strings and keywords that have at most 20 distinct values are one-hot-encoded by default.

Eventually, the sequence of all resulting columns is returned.

Examples

Note the naming of the resulting columns, and note they can sequentially depend on each other.

(tm/columns-with (tc/dataset {"v" [4 5 6]
                              :w [:A :B :C]
                              :x (range 3)
                              :y (reverse (range 3))})
                 [:v
                  :w
                  :x
                  '(tcc/+ x y)
                  [:z '(tcc/+ x y)]
                  [:z1000 '(tcc/* z 1000)]
                  '((juxt tcc/+ tcc/*) x y)
                  [:p '((juxt tcc/+ tcc/*) x y)]
                  '{:a (tcc/+ x y)
                    :b (tcc/* x y)}
                  [:p '{:a (tcc/+ x y)
                        :b (tcc/* x y)}]
                  '[(tcc/column (tcc/+ x y) {:name :c})
                    (tcc/column (tcc/* x y) {:name :d})]
                  [:p '[(tcc/column (tcc/+ x y) {:name :c})
                        (tcc/column (tcc/* x y) {:name :d})]]])
(#tech.v3.dataset.column<int64>[3]
:w=:B
[0, 1, 0] #tech.v3.dataset.column<int64>[3]
:w=:C
[0, 0, 1] #tech.v3.dataset.column<int64>[3]
:x
[0, 1, 2] #tech.v3.dataset.column<int64>[3]
(tcc/+ x y)
[2, 2, 2] #tech.v3.dataset.column<int64>[3]
:z
[2, 2, 2] #tech.v3.dataset.column<int64>[3]
:z1000
[2000, 2000, 2000] #tech.v3.dataset.column<int64>[3]
((juxt tcc/+ tcc/*) x y)_0
[2, 2, 2] #tech.v3.dataset.column<int64>[3]
((juxt tcc/+ tcc/*) x y)_1
[0, 1, 0] #tech.v3.dataset.column<int64>[3]
:p_0
[2, 2, 2] #tech.v3.dataset.column<int64>[3]
:p_1
[0, 1, 0] #tech.v3.dataset.column<int64>[3]
:a
[2, 2, 2] #tech.v3.dataset.column<int64>[3]
:b
[0, 1, 0] #tech.v3.dataset.column<int64>[3]
:pa
[2, 2, 2] #tech.v3.dataset.column<int64>[3]
:pb
[0, 1, 0] #tech.v3.dataset.column<int64>[3]
:c
[2, 2, 2] #tech.v3.dataset.column<int64>[3]
:d
[0, 1, 0] #tech.v3.dataset.column<int64>[3]
:pc
[2, 2, 2] #tech.v3.dataset.column<int64>[3]
:pd
[0, 1, 0])

design

[dataset target-specs feature-specs]

Given a dataset and sequences target-specs, feature-specs, generate a new dataset from the columns generated by columns-with from these two sequences. The columns from target-specs will be marked as targets for modelling (e.g., regression, classification).

(Inspired by metamorph.ml.design-matrix but adapted for columnwise computation.)

Examples

(tm/design (tc/dataset {"v" [4 5 6]
                        :w [:A :B :C]
                        :x (range 3)
                        :y (reverse (range 3))})
           [:y]
           [:v
            :w
            :x
            '(tcc/+ x y)
            [:z '(tcc/+ x y)]
            [:z1000 '(tcc/* z 1000)]
            '((juxt tcc/+ tcc/*) x y)
            [:p '((juxt tcc/+ tcc/*) x y)]
            '{:a (tcc/+ x y)
              :b (tcc/* x y)}
            [:p '{:a (tcc/+ x y)
                  :b (tcc/* x y)}]
            '[(tcc/column (tcc/+ x y) {:name :c})
              (tcc/column (tcc/* x y) {:name :d})]
            [:p '[(tcc/column (tcc/+ x y) {:name :c})
                  (tcc/column (tcc/* x y) {:name :d})]]])

_unnamed [3 19]:

:y :w=:B :w=:C :x (tcc/+ x y) :z :z1000 ((juxt tcc/+ tcc/*) x y)_0 ((juxt tcc/+ tcc/*) x y)_1 :p_0 :p_1 :a :b :pa :pb :c :d :pc :pd
2 0 0 0 2 2 2000 2 0 2 0 2 0 2 0 2 0 2 0
1 1 0 1 2 2 2000 2 1 2 1 2 1 2 1 2 1 2 1
0 0 1 2 2 2 2000 2 0 2 0 2 0 2 0 2 0 2 0

lm

[dataset]

[dataset options]

Compute a linear regression model for dataset. The first column marked as target is the target. All the columns unmarked as target are the features. The resulting model is of type fastmath.ml.regression.LMData, a generated by Fastmath. It can be summarized by summary.

See fastmath.ml.regression.lm for options.

Examples

Linear relationship
(def linear-toydata
  (-> {:x (range 9)}
      tc/dataset
      (tc/map-columns :y
                      [:x]
                      (fn [x]
                        (+ (* 2 x)
                           -3
                           (* 3 (rand)))))))
(-> linear-toydata
    plotly/layer-point)

Note how the coefficients fit the way we generated the data:

(-> linear-toydata
    (tm/design [:y]
               [:x])
    tm/lm
    tm/summary)
Residuals:

|      :min |       :q1 |   :median |     :q3 |     :max |
|-----------+-----------+-----------+---------+----------|
| -1.443772 | -0.350436 | -0.244968 | 0.76477 | 1.198845 |

Coefficients:

|     :name | :estimate |  :stderr |  :t-value | :p-value | :confidence-interval |
|-----------+-----------+----------+-----------+----------+----------------------|
| Intercept | -1.009014 | 0.523742 | -1.926548 | 0.095406 | [-2.247467 0.229439] |
|        :x |  1.846975 | 0.110008 | 16.789489 |   1.0E-6 |  [1.586848 2.107102] |

F-statistic: 281.8869508632346 on degrees of freedom: {:residual 7, :model 1, :intercept 1}
p-value: 6.506718930321398E-7

R2: 0.975769068214805
Adjusted R2: 0.9723075065312058
Residual standard error: 0.8521167114773845 on 7 degrees of freedom
AIC: 26.398491770973536
Cubic relationship
(def cubic-toydata
  (-> {:x (range 9)}
      tc/dataset
      (tc/map-columns :y
                      [:x]
                      (fn [x]
                        (+ 50
                           (* 4 x)
                           (* -9 x x)
                           (* x x x)
                           (* 3 (rand)))))))
(-> cubic-toydata
    plotly/layer-point)

Note how the coefficients fit the way we generated the data:

(-> cubic-toydata
    (tm/design [:y]
               ['(tm/polynomial x 3)])
    tm/lm
    tm/summary)
Residuals:

|      :min |       :q1 |  :median |      :q3 |     :max |
|-----------+-----------+----------+----------+----------|
| -1.200883 | -0.416095 | 0.089125 | 0.557187 | 0.843251 |

Coefficients:

|     :name | :estimate |  :stderr |   :t-value | :p-value |  :confidence-interval |
|-----------+-----------+----------+------------+----------+-----------------------|
| Intercept | 51.854441 | 0.761092 |   68.13167 |      0.0 | [49.897993 53.810889] |
|        :x |  4.161807 |  0.87841 |   4.737889 |  0.00516 |   [1.903784 6.419831] |
|       :x2 | -9.051775 | 0.265215 | -34.130006 |      0.0 |  [-9.73353 -8.370019] |
|       :x3 |  1.004353 | 0.021754 |  46.167917 |      0.0 |   [0.948432 1.060275] |

F-statistic: 4069.899285168728 on degrees of freedom: {:residual 5, :model 3, :intercept 1}
p-value: 6.905335636631094E-9

R2: 0.99959065708713
Adjusted R2: 0.999345051339408
Residual standard error: 0.8213817632386143 on 5 degrees of freedom
AIC: 26.709002578008814
Categorical relationship
(def days-of-week
  [:Mon :Tue :Wed :Thu :Fri :Sat :Sun])
(def categorical-toydata
  (-> {:t (range 18)
       :day-of-week (->> days-of-week
                         (repeat 3)
                         (apply concat))}
      tc/dataset
      (tc/map-columns :traffic
                      [:day-of-week]
                      (fn [dow]
                        (+ (case dow
                             :Sat 50
                             :Sun 50
                             60)
                           (* 5 (rand)))))))
(-> categorical-toydata
    (plotly/layer-point {:=x :t
                         :=y :traffic
                         :=color :day-of-week
                         :=mark-size 10})
    (plotly/layer-line {:=x :t
                        :=y :traffic}))

A model with all days except for one, dropping one category to avoid multicolinearity (note we begin with Thursday due to the order of appearance):

(-> categorical-toydata
    (tm/design [:traffic]
               ['(tm/one-hot day-of-week)])
    tm/lm
    tm/summary)
Residuals:

|      :min |       :q1 |  :median |      :q3 |     :max |
|-----------+-----------+----------+----------+----------|
| -2.089684 | -0.725653 | 0.027399 | 0.743488 | 1.896184 |

Coefficients:

|             :name | :estimate |  :stderr |  :t-value | :p-value |   :confidence-interval |
|-------------------+-----------+----------+-----------+----------+------------------------|
|         Intercept | 61.711979 | 0.785093 | 78.604691 |      0.0 |  [59.984002 63.439957] |
| :day-of-week=:Tue |  2.061152 | 1.110289 |  1.856411 | 0.090358 |   [-0.382577 4.504882] |
| :day-of-week=:Wed |  0.460481 | 1.110289 |   0.41474 | 0.686306 |    [-1.983249 2.90421] |
| :day-of-week=:Thu |  0.560154 | 1.110289 |  0.504512 | 0.623855 |   [-1.883575 3.003884] |
| :day-of-week=:Fri |  2.637936 | 1.241341 |   2.12507 | 0.057066 |   [-0.094236 5.370109] |
| :day-of-week=:Sat | -8.227844 | 1.241341 | -6.628191 |   3.7E-5 | [-10.960016 -5.495671] |
| :day-of-week=:Sun |  -9.23519 | 1.241341 | -7.439689 |   1.3E-5 | [-11.967362 -6.503017] |

F-statistic: 28.03879101493394 on degrees of freedom: {:residual 11, :model 6, :intercept 1}
p-value: 4.726292776258134E-6

R2: 0.9386272863637274
Adjusted R2: 0.9051512607439424
Residual standard error: 1.359820669918769 on 11 degrees of freedom
AIC: 69.28191236879977

A model with all days except for one, dropping one category to avoid multicolinearity, and speciftying the order of encoded values:

(-> categorical-toydata
    (tm/design [:traffic]
               ['(tm/one-hot day-of-week
                             {:values days-of-week})])
    tm/lm
    tm/summary)
Residuals:

|      :min |       :q1 |  :median |      :q3 |     :max |
|-----------+-----------+----------+----------+----------|
| -2.089684 | -0.725653 | 0.027399 | 0.743488 | 1.896184 |

Coefficients:

|             :name | :estimate |  :stderr |  :t-value | :p-value |   :confidence-interval |
|-------------------+-----------+----------+-----------+----------+------------------------|
|         Intercept | 61.711979 | 0.785093 | 78.604691 |      0.0 |  [59.984002 63.439957] |
| :day-of-week=:Tue |  2.061152 | 1.110289 |  1.856411 | 0.090358 |   [-0.382577 4.504882] |
| :day-of-week=:Wed |  0.460481 | 1.110289 |   0.41474 | 0.686306 |    [-1.983249 2.90421] |
| :day-of-week=:Thu |  0.560154 | 1.110289 |  0.504512 | 0.623855 |   [-1.883575 3.003884] |
| :day-of-week=:Fri |  2.637936 | 1.241341 |   2.12507 | 0.057066 |   [-0.094236 5.370109] |
| :day-of-week=:Sat | -8.227844 | 1.241341 | -6.628191 |   3.7E-5 | [-10.960016 -5.495671] |
| :day-of-week=:Sun |  -9.23519 | 1.241341 | -7.439689 |   1.3E-5 | [-11.967362 -6.503017] |

F-statistic: 28.03879101493394 on degrees of freedom: {:residual 11, :model 6, :intercept 1}
p-value: 4.726292776258134E-6

R2: 0.9386272863637274
Adjusted R2: 0.9051512607439424
Residual standard error: 1.359820669918769 on 11 degrees of freedom
AIC: 69.28191236879977

A model with all days and no intercept, dropping the intercept to avoid multicolinearity and have an easier interpretation of the coefficients:

Note how the coefficients fit the way we generated the data:

(-> categorical-toydata
    (tm/design [:traffic]
               ['(tm/one-hot day-of-week
                             {:values days-of-week
                              :include-first true})])
    (tm/lm {:intercept? false})
    tm/summary)
Residuals:

|      :min |       :q1 |  :median |      :q3 |     :max |
|-----------+-----------+----------+----------+----------|
| -2.089684 | -0.725653 | 0.027399 | 0.743488 | 1.896184 |

Coefficients:

|             :name | :estimate |  :stderr |  :t-value | :p-value |  :confidence-interval |
|-------------------+-----------+----------+-----------+----------+-----------------------|
| :day-of-week=:Mon | 61.711979 | 0.785093 | 78.604691 |      0.0 | [59.984002 63.439957] |
| :day-of-week=:Tue | 63.773132 | 0.785093 | 81.230053 |      0.0 | [62.045154 65.501109] |
| :day-of-week=:Wed |  62.17246 | 0.785093 | 79.191222 |      0.0 | [60.444483 63.900438] |
| :day-of-week=:Thu | 62.272133 | 0.785093 | 79.318179 |      0.0 | [60.544156 64.000111] |
| :day-of-week=:Fri | 64.349916 | 0.961538 | 66.923915 |      0.0 | [62.233584 66.466247] |
| :day-of-week=:Sat | 53.484136 | 0.961538 | 55.623504 |      0.0 | [51.367804 55.600468] |
| :day-of-week=:Sun |  52.47679 | 0.961538 | 54.575864 |      0.0 | [50.360458 54.593121] |

F-statistic: 5127.2787830847365 on degrees of freedom: {:residual 11, :model 7, :intercept 0}
p-value: 0.0

R2: 0.9996936099697633
Adjusted R2: 0.9994986344959763
Residual standard error: 1.359820669918769 on 11 degrees of freedom
AIC: 69.28191236879977