2  API reference

Setup

In this notebook, we will use Tablecloth and Tableplot for code examples, alongside Tablemath.

(ns tablemath-book.reference
  (:require [scicloj.tablemath.v1.api :as tm]
            [tablecloth.api :as tc]
            [tablecloth.column.api :as tcc]
            [scicloj.tableplot.v1.plotly :as plotly]
            [tablemath-book.utils :as utils]))

Reference

polynomial

[column degree]

Given a column and an integer degree, return a vector of columns with all its powers up to that degree, named appropriately.

Examples

(-> [1 2 3]
    (tcc/column {:name :x})
    (tm/polynomial 4))
[#tech.v3.dataset.column<int64>[3]
:x
[1, 2, 3] #tech.v3.dataset.column<int64>[3]
:x2
[1, 4, 9] #tech.v3.dataset.column<int64>[3]
:x3
[1, 8, 27] #tech.v3.dataset.column<int64>[3]
:x4
[1, 16, 81]]

one-hot

[column]

[column {:keys [values include-first], :or {values (distinct column), include-first false}}]

Given a column, create a vector of integer binary columns, each encoding the presence of absence of one of its values.

E.g., if the column name is :x, and one of the values is :A, then a resulting binary column will have 1 in all the rows where column has :A.

The sequence of values to generate the binary columns is defined as follows: either the value provided for the :values key if present, or the distinct values in column in their order of appearance. If the value of the option key :include-first is false (which is the default), then the first value is ommitted. This is handy for avoiding multicollinearity in linear regression.

Supported options: - :values - the values to encode as columns - default nil - :include-first - should the first value be included - default false

Examples

(tm/one-hot (tcc/column [:B :A :A :B :B :C]
                        {:name :x}))
[#tech.v3.dataset.column<int64>[6]
:x=:A
[0, 1, 1, 0, 0, 0] #tech.v3.dataset.column<int64>[6]
:x=:C
[0, 0, 0, 0, 0, 1]]
(tm/one-hot (tcc/column [:B :A :A :B :B :C]
                        {:name :x})
            {:values [:A :B :C]})
[#tech.v3.dataset.column<int64>[6]
:x=:B
[1, 0, 0, 1, 1, 0] #tech.v3.dataset.column<int64>[6]
:x=:C
[0, 0, 0, 0, 0, 1]]
(tm/one-hot (tcc/column [:B :A :A :B :B :C]
                        {:name :x})
            {:values [:A :B :C]
             :include-first true})
[#tech.v3.dataset.column<int64>[6]
:x=:A
[0, 1, 1, 0, 0, 0] #tech.v3.dataset.column<int64>[6]
:x=:B
[1, 0, 0, 1, 1, 0] #tech.v3.dataset.column<int64>[6]
:x=:C
[0, 0, 0, 0, 0, 1]]

with

[m expr]

Evaluate expression expr in the context of destructuring all the keys of map m.

Examples

(tm/with {:x 3 :y 9}
         '(+ x y))
12
(tm/with (tc/dataset {:x (range 4)
                      :y 9})
         '(tcc/+ x y))
#tech.v3.dataset.column<int64>[4]
null
[9, 10, 11, 12]

columns-with

[dataset specs]

Compute a sequence of named columns by a given sequence of specs in the context of a given dataset.

Each spec is one of the following:

    1. a keyword or string - in that case, we just take the corresponding column of the original dataset.
    1. a vector of two elements [nam expr], where the first is a string or a keyword. In that case, nam is interpreted as a name or a name-prefix for the resulting columns, and expr is handled as an expression as in (3).
    1. any other Clojure form - in that case, we treat it as an expression, and evaluate it while destructuring the column names of dataset as well as all the columns created by previous specs; the evaluation is expected to return one of the following:
    • a column (or the data to create a column (e.g., a vector of numbers))
    • a sequential of columns
    • a map from column names to columns

In any case, the result of the spec is turned into a sequence of named columns, which is conctenated to the columns from the previous specs. Some default naming mechanisms are invoked if column names are missing.

Columns of strings and keywords that have at most 20 distinct values are one-hot-encoded by default.

Eventually, the sequence of all resulting columns is returned.

Examples

Note the naming of the resulting columns, and note they can sequentially depend on each other.

(tm/columns-with (tc/dataset {"v" [4 5 6]
                              :w [:A :B :C]
                              :x (range 3)
                              :y (reverse (range 3))})
                 [:v
                  :w
                  :x
                  '(tcc/+ x y)
                  [:z '(tcc/+ x y)]
                  [:z1000 '(tcc/* z 1000)]
                  '((juxt tcc/+ tcc/*) x y)
                  [:p '((juxt tcc/+ tcc/*) x y)]
                  '{:a (tcc/+ x y)
                    :b (tcc/* x y)}
                  [:p '{:a (tcc/+ x y)
                        :b (tcc/* x y)}]
                  '[(tcc/column (tcc/+ x y) {:name :c})
                    (tcc/column (tcc/* x y) {:name :d})]
                  [:p '[(tcc/column (tcc/+ x y) {:name :c})
                        (tcc/column (tcc/* x y) {:name :d})]]])
(#tech.v3.dataset.column<int64>[3]
:w=:B
[0, 1, 0] #tech.v3.dataset.column<int64>[3]
:w=:C
[0, 0, 1] #tech.v3.dataset.column<int64>[3]
:x
[0, 1, 2] #tech.v3.dataset.column<int64>[3]
(tcc/+ x y)
[2, 2, 2] #tech.v3.dataset.column<int64>[3]
:z
[2, 2, 2] #tech.v3.dataset.column<int64>[3]
:z1000
[2000, 2000, 2000] #tech.v3.dataset.column<int64>[3]
((juxt tcc/+ tcc/*) x y)_0
[2, 2, 2] #tech.v3.dataset.column<int64>[3]
((juxt tcc/+ tcc/*) x y)_1
[0, 1, 0] #tech.v3.dataset.column<int64>[3]
:p_0
[2, 2, 2] #tech.v3.dataset.column<int64>[3]
:p_1
[0, 1, 0] #tech.v3.dataset.column<int64>[3]
:a
[2, 2, 2] #tech.v3.dataset.column<int64>[3]
:b
[0, 1, 0] #tech.v3.dataset.column<int64>[3]
:pa
[2, 2, 2] #tech.v3.dataset.column<int64>[3]
:pb
[0, 1, 0] #tech.v3.dataset.column<int64>[3]
:c
[2, 2, 2] #tech.v3.dataset.column<int64>[3]
:d
[0, 1, 0] #tech.v3.dataset.column<int64>[3]
:pc
[2, 2, 2] #tech.v3.dataset.column<int64>[3]
:pd
[0, 1, 0])

design

[dataset target-specs feature-specs]

Given a dataset and sequences target-specs, feature-specs, generate a new dataset from the columns generated by columns-with from these two sequences. The columns from target-specs will be marked as targets for modelling (e.g., regression, classification).

(Inspired by metamorph.ml.design-matrix but adapted for columnwise computation.)

Examples

(tm/design (tc/dataset {"v" [4 5 6]
                        :w [:A :B :C]
                        :x (range 3)
                        :y (reverse (range 3))})
           [:y]
           [:v
            :w
            :x
            '(tcc/+ x y)
            [:z '(tcc/+ x y)]
            [:z1000 '(tcc/* z 1000)]
            '((juxt tcc/+ tcc/*) x y)
            [:p '((juxt tcc/+ tcc/*) x y)]
            '{:a (tcc/+ x y)
              :b (tcc/* x y)}
            [:p '{:a (tcc/+ x y)
                  :b (tcc/* x y)}]
            '[(tcc/column (tcc/+ x y) {:name :c})
              (tcc/column (tcc/* x y) {:name :d})]
            [:p '[(tcc/column (tcc/+ x y) {:name :c})
                  (tcc/column (tcc/* x y) {:name :d})]]])

_unnamed [3 19]:

:y :w=:B :w=:C :x (tcc/+ x y) :z :z1000 ((juxt tcc/+ tcc/*) x y)_0 ((juxt tcc/+ tcc/*) x y)_1 :p_0 :p_1 :a :b :pa :pb :c :d :pc :pd
2 0 0 0 2 2 2000 2 0 2 0 2 0 2 0 2 0 2 0
1 1 0 1 2 2 2000 2 1 2 1 2 1 2 1 2 1 2 1
0 0 1 2 2 2 2000 2 0 2 0 2 0 2 0 2 0 2 0

lm

[dataset]

[dataset options]

Compute a linear regression model for dataset. The first column marked as target is the target. All the columns unmarked as target are the features. The resulting model is of type fastmath.ml.regression.LMData, a generated by Fastmath. It can be summarized by summary.

See fastmath.ml.regression.lm for options.

Examples

Linear relationship
(def linear-toydata
  (-> {:x (range 9)}
      tc/dataset
      (tc/map-columns :y
                      [:x]
                      (fn [x]
                        (+ (* 2 x)
                           -3
                           (* 3 (rand)))))))
(-> linear-toydata
    plotly/layer-point)

Note how the coefficients fit the way we generated the data:

(-> linear-toydata
    (tm/design [:y]
               [:x])
    tm/lm
    tm/summary)
Residuals:

|      :min |       :q1 |  :median |      :q3 |     :max |
|-----------+-----------+----------+----------+----------|
| -1.065875 | -0.558432 | 0.158398 | 0.528058 | 0.669637 |

Coefficients:

|     :name | :estimate |  :stderr |  :t-value | :p-value | :confidence-interval |
|-----------+-----------+----------+-----------+----------+----------------------|
| Intercept | -0.546876 | 0.392459 |  -1.39346 | 0.206125 | [-1.474893 0.381142] |
|        :x |  1.846629 | 0.082433 | 22.401628 |      0.0 |  [1.651706 2.041552] |

F-statistic: 501.83291766778524 on degrees of freedom: {:residual 7, :model 1, :intercept 1}
p-value: 8.935090556327907E-8

R2: 0.9862430283950885
Adjusted R2: 0.984277746737244
Residual standard error: 0.6385217752507847 on 7 degrees of freedom
AIC: 21.204272737197513
Cubic relationship
(def cubic-toydata
  (-> {:x (range 9)}
      tc/dataset
      (tc/map-columns :y
                      [:x]
                      (fn [x]
                        (+ 50
                           (* 4 x)
                           (* -9 x x)
                           (* x x x)
                           (* 3 (rand)))))))
(-> cubic-toydata
    plotly/layer-point)

Note how the coefficients fit the way we generated the data:

(-> cubic-toydata
    (tm/design [:y]
               ['(tm/polynomial x 3)])
    tm/lm
    tm/summary)
Residuals:

|      :min |       :q1 |   :median |      :q3 |     :max |
|-----------+-----------+-----------+----------+----------|
| -1.416885 | -0.361248 | -0.159624 | 0.421575 | 1.340384 |

Coefficients:

|     :name | :estimate |  :stderr |   :t-value | :p-value |   :confidence-interval |
|-----------+-----------+----------+------------+----------+------------------------|
| Intercept | 50.681655 | 0.891266 |  56.864814 |      0.0 |  [48.390584 52.972726] |
|        :x |  5.152666 | 1.028649 |   5.009157 | 0.004073 |    [2.508439 7.796894] |
|       :x2 | -9.293254 | 0.310576 | -29.922661 |   1.0E-6 | [-10.091615 -8.494894] |
|       :x3 |  1.019036 | 0.025475 |  40.001202 |      0.0 |     [0.95355 1.084522] |

F-statistic: 2962.49805122806 on degrees of freedom: {:residual 5, :model 3, :intercept 1}
p-value: 1.5268940778412343E-8

R2: 0.9994377280531662
Adjusted R2: 0.9991003648850659
Residual standard error: 0.9618675995935431 on 5 degrees of freedom
AIC: 29.551001186892037
Categorical relationship
(def days-of-week
  [:Mon :Tue :Wed :Thu :Fri :Sat :Sun])
(def categorical-toydata
  (-> {:t (range 18)
       :day-of-week (->> days-of-week
                         (repeat 3)
                         (apply concat))}
      tc/dataset
      (tc/map-columns :traffic
                      [:day-of-week]
                      (fn [dow]
                        (+ (case dow
                             :Sat 50
                             :Sun 50
                             60)
                           (* 5 (rand)))))))
(-> categorical-toydata
    (plotly/layer-point {:=x :t
                         :=y :traffic
                         :=color :day-of-week
                         :=mark-size 10})
    (plotly/layer-line {:=x :t
                        :=y :traffic}))

A model with all days except for one, dropping one category to avoid multicolinearity (note we begin with Thursday due to the order of appearance):

(-> categorical-toydata
    (tm/design [:traffic]
               ['(tm/one-hot day-of-week)])
    tm/lm
    tm/summary)
Residuals:

|      :min |       :q1 |  :median |      :q3 |     :max |
|-----------+-----------+----------+----------+----------|
| -2.384111 | -0.945412 | 0.201007 | 1.030651 | 2.271914 |

Coefficients:

|             :name | :estimate |  :stderr |  :t-value | :p-value |   :confidence-interval |
|-------------------+-----------+----------+-----------+----------+------------------------|
|         Intercept | 61.515092 | 0.964584 | 63.773679 |      0.0 |  [59.392056 63.638128] |
| :day-of-week=:Tue |  1.643425 | 1.364128 |  1.204744 | 0.253582 |   [-1.359001 4.645851] |
| :day-of-week=:Wed |  0.208831 | 1.364128 |  0.153088 | 0.881101 |   [-2.793595 3.211257] |
| :day-of-week=:Thu |  0.943699 | 1.364128 |  0.691796 | 0.503406 |   [-2.058728 3.946125] |
| :day-of-week=:Fri |  1.181772 | 1.525142 |  0.774861 | 0.454756 |   [-2.175042 4.538587] |
| :day-of-week=:Sat | -7.688267 | 1.525142 | -5.041018 |  3.77E-4 | [-11.045081 -4.331452] |
| :day-of-week=:Sun | -9.537227 | 1.525142 | -6.253338 |   6.2E-5 | [-12.894041 -6.180412] |

F-statistic: 16.87587496496225 on degrees of freedom: {:residual 11, :model 6, :intercept 1}
p-value: 5.799260079852875E-5

R2: 0.9020090372557159
Adjusted R2: 0.8485594212133791
Residual standard error: 1.670709092712173 on 11 degrees of freedom
AIC: 76.69414360164832

A model with all days except for one, dropping one category to avoid multicolinearity, and speciftying the order of encoded values:

(-> categorical-toydata
    (tm/design [:traffic]
               ['(tm/one-hot day-of-week
                             {:values days-of-week})])
    tm/lm
    tm/summary)
Residuals:

|      :min |       :q1 |  :median |      :q3 |     :max |
|-----------+-----------+----------+----------+----------|
| -2.384111 | -0.945412 | 0.201007 | 1.030651 | 2.271914 |

Coefficients:

|             :name | :estimate |  :stderr |  :t-value | :p-value |   :confidence-interval |
|-------------------+-----------+----------+-----------+----------+------------------------|
|         Intercept | 61.515092 | 0.964584 | 63.773679 |      0.0 |  [59.392056 63.638128] |
| :day-of-week=:Tue |  1.643425 | 1.364128 |  1.204744 | 0.253582 |   [-1.359001 4.645851] |
| :day-of-week=:Wed |  0.208831 | 1.364128 |  0.153088 | 0.881101 |   [-2.793595 3.211257] |
| :day-of-week=:Thu |  0.943699 | 1.364128 |  0.691796 | 0.503406 |   [-2.058728 3.946125] |
| :day-of-week=:Fri |  1.181772 | 1.525142 |  0.774861 | 0.454756 |   [-2.175042 4.538587] |
| :day-of-week=:Sat | -7.688267 | 1.525142 | -5.041018 |  3.77E-4 | [-11.045081 -4.331452] |
| :day-of-week=:Sun | -9.537227 | 1.525142 | -6.253338 |   6.2E-5 | [-12.894041 -6.180412] |

F-statistic: 16.87587496496225 on degrees of freedom: {:residual 11, :model 6, :intercept 1}
p-value: 5.799260079852875E-5

R2: 0.9020090372557159
Adjusted R2: 0.8485594212133791
Residual standard error: 1.670709092712173 on 11 degrees of freedom
AIC: 76.69414360164832

A model with all days and no intercept, dropping the intercept to avoid multicolinearity and have an easier interpretation of the coefficients:

Note how the coefficients fit the way we generated the data:

(-> categorical-toydata
    (tm/design [:traffic]
               ['(tm/one-hot day-of-week
                             {:values days-of-week
                              :include-first true})])
    (tm/lm {:intercept? false})
    tm/summary)
Residuals:

|      :min |       :q1 |  :median |      :q3 |     :max |
|-----------+-----------+----------+----------+----------|
| -2.384111 | -0.945412 | 0.201007 | 1.030651 | 2.271914 |

Coefficients:

|             :name | :estimate |  :stderr |  :t-value | :p-value |  :confidence-interval |
|-------------------+-----------+----------+-----------+----------+-----------------------|
| :day-of-week=:Mon | 61.515092 | 0.964584 | 63.773679 |      0.0 | [59.392056 63.638128] |
| :day-of-week=:Tue | 63.158517 | 0.964584 | 65.477444 |      0.0 | [61.035481 65.281553] |
| :day-of-week=:Wed | 61.723923 | 0.964584 | 63.990177 |      0.0 | [59.600887 63.846959] |
| :day-of-week=:Thu |  62.45879 | 0.964584 | 64.752026 |      0.0 | [60.335755 64.581826] |
| :day-of-week=:Fri | 62.696864 |  1.18137 | 53.071331 |      0.0 | [60.096687 65.297041] |
| :day-of-week=:Sat | 53.826825 |  1.18137 | 45.563064 |      0.0 | [51.226648 56.427002] |
| :day-of-week=:Sun | 51.977865 |  1.18137 | 43.997966 |      0.0 | [49.377688 54.578043] |

F-statistic: 3352.9036264916203 on degrees of freedom: {:residual 11, :model 7, :intercept 0}
p-value: 0.0

R2: 0.9995315426271968
Adjusted R2: 0.9992334333899584
Residual standard error: 1.6707090927121728 on 11 degrees of freedom
AIC: 76.69414360164832