13 Ordinary least squares with interactions
author: Carsten Behring, Daniel Slutsky
ns noj-book.interactions-ols
(:require [fastmath.stats :as fmstats]
(:as kindly]
[scicloj.kindly.v4.api :as kind]
[scicloj.kindly.v4.kind :as mm]
[scicloj.metamorph.core :as ml]
[scicloj.metamorph.ml :as loss]
[scicloj.metamorph.ml.loss
[scicloj.metamorph.ml.regression]:as tc]
[tablecloth.api :as tcc]
[tablecloth.column.api :as tcpipe]
[tablecloth.pipeline :as modelling]
[tech.v3.dataset.modelling
[scicloj.ml.tribuo]:as dm])) [scicloj.metamorph.ml.design-matrix
This examples shows how to do interactions in linear regression with metamorph.ml
.
Taking ideas from: Interaction Effect in Multiple Regression: Essentials by Alboukadel Kassambara
First we load the data:
def marketing
("https://github.com/scicloj/datarium-CSV/raw/main/data/marketing.csv.gz"
(tc/dataset :key-fn keyword})) {
and do some preprocessing to set up the regression:
def preprocessed-data
(-> marketing
(:newspaper])
(tc/drop-columns [:sales))) (modelling/set-inference-target
13.1 Additive model
First we build an additive model, which model equation is \[sales = b0 + b1 * youtube + b2 * facebook\]
def linear-model-config {:model-type :fastmath/ols}) (
def additive-pipeline
(
(mm/pipeline:metamorph/id :model}
{ (ml/model linear-model-config)))
We evaluate it,
def evaluations
(
(ml/evaluate-pipelines
[additive-pipeline]
(tc/split->seq preprocessed-data:holdout
:seed 112723})
{
loss/rmse:loss
:other-metrices [{:name :r2
{:metric-fn fmstats/r2-determination}]}))
and print the resulting model: (note that the :sales
term means the intercept b0
)
(note that )
-> evaluations flatten first :fit-ctx :model ml/tidy) (
_unnamed [3 5]:
:term | :statistic | :estimate | :p.value | :std.error |
---|---|---|---|---|
:sales | 6.93059345 | 3.23892397 | 1.75340853E-10 | 0.46733718 |
:youtube | 26.78112104 | 0.04746972 | 0.00000000E+00 | 0.00177251 |
17.44850145 | 0.18475974 | 0.00000000E+00 | 0.01058886 |
We have the following metrics:
\(RMSE\)
-> evaluations flatten first :test-transform :metric) (
1.772159024927988
\(R^2\)
-> evaluations flatten first :test-transform :other-metrices first :metric) (
0.9094193687523886
13.2 Interaction effects
Now we add interaction effects to it, resulting in this model equation: \[sales = b0 + b1 * youtube + b2 * facebook + b3 * (youtube * facebook)\]
def pipe-interaction
(
(mm/pipeline:youtube*facebook (fn [ds] (tcc/* (ds :youtube) (ds :facebook))))
(tcpipe/add-column :metamorph/id :model} (ml/model linear-model-config))) {
Again we evaluate the model,
def evaluations
(
(ml/evaluate-pipelines
[pipe-interaction]
(tc/split->seq preprocessed-data:holdout
:seed 112723})
{
loss/rmse:loss
:other-metrices [{:name :r2
{:metric-fn fmstats/r2-determination}]}))
and print it and the performance metrics:
-> evaluations flatten first :fit-ctx :model ml/tidy) (
_unnamed [4 5]:
:term | :statistic | :estimate | :p.value | :std.error |
---|---|---|---|---|
:sales | 20.25471327 | 8.16387196 | 0.00000000E+00 | 0.40306036 |
:youtube | 9.28964322 | 0.01881844 | 4.44089210E-16 | 0.00202574 |
1.84214022 | 0.02152468 | 6.77510166E-02 | 0.01168460 | |
:youtube*facebook | 16.34505330 | 0.00093206 | 0.00000000E+00 | 0.00005702 |
As the multiplcation of youtube*facebook
is as well statistically relevant, it suggests that there is indeed an interaction between these 2 predictor variables youtube and facebook.
\(RMSE\)
-> evaluations flatten first :test-transform :metric) (
0.933077510748531
\(R^2\)
-> evaluations flatten first :test-transform :other-metrices first :metric) (
0.9747551116991899
\(RMSE\) and \(R^2\) of the intercation model are sligtly better.
These results suggest that the model with the interaction term is better than the model that contains only main effects. So, for this specific data, we should go for the model with the interaction model.
13.3 use design matrix
Since metamorph.ml 0.9.0
we have a simpler way to express the same inteactions as before.
We can express the same formula \[sales = b0 + b1 * youtube + b2 * facebook + b3 * (youtube * facebook)\] by specifying a design matrix.
require '[scicloj.metamorph.ml.design-matrix :as dm]) (
def dm
(
(dm/create-design-matrix
preprocessed-data:sales] ;; predictor
[
[:youtube '(identity :youtube)] ;; youtube stays as-is
[:facebook '(identity :facebook)] ;; facebook stays as-is
[:youtube*facebook '(* :youtube :facebook)] ;; new term is created
[ ]))
The result of the create-design-matrix
function is directly βreadyβ to be used without any further preprocessing: - only specified terms are present - all numeric - predictor is βmarkedβ as such all present terms are added
dm
https://github.com/scicloj/datarium-CSV/raw/main/data/marketing.csv.gz [200 4]:
:youtube | :sales | :youtube*facebook | |
---|---|---|---|
276.12 | 45.36 | 26.52 | 12524.8032 |
53.40 | 47.16 | 12.48 | 2518.3440 |
20.64 | 55.08 | 11.16 | 1136.8512 |
181.80 | 49.56 | 22.20 | 9010.0080 |
216.96 | 12.96 | 15.48 | 2811.8016 |
10.44 | 58.68 | 8.64 | 612.6192 |
69.00 | 39.36 | 14.16 | 2715.8400 |
144.24 | 23.52 | 15.84 | 3392.5248 |
10.32 | 2.52 | 5.76 | 26.0064 |
239.76 | 3.12 | 12.72 | 748.0512 |
β¦ | β¦ | β¦ | β¦ |
22.44 | 14.52 | 8.04 | 325.8288 |
47.40 | 49.32 | 12.96 | 2337.7680 |
90.60 | 12.96 | 11.88 | 1174.1760 |
20.64 | 4.92 | 7.08 | 101.5488 |
200.16 | 50.40 | 23.52 | 10088.0640 |
179.64 | 42.72 | 20.76 | 7674.2208 |
45.84 | 4.44 | 9.12 | 203.5296 |
113.04 | 5.88 | 11.64 | 664.6752 |
212.40 | 11.16 | 15.36 | 2370.3840 |
340.32 | 50.40 | 30.60 | 17152.1280 |
278.52 | 10.32 | 16.08 | 2874.3264 |
Having such numeric dataset the pipeline is βminimalβ, only containing the model:
def pipe-mode-only
(
(mm/pipeline:metamorph/id :model} (ml/model linear-model-config))) {
def evaluations-dm
(
(ml/evaluate-pipelines
[pipe-mode-only]
(tc/split->seq dm:holdout
:seed 112723})
{
loss/rmse:loss
:other-metrices [{:name :r2
{:metric-fn fmstats/r2-determination}]}))
we get the same metrics as before, (as it is the same model specification):
\(RMSE\)
-> evaluations-dm flatten first :test-transform :metric) (
0.933077510748531
\(R^2\)
-> evaluations-dm flatten first :test-transform :other-metrices first :metric) (
0.9747551116991899