5  Ordinary least squares with interactions

author: Carsten Behring, Daniel Slutsky

(ns noj-book.interactions-ols
  (:require [fastmath.stats :as fmstats]
            [scicloj.kindly.v4.api :as kindly]
            [scicloj.kindly.v4.kind :as kind]
            [scicloj.metamorph.core :as mm]
            [scicloj.metamorph.ml :as ml]
            [scicloj.metamorph.ml.loss :as loss]
            [tablecloth.api :as tc]
            [tablecloth.column.api :as tcc]
            [tablecloth.pipeline :as tcpipe]
            [tech.v3.dataset.modelling :as modelling]
            [scicloj.ml.smile.regression]))

This examples shows how to do interactions in linear regression with metamorph.ml.

Taking ideas from: Interaction Effect in Multiple Regression: Essentials by Alboukadel Kassambara

First we load the data:

(def marketing
  (tc/dataset "https://github.com/scicloj/datarium-CSV/raw/main/data/marketing.csv.gz"
              {:key-fn keyword}))

and do some preprocessing to set up the regression:

(def preprocessed-data
  (-> marketing
      (tc/drop-columns [:newspaper])
      (modelling/set-inference-target :sales)))

5.1 Additive model

First we build an additive model, which model equation is \[sales = b0 + b1 * youtube + b2 * facebook\]

(def additive-pipeline
  (mm/pipeline
   {:metamorph/id :model}
   (ml/model {:model-type :smile.regression/ordinary-least-square})))

We evaluate it,

(def evaluations
  (ml/evaluate-pipelines
   [additive-pipeline]
   (tc/split->seq preprocessed-data :holdout)
   loss/rmse
   :loss
   {:other-metrices [{:name :r2
                      :metric-fn fmstats/r2-determination}]}))

and print the result:

(-> evaluations flatten first :fit-ctx :model ml/thaw-model)
Linear Model:

Residuals:
       Min          1Q      Median          3Q         Max
   -5.4674     -0.9787      0.2357      1.1127      2.7604

Coefficients:
                  Estimate Std. Error    t value   Pr(>|t|)
Intercept           3.6987     0.3814     9.6974     0.0000 ***
youtube             0.0436     0.0015    29.3653     0.0000 ***
facebook            0.1988     0.0083    23.8933     0.0000 ***
---------------------------------------------------------------------
Significance codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.7036 on 130 degrees of freedom
Multiple R-squared: 0.9203,    Adjusted R-squared: 0.9191
F-statistic: 750.8533 on 3 and 130 DF,  p-value: 3.842e-72

We have the following metrics:

\(RMSE\)

(-> evaluations flatten first :test-transform :metric)
2.5729902897341193

\(R^2\)

(-> evaluations flatten first :test-transform :other-metrices first :metric)
0.8535450866841606

5.2 Interaction effects

Now we add interaction effects to it, resulting in this model equation: \[sales = b0 + b1 * youtube + b2 * facebook + b3 * (youtube * facebook)\]

(def pipe-interaction
  (mm/pipeline
   (tcpipe/add-column :youtube*facebook (fn [ds] (tcc/* (ds :youtube) (ds :facebook))))
   {:metamorph/id :model}(ml/model {:model-type :smile.regression/ordinary-least-square})))

Again we evaluate the model,

(def evaluations
  (ml/evaluate-pipelines
   [pipe-interaction]
   (tc/split->seq preprocessed-data :holdout)
   loss/rmse
   :loss
   {:other-metrices [{:name :r2
                      :metric-fn fmstats/r2-determination}]}))

and print it and the performance metrices:

(-> evaluations flatten first :fit-ctx :model ml/thaw-model)
Linear Model:

Residuals:
       Min          1Q      Median          3Q         Max
   -7.8412     -0.3875      0.1922      0.6872      1.7406

Coefficients:
                  Estimate Std. Error    t value   Pr(>|t|)
Intercept           8.2403     0.3909    21.0826     0.0000 ***
youtube             0.0186     0.0020     9.2254     0.0000 ***
facebook            0.0309     0.0112     2.7531     0.0068 **
youtube*facebook     0.0009     0.0001    15.8956     0.0000 ***
---------------------------------------------------------------------
Significance codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.1732 on 129 degrees of freedom
Multiple R-squared: 0.9661,    Adjusted R-squared: 0.9653
F-statistic: 1225.0733 on 4 and 129 DF,  p-value: 1.440e-94

As the multiplcation of youtube*facebook is as well statistically relevant, it suggests that there is indeed an interaction between these 2 predictor variables youtube and facebook.

\(RMSE\)

(-> evaluations flatten first :test-transform :metric)
1.0639476851234684

\(R^2\)

(-> evaluations flatten first :test-transform :other-metrices first :metric)
0.9715836303642873

\(RMSE\) and \(R^2\) of the intercation model are sligtly better.

These results suggest that the model with the interaction term is better than the model that contains only main effects. So, for this specific data, we should go for the model with the interaction model.

source: notebooks/noj_book/interactions_ols.clj