5  Ordinary least squares with interactions

author: Carsten Behring, Daniel Slutsky

(ns noj-book.interactions-ols
  (:require [fastmath.stats :as fmstats]
            [scicloj.kindly.v4.api :as kindly]
            [scicloj.kindly.v4.kind :as kind]
            [scicloj.metamorph.core :as mm]
            [scicloj.metamorph.ml :as ml]
            [scicloj.metamorph.ml.loss :as loss]
            [tablecloth.api :as tc]
            [tablecloth.column.api :as tcc]
            [tablecloth.pipeline :as tcpipe]
            [tech.v3.dataset.modelling :as modelling]
            [scicloj.ml.smile.regression]))

This examples shows how to do interactions in linear regression with metamorph.ml.

Taking ideas from: Interaction Effect in Multiple Regression: Essentials by Alboukadel Kassambara

First we load the data:

(def marketing
  (tc/dataset "https://github.com/scicloj/datarium-CSV/raw/main/data/marketing.csv.gz"
              {:key-fn keyword}))

and do some preprocessing to set up the regression:

(def preprocessed-data
  (-> marketing
      (tc/drop-columns [:newspaper])
      (modelling/set-inference-target :sales)))

5.1 Additive model

First we build an additive model, which model equation is \[sales = b0 + b1 * youtube + b2 * facebook\]

(def additive-pipeline
  (mm/pipeline
   {:metamorph/id :model}
   (ml/model {:model-type :smile.regression/ordinary-least-square})))

We evaluate it,

(def evaluations
  (ml/evaluate-pipelines
   [additive-pipeline]
   (tc/split->seq preprocessed-data :holdout)
   loss/rmse
   :loss
   {:other-metrices [{:name :r2
                      :metric-fn fmstats/r2-determination}]}))

and print the result:

(-> evaluations flatten first :fit-ctx :model ml/thaw-model)
Linear Model:

Residuals:
       Min          1Q      Median          3Q         Max
   -5.7074     -0.9998      0.3252      1.1840      3.6785

Coefficients:
                  Estimate Std. Error    t value   Pr(>|t|)
Intercept           3.3838     0.3901     8.6751     0.0000 ***
youtube             0.0438     0.0016    27.2563     0.0000 ***
facebook            0.2107     0.0090    23.4578     0.0000 ***
---------------------------------------------------------------------
Significance codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.8727 on 130 degrees of freedom
Multiple R-squared: 0.9216,    Adjusted R-squared: 0.9204
F-statistic: 764.3823 on 3 and 130 DF,  p-value: 1.319e-72

We have the following metrics:

\(RMSE\)

(-> evaluations flatten first :test-transform :metric)
2.4136890070958295

\(R^2\)

(-> evaluations flatten first :test-transform :other-metrices first :metric)
0.8114211930370409

5.2 Interaction effects

Now we add interaction effects to it, resulting in this model equation: \[sales = b0 + b1 * youtube + b2 * facebook + b3 * (youtube * facebook)\]

(def pipe-interaction
  (mm/pipeline
   (tcpipe/add-column :youtube*facebook (fn [ds] (tcc/* (ds :youtube) (ds :facebook))))
   {:metamorph/id :model}(ml/model {:model-type :smile.regression/ordinary-least-square})))

Again we evaluate the model,

(def evaluations
  (ml/evaluate-pipelines
   [pipe-interaction]
   (tc/split->seq preprocessed-data :holdout)
   loss/rmse
   :loss
   {:other-metrices [{:name :r2
                      :metric-fn fmstats/r2-determination}]}))

and print it and the performance metrices:

(-> evaluations flatten first :fit-ctx :model ml/thaw-model)
Linear Model:

Residuals:
       Min          1Q      Median          3Q         Max
   -7.1135     -0.5183      0.2392      0.7820      1.8570

Coefficients:
                  Estimate Std. Error    t value   Pr(>|t|)
Intercept           8.2880     0.4010    20.6681     0.0000 ***
youtube             0.0177     0.0020     8.8557     0.0000 ***
facebook            0.0145     0.0126     1.1561     0.2498 
youtube*facebook     0.0010     0.0001    15.9127     0.0000 ***
---------------------------------------------------------------------
Significance codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.2082 on 129 degrees of freedom
Multiple R-squared: 0.9613,    Adjusted R-squared: 0.9604
F-statistic: 1068.5077 on 4 and 129 DF,  p-value: 7.058e-91

As the multiplcation of youtube*facebook is as well statistically relevant, it suggests that there is indeed an interaction between these 2 predictor variables youtube and facebook.

\(RMSE\)

(-> evaluations flatten first :test-transform :metric)
1.0326921147850543

\(R^2\)

(-> evaluations flatten first :test-transform :other-metrices first :metric)
0.9797845745830085

\(RMSE\) and \(R^2\) of the intercation model are sligtly better.

These results suggest that the model with the interaction term is better than the model that contains only main effects. So, for this specific data, we should go for the model with the interaction model.

source: notebooks/noj_book/interactions_ols.clj