13  Ordinary least squares with interactions

author: Carsten Behring, Daniel Slutsky

(ns noj-book.interactions-ols
  (:require [fastmath.stats :as fmstats]
            [scicloj.kindly.v4.api :as kindly]
            [scicloj.kindly.v4.kind :as kind]
            [scicloj.metamorph.core :as mm]
            [scicloj.metamorph.ml :as ml]
            [scicloj.metamorph.ml.loss :as loss]
            [scicloj.metamorph.ml.regression]
            [tablecloth.api :as tc]
            [tablecloth.column.api :as tcc]
            [tablecloth.pipeline :as tcpipe]
            [tech.v3.dataset.modelling :as modelling]
            [scicloj.ml.tribuo]
            [scicloj.metamorph.ml.design-matrix :as dm]))

This examples shows how to do interactions in linear regression with metamorph.ml.

Taking ideas from: Interaction Effect in Multiple Regression: Essentials by Alboukadel Kassambara

First we load the data:

(def marketing
  (tc/dataset "https://github.com/scicloj/datarium-CSV/raw/main/data/marketing.csv.gz"
              {:key-fn keyword}))

and do some preprocessing to set up the regression:

(def preprocessed-data
  (-> marketing
      (tc/drop-columns [:newspaper])
      (modelling/set-inference-target :sales)))

13.1 Additive model

First we build an additive model, which model equation is \[sales = b0 + b1 * youtube + b2 * facebook\]

(def linear-model-config {:model-type :fastmath/ols})
(def additive-pipeline
  (mm/pipeline
   {:metamorph/id :model}
   (ml/model linear-model-config)))

We evaluate it,

(def evaluations
  (ml/evaluate-pipelines
   [additive-pipeline]
   (tc/split->seq preprocessed-data
                  :holdout
                  {:seed 112723})
   loss/rmse
   :loss
   {:other-metrices [{:name :r2
                      :metric-fn fmstats/r2-determination}]}))

and print the resulting model: (note that the :sales term means the intercept b0)

(note that )

(-> evaluations flatten first :fit-ctx :model ml/tidy)

_unnamed [3 5]:

:term :statistic :estimate :p.value :std.error
:sales 6.93059345 3.23892397 1.75340853E-10 0.46733718
:youtube 26.78112104 0.04746972 0.00000000E+00 0.00177251
:facebook 17.44850145 0.18475974 0.00000000E+00 0.01058886

We have the following metrics:

\(RMSE\)

(-> evaluations flatten first :test-transform :metric)
1.772159024927988

\(R^2\)

(-> evaluations flatten first :test-transform :other-metrices first :metric)
0.9094193687523886

13.2 Interaction effects

Now we add interaction effects to it, resulting in this model equation: \[sales = b0 + b1 * youtube + b2 * facebook + b3 * (youtube * facebook)\]

(def pipe-interaction
  (mm/pipeline
   (tcpipe/add-column :youtube*facebook (fn [ds] (tcc/* (ds :youtube) (ds :facebook))))
   {:metamorph/id :model} (ml/model linear-model-config)))

Again we evaluate the model,

(def evaluations
  (ml/evaluate-pipelines
   [pipe-interaction]
   (tc/split->seq preprocessed-data
                  :holdout
                  {:seed 112723})
   loss/rmse
   :loss
   {:other-metrices [{:name :r2
                      :metric-fn fmstats/r2-determination}]}))

and print it and the performance metrics:

(-> evaluations flatten first :fit-ctx :model ml/tidy)

_unnamed [4 5]:

:term :statistic :estimate :p.value :std.error
:sales 20.25471327 8.16387196 0.00000000E+00 0.40306036
:youtube 9.28964322 0.01881844 4.44089210E-16 0.00202574
:facebook 1.84214022 0.02152468 6.77510166E-02 0.01168460
:youtube*facebook 16.34505330 0.00093206 0.00000000E+00 0.00005702

As the multiplcation of youtube*facebook is as well statistically relevant, it suggests that there is indeed an interaction between these 2 predictor variables youtube and facebook.

\(RMSE\)

(-> evaluations flatten first :test-transform :metric)
0.933077510748531

\(R^2\)

(-> evaluations flatten first :test-transform :other-metrices first :metric)
0.9747551116991899

\(RMSE\) and \(R^2\) of the intercation model are sligtly better.

These results suggest that the model with the interaction term is better than the model that contains only main effects. So, for this specific data, we should go for the model with the interaction model.

13.3 use design matrix

Since metamorph.ml 0.9.0 we have a simpler way to express the same inteactions as before.

We can express the same formula \[sales = b0 + b1 * youtube + b2 * facebook + b3 * (youtube * facebook)\] by specifying a design matrix.

(require '[scicloj.metamorph.ml.design-matrix :as dm])
(def dm
  (dm/create-design-matrix 
   preprocessed-data
   [:sales]                                         ;; predictor
   [
    [:youtube '(identity :youtube)]                  ;; youtube stays as-is
    [:facebook '(identity :facebook)]                ;; facebook stays as-is
    [:youtube*facebook '(* :youtube :facebook)]       ;; new term is created
    ]))

The result of the create-design-matrix function is directly β€œready” to be used without any further preprocessing: - only specified terms are present - all numeric - predictor is β€œmarked” as such all present terms are added

dm

https://github.com/scicloj/datarium-CSV/raw/main/data/marketing.csv.gz [200 4]:

:youtube :facebook :sales :youtube*facebook
276.12 45.36 26.52 12524.8032
53.40 47.16 12.48 2518.3440
20.64 55.08 11.16 1136.8512
181.80 49.56 22.20 9010.0080
216.96 12.96 15.48 2811.8016
10.44 58.68 8.64 612.6192
69.00 39.36 14.16 2715.8400
144.24 23.52 15.84 3392.5248
10.32 2.52 5.76 26.0064
239.76 3.12 12.72 748.0512
… … … …
22.44 14.52 8.04 325.8288
47.40 49.32 12.96 2337.7680
90.60 12.96 11.88 1174.1760
20.64 4.92 7.08 101.5488
200.16 50.40 23.52 10088.0640
179.64 42.72 20.76 7674.2208
45.84 4.44 9.12 203.5296
113.04 5.88 11.64 664.6752
212.40 11.16 15.36 2370.3840
340.32 50.40 30.60 17152.1280
278.52 10.32 16.08 2874.3264

Having such numeric dataset the pipeline is β€œminimal”, only containing the model:

(def pipe-mode-only
  (mm/pipeline
   {:metamorph/id :model} (ml/model linear-model-config)))
(def evaluations-dm
  (ml/evaluate-pipelines
   [pipe-mode-only]
   (tc/split->seq dm
                  :holdout
                  {:seed 112723})
   loss/rmse
   :loss
   {:other-metrices [{:name :r2
                      :metric-fn fmstats/r2-determination}]}))

we get the same metrics as before, (as it is the same model specification):

\(RMSE\)

(-> evaluations-dm flatten first :test-transform :metric)
0.933077510748531

\(R^2\)

(-> evaluations-dm flatten first :test-transform :other-metrices first :metric)
0.9747551116991899
source: notebooks/noj_book/interactions_ols.clj