5 Ordinary least squares with interactions
author: Carsten Behring, Daniel Slutsky
ns noj-book.interactions-ols
(:require [fastmath.stats :as fmstats]
(:as kindly]
[scicloj.kindly.v4.api :as kind]
[scicloj.kindly.v4.kind :as mm]
[scicloj.metamorph.core :as ml]
[scicloj.metamorph.ml :as loss]
[scicloj.metamorph.ml.loss :as tc]
[tablecloth.api :as tcc]
[tablecloth.column.api :as tcpipe]
[tablecloth.pipeline :as modelling]
[tech.v3.dataset.modelling [scicloj.ml.smile.regression]))
This examples shows how to do interactions in linear regression with metamorph.ml
.
Taking ideas from: Interaction Effect in Multiple Regression: Essentials by Alboukadel Kassambara
First we load the data:
def marketing
("https://github.com/scicloj/datarium-CSV/raw/main/data/marketing.csv.gz"
(tc/dataset :key-fn keyword})) {
and do some preprocessing to set up the regression:
def preprocessed-data
(-> marketing
(:newspaper])
(tc/drop-columns [:sales))) (modelling/set-inference-target
5.1 Additive model
First we build an additive model, which model equation is \[sales = b0 + b1 * youtube + b2 * facebook\]
def additive-pipeline
(
(mm/pipeline:metamorph/id :model}
{:model-type :smile.regression/ordinary-least-square}))) (ml/model {
We evaluate it,
def evaluations
(
(ml/evaluate-pipelines
[additive-pipeline]:holdout)
(tc/split->seq preprocessed-data
loss/rmse:loss
:other-metrices [{:name :r2
{:metric-fn fmstats/r2-determination}]}))
and print the result:
-> evaluations flatten first :fit-ctx :model ml/thaw-model) (
Linear Model:
Residuals:1Q Median 3Q Max
Min 5.7074 -0.9998 0.3252 1.1840 3.6785
-
Coefficients:>|t|)
Estimate Std. Error t value Pr(3.3838 0.3901 8.6751 0.0000 ***
Intercept 0.0438 0.0016 27.2563 0.0000 ***
youtube 0.2107 0.0090 23.4578 0.0000 ***
facebook
---------------------------------------------------------------------0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Significance codes:
1.8727 on 130 degrees of freedom
Residual standard error: 0.9216, Adjusted R-squared: 0.9204
Multiple R-squared: 764.3823 on 3 and 130 DF, p-value: 1.319e-72 F-statistic:
We have the following metrics:
\(RMSE\)
-> evaluations flatten first :test-transform :metric) (
2.4136890070958295
\(R^2\)
-> evaluations flatten first :test-transform :other-metrices first :metric) (
0.8114211930370409
5.2 Interaction effects
Now we add interaction effects to it, resulting in this model equation: \[sales = b0 + b1 * youtube + b2 * facebook + b3 * (youtube * facebook)\]
def pipe-interaction
(
(mm/pipeline:youtube*facebook (fn [ds] (tcc/* (ds :youtube) (ds :facebook))))
(tcpipe/add-column :metamorph/id :model}(ml/model {:model-type :smile.regression/ordinary-least-square}))) {
Again we evaluate the model,
def evaluations
(
(ml/evaluate-pipelines
[pipe-interaction]:holdout)
(tc/split->seq preprocessed-data
loss/rmse:loss
:other-metrices [{:name :r2
{:metric-fn fmstats/r2-determination}]}))
and print it and the performance metrices:
-> evaluations flatten first :fit-ctx :model ml/thaw-model) (
Linear Model:
Residuals:1Q Median 3Q Max
Min 7.1135 -0.5183 0.2392 0.7820 1.8570
-
Coefficients:>|t|)
Estimate Std. Error t value Pr(8.2880 0.4010 20.6681 0.0000 ***
Intercept 0.0177 0.0020 8.8557 0.0000 ***
youtube 0.0145 0.0126 1.1561 0.2498
facebook 0.0010 0.0001 15.9127 0.0000 ***
youtube*facebook
---------------------------------------------------------------------0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Significance codes:
1.2082 on 129 degrees of freedom
Residual standard error: 0.9613, Adjusted R-squared: 0.9604
Multiple R-squared: 1068.5077 on 4 and 129 DF, p-value: 7.058e-91 F-statistic:
As the multiplcation of youtube*facebook
is as well statistically relevant, it suggests that there is indeed an interaction between these 2 predictor variables youtube and facebook.
\(RMSE\)
-> evaluations flatten first :test-transform :metric) (
1.0326921147850543
\(R^2\)
-> evaluations flatten first :test-transform :other-metrices first :metric) (
0.9797845745830085
\(RMSE\) and \(R^2\) of the intercation model are sligtly better.
These results suggest that the model with the interaction term is better than the model that contains only main effects. So, for this specific data, we should go for the model with the interaction model.