14 Ordinary least squares with interactions
author: Carsten Behring, Daniel Slutsky
ns noj-book.interactions-ols
(:require [fastmath.stats :as fmstats]
(:as kindly]
[scicloj.kindly.v4.api :as kind]
[scicloj.kindly.v4.kind :as mm]
[scicloj.metamorph.core :as ml]
[scicloj.metamorph.ml :as loss]
[scicloj.metamorph.ml.loss
[scicloj.metamorph.ml.regression]:as tc]
[tablecloth.api :as tcc]
[tablecloth.column.api :as tcpipe]
[tablecloth.pipeline :as modelling]
[tech.v3.dataset.modelling
[scicloj.ml.tribuo]:as dm])) [scicloj.metamorph.ml.design-matrix
This examples shows how to do interactions in linear regression with metamorph.ml
.
Taking ideas from: Interaction Effect in Multiple Regression: Essentials by Alboukadel Kassambara
First we load the data:
def marketing
("https://github.com/scicloj/datarium-CSV/raw/main/data/marketing.csv.gz"
(tc/dataset :key-fn keyword})) {
and do some preprocessing to set up the regression:
def preprocessed-data
(-> marketing
(:newspaper])
(tc/drop-columns [:sales))) (modelling/set-inference-target
14.1 Additive model
First we build an additive model, which model equation is \[sales = b0 + b1 * youtube + b2 * facebook\]
def linear-model-config {:model-type :fastmath/ols}) (
def additive-pipeline
(
(mm/pipeline:metamorph/id :model}
{ (ml/model linear-model-config)))
We evaluate it,
def evaluations
(
(ml/evaluate-pipelines
[additive-pipeline]
(tc/split->seq preprocessed-data:holdout
:seed 112723})
{
loss/rmse:loss
:other-metrics [{:name :r2
{:metric-fn fmstats/r2-determination}]}))
and print the resulting model: (note that the :sales
term means the intercept b0
)
-> evaluations flatten first :fit-ctx :model ml/tidy) (
_unnamed [3 5]:
:term | :statistic | :estimate | :p.value | :std.error |
---|---|---|---|---|
:sales | 6.93059345 | 3.23892397 | 1.75340853E-10 | 0.46733718 |
:youtube | 26.78112104 | 0.04746972 | 0.00000000E+00 | 0.00177251 |
17.44850145 | 0.18475974 | 0.00000000E+00 | 0.01058886 |
We have the following metrics:
\(RMSE\):
-> evaluations flatten first :test-transform :metric) (
1.772159024927988
\(R^2\):
-> evaluations flatten first :test-transform :other-metrics first :metric) (
0.9094193687523886
14.2 Interaction effects
We add a new column wit an interaction:
def pipe-interaction
(
(mm/pipeline:youtube*facebook (fn [ds] (tcc/* (ds :youtube) (ds :facebook))))
(tcpipe/add-column :metamorph/id :model} (ml/model linear-model-config))) {
Again we evaluate the model,
def evaluations
(
(ml/evaluate-pipelines
[pipe-interaction]
(tc/split->seq preprocessed-data:holdout
:seed 112723})
{
loss/rmse:loss
:other-metrics [{:name :r2
{:metric-fn fmstats/r2-determination}]}))
and print it and the performance metrics:
-> evaluations flatten first :fit-ctx :model ml/tidy) (
_unnamed [4 5]:
:term | :statistic | :estimate | :p.value | :std.error |
---|---|---|---|---|
:sales | 20.25471327 | 8.16387196 | 0.00000000E+00 | 0.40306036 |
:youtube | 9.28964322 | 0.01881844 | 4.44089210E-16 | 0.00202574 |
1.84214022 | 0.02152468 | 6.77510166E-02 | 0.01168460 | |
:youtube*facebook | 16.34505330 | 0.00093206 | 0.00000000E+00 | 0.00005702 |
As the multiplcation of youtube*facebook
is as well statistically relevant, it suggests that there is indeed an interaction between these 2 predictor variables youtube and facebook.
\(RMSE\)
-> evaluations flatten first :test-transform :metric) (
0.933077510748531
\(R^2\)
-> evaluations flatten first :test-transform :other-metrics first :metric) (
0.9747551116991899
\(RMSE\) and \(R^2\) of the intercation model are sligtly better.
These results suggest that the model with the interaction term is better than the model that contains only main effects. So, for this specific data, we should go for the model with the interaction model.
14.3 use design matrix
Since metamorph.ml 0.9.0
we have a simpler way to express the same inteactions as before.
We can express the same formula \[sales = b0 + b1 * youtube + b2 * facebook + b3 * (youtube * facebook)\] by specifying a design matrix.
require '[scicloj.metamorph.ml.design-matrix :as dm]) (
def dm
(
(dm/create-design-matrix
preprocessed-data:sales] ;; predictor
[
[:youtube '(identity :youtube)] ;; youtube stays as-is
[:facebook '(identity :facebook)] ;; facebook stays as-is
[:youtube*facebook '(* :youtube :facebook)] ;; new term is created
[ ]))
The result of the create-design-matrix
function is directly “ready” to be used without any further preprocessing: - only specified terms are present - all numeric - predictor is “marked” as such all present terms are added
dm
https://github.com/scicloj/datarium-CSV/raw/main/data/marketing.csv.gz [200 4]:
:youtube | :sales | :youtube*facebook | |
---|---|---|---|
276.12 | 45.36 | 26.52 | 12524.8032 |
53.40 | 47.16 | 12.48 | 2518.3440 |
20.64 | 55.08 | 11.16 | 1136.8512 |
181.80 | 49.56 | 22.20 | 9010.0080 |
216.96 | 12.96 | 15.48 | 2811.8016 |
10.44 | 58.68 | 8.64 | 612.6192 |
69.00 | 39.36 | 14.16 | 2715.8400 |
144.24 | 23.52 | 15.84 | 3392.5248 |
10.32 | 2.52 | 5.76 | 26.0064 |
239.76 | 3.12 | 12.72 | 748.0512 |
… | … | … | … |
22.44 | 14.52 | 8.04 | 325.8288 |
47.40 | 49.32 | 12.96 | 2337.7680 |
90.60 | 12.96 | 11.88 | 1174.1760 |
20.64 | 4.92 | 7.08 | 101.5488 |
200.16 | 50.40 | 23.52 | 10088.0640 |
179.64 | 42.72 | 20.76 | 7674.2208 |
45.84 | 4.44 | 9.12 | 203.5296 |
113.04 | 5.88 | 11.64 | 664.6752 |
212.40 | 11.16 | 15.36 | 2370.3840 |
340.32 | 50.40 | 30.60 | 17152.1280 |
278.52 | 10.32 | 16.08 | 2874.3264 |
Having such numeric dataset the pipeline is “minimal”, only containing the model:
def pipe-mode-only
(
(mm/pipeline:metamorph/id :model} (ml/model linear-model-config))) {
def evaluations-dm
(
(ml/evaluate-pipelines
[pipe-mode-only]
(tc/split->seq dm:holdout
:seed 112723})
{
loss/rmse:loss
:other-metrics [{:name :r2
{:metric-fn fmstats/r2-determination}]}))
we get the same metrics as before, (as it is the same model specification):
\(RMSE\)
-> evaluations-dm flatten first :test-transform :metric) (
0.933077510748531
\(R^2\)
-> evaluations-dm flatten first :test-transform :other-metrics first :metric) (
0.9747551116991899