8  Introduction to Linear Regression

last update: 2024-12-29

In this tutorial, we introduce the fundamentals of linear regression, guided by the In Depth: Linear Regression chapter of the Python Data Science Handbook by Jake VanderPlas.

8.1 Setup

(ns noj-book.linear-regression-intro
  (:require
   [tech.v3.dataset :as ds]
   [tablecloth.api :as tc]
   [tablecloth.column.api :as tcc]
   [tech.v3.datatype.datetime :as datetime]
   [tech.v3.dataset.modelling :as ds-mod]
   [fastmath.ml.regression :as reg]
   [scicloj.kindly.v4.kind :as kind]
   [fastmath.random :as rand]
   [scicloj.tableplot.v1.plotly :as plotly]))

8.2 Simple Linear Regression

We begin with the classic straight-line model: for data points \((x, y)\), we assume there is a linear relationship allowing us to predict \(y\) as \[y = ax + b.\] In this formulation, \(a\) is the slope and \(b\) is the intercept, the point where our line would cross the \(y\) axis.

To illustrate, we’ll use Fastmath and Tablecloth to create synthetic data in which the relationship is known to hold with \(a=2\) and \(b=-5\).

For each row in the dataset below, we draw \(x\) uniformly from 0 to 10 and compute \(y = ax + b\) plus an extra random noise term (drawn from a standard Gaussian distribution). This noise is added independently for every row.

(def simple-linear-data
  (let [rng (rand/rng 1234)
        n 50
        a 2
        b -5]
    (-> {:x (repeatedly n #(rand/frandom rng 0 10))}
        tc/dataset
        (tc/map-columns :y
                        [:x]
                        (fn [x]
                          (+ (* a x)
                             b
                             (rand/grandom rng)))))))
simple-linear-data

_unnamed [50 2]:

:x :y
4.40189362 4.64551891
9.99561119 13.48501490
3.21222663 0.54057294
7.99374199 9.70048112
3.84180498 2.53116268
3.49912119 0.70851472
8.50561714 10.34257243
1.98090017 -0.48385090
7.45241165 9.69972767
0.13574839 -5.08259141
1.41390860 -0.33601006
1.12722397 -2.26532006
8.39029312 11.40563831
9.41655540 12.60471241
7.72797155 9.80747902
4.99533463 4.50057343
2.14969826 -0.26775764
5.91589880 6.28128141
5.72228336 5.11411693
7.92238522 13.41798763
5.02370310 5.98304872

Let’s plot these points using Tableplot’s Plotly API.

(-> simple-linear-data
    plotly/layer-point)

8.2.1 Regression using Fastmath

We can now fit a linear model to the data using the Fastmath library.

(def simple-linear-data-model
  (reg/lm
   ;; ys - a "column" sequence of `y` values:
   (simple-linear-data :y)
   ;; xss - a sequence of "rows", each containing `x` values:
   ;; (one `x` per row, in our case):
   (-> simple-linear-data
       (tc/select-columns [:x])
       tc/rows)
   ;; options
   {:names ["x"]}))
(type simple-linear-data-model)
fastmath.ml.regression.LMData
simple-linear-data-model
{:model :ols,
 :intercept? true,
 :offset? false,
 :transformer nil,
 :xtxinv
 #object[org.apache.commons.math3.linear.Array2DRowRealMatrix 0x7e48cd92 "Array2DRowRealMatrix{{0.0920441012,-0.0135677872},{-0.0135677872,0.0025551689}}"],
 :intercept -4.298390892717667,
 :beta [1.8192131533035178],
 :coefficients
 [{:estimate -4.298390892717667,
   :stderr 0.32452927724774444,
   :t-value -13.245001896812816,
   :p-value 1.2012148033976573E-17,
   :confidence-interval [-4.950900737418651 -3.645881048016683]}
  {:estimate 1.8192131533035178,
   :stderr 0.054071162964162166,
   :t-value 33.64479425954412,
   :p-value 0.0,
   :confidence-interval [1.7104957936626095 1.9279305129444262]}],
 :offset
 [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
  0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
  0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
  0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
 :weights
 [1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0],
 :residuals
 {:weighted
  [0.9359270360850536
   -0.4007415611019258
   -1.004761100167602
   -0.5434485594889509
   -0.15950858036297388
   -1.3587416822991665
   -0.8325672610282719
   0.21086035275594006
   0.44059325818501627
   -1.0311557695325497
   1.7917691500388013
   -0.4396936043475561
   -1.0806302566782673
   -0.43371440967007124
   -1.510403625068255
   0.5249346923412235
   0.24209324404804988
   0.23865707454518326
   -0.5584918585990319
   -1.580902643943686
   -0.5819813347862111
   0.33715143154261007
   0.21161264899772547
   -0.6231348153119174
   0.2732769438200582
   1.5349269628918978
   -1.2159354614728644
   -0.04010968273134341
   2.767888094667245
   -2.124211372086929
   0.32810281424294807
   -0.6332386369004745
   1.9693712906221492
   0.5212229139226014
   -0.7449557489477323
   0.15354280631142458
   -0.20137477828508388
   0.531699471117336
   -0.6434689906766744
   1.3901797097205533
   -0.017589840565357395
   0.4402975956124511
   -0.2276181455866748
   0.04704241182193947
   -0.28861413232989275
   0.11987390714443552
   -0.18260861209516133
   -0.9975453361660085
   3.3038711284501048
   1.1422528613458782],
  :raw
  [0.9359270360850536
   -0.4007415611019258
   -1.004761100167602
   -0.5434485594889509
   -0.15950858036297388
   -1.3587416822991665
   -0.8325672610282719
   0.21086035275594006
   0.44059325818501627
   -1.0311557695325497
   1.7917691500388013
   -0.4396936043475561
   -1.0806302566782673
   -0.43371440967007124
   -1.510403625068255
   0.5249346923412235
   0.24209324404804988
   0.23865707454518326
   -0.5584918585990319
   -1.580902643943686
   -0.5819813347862111
   0.33715143154261007
   0.21161264899772547
   -0.6231348153119174
   0.2732769438200582
   1.5349269628918978
   -1.2159354614728644
   -0.04010968273134341
   2.767888094667245
   -2.124211372086929
   0.32810281424294807
   -0.6332386369004745
   1.9693712906221492
   0.5212229139226014
   -0.7449557489477323
   0.15354280631142458
   -0.20137477828508388
   0.531699471117336
   -0.6434689906766744
   1.3901797097205533
   -0.017589840565357395
   0.4402975956124511
   -0.2276181455866748
   0.04704241182193947
   -0.28861413232989275
   0.11987390714443552
   -0.18260861209516133
   -0.9975453361660085
   3.3038711284501048
   1.1422528613458782]},
 :fitted
 (3.7095918724477697
  13.885756460886071
  1.5453340426188573
  10.2439296780327
  2.6906712615659965
  2.0672563995278104
  11.175139688469837
  -0.6947112509488793
  9.259134407726055
  -4.051435642669882
  3.5766367210898693
  6.763566529026899
  10.575271389500646
  1.109663204254967
  12.900872708799547
  7.0144895796743585
  1.6350203024132455
  -2.511486167362249
  6.805397591667377
  8.180616113285144
  2.5227744818401643
  8.856094695566295
  12.698107168053916
  -3.790898005676445
  2.279672092825055
  1.052139203978192
  7.221872654873873
  0.3622949107589317
  5.871717898609497
  10.532683894389045
  5.945565486071316
  -2.124774162862575
  9.82483572763547
  6.349241052590405
  -1.506857966336237
  13.889892550361203
  12.426292875202261
  7.946820359323296
  0.321459701876309
  -1.7261897685602543
  -2.2477302224928177
  10.965340713692253
  12.83233055825675
  9.760436606315249
  4.789187562678973
  -0.3876315471377261
  6.46389001766607
  6.111662268804511
  10.11411649732894
  4.840795861996088),
 :df {:residual 48, :model 1, :intercept 1},
 :observations 50,
 :names ["Intercept" "x"],
 :r-squared 0.9593210748673777,
 :adjusted-r-squared 0.9584735972604481,
 :sigma2 1.144225978228796,
 :sigma 1.0696849901858005,
 :tss 1350.1548228209472,
 :rss 54.922846954982205,
 :regss 1295.231975865965,
 :msreg 1295.231975865965,
 :qt 2.010634757624228,
 :f-statistic 1131.9721807670533,
 :p-value 0.0,
 :ll
 {:log-likelihood -73.29458696890799,
  :aic 152.58917393781599,
  :bic 158.32524295410042,
  :aic-rss 8.69532061734871,
  :bic-rss 12.519366628205002},
 :analysis #<Delay@4570c409: :not-delivered>}

Printing the model gives a tabular summary: We’ll capture the printed output and display it via Kindly for cleaner formatting.

(kind/code
 (with-out-str
   (println
    simple-linear-data-model)))
Residuals:

|      :min |       :q1 |   :median |      :q3 |     :max |
|-----------+-----------+-----------+----------+----------|
| -2.124211 | -0.635796 | -0.099809 | 0.440372 | 3.303871 |

Coefficients:

|     :name | :estimate |  :stderr |   :t-value | :p-value |  :confidence-interval |
|-----------+-----------+----------+------------+----------+-----------------------|
| Intercept | -4.298391 | 0.324529 | -13.245002 |      0.0 | [-4.950901 -3.645881] |
|         x |  1.819213 | 0.054071 |  33.644794 |      0.0 |   [1.710496 1.927931] |

F-statistic: 1131.9721807670533 on degrees of freedom: {:residual 48, :model 1, :intercept 1}
p-value: 0.0

R2: 0.9593210748673777
Adjusted R2: 0.9584735972604481
Residual standard error: 1.0696849901858005 on 48 degrees of freedom
AIC: 152.58917393781599

As you can see, the estimated coefficients match our intercept \(b\) and slope \(a\) (the coefficient of \(x\)).

8.2.2 Dataset ergonomics

Below are a couple of helper functions that simplify how we use regression with datasets and display model summaries. We have similar ideas under development in the Tablemath library, but it is still in an experimental stage and not part of Noj yet.

(defn lm
  "Compute a linear regression model for `dataset`.
  The first column marked as target is the target.
  All the columns unmarked as target are the features.
  The resulting model is of type `fastmath.ml.regression.LMData`,
  created via [Fastmath](https://github.com/generateme/fastmath).
  
  See [fastmath.ml.regression.lm](https://generateme.github.io/fastmath/clay/ml.html#lm)
  for `options`."
  ([dataset]
   (lm dataset nil))
  ([dataset options]
   (let [inference-column-name (-> dataset
                                   ds-mod/inference-target-column-names
                                   first)
         ds-without-target (-> dataset
                               (tc/drop-columns [inference-column-name]))]
     (reg/lm
      ;; ys
      (get dataset inference-column-name)
      ;; xss
      (tc/rows ds-without-target)
      ;; options
      (merge {:names (-> ds-without-target
                         tc/column-names
                         vec)}
             options)))))
(defn summary
  "Generate a summary of a linear model."
  [lmdata]
  (kind/code
   (with-out-str
     (println
      lmdata))))
(-> simple-linear-data
    (ds-mod/set-inference-target :y)
    lm
    summary)
Residuals:

|      :min |       :q1 |   :median |      :q3 |     :max |
|-----------+-----------+-----------+----------+----------|
| -2.124211 | -0.635796 | -0.099809 | 0.440372 | 3.303871 |

Coefficients:

|     :name | :estimate |  :stderr |   :t-value | :p-value |  :confidence-interval |
|-----------+-----------+----------+------------+----------+-----------------------|
| Intercept | -4.298391 | 0.324529 | -13.245002 |      0.0 | [-4.950901 -3.645881] |
|        :x |  1.819213 | 0.054071 |  33.644794 |      0.0 |   [1.710496 1.927931] |

F-statistic: 1131.9721807670533 on degrees of freedom: {:residual 48, :model 1, :intercept 1}
p-value: 0.0

R2: 0.9593210748673777
Adjusted R2: 0.9584735972604481
Residual standard error: 1.0696849901858005 on 48 degrees of freedom
AIC: 152.58917393781599

8.2.3 Prediction

Once we have a linear model, we can generate new predictions. For instance, let’s predict \(y\) when \(x=3\):

(simple-linear-data-model [3])
1.1592485671928863

8.2.4 Displaying the regression line

We can visualize the fitted line by adding a smooth layer to our scatter plot. Tableplot makes this convenient:

(-> simple-linear-data
    (plotly/layer-point {:=name "data"})
    (plotly/layer-smooth {:=name "prediction"}))

Alternatively, we can build the regression line explicitly. We’ll obtain predictions and then plot them:

(-> simple-linear-data
    (tc/map-columns :prediction
                    [:x]
                    simple-linear-data-model)
    (plotly/layer-point {:=name "data"})
    (plotly/layer-smooth {:=y :prediction
                          :=name "prediction"}))

8.3 Multiple linear regression

We can easily extend these ideas to multiple linear predictors.

(def multiple-linear-data
  (let [rng (rand/rng 1234)
        n 50
        a0 2
        a1 -3
        b -5]
    (-> {:x0 (repeatedly n #(rand/frandom rng 0 10))
         :x1 (repeatedly n #(rand/frandom rng 0 10))}
        tc/dataset
        (tc/map-columns :y
                        [:x0 :x1]
                        (fn [x0 x1]
                          (+ (* a0 x0)
                             (* a1 x1)
                             b
                             (rand/grandom rng)))))))
(def multiple-linear-data-model
  (-> multiple-linear-data
      (ds-mod/set-inference-target :y)
      lm))
(summary multiple-linear-data-model)
Residuals:

|      :min |       :q1 |  :median |      :q3 |     :max |
|-----------+-----------+----------+----------+----------|
| -2.177656 | -0.893365 | -0.03122 | 0.782167 | 2.296723 |

Coefficients:

|     :name | :estimate |  :stderr |   :t-value | :p-value |  :confidence-interval |
|-----------+-----------+----------+------------+----------+-----------------------|
| Intercept | -5.037096 | 0.381998 | -13.186179 |      0.0 | [-5.805577 -4.268615] |
|       :x0 |  1.966706 | 0.057041 |  34.478778 |      0.0 |   [1.851954 2.081458] |
|       :x1 | -2.978405 | 0.051873 | -57.417725 |      0.0 | [-3.082759 -2.874051] |

F-statistic: 2002.3339563115835 on degrees of freedom: {:residual 47, :model 2, :intercept 1}
p-value: 0.0

R2: 0.988399839025906
Adjusted R2: 0.9879062151546679
Residual standard error: 1.0969985919747627 on 47 degrees of freedom
AIC: 156.05787291144372

Visualizing multiple dimensions is more involved. In the case of two features, we can use a 3D scatterplot and a 3D surface. Let us do that using Tableplot’s Plotly API.

(-> multiple-linear-data
    (plotly/layer-point {:=coordinates :3d
                         :=x :x0
                         :=y :x1
                         :=z :y})
    (plotly/layer-surface {:=dataset (let [xs (range 11)
                                           ys (range 11)]
                                       (tc/dataset
                                        {:x xs
                                         :y ys
                                         :z (for [y ys]
                                              (for [x xs]
                                                (multiple-linear-data-model
                                                 [x y])))}))
                           :=mark-opacity 0.5}))

8.4 Coming soon: Polynomial regression 🛠

8.5 Coming soon: One-hot encoding 🛠

8.6 Coming soon: Regularization 🛠

8.7 Example: Predicting Bicycle Traffic

As in the Python Data Science Handbook, we’ll try predicting the daily number of bicycle trips across the Fremont Bridge in Seattle. The features will include weather, season, day of week, and related factors.

8.7.1 Reading and parsing data

(def column-name-mapping
  {"Fremont Bridge Sidewalks, south of N 34th St" :total
   "Fremont Bridge Sidewalks, south of N 34th St Cyclist West Sidewalk" :west
   "Fremont Bridge Sidewalks, south of N 34th St Cyclist East Sidewalk" :east
   "Date" :datetime})
(column-name-mapping
 "Fremont Bridge Sidewalks, south of N 34th St")
:total
(def counts
  (tc/dataset "data/seattle-bikes-and-weather/Fremont_Bridge_Bicycle_Counter.csv.gz"
              {:key-fn column-name-mapping
               :parser-fn {"Date" [:local-date-time "MM/dd/yyyy hh:mm:ss a"]}}))
counts

data/seattle-bikes-and-weather/Fremont_Bridge_Bicycle_Counter.csv.gz [106608 4]:

:datetime :total :west :east
2012-10-02T13:00 55 7 48
2012-10-02T14:00 130 55 75
2012-10-02T15:00 152 81 71
2012-10-02T16:00 278 167 111
2012-10-02T17:00 563 393 170
2012-10-02T18:00 381 236 145
2012-10-02T19:00 175 104 71
2012-10-02T20:00 86 51 35
2012-10-02T21:00 63 35 28
2012-10-02T22:00 42 27 15
2024-11-30T13:00 147 62 85
2024-11-30T14:00 154 73 81
2024-11-30T15:00 118 57 61
2024-11-30T16:00 88 51 37
2024-11-30T17:00 46 11 35
2024-11-30T18:00 46 18 28
2024-11-30T19:00 69 14 55
2024-11-30T20:00 18 8 10
2024-11-30T21:00 49 15 34
2024-11-30T22:00 14 4 10
2024-11-30T23:00 10 5 5
(def weather
  (tc/dataset "data/seattle-bikes-and-weather/BicycleWeather.csv.gz"
              {:key-fn keyword}))
weather

data/seattle-bikes-and-weather/BicycleWeather.csv.gz [1340 26]:

:STATION :STATION_NAME :DATE :PRCP :SNWD :SNOW :TMAX :TMIN :AWND :WDF2 :WDF5 :WSF2 :WSF5 :FMTM :WT14 :WT01 :WT17 :WT05 :WT02 :WT22 :WT04 :WT13 :WT16 :WT08 :WT18 :WT03
GHCND:USW00024233 SEATTLE TACOMA INTERNATIONAL AIRPORT WA US 20120101 0 0 0 128 50 47 100 90 89 112 -9999 1 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999
GHCND:USW00024233 SEATTLE TACOMA INTERNATIONAL AIRPORT WA US 20120102 109 0 0 106 28 45 180 200 130 179 -9999 -9999 1 -9999 -9999 -9999 -9999 -9999 1 1 -9999 -9999 -9999
GHCND:USW00024233 SEATTLE TACOMA INTERNATIONAL AIRPORT WA US 20120103 8 0 0 117 72 23 180 170 54 67 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 1 -9999 -9999 -9999
GHCND:USW00024233 SEATTLE TACOMA INTERNATIONAL AIRPORT WA US 20120104 203 0 0 122 56 47 180 190 107 148 -9999 -9999 1 -9999 -9999 -9999 -9999 -9999 1 1 -9999 -9999 -9999
GHCND:USW00024233 SEATTLE TACOMA INTERNATIONAL AIRPORT WA US 20120105 13 0 0 89 28 61 200 220 107 165 -9999 -9999 1 -9999 -9999 -9999 -9999 -9999 -9999 1 -9999 -9999 -9999
GHCND:USW00024233 SEATTLE TACOMA INTERNATIONAL AIRPORT WA US 20120106 25 0 0 44 22 22 180 180 45 63 -9999 1 1 -9999 -9999 -9999 -9999 -9999 -9999 1 -9999 -9999 -9999
GHCND:USW00024233 SEATTLE TACOMA INTERNATIONAL AIRPORT WA US 20120107 0 0 0 72 28 23 170 180 54 63 -9999 -9999 1 -9999 -9999 -9999 -9999 -9999 1 1 -9999 -9999 -9999
GHCND:USW00024233 SEATTLE TACOMA INTERNATIONAL AIRPORT WA US 20120108 0 0 0 100 28 20 160 200 45 63 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999
GHCND:USW00024233 SEATTLE TACOMA INTERNATIONAL AIRPORT WA US 20120109 43 0 0 94 50 34 200 200 67 89 -9999 1 1 -9999 -9999 -9999 -9999 -9999 1 1 -9999 -9999 -9999
GHCND:USW00024233 SEATTLE TACOMA INTERNATIONAL AIRPORT WA US 20120110 10 0 0 61 6 34 20 30 89 107 -9999 -9999 1 -9999 -9999 -9999 -9999 -9999 -9999 1 -9999 -9999 -9999
GHCND:USW00024233 SEATTLE TACOMA INTERNATIONAL AIRPORT WA US 20150822 0 0 0 267 122 25 20 20 63 76 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 1 -9999 -9999
GHCND:USW00024233 SEATTLE TACOMA INTERNATIONAL AIRPORT WA US 20150823 0 0 0 278 139 18 10 10 67 81 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 1 -9999 -9999
GHCND:USW00024233 SEATTLE TACOMA INTERNATIONAL AIRPORT WA US 20150824 0 0 0 239 122 23 190 190 54 67 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999
GHCND:USW00024233 SEATTLE TACOMA INTERNATIONAL AIRPORT WA US 20150825 0 0 0 256 122 34 350 360 63 76 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999
GHCND:USW00024233 SEATTLE TACOMA INTERNATIONAL AIRPORT WA US 20150826 0 0 0 283 139 17 30 40 58 67 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999
GHCND:USW00024233 SEATTLE TACOMA INTERNATIONAL AIRPORT WA US 20150827 0 0 0 294 144 21 230 200 45 63 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999
GHCND:USW00024233 SEATTLE TACOMA INTERNATIONAL AIRPORT WA US 20150828 5 0 0 233 156 26 230 240 81 103 -9999 -9999 1 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999
GHCND:USW00024233 SEATTLE TACOMA INTERNATIONAL AIRPORT WA US 20150829 325 0 0 222 133 58 210 210 157 206 -9999 -9999 1 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999
GHCND:USW00024233 SEATTLE TACOMA INTERNATIONAL AIRPORT WA US 20150830 102 0 0 200 128 47 200 200 89 112 -9999 -9999 1 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999
GHCND:USW00024233 SEATTLE TACOMA INTERNATIONAL AIRPORT WA US 20150831 0 0 0 189 161 58 210 210 112 134 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999
GHCND:USW00024233 SEATTLE TACOMA INTERNATIONAL AIRPORT WA US 20150901 58 0 0 194 139 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999

8.7.2 Preprocessing

The bike counts come in hourly data, but our weather information is daily. We’ll need to aggregate the hourly counts into daily totals before combining the datasets.

In the Python handbook, one does:

daily = counts.resample('d').sum()

Since Tablecloth’s time series features are still evolving, we’ll be a bit more explicit:

(def daily-totals
  (-> counts
      (tc/group-by (fn [{:keys [datetime]}]
                     {:date (datetime/local-date-time->local-date
                             datetime)}))
      (tc/aggregate-columns [:total :west :east]
                            tcc/sum)))
daily-totals

_unnamed [4443 4]:

:date :total :west :east
2012-10-02 1938.0 1165.0 773.0
2012-10-03 3521.0 1761.0 1760.0
2012-10-04 3475.0 1767.0 1708.0
2012-10-05 3148.0 1590.0 1558.0
2012-10-06 2006.0 926.0 1080.0
2012-10-07 2142.0 951.0 1191.0
2012-10-08 3537.0 1708.0 1829.0
2012-10-09 3501.0 1742.0 1759.0
2012-10-10 3235.0 1587.0 1648.0
2012-10-11 3047.0 1468.0 1579.0
2024-11-20 2300.0 787.0 1513.0
2024-11-21 2382.0 775.0 1607.0
2024-11-22 1473.0 536.0 937.0
2024-11-23 1453.0 652.0 801.0
2024-11-24 727.0 311.0 416.0
2024-11-25 1483.0 486.0 997.0
2024-11-26 2173.0 727.0 1446.0
2024-11-27 1522.0 548.0 974.0
2024-11-28 631.0 261.0 370.0
2024-11-29 833.0 368.0 465.0
2024-11-30 1178.0 509.0 669.0

8.7.3 Prediction by day-of-week

Next, we’ll explore a simple regression by day of week.

(def days-of-week
  [:Mon :Tue :Wed :Thu :Fri :Sat :Sun])

We’ll convert the numeric day-of-week to the corresponding keyword:

(def idx->day-of-week
  (comp days-of-week dec))

For example,

(idx->day-of-week 1)
:Mon
(idx->day-of-week 7)
:Sun

Now, let’s build our dataset:

(def totals-with-day-of-week
  (-> daily-totals
      (tc/add-column :day-of-week
                     (fn [ds]
                       (map idx->day-of-week
                            (datetime/long-temporal-field
                             :day-of-week
                             (:date ds)))))
      (tc/select-columns [:total :day-of-week])))
totals-with-day-of-week

_unnamed [4443 2]:

:total :day-of-week
1938.0 :Tue
3521.0 :Wed
3475.0 :Thu
3148.0 :Fri
2006.0 :Sat
2142.0 :Sun
3537.0 :Mon
3501.0 :Tue
3235.0 :Wed
3047.0 :Thu
2300.0 :Wed
2382.0 :Thu
1473.0 :Fri
1453.0 :Sat
727.0 :Sun
1483.0 :Mon
2173.0 :Tue
1522.0 :Wed
631.0 :Thu
833.0 :Fri
1178.0 :Sat
(def totals-with-one-hot-days-of-week
  (-> (reduce (fn [dataset day-of-week]
                (-> dataset
                    (tc/add-column day-of-week
                                   #(-> (:day-of-week %)
                                        (tcc/eq day-of-week)
                                        ;; convert booleans to 0/1
                                        (tcc/* 1)))))
              totals-with-day-of-week
              days-of-week)
      (tc/drop-columns [:day-of-week])
      (ds-mod/set-inference-target :total)))
(-> totals-with-one-hot-days-of-week
    (tc/select-columns ds-mod/inference-column?))

_unnamed [0 0]

Since the binary columns sum to 1, they’re collinear, and we won’t use an intercept. This way, each coefficient directly reflects the expected bike count for that day of week.

(def days-of-week-model
  (lm totals-with-one-hot-days-of-week
      {:intercept? false}))

Let’s take a look at the results:

(-> days-of-week-model
    println
    with-out-str
    kind/code)
Residuals:

|         :min |         :q1 |    :median |        :q3 |        :max |
|--------------+-------------+------------+------------+-------------|
| -3034.944882 | -841.755906 | -90.179528 | 828.820472 | 3745.244094 |

Coefficients:

| :name |   :estimate |   :stderr |  :t-value | :p-value |      :confidence-interval |
|-------+-------------+-----------+-----------+----------+---------------------------|
|  :Mon | 2785.741325 | 44.521658 | 62.570476 |      0.0 | [2698.456664 2873.025986] |
|  :Tue | 3115.527559 | 44.486587 | 70.032964 |      0.0 | [3028.311653 3202.743465] |
|  :Wed | 3109.944882 | 44.486587 | 69.907472 |      0.0 | [3022.728976 3197.160788] |
|  :Thu | 2961.179528 | 44.486587 | 66.563423 |      0.0 | [2873.963622 3048.395433] |
|  :Fri | 2623.755906 | 44.486587 | 58.978583 |      0.0 |     [2536.54 2710.971811] |
|  :Sat | 1688.029921 | 44.486587 | 37.944693 |      0.0 | [1600.814015 1775.245827] |
|  :Sun | 1576.178233 | 44.521658 | 35.402506 |      0.0 | [1488.893572 1663.462894] |

F-statistic: 3472.719277560978 on degrees of freedom: {:residual 4436, :model 7, :intercept 0}
p-value: 0.0

R2: 0.8456776967289252
Adjusted R2: 0.8454341764126724
Residual standard error: 1121.0266948292917 on 4436 degrees of freedom
AIC: 75015.17638480052

We can clearly see weekend versus weekday differences.

8.7.4 Coming soon: more predictors for the bike counts 🛠

source: notebooks/noj_book/linear_regression_intro.clj