10  Introduction to Linear Regression

last update: 2024-12-29

In this tutorial, we introduce the fundamentals of linear regression, guided by the In Depth: Linear Regression chapter of the Python Data Science Handbook by Jake VanderPlas.

10.1 Setup

(ns noj-book.linear-regression-intro
  (:require
   [tech.v3.dataset :as ds]
   [tablecloth.api :as tc]
   [tablecloth.column.api :as tcc]
   [tech.v3.datatype.datetime :as datetime]
   [tech.v3.dataset.modelling :as ds-mod]
   [fastmath.ml.regression :as reg]
   [scicloj.kindly.v4.kind :as kind]
   [fastmath.random :as rand]
   [scicloj.tableplot.v1.plotly :as plotly]))

10.2 Simple Linear Regression

We begin with the classic straight-line model: for data points \((x, y)\), we assume there is a linear relationship allowing us to predict \(y\) as \[y = ax + b.\] In this formulation, \(a\) is the slope and \(b\) is the intercept, the point where our line would cross the \(y\) axis.

To illustrate, we’ll use Fastmath and Tablecloth to create synthetic data in which the relationship is known to hold with \(a=2\) and \(b=-5\).

For each row in the dataset below, we draw \(x\) uniformly from 0 to 10 and compute \(y = ax + b\) plus an extra random noise term (drawn from a standard Gaussian distribution). This noise is added independently for every row.

(def simple-linear-data
  (let [rng (rand/rng 1234)
        n 50
        a 2
        b -5]
    (-> {:x (repeatedly n #(rand/frandom rng 0 10))}
        tc/dataset
        (tc/map-columns :y
                        [:x]
                        (fn [x]
                          (+ (* a x)
                             b
                             (rand/grandom rng)))))))
simple-linear-data

_unnamed [50 2]:

:x :y
6.57609510 9.81891552
2.99598503 2.96246278
4.53403425 4.15025195
8.95082378 13.86513537
4.93380737 6.12487310
2.74368882 -0.28274251
3.32232666 2.47099769
2.29794025 0.57144438
7.45839643 7.95909802
0.12831211 -3.44401887
0.34915030 -4.20917377
7.61502552 11.46983175
2.81602859 -0.70006338
9.22632980 12.28862659
8.06345844 9.63705836
6.67792606 8.03790305
4.09426975 2.27687813
4.14835215 1.39010540
5.90655327 5.94909864
5.57058048 6.34198473
9.29507732 14.13070365

Let’s plot these points using Tableplot’s Plotly API.

(-> simple-linear-data
    plotly/layer-point)

10.2.1 Regression using Fastmath

We can now fit a linear model to the data using the Fastmath library.

(def simple-linear-data-model
  (reg/lm
   ;; ys - a "column" sequence of `y` values:
   (simple-linear-data :y)
   ;; xss - a sequence of "rows", each containing `x` values:
   ;; (one `x` per row, in our case):
   (-> simple-linear-data
       (tc/select-columns [:x])
       tc/rows)
   ;; options
   {:names ["x"]}))
(type simple-linear-data-model)
fastmath.ml.regression.LMData
simple-linear-data-model
{:model :ols,
 :intercept? true,
 :offset? false,
 :transformer nil,
 :xtxinv
 #object[org.apache.commons.math3.linear.BlockRealMatrix 0x239cc4cc "BlockRealMatrix{{0.1101479967,-0.0164666125},{-0.0164666125,0.0030078242}}"],
 :intercept -4.628034907513457,
 :beta [1.978053629698538],
 :coefficients
 [{:estimate -4.628034907513457,
   :stderr 0.3472982554988266,
   :t-value -13.32582250050805,
   :p-value 9.536350013925433E-18,
   :confidence-interval [-5.326324851281657 -3.9297449637452564]}
  {:estimate 1.978053629698538,
   :stderr 0.05739056840257152,
   :t-value 34.46652794625234,
   :p-value 0.0,
   :confidence-interval [1.862662158108517 2.093445101288559]}],
 :offset
 (0.0
  0.0
  0.0
  0.0
  0.0
  0.0
  0.0
  0.0
  0.0
  0.0
  0.0
  0.0
  0.0
  0.0
  0.0
  0.0
  0.0
  0.0
  0.0
  0.0
  0.0
  0.0
  0.0
  0.0
  0.0
  0.0
  0.0
  0.0
  0.0
  0.0
  0.0
  0.0
  0.0
  0.0
  0.0
  0.0
  0.0
  0.0
  0.0
  0.0
  0.0
  0.0
  0.0
  0.0
  0.0
  0.0
  0.0
  0.0
  0.0
  0.0),
 :weights
 (1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0),
 :residuals
 {:weighted
  (1.4390816353686553
   1.664278626295986
   -0.1902760499909819
   0.7879608074747413
   0.9935724264739703
   -1.0818712302747977
   0.5272922919586902
   0.6540302225642112
   -2.1659752093780895
   0.9302078042227819
   0.19205322643183287
   0.07472696650295507
   0.04172168162458867
   -0.5987389482319276
   -0.46354475709306886
   1.7678105506520376
   0.44316199021818825
   0.2950100836469467
   1.6882393429462397
   -0.5838637669430167
   0.8059474053660232
   0.65864673506527
   -0.9901572546126012
   -0.542458998608609
   0.6053052116158657
   -0.16929756855717226
   0.49388777209914103
   -0.2430591623499221
   2.016451355822311
   -0.9644473520835293
   -0.26957654070522663
   1.4881494225008423
   0.3554571054906184
   -1.5628127818493867
   0.6859127773398193
   0.8757275970681571
   -0.026154126840476577
   -0.9733132525364994
   -0.054280726612699226
   -0.2717768852359548
   1.0349377853990376
   -1.6422840591191028
   -1.3335136596999515
   -1.684859969737655
   -0.5433579301250688
   -1.1937721037336866
   -2.1875227151053
   -1.1063455886669447
   -0.048887301300851504
   0.37257711524340387),
  :raw
  (1.4390816353686553
   1.664278626295986
   -0.1902760499909819
   0.7879608074747413
   0.9935724264739703
   -1.0818712302747977
   0.5272922919586902
   0.6540302225642112
   -2.1659752093780895
   0.9302078042227819
   0.19205322643183287
   0.07472696650295507
   0.04172168162458867
   -0.5987389482319276
   -0.46354475709306886
   1.7678105506520376
   0.44316199021818825
   0.2950100836469467
   1.6882393429462397
   -0.5838637669430167
   0.8059474053660232
   0.65864673506527
   -0.9901572546126012
   -0.542458998608609
   0.6053052116158657
   -0.16929756855717226
   0.49388777209914103
   -0.2430591623499221
   2.016451355822311
   -0.9644473520835293
   -0.26957654070522663
   1.4881494225008423
   0.3554571054906184
   -1.5628127818493867
   0.6859127773398193
   0.8757275970681571
   -0.026154126840476577
   -0.9733132525364994
   -0.054280726612699226
   -0.2717768852359548
   1.0349377853990376
   -1.6422840591191028
   -1.3335136596999515
   -1.684859969737655
   -0.5433579301250688
   -1.1937721037336866
   -2.1875227151053
   -1.1063455886669447
   -0.048887301300851504
   0.37257711524340387),
  :loocv
  (1.4739394420768743
   1.7308804049302067
   -0.19468784641711312
   0.8350112682109103
   1.0147602495975467
   -1.1298112215810396
   0.5458133722549379
   0.6887082805440684
   -2.237201493873601
   1.040468062971258
   0.195986698046567
   0.08182934680436647
   0.04280990300622217
   -0.6181613780483527
   -0.47482982354881714
   1.805400754675343
   0.46403472752550784
   0.31241455696113746
   1.7429150984892534
   -0.6002391930943249
   0.8621083718760855
   0.6821073122235489
   -1.0247840780670159
   -0.5580268519371991
   0.6472594725236961
   -0.17512423046519315
   0.5339095267120939
   -0.25330645654588346
   2.058182851764122
   -0.9893193751859812
   -0.27630948569104363
   1.525862688971529
   0.37107040107412104
   -1.6944484966988618
   0.699986802073636
   0.8937699742700723
   -0.027323833696372465
   -1.0041570138484797
   -0.05693350735142049
   -0.30164452658107915
   1.0711204530009715
   -1.712959376582249
   -1.4221668838478154
   -1.7553534129548423
   -0.5569219668264713
   -1.2253000314395983
   -2.2442817316135786
   -1.1295709565733174
   -0.04988641204641125
   0.3980110188331572)},
 :fitted
 (8.379833882626812
  1.2981841578316322
  4.340528002162476
  13.077174566971525
  5.131300674975324
  0.7991287251967059
  1.943705401652835
  -0.08258584684016501
  10.125073232041544
  -4.374226670811692
  6.502966821827635
  -3.120456048647405
  3.545752348352333
  10.055242718814686
  8.414524183781275
  7.234422937393091
  11.90155073594483
  13.016601654708015
  10.046900031038746
  9.278674780643513
  13.864211187067529
  1.8738074489487602
  10.436304061371807
  9.406323091267456
  13.836539307590936
  2.0459809058871086
  -2.254402290873929
  1.0427526576293147
  5.601926359323259
  8.786929693017903
  3.8174570415674554
  3.724150987802969
  11.559903287575736
  -2.4616077265978116
  6.572562475896073
  6.693981831169842
  0.7539041890591394
  9.934619432334912
  12.082754480363882
  -3.9373968892368127
  10.434893963211827
  0.9422206761032168
  13.62214024902984
  11.321918332968838
  8.58126098132556
  3.4706502373889316
  3.577628112443663
  7.055444224117445
  6.390872035389697
  13.758126531382363),
 :df {:residual 48, :model 1, :intercept 1},
 :observations 50,
 :names ["Intercept" "x"],
 :cv 1.0635272269106373,
 :r-squared 0.96116321192401,
 :adjusted-r-squared 0.9603541121724268,
 :sigma2 1.0950365139474763,
 :sigma 1.046439923716348,
 :tss 1353.4011248982256,
 :rss 52.561752669478864,
 :regss 1300.8393722287467,
 :msreg 1300.8393722287467,
 :qt 2.010634757624228,
 :f-statistic 1187.9415486698024,
 :p-value 0.0,
 :ll
 {:log-likelihood -72.19606951680854,
  :aic 150.3921390336171,
  :bic 156.12820804990156,
  :aic-rss 6.498285713149814,
  :bic-rss 10.322331724006125},
 :analysis #<Delay@1b977bdd: :not-delivered>,
 :decomposition :cholesky,
 :augmentation nil,
 :hat
 (0.023649415785428788
  0.03847855602531134
  0.02266087230057018
  0.05634709677269267
  0.020879634506751205
  0.042431859757203955
  0.033932991087651215
  0.050352317460829916
  0.03183722373266742
  0.10597178584569564
  0.020070094827555625
  0.08679502621952269
  0.025419851604787196
  0.03141967535685451
  0.023766549395328986
  0.02082097502505193
  0.04498098109730851
  0.05570954658286223
  0.031370291984048046
  0.027281501007774052
  0.06514374334150949
  0.034394261339614665
  0.033789384706024175
  0.027898036222712357
  0.06481830346066418
  0.03327159178683135
  0.07495980612935232
  0.040454137394264233
  0.020275893323103693
  0.025140539775415114
  0.024367404430499527
  0.024716029000031805
  0.04207637024755283
  0.07768646559982725
  0.020106128704318244
  0.020186824038981904
  0.04280903144463149
  0.03071607416629995
  0.046594367045526414
  0.09901602287849323
  0.03378020417830735
  0.04125919063192249
  0.0623367237380775
  0.04015911707405023
  0.024355363065843717
  0.02573078176523816
  0.025290504177241005
  0.020561229705152534
  0.02002771305000309
  0.0639025112026233),
 :effective-dimension 2.00000000000001}

Printing the model gives a tabular summary: We’ll capture the printed output and display it via Kindly for cleaner formatting.

(kind/code
 (with-out-str
   (println
    simple-linear-data-model)))
Residuals:

|      :min |       :q1 |  :median |      :q3 |     :max |
|-----------+-----------+----------+----------+----------|
| -2.187523 | -0.690166 | 0.007784 | 0.711425 | 2.016451 |

Coefficients:

|     :name | :estimate |  :stderr |   :t-value | :p-value |  :confidence-interval |
|-----------+-----------+----------+------------+----------+-----------------------|
| Intercept | -4.628035 | 0.347298 | -13.325823 |      0.0 | [-5.326325 -3.929745] |
|         x |  1.978054 | 0.057391 |  34.466528 |      0.0 |   [1.862662 2.093445] |

F-statistic: 1187.9415486698024 on degrees of freedom: {:residual 48, :model 1, :intercept 1}
p-value: 0.0

R2: 0.96116321192401
Adjusted R2: 0.9603541121724268
Residual standard error: 1.046439923716348 on 48 degrees of freedom
AIC: 150.3921390336171

As you can see, the estimated coefficients match our intercept \(b\) and slope \(a\) (the coefficient of \(x\)).

10.2.2 Dataset ergonomics

Below are a couple of helper functions that simplify how we use regression with datasets and display model summaries. We have similar ideas under development in the Tablemath library, but it is still in an experimental stage and not part of Noj yet.

(defn lm
  "Compute a linear regression model for `dataset`.
  The first column marked as target is the target.
  All the columns unmarked as target are the features.
  The resulting model is of type `fastmath.ml.regression.LMData`,
  created via [Fastmath](https://github.com/generateme/fastmath).
  
  See [fastmath.ml.regression.lm](https://generateme.github.io/fastmath/clay/ml.html#lm)
  for `options`."
  ([dataset]
   (lm dataset nil))
  ([dataset options]
   (let [inference-column-name (-> dataset
                                   ds-mod/inference-target-column-names
                                   first)
         ds-without-target (-> dataset
                               (tc/drop-columns [inference-column-name]))]
     (reg/lm
      ;; ys
      (get dataset inference-column-name)
      ;; xss
      (tc/rows ds-without-target)
      ;; options
      (merge {:names (-> ds-without-target
                         tc/column-names
                         vec)}
             options)))))
(defn summary
  "Generate a summary of a linear model."
  [lmdata]
  (kind/code
   (with-out-str
     (println
      lmdata))))
(-> simple-linear-data
    (ds-mod/set-inference-target :y)
    lm
    summary)
Residuals:

|      :min |       :q1 |  :median |      :q3 |     :max |
|-----------+-----------+----------+----------+----------|
| -2.187523 | -0.690166 | 0.007784 | 0.711425 | 2.016451 |

Coefficients:

|     :name | :estimate |  :stderr |   :t-value | :p-value |  :confidence-interval |
|-----------+-----------+----------+------------+----------+-----------------------|
| Intercept | -4.628035 | 0.347298 | -13.325823 |      0.0 | [-5.326325 -3.929745] |
|        :x |  1.978054 | 0.057391 |  34.466528 |      0.0 |   [1.862662 2.093445] |

F-statistic: 1187.9415486698024 on degrees of freedom: {:residual 48, :model 1, :intercept 1}
p-value: 0.0

R2: 0.96116321192401
Adjusted R2: 0.9603541121724268
Residual standard error: 1.046439923716348 on 48 degrees of freedom
AIC: 150.3921390336171

10.2.3 Prediction

Once we have a linear model, we can generate new predictions. For instance, let’s predict \(y\) when \(x=3\):

(simple-linear-data-model [3])
1.3061259815821575

10.2.4 Displaying the regression line

We can visualize the fitted line by adding a smooth layer to our scatter plot. Tableplot makes this convenient:

(-> simple-linear-data
    (plotly/layer-point {:=name "data"})
    (plotly/layer-smooth {:=name "prediction"}))

Alternatively, we can build the regression line explicitly. We’ll obtain predictions and then plot them:

(-> simple-linear-data
    (tc/map-columns :prediction
                    [:x]
                    simple-linear-data-model)
    (plotly/layer-point {:=name "data"})
    (plotly/layer-smooth {:=y :prediction
                          :=name "prediction"}))

10.3 Multiple linear regression

We can easily extend these ideas to multiple linear predictors.

(def multiple-linear-data
  (let [rng (rand/rng 1234)
        n 50
        a0 2
        a1 -3
        b -5]
    (-> {:x0 (repeatedly n #(rand/frandom rng 0 10))
         :x1 (repeatedly n #(rand/frandom rng 0 10))}
        tc/dataset
        (tc/map-columns :y
                        [:x0 :x1]
                        (fn [x0 x1]
                          (+ (* a0 x0)
                             (* a1 x1)
                             b
                             (rand/grandom rng)))))))
(def multiple-linear-data-model
  (-> multiple-linear-data
      (ds-mod/set-inference-target :y)
      lm))
(summary multiple-linear-data-model)
Residuals:

|      :min |       :q1 |  :median |      :q3 |     :max |
|-----------+-----------+----------+----------+----------|
| -2.640647 | -0.444209 | 0.128183 | 0.543156 | 2.084828 |

Coefficients:

|     :name | :estimate |  :stderr |   :t-value | :p-value |  :confidence-interval |
|-----------+-----------+----------+------------+----------+-----------------------|
| Intercept | -4.981224 | 0.377394 | -13.198995 |      0.0 | [-5.740444 -4.222005] |
|       :x0 |  1.989129 | 0.045853 |  43.380949 |      0.0 |   [1.896885 2.081372] |
|       :x1 |  -2.96822 | 0.050025 | -59.334343 |      0.0 | [-3.068858 -2.867582] |

F-statistic: 2947.1974383612587 on degrees of freedom: {:residual 47, :model 2, :intercept 1}
p-value: 0.0

R2: 0.9920893997158582
Adjusted R2: 0.9917527784271714
Residual standard error: 0.9895714507851007 on 47 degrees of freedom
AIC: 145.7517523778554

Visualizing multiple dimensions is more involved. In the case of two features, we can use a 3D scatterplot and a 3D surface. Let us do that using Tableplot’s Plotly API.

(-> multiple-linear-data
    (plotly/layer-point {:=coordinates :3d
                         :=x :x0
                         :=y :x1
                         :=z :y})
    (plotly/layer-surface {:=dataset (let [xs (range 11)
                                           ys (range 11)]
                                       (tc/dataset
                                        {:x xs
                                         :y ys
                                         :z (for [y ys]
                                              (for [x xs]
                                                (multiple-linear-data-model
                                                 [x y])))}))
                           :=mark-opacity 0.5}))

10.4 Coming soon: Polynomial regression 🛠

10.5 Coming soon: One-hot encoding 🛠

10.6 Coming soon: Regularization 🛠

10.7 Example: Predicting Bicycle Traffic

As in the Python Data Science Handbook, we’ll try predicting the daily number of bicycle trips across the Fremont Bridge in Seattle. The features will include weather, season, day of week, and related factors.

10.7.1 Reading and parsing data

(def column-name-mapping
  {"Fremont Bridge Sidewalks, south of N 34th St" :total
   "Fremont Bridge Sidewalks, south of N 34th St Cyclist West Sidewalk" :west
   "Fremont Bridge Sidewalks, south of N 34th St Cyclist East Sidewalk" :east
   "Date" :datetime})
(column-name-mapping
 "Fremont Bridge Sidewalks, south of N 34th St")
:total
(def counts
  (tc/dataset "data/seattle-bikes-and-weather/Fremont_Bridge_Bicycle_Counter.csv.gz"
              {:key-fn column-name-mapping
               :parser-fn {"Date" [:local-date-time "MM/dd/yyyy hh:mm:ss a"]}}))
counts

data/seattle-bikes-and-weather/Fremont_Bridge_Bicycle_Counter.csv.gz [106608 4]:

:datetime :total :west :east
2012-10-02T13:00 55 7 48
2012-10-02T14:00 130 55 75
2012-10-02T15:00 152 81 71
2012-10-02T16:00 278 167 111
2012-10-02T17:00 563 393 170
2012-10-02T18:00 381 236 145
2012-10-02T19:00 175 104 71
2012-10-02T20:00 86 51 35
2012-10-02T21:00 63 35 28
2012-10-02T22:00 42 27 15
2024-11-30T13:00 147 62 85
2024-11-30T14:00 154 73 81
2024-11-30T15:00 118 57 61
2024-11-30T16:00 88 51 37
2024-11-30T17:00 46 11 35
2024-11-30T18:00 46 18 28
2024-11-30T19:00 69 14 55
2024-11-30T20:00 18 8 10
2024-11-30T21:00 49 15 34
2024-11-30T22:00 14 4 10
2024-11-30T23:00 10 5 5
(def weather
  (tc/dataset "data/seattle-bikes-and-weather/BicycleWeather.csv.gz"
              {:key-fn keyword}))
weather

data/seattle-bikes-and-weather/BicycleWeather.csv.gz [1340 26]:

:STATION :STATION_NAME :DATE :PRCP :SNWD :SNOW :TMAX :TMIN :AWND :WDF2 :WDF5 :WSF2 :WSF5 :FMTM :WT14 :WT01 :WT17 :WT05 :WT02 :WT22 :WT04 :WT13 :WT16 :WT08 :WT18 :WT03
GHCND:USW00024233 SEATTLE TACOMA INTERNATIONAL AIRPORT WA US 20120101 0 0 0 128 50 47 100 90 89 112 -9999 1 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999
GHCND:USW00024233 SEATTLE TACOMA INTERNATIONAL AIRPORT WA US 20120102 109 0 0 106 28 45 180 200 130 179 -9999 -9999 1 -9999 -9999 -9999 -9999 -9999 1 1 -9999 -9999 -9999
GHCND:USW00024233 SEATTLE TACOMA INTERNATIONAL AIRPORT WA US 20120103 8 0 0 117 72 23 180 170 54 67 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 1 -9999 -9999 -9999
GHCND:USW00024233 SEATTLE TACOMA INTERNATIONAL AIRPORT WA US 20120104 203 0 0 122 56 47 180 190 107 148 -9999 -9999 1 -9999 -9999 -9999 -9999 -9999 1 1 -9999 -9999 -9999
GHCND:USW00024233 SEATTLE TACOMA INTERNATIONAL AIRPORT WA US 20120105 13 0 0 89 28 61 200 220 107 165 -9999 -9999 1 -9999 -9999 -9999 -9999 -9999 -9999 1 -9999 -9999 -9999
GHCND:USW00024233 SEATTLE TACOMA INTERNATIONAL AIRPORT WA US 20120106 25 0 0 44 22 22 180 180 45 63 -9999 1 1 -9999 -9999 -9999 -9999 -9999 -9999 1 -9999 -9999 -9999
GHCND:USW00024233 SEATTLE TACOMA INTERNATIONAL AIRPORT WA US 20120107 0 0 0 72 28 23 170 180 54 63 -9999 -9999 1 -9999 -9999 -9999 -9999 -9999 1 1 -9999 -9999 -9999
GHCND:USW00024233 SEATTLE TACOMA INTERNATIONAL AIRPORT WA US 20120108 0 0 0 100 28 20 160 200 45 63 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999
GHCND:USW00024233 SEATTLE TACOMA INTERNATIONAL AIRPORT WA US 20120109 43 0 0 94 50 34 200 200 67 89 -9999 1 1 -9999 -9999 -9999 -9999 -9999 1 1 -9999 -9999 -9999
GHCND:USW00024233 SEATTLE TACOMA INTERNATIONAL AIRPORT WA US 20120110 10 0 0 61 6 34 20 30 89 107 -9999 -9999 1 -9999 -9999 -9999 -9999 -9999 -9999 1 -9999 -9999 -9999
GHCND:USW00024233 SEATTLE TACOMA INTERNATIONAL AIRPORT WA US 20150822 0 0 0 267 122 25 20 20 63 76 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 1 -9999 -9999
GHCND:USW00024233 SEATTLE TACOMA INTERNATIONAL AIRPORT WA US 20150823 0 0 0 278 139 18 10 10 67 81 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 1 -9999 -9999
GHCND:USW00024233 SEATTLE TACOMA INTERNATIONAL AIRPORT WA US 20150824 0 0 0 239 122 23 190 190 54 67 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999
GHCND:USW00024233 SEATTLE TACOMA INTERNATIONAL AIRPORT WA US 20150825 0 0 0 256 122 34 350 360 63 76 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999
GHCND:USW00024233 SEATTLE TACOMA INTERNATIONAL AIRPORT WA US 20150826 0 0 0 283 139 17 30 40 58 67 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999
GHCND:USW00024233 SEATTLE TACOMA INTERNATIONAL AIRPORT WA US 20150827 0 0 0 294 144 21 230 200 45 63 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999
GHCND:USW00024233 SEATTLE TACOMA INTERNATIONAL AIRPORT WA US 20150828 5 0 0 233 156 26 230 240 81 103 -9999 -9999 1 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999
GHCND:USW00024233 SEATTLE TACOMA INTERNATIONAL AIRPORT WA US 20150829 325 0 0 222 133 58 210 210 157 206 -9999 -9999 1 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999
GHCND:USW00024233 SEATTLE TACOMA INTERNATIONAL AIRPORT WA US 20150830 102 0 0 200 128 47 200 200 89 112 -9999 -9999 1 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999
GHCND:USW00024233 SEATTLE TACOMA INTERNATIONAL AIRPORT WA US 20150831 0 0 0 189 161 58 210 210 112 134 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999
GHCND:USW00024233 SEATTLE TACOMA INTERNATIONAL AIRPORT WA US 20150901 58 0 0 194 139 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999

10.7.2 Preprocessing

The bike counts come in hourly data, but our weather information is daily. We’ll need to aggregate the hourly counts into daily totals before combining the datasets.

In the Python handbook, one does:

daily = counts.resample('d').sum()

Since Tablecloth’s time series features are still evolving, we’ll be a bit more explicit:

(def daily-totals
  (-> counts
      (tc/group-by (fn [{:keys [datetime]}]
                     {:date (datetime/local-date-time->local-date
                             datetime)}))
      (tc/aggregate-columns [:total :west :east]
                            tcc/sum)))
daily-totals

_unnamed [4443 4]:

:date :total :west :east
2012-10-02 1938.0 1165.0 773.0
2012-10-03 3521.0 1761.0 1760.0
2012-10-04 3475.0 1767.0 1708.0
2012-10-05 3148.0 1590.0 1558.0
2012-10-06 2006.0 926.0 1080.0
2012-10-07 2142.0 951.0 1191.0
2012-10-08 3537.0 1708.0 1829.0
2012-10-09 3501.0 1742.0 1759.0
2012-10-10 3235.0 1587.0 1648.0
2012-10-11 3047.0 1468.0 1579.0
2024-11-20 2300.0 787.0 1513.0
2024-11-21 2382.0 775.0 1607.0
2024-11-22 1473.0 536.0 937.0
2024-11-23 1453.0 652.0 801.0
2024-11-24 727.0 311.0 416.0
2024-11-25 1483.0 486.0 997.0
2024-11-26 2173.0 727.0 1446.0
2024-11-27 1522.0 548.0 974.0
2024-11-28 631.0 261.0 370.0
2024-11-29 833.0 368.0 465.0
2024-11-30 1178.0 509.0 669.0

10.7.3 Prediction by day-of-week

Next, we’ll explore a simple regression by day of week.

(def days-of-week
  [:Mon :Tue :Wed :Thu :Fri :Sat :Sun])

We’ll convert the numeric day-of-week to the corresponding keyword:

(def idx->day-of-week
  (comp days-of-week dec))

For example,

(idx->day-of-week 1)
:Mon
(idx->day-of-week 7)
:Sun

Now, let’s build our dataset:

(def totals-with-day-of-week
  (-> daily-totals
      (tc/add-column :day-of-week
                     (fn [ds]
                       (map idx->day-of-week
                            (datetime/long-temporal-field
                             :day-of-week
                             (:date ds)))))
      (tc/select-columns [:total :day-of-week])))
totals-with-day-of-week

_unnamed [4443 2]:

:total :day-of-week
1938.0 :Tue
3521.0 :Wed
3475.0 :Thu
3148.0 :Fri
2006.0 :Sat
2142.0 :Sun
3537.0 :Mon
3501.0 :Tue
3235.0 :Wed
3047.0 :Thu
2300.0 :Wed
2382.0 :Thu
1473.0 :Fri
1453.0 :Sat
727.0 :Sun
1483.0 :Mon
2173.0 :Tue
1522.0 :Wed
631.0 :Thu
833.0 :Fri
1178.0 :Sat
(def totals-with-one-hot-days-of-week
  (-> (reduce (fn [dataset day-of-week]
                (-> dataset
                    (tc/add-column day-of-week
                                   #(-> (:day-of-week %)
                                        (tcc/eq day-of-week)
                                        ;; convert booleans to 0/1
                                        (tcc/* 1)))))
              totals-with-day-of-week
              days-of-week)
      (tc/drop-columns [:day-of-week])
      (ds-mod/set-inference-target :total)))
(-> totals-with-one-hot-days-of-week
    (tc/select-columns ds-mod/inference-column?))

_unnamed [0 0]

Since the binary columns sum to 1, they’re collinear, and we won’t use an intercept. This way, each coefficient directly reflects the expected bike count for that day of week.

(def days-of-week-model
  (lm totals-with-one-hot-days-of-week
      {:intercept? false}))

Let’s take a look at the results:

(-> days-of-week-model
    println
    with-out-str
    kind/code)
Residuals:

|         :min |         :q1 |    :median |        :q3 |        :max |
|--------------+-------------+------------+------------+-------------|
| -3034.944882 | -841.755906 | -90.179528 | 828.820472 | 3745.244094 |

Coefficients:

| :name |   :estimate |   :stderr |  :t-value | :p-value |      :confidence-interval |
|-------+-------------+-----------+-----------+----------+---------------------------|
|  :Mon | 2785.741325 | 44.521658 | 62.570476 |      0.0 | [2698.456664 2873.025986] |
|  :Tue | 3115.527559 | 44.486587 | 70.032964 |      0.0 | [3028.311653 3202.743465] |
|  :Wed | 3109.944882 | 44.486587 | 69.907472 |      0.0 | [3022.728976 3197.160788] |
|  :Thu | 2961.179528 | 44.486587 | 66.563423 |      0.0 | [2873.963622 3048.395433] |
|  :Fri | 2623.755906 | 44.486587 | 58.978583 |      0.0 |     [2536.54 2710.971811] |
|  :Sat | 1688.029921 | 44.486587 | 37.944693 |      0.0 | [1600.814015 1775.245827] |
|  :Sun | 1576.178233 | 44.521658 | 35.402506 |      0.0 | [1488.893572 1663.462894] |

F-statistic: 3472.719277560978 on degrees of freedom: {:residual 4436, :model 7, :intercept 0}
p-value: 0.0

R2: 0.8456776967289252
Adjusted R2: 0.8454341764126724
Residual standard error: 1121.0266948292917 on 4436 degrees of freedom
AIC: 75015.17638480052

We can clearly see weekend versus weekday differences.

10.7.4 Coming soon: more predictors for the bike counts 🛠

source: notebooks/noj_book/linear_regression_intro.clj