9  Introduction to Linear Regression

last update: 2024-12-29

In this tutorial, we introduce the fundamentals of linear regression, guided by the In Depth: Linear Regression chapter of the Python Data Science Handbook by Jake VanderPlas.

9.1 Setup

(ns noj-book.linear-regression-intro
  (:require
   [tech.v3.dataset :as ds]
   [tablecloth.api :as tc]
   [tablecloth.column.api :as tcc]
   [tech.v3.datatype.datetime :as datetime]
   [tech.v3.dataset.modelling :as ds-mod]
   [fastmath.ml.regression :as reg]
   [scicloj.kindly.v4.kind :as kind]
   [fastmath.random :as rand]
   [scicloj.tableplot.v1.plotly :as plotly]))

9.2 Simple Linear Regression

We begin with the classic straight-line model: for data points \((x, y)\), we assume there is a linear relationship allowing us to predict \(y\) as \[y = ax + b.\] In this formulation, \(a\) is the slope and \(b\) is the intercept, the point where our line would cross the \(y\) axis.

To illustrate, we’ll use Fastmath and Tablecloth to create synthetic data in which the relationship is known to hold with \(a=2\) and \(b=-5\).

For each row in the dataset below, we draw \(x\) uniformly from 0 to 10 and compute \(y = ax + b\) plus an extra random noise term (drawn from a standard Gaussian distribution). This noise is added independently for every row.

(def simple-linear-data
  (let [rng (rand/rng 1234)
        n 50
        a 2
        b -5]
    (-> {:x (repeatedly n #(rand/frandom rng 0 10))}
        tc/dataset
        (tc/map-columns :y
                        [:x]
                        (fn [x]
                          (+ (* a x)
                             b
                             (rand/grandom rng)))))))
simple-linear-data

_unnamed [50 2]:

:x :y
2.38517475 0.04942258
4.23743963 2.23159042
0.03171802 -5.31679797
8.23886776 12.92243798
8.85289383 13.04377334
1.00212038 -2.65984090
9.38846302 14.20798133
5.77733755 6.41680105
1.27384186 -3.98953499
1.63633645 -1.21239164
1.41956210 -1.37584393
9.73743629 15.72954955
2.23365498 -0.50168044
8.89793396 10.49822926
5.07024956 3.29915722
3.31702900 0.90552579
9.26859093 13.61544128
3.36653757 2.21968780
8.02170277 10.86430092
5.47477055 6.54468116
2.96182442 0.93830277

Let’s plot these points using Tableplot’s Plotly API.

(-> simple-linear-data
    plotly/layer-point)

9.2.1 Regression using Fastmath

We can now fit a linear model to the data using the Fastmath library.

(def simple-linear-data-model
  (reg/lm
   ;; ys - a "column" sequence of `y` values:
   (simple-linear-data :y)
   ;; xss - a sequence of "rows", each containing `x` values:
   ;; (one `x` per row, in our case):
   (-> simple-linear-data
       (tc/select-columns [:x])
       tc/rows)
   ;; options
   {:names ["x"]}))
(type simple-linear-data-model)
fastmath.ml.regression.LMData
simple-linear-data-model
{:model :ols,
 :intercept? true,
 :offset? false,
 :transformer nil,
 :xtxinv
 #object[org.apache.commons.math3.linear.Array2DRowRealMatrix 0x7088809e "Array2DRowRealMatrix{{0.0667061806,-0.0104066882},{-0.0104066882,0.0023187329}}"],
 :intercept -5.180532100953919,
 :beta [2.0260242526093],
 :coefficients
 [{:estimate -5.180532100953919,
   :stderr 0.24583919706042817,
   :t-value -21.072848280091502,
   :p-value 6.632285526004838E-26,
   :confidence-interval [-5.674824935350048 -4.6862392665577906]}
  {:estimate 2.0260242526093,
   :stderr 0.04583460102211224,
   :t-value 44.20294291711788,
   :p-value 0.0,
   :confidence-interval [1.9338676106924024 2.118180894526198]}],
 :offset
 [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
  0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
  0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
  0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
 :weights
 [1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0],
 :residuals
 {:weighted
  [0.397532790731087
   -1.1730329422912393
   -0.20052733373988918
   1.4108241898650107
   0.28812783307271417
   0.49037101290730334
   0.36725966157419165
   -0.10769284216719299
   -1.3898373832991608
   0.6528831409098683
   -0.6735736270752675
   1.2244003648664892
   -0.9869231348429306
   0.15478747180468755
   -0.2079856609573456
   0.17504437549823448
   0.8524989713986129
   0.6799524574854169
   1.3296832066541393
   -0.35392079668397436
   -1.813912175751509
   0.21644118860665218
   -1.3449858344261827
   0.7035908715209525
   -0.5061510643035754
   1.2769491322054085
   -0.8480462351940101
   -0.9914992055466665
   -0.9551320159660595
   0.03110927828036303
   0.15600264906164796
   1.798623577891279
   0.9898533985516562
   0.5859030775164564
   -0.566947703774689
   -0.6342953919498568
   1.548364247782307
   -0.924134282573509
   -0.28077437321785936
   0.928620925733231
   1.1817995562887535
   0.1534125105644646
   -2.3486686407107555
   -1.7927592534127457
   -0.6343233077427036
   0.01758337600680271
   0.5795331386980758
   -0.2073313298051005
   0.6331953551866034
   0.11810677476984577],
  :raw
  [0.397532790731087
   -1.1730329422912393
   -0.20052733373988918
   1.4108241898650107
   0.28812783307271417
   0.49037101290730334
   0.36725966157419165
   -0.10769284216719299
   -1.3898373832991608
   0.6528831409098683
   -0.6735736270752675
   1.2244003648664892
   -0.9869231348429306
   0.15478747180468755
   -0.2079856609573456
   0.17504437549823448
   0.8524989713986129
   0.6799524574854169
   1.3296832066541393
   -0.35392079668397436
   -1.813912175751509
   0.21644118860665218
   -1.3449858344261827
   0.7035908715209525
   -0.5061510643035754
   1.2769491322054085
   -0.8480462351940101
   -0.9914992055466665
   -0.9551320159660595
   0.03110927828036303
   0.15600264906164796
   1.798623577891279
   0.9898533985516562
   0.5859030775164564
   -0.566947703774689
   -0.6342953919498568
   1.548364247782307
   -0.924134282573509
   -0.28077437321785936
   0.928620925733231
   1.1817995562887535
   0.1534125105644646
   -2.3486686407107555
   -1.7927592534127457
   -0.6343233077427036
   0.01758337600680271
   0.5795331386980758
   -0.2073313298051005
   0.6331953551866034
   0.11810677476984577]},
 :fitted
 (-0.3481102081457621
  3.4046233632882723
  -5.116270631960284
  11.511613794248433
  12.755645503075694
  -3.1502119158870334
  13.840721672949499
  6.52449389311943
  -2.599697602839053
  -1.8652747763175754
  11.061181158139043
  2.8158591419229646
  14.688380854389827
  2.8800679593578353
  -0.7281568879809477
  1.9082829696498527
  -2.9733497750143165
  -0.02012964038548759
  -1.3268220176031988
  -4.925146681247773
  -0.684572024243618
  -2.4971140107754737
  4.7730779511046615
  -2.733695819896062
  -2.384608739656539
  7.351462461090571
  4.51449890038223
  2.766666162614424
  6.532211935375831
  12.442093394584333
  4.592592271992715
  12.176453238485507
  -3.898077796357418
  7.924353928134603
  1.3372661879229497
  -1.702831433877301
  7.805133409584921
  2.2944990596551946
  12.567330682087935
  -2.3044648555051586
  14.547749990053894
  -0.6550929478370824
  12.846897900042851
  5.09191646931276
  1.5398490982942095
  13.597857904913873
  1.6401546651180094
  11.071632251032991
  5.911485802631032
  0.8201960000899708),
 :df {:residual 48, :model 1, :intercept 1},
 :observations 50,
 :names ["Intercept" "x"],
 :r-squared 0.9760227803072645,
 :adjusted-r-squared 0.9755232548969992,
 :sigma2 0.9060166583250343,
 :sigma 0.951849073291052,
 :tss 1813.7548955593802,
 :rss 43.48879959960165,
 :regss 1770.2660959597786,
 :msreg 1770.2660959597786,
 :qt 2.010634757624228,
 :f-statistic 1953.9001625339806,
 :p-value 0.0,
 :ll
 {:log-likelihood -67.45893713632807,
  :aic 140.91787427265615,
  :bic 146.65394328894058,
  :aic-rss -2.9759790478111174,
  :bic-rss 0.8480669630451745},
 :analysis #<Delay@24575f47: :not-delivered>}

Printing the model gives a tabular summary: We’ll capture the printed output and display it via Kindly for cleaner formatting.

(kind/code
 (with-out-str
   (println
    simple-linear-data-model)))
Residuals:

|      :min |       :q1 | :median |     :q3 |     :max |
|-----------+-----------+---------+---------+----------|
| -2.348669 | -0.644136 | 0.13576 | 0.65965 | 1.798624 |

Coefficients:

|     :name | :estimate |  :stderr |   :t-value | :p-value |  :confidence-interval |
|-----------+-----------+----------+------------+----------+-----------------------|
| Intercept | -5.180532 | 0.245839 | -21.072848 |      0.0 | [-5.674825 -4.686239] |
|         x |  2.026024 | 0.045835 |  44.202943 |      0.0 |   [1.933868 2.118181] |

F-statistic: 1953.9001625339806 on degrees of freedom: {:residual 48, :model 1, :intercept 1}
p-value: 0.0

R2: 0.9760227803072645
Adjusted R2: 0.9755232548969992
Residual standard error: 0.951849073291052 on 48 degrees of freedom
AIC: 140.91787427265615

As you can see, the estimated coefficients match our intercept \(b\) and slope \(a\) (the coefficient of \(x\)).

9.2.2 Dataset ergonomics

Below are a couple of helper functions that simplify how we use regression with datasets and display model summaries. We have similar ideas under development in the Tablemath library, but it is still in an experimental stage and not part of Noj yet.

(defn lm
  "Compute a linear regression model for `dataset`.
  The first column marked as target is the target.
  All the columns unmarked as target are the features.
  The resulting model is of type `fastmath.ml.regression.LMData`,
  created via [Fastmath](https://github.com/generateme/fastmath).
  
  See [fastmath.ml.regression.lm](https://generateme.github.io/fastmath/clay/ml.html#lm)
  for `options`."
  ([dataset]
   (lm dataset nil))
  ([dataset options]
   (let [inference-column-name (-> dataset
                                   ds-mod/inference-target-column-names
                                   first)
         ds-without-target (-> dataset
                               (tc/drop-columns [inference-column-name]))]
     (reg/lm
      ;; ys
      (get dataset inference-column-name)
      ;; xss
      (tc/rows ds-without-target)
      ;; options
      (merge {:names (-> ds-without-target
                         tc/column-names
                         vec)}
             options)))))
(defn summary
  "Generate a summary of a linear model."
  [lmdata]
  (kind/code
   (with-out-str
     (println
      lmdata))))
(-> simple-linear-data
    (ds-mod/set-inference-target :y)
    lm
    summary)
Residuals:

|      :min |       :q1 | :median |     :q3 |     :max |
|-----------+-----------+---------+---------+----------|
| -2.348669 | -0.644136 | 0.13576 | 0.65965 | 1.798624 |

Coefficients:

|     :name | :estimate |  :stderr |   :t-value | :p-value |  :confidence-interval |
|-----------+-----------+----------+------------+----------+-----------------------|
| Intercept | -5.180532 | 0.245839 | -21.072848 |      0.0 | [-5.674825 -4.686239] |
|        :x |  2.026024 | 0.045835 |  44.202943 |      0.0 |   [1.933868 2.118181] |

F-statistic: 1953.9001625339806 on degrees of freedom: {:residual 48, :model 1, :intercept 1}
p-value: 0.0

R2: 0.9760227803072645
Adjusted R2: 0.9755232548969992
Residual standard error: 0.951849073291052 on 48 degrees of freedom
AIC: 140.91787427265615

9.2.3 Prediction

Once we have a linear model, we can generate new predictions. For instance, let’s predict \(y\) when \(x=3\):

(simple-linear-data-model [3])
0.8975406568739812

9.2.4 Displaying the regression line

We can visualize the fitted line by adding a smooth layer to our scatter plot. Tableplot makes this convenient:

(-> simple-linear-data
    (plotly/layer-point {:=name "data"})
    (plotly/layer-smooth {:=name "prediction"}))

Alternatively, we can build the regression line explicitly. We’ll obtain predictions and then plot them:

(-> simple-linear-data
    (tc/map-columns :prediction
                    [:x]
                    simple-linear-data-model)
    (plotly/layer-point {:=name "data"})
    (plotly/layer-smooth {:=y :prediction
                          :=name "prediction"}))

9.3 Multiple linear regression

We can easily extend these ideas to multiple linear predictors.

(def multiple-linear-data
  (let [rng (rand/rng 1234)
        n 50
        a0 2
        a1 -3
        b -5]
    (-> {:x0 (repeatedly n #(rand/frandom rng 0 10))
         :x1 (repeatedly n #(rand/frandom rng 0 10))}
        tc/dataset
        (tc/map-columns :y
                        [:x0 :x1]
                        (fn [x0 x1]
                          (+ (* a0 x0)
                             (* a1 x1)
                             b
                             (rand/grandom rng)))))))
(def multiple-linear-data-model
  (-> multiple-linear-data
      (ds-mod/set-inference-target :y)
      lm))
(summary multiple-linear-data-model)
Residuals:

|    :min |       :q1 | :median |    :q3 |     :max |
|---------+-----------+---------+--------+----------|
| -1.9375 | -0.842893 | 0.03606 | 0.8495 | 1.888152 |

Coefficients:

|     :name | :estimate |  :stderr |   :t-value | :p-value |  :confidence-interval |
|-----------+-----------+----------+------------+----------+-----------------------|
| Intercept | -5.688223 | 0.518933 | -10.961377 |      0.0 | [-6.732182 -4.644264] |
|       :x0 |   2.03282 | 0.060823 |   33.42169 |      0.0 |    [1.91046 2.155181] |
|       :x1 | -2.939436 | 0.062404 | -47.103379 |      0.0 | [-3.064976 -2.813895] |

F-statistic: 1659.3527178903253 on degrees of freedom: {:residual 47, :model 2, :intercept 1}
p-value: 0.0

R2: 0.9860356169317893
Adjusted R2: 0.985441387865057
Residual standard error: 1.0995296643474624 on 47 degrees of freedom
AIC: 156.2883341846529

Visualizing multiple dimensions is more involved. In the case of two features, we can use a 3D scatterplot and a 3D surface. Let us do that using Tableplot’s Plotly API.

(-> multiple-linear-data
    (plotly/layer-point {:=coordinates :3d
                         :=x :x0
                         :=y :x1
                         :=z :y})
    (plotly/layer-surface {:=dataset (let [xs (range 11)
                                           ys (range 11)]
                                       (tc/dataset
                                        {:x xs
                                         :y ys
                                         :z (for [y ys]
                                              (for [x xs]
                                                (multiple-linear-data-model
                                                 [x y])))}))
                           :=mark-opacity 0.5}))

9.4 Coming soon: Polynomial regression 🛠

9.5 Coming soon: One-hot encoding 🛠

9.6 Coming soon: Regularization 🛠

9.7 Example: Predicting Bicycle Traffic

As in the Python Data Science Handbook, we’ll try predicting the daily number of bicycle trips across the Fremont Bridge in Seattle. The features will include weather, season, day of week, and related factors.

9.7.1 Reading and parsing data

(def column-name-mapping
  {"Fremont Bridge Sidewalks, south of N 34th St" :total
   "Fremont Bridge Sidewalks, south of N 34th St Cyclist West Sidewalk" :west
   "Fremont Bridge Sidewalks, south of N 34th St Cyclist East Sidewalk" :east
   "Date" :datetime})
(column-name-mapping
 "Fremont Bridge Sidewalks, south of N 34th St")
:total
(def counts
  (tc/dataset "data/seattle-bikes-and-weather/Fremont_Bridge_Bicycle_Counter.csv.gz"
              {:key-fn column-name-mapping
               :parser-fn {"Date" [:local-date-time "MM/dd/yyyy hh:mm:ss a"]}}))
counts

data/seattle-bikes-and-weather/Fremont_Bridge_Bicycle_Counter.csv.gz [106608 4]:

:datetime :total :west :east
2012-10-02T13:00 55 7 48
2012-10-02T14:00 130 55 75
2012-10-02T15:00 152 81 71
2012-10-02T16:00 278 167 111
2012-10-02T17:00 563 393 170
2012-10-02T18:00 381 236 145
2012-10-02T19:00 175 104 71
2012-10-02T20:00 86 51 35
2012-10-02T21:00 63 35 28
2012-10-02T22:00 42 27 15
2024-11-30T13:00 147 62 85
2024-11-30T14:00 154 73 81
2024-11-30T15:00 118 57 61
2024-11-30T16:00 88 51 37
2024-11-30T17:00 46 11 35
2024-11-30T18:00 46 18 28
2024-11-30T19:00 69 14 55
2024-11-30T20:00 18 8 10
2024-11-30T21:00 49 15 34
2024-11-30T22:00 14 4 10
2024-11-30T23:00 10 5 5
(def weather
  (tc/dataset "data/seattle-bikes-and-weather/BicycleWeather.csv.gz"
              {:key-fn keyword}))
weather

data/seattle-bikes-and-weather/BicycleWeather.csv.gz [1340 26]:

:STATION :STATION_NAME :DATE :PRCP :SNWD :SNOW :TMAX :TMIN :AWND :WDF2 :WDF5 :WSF2 :WSF5 :FMTM :WT14 :WT01 :WT17 :WT05 :WT02 :WT22 :WT04 :WT13 :WT16 :WT08 :WT18 :WT03
GHCND:USW00024233 SEATTLE TACOMA INTERNATIONAL AIRPORT WA US 20120101 0 0 0 128 50 47 100 90 89 112 -9999 1 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999
GHCND:USW00024233 SEATTLE TACOMA INTERNATIONAL AIRPORT WA US 20120102 109 0 0 106 28 45 180 200 130 179 -9999 -9999 1 -9999 -9999 -9999 -9999 -9999 1 1 -9999 -9999 -9999
GHCND:USW00024233 SEATTLE TACOMA INTERNATIONAL AIRPORT WA US 20120103 8 0 0 117 72 23 180 170 54 67 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 1 -9999 -9999 -9999
GHCND:USW00024233 SEATTLE TACOMA INTERNATIONAL AIRPORT WA US 20120104 203 0 0 122 56 47 180 190 107 148 -9999 -9999 1 -9999 -9999 -9999 -9999 -9999 1 1 -9999 -9999 -9999
GHCND:USW00024233 SEATTLE TACOMA INTERNATIONAL AIRPORT WA US 20120105 13 0 0 89 28 61 200 220 107 165 -9999 -9999 1 -9999 -9999 -9999 -9999 -9999 -9999 1 -9999 -9999 -9999
GHCND:USW00024233 SEATTLE TACOMA INTERNATIONAL AIRPORT WA US 20120106 25 0 0 44 22 22 180 180 45 63 -9999 1 1 -9999 -9999 -9999 -9999 -9999 -9999 1 -9999 -9999 -9999
GHCND:USW00024233 SEATTLE TACOMA INTERNATIONAL AIRPORT WA US 20120107 0 0 0 72 28 23 170 180 54 63 -9999 -9999 1 -9999 -9999 -9999 -9999 -9999 1 1 -9999 -9999 -9999
GHCND:USW00024233 SEATTLE TACOMA INTERNATIONAL AIRPORT WA US 20120108 0 0 0 100 28 20 160 200 45 63 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999
GHCND:USW00024233 SEATTLE TACOMA INTERNATIONAL AIRPORT WA US 20120109 43 0 0 94 50 34 200 200 67 89 -9999 1 1 -9999 -9999 -9999 -9999 -9999 1 1 -9999 -9999 -9999
GHCND:USW00024233 SEATTLE TACOMA INTERNATIONAL AIRPORT WA US 20120110 10 0 0 61 6 34 20 30 89 107 -9999 -9999 1 -9999 -9999 -9999 -9999 -9999 -9999 1 -9999 -9999 -9999
GHCND:USW00024233 SEATTLE TACOMA INTERNATIONAL AIRPORT WA US 20150822 0 0 0 267 122 25 20 20 63 76 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 1 -9999 -9999
GHCND:USW00024233 SEATTLE TACOMA INTERNATIONAL AIRPORT WA US 20150823 0 0 0 278 139 18 10 10 67 81 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 1 -9999 -9999
GHCND:USW00024233 SEATTLE TACOMA INTERNATIONAL AIRPORT WA US 20150824 0 0 0 239 122 23 190 190 54 67 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999
GHCND:USW00024233 SEATTLE TACOMA INTERNATIONAL AIRPORT WA US 20150825 0 0 0 256 122 34 350 360 63 76 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999
GHCND:USW00024233 SEATTLE TACOMA INTERNATIONAL AIRPORT WA US 20150826 0 0 0 283 139 17 30 40 58 67 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999
GHCND:USW00024233 SEATTLE TACOMA INTERNATIONAL AIRPORT WA US 20150827 0 0 0 294 144 21 230 200 45 63 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999
GHCND:USW00024233 SEATTLE TACOMA INTERNATIONAL AIRPORT WA US 20150828 5 0 0 233 156 26 230 240 81 103 -9999 -9999 1 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999
GHCND:USW00024233 SEATTLE TACOMA INTERNATIONAL AIRPORT WA US 20150829 325 0 0 222 133 58 210 210 157 206 -9999 -9999 1 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999
GHCND:USW00024233 SEATTLE TACOMA INTERNATIONAL AIRPORT WA US 20150830 102 0 0 200 128 47 200 200 89 112 -9999 -9999 1 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999
GHCND:USW00024233 SEATTLE TACOMA INTERNATIONAL AIRPORT WA US 20150831 0 0 0 189 161 58 210 210 112 134 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999
GHCND:USW00024233 SEATTLE TACOMA INTERNATIONAL AIRPORT WA US 20150901 58 0 0 194 139 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999

9.7.2 Preprocessing

The bike counts come in hourly data, but our weather information is daily. We’ll need to aggregate the hourly counts into daily totals before combining the datasets.

In the Python handbook, one does:

daily = counts.resample('d').sum()

Since Tablecloth’s time series features are still evolving, we’ll be a bit more explicit:

(def daily-totals
  (-> counts
      (tc/group-by (fn [{:keys [datetime]}]
                     {:date (datetime/local-date-time->local-date
                             datetime)}))
      (tc/aggregate-columns [:total :west :east]
                            tcc/sum)))
daily-totals

_unnamed [4443 4]:

:date :total :west :east
2012-10-02 1938.0 1165.0 773.0
2012-10-03 3521.0 1761.0 1760.0
2012-10-04 3475.0 1767.0 1708.0
2012-10-05 3148.0 1590.0 1558.0
2012-10-06 2006.0 926.0 1080.0
2012-10-07 2142.0 951.0 1191.0
2012-10-08 3537.0 1708.0 1829.0
2012-10-09 3501.0 1742.0 1759.0
2012-10-10 3235.0 1587.0 1648.0
2012-10-11 3047.0 1468.0 1579.0
2024-11-20 2300.0 787.0 1513.0
2024-11-21 2382.0 775.0 1607.0
2024-11-22 1473.0 536.0 937.0
2024-11-23 1453.0 652.0 801.0
2024-11-24 727.0 311.0 416.0
2024-11-25 1483.0 486.0 997.0
2024-11-26 2173.0 727.0 1446.0
2024-11-27 1522.0 548.0 974.0
2024-11-28 631.0 261.0 370.0
2024-11-29 833.0 368.0 465.0
2024-11-30 1178.0 509.0 669.0

9.7.3 Prediction by day-of-week

Next, we’ll explore a simple regression by day of week.

(def days-of-week
  [:Mon :Tue :Wed :Thu :Fri :Sat :Sun])

We’ll convert the numeric day-of-week to the corresponding keyword:

(def idx->day-of-week
  (comp days-of-week dec))

For example,

(idx->day-of-week 1)
:Mon
(idx->day-of-week 7)
:Sun

Now, let’s build our dataset:

(def totals-with-day-of-week
  (-> daily-totals
      (tc/add-column :day-of-week
                     (fn [ds]
                       (map idx->day-of-week
                            (datetime/long-temporal-field
                             :day-of-week
                             (:date ds)))))
      (tc/select-columns [:total :day-of-week])))
totals-with-day-of-week

_unnamed [4443 2]:

:total :day-of-week
1938.0 :Tue
3521.0 :Wed
3475.0 :Thu
3148.0 :Fri
2006.0 :Sat
2142.0 :Sun
3537.0 :Mon
3501.0 :Tue
3235.0 :Wed
3047.0 :Thu
2300.0 :Wed
2382.0 :Thu
1473.0 :Fri
1453.0 :Sat
727.0 :Sun
1483.0 :Mon
2173.0 :Tue
1522.0 :Wed
631.0 :Thu
833.0 :Fri
1178.0 :Sat
(def totals-with-one-hot-days-of-week
  (-> (reduce (fn [dataset day-of-week]
                (-> dataset
                    (tc/add-column day-of-week
                                   #(-> (:day-of-week %)
                                        (tcc/eq day-of-week)
                                        ;; convert booleans to 0/1
                                        (tcc/* 1)))))
              totals-with-day-of-week
              days-of-week)
      (tc/drop-columns [:day-of-week])
      (ds-mod/set-inference-target :total)))
(-> totals-with-one-hot-days-of-week
    (tc/select-columns ds-mod/inference-column?))

_unnamed [0 0]

Since the binary columns sum to 1, they’re collinear, and we won’t use an intercept. This way, each coefficient directly reflects the expected bike count for that day of week.

(def days-of-week-model
  (lm totals-with-one-hot-days-of-week
      {:intercept? false}))

Let’s take a look at the results:

(-> days-of-week-model
    println
    with-out-str
    kind/code)
Residuals:

|         :min |         :q1 |    :median |        :q3 |        :max |
|--------------+-------------+------------+------------+-------------|
| -3034.944882 | -841.755906 | -90.179528 | 828.820472 | 3745.244094 |

Coefficients:

| :name |   :estimate |   :stderr |  :t-value | :p-value |      :confidence-interval |
|-------+-------------+-----------+-----------+----------+---------------------------|
|  :Mon | 2785.741325 | 44.521658 | 62.570476 |      0.0 | [2698.456664 2873.025986] |
|  :Tue | 3115.527559 | 44.486587 | 70.032964 |      0.0 | [3028.311653 3202.743465] |
|  :Wed | 3109.944882 | 44.486587 | 69.907472 |      0.0 | [3022.728976 3197.160788] |
|  :Thu | 2961.179528 | 44.486587 | 66.563423 |      0.0 | [2873.963622 3048.395433] |
|  :Fri | 2623.755906 | 44.486587 | 58.978583 |      0.0 |     [2536.54 2710.971811] |
|  :Sat | 1688.029921 | 44.486587 | 37.944693 |      0.0 | [1600.814015 1775.245827] |
|  :Sun | 1576.178233 | 44.521658 | 35.402506 |      0.0 | [1488.893572 1663.462894] |

F-statistic: 3472.719277560978 on degrees of freedom: {:residual 4436, :model 7, :intercept 0}
p-value: 0.0

R2: 0.8456776967289252
Adjusted R2: 0.8454341764126724
Residual standard error: 1121.0266948292917 on 4436 degrees of freedom
AIC: 75015.17638480052

We can clearly see weekend versus weekday differences.

9.7.4 Coming soon: more predictors for the bike counts 🛠

source: notebooks/noj_book/linear_regression_intro.clj