10 Introduction to Linear Regression

last update: 2024-12-29

In this tutorial, we introduce the fundamentals of linear regression, guided by the In Depth: Linear Regression chapter of the Python Data Science Handbook by Jake VanderPlas.

10.1 Setup

(ns noj-book.linear-regression-intro
  (:require
   [tech.v3.dataset :as ds]
   [tablecloth.api :as tc]
   [tablecloth.column.api :as tcc]
   [tech.v3.datatype.datetime :as datetime]
   [tech.v3.dataset.modelling :as ds-mod]
   [fastmath.ml.regression :as reg]
   [scicloj.kindly.v4.kind :as kind]
   [fastmath.random :as rand]
   [scicloj.tableplot.v1.plotly :as plotly]))

10.2 Simple Linear Regression

We begin with the classic straight-line model: for data points \((x, y)\), we assume there is a linear relationship allowing us to predict \(y\) as \[y = ax + b.\] In this formulation, \(a\) is the slope and \(b\) is the intercept, the point where our line would cross the \(y\) axis.

To illustrate, we’ll use Fastmath and Tablecloth to create synthetic data in which the relationship is known to hold with \(a=2\) and \(b=-5\).

For each row in the dataset below, we draw \(x\) uniformly from 0 to 10 and compute \(y = ax + b\) plus an extra random noise term (drawn from a standard Gaussian distribution). This noise is added independently for every row.

(def simple-linear-data
  (let [rng (rand/rng 1234)
        n 50
        a 2
        b -5]
    (-> {:x (repeatedly n #(rand/frandom rng 0 10))}
        tc/dataset
        (tc/map-columns :y
                        [:x]
                        (fn [x]
                          (+ (* a x)
                             b
                             (rand/grandom rng)))))))

simple-linear-data

_unnamed [50 2]:

:x	:y
7.97690344	10.27676182
9.33682251	12.55247210
3.48480940	2.78219772
0.32233775	-2.10696103
6.76769972	7.79522824
9.57730865	13.66814034
3.23972511	4.12318981
4.76181126	4.28776110
8.59562969	11.43073247
7.27913380	11.00014082
…	…
9.83472061	13.89653018
7.60345840	10.31939758
0.71324766	-2.12434599
9.96650028	15.69500004
4.46464539	3.11673576
4.67479134	5.70488322
0.68129003	-4.50180721
8.77676392	13.14855351
3.94394875	4.45846826
0.21207690	-4.78543376
4.67980385	3.94169960

Let’s plot these points using Tableplot’s Plotly API.

(-> simple-linear-data
    plotly/layer-point)

10.2.1 Regression using Fastmath

We can now fit a linear model to the data using the Fastmath library.

(def simple-linear-data-model
  (reg/lm
   ;; ys - a "column" sequence of `y` values:
   (simple-linear-data :y)
   ;; xss - a sequence of "rows", each containing `x` values:
   ;; (one `x` per row, in our case):
   (-> simple-linear-data
       (tc/select-columns [:x])
       tc/rows)
   ;; options
   {:names ["x"]}))

(type simple-linear-data-model)

fastmath.ml.regression.LMData

simple-linear-data-model

{:model :ols,
 :intercept? true,
 :offset? false,
 :transformer nil,
 :xtxinv
 #object[org.apache.commons.math3.linear.BlockRealMatrix 0x15773356 "BlockRealMatrix{{0.0799485475,-0.0116715265},{-0.0116715265,0.0022723575}}"],
 :intercept -5.013951299957048,
 :beta [2.0313776572955815],
 :coefficients
 [{:estimate -5.013951299957048,
   :stderr 0.3066985108771736,
   :t-value -16.348143607273762,
   :p-value 2.9857631516674843E-21,
   :confidence-interval [-5.630609986038285 -4.397292613875811]}
  {:estimate 2.0313776572955815,
   :stderr 0.05170644803846648,
   :t-value 39.28673761895922,
   :p-value 0.0,
   :confidence-interval [1.9274148756761498 2.1353404389150135]}],
 :offset
 (0.0
  0.0
  0.0
  0.0
  0.0
  0.0
  0.0
  0.0
  0.0
  0.0
  0.0
  0.0
  0.0
  0.0
  0.0
  0.0
  0.0
  0.0
  0.0
  0.0
  0.0
  0.0
  0.0
  0.0
  0.0
  0.0
  0.0
  0.0
  0.0
  0.0
  0.0
  0.0
  0.0
  0.0
  0.0
  0.0
  0.0
  0.0
  0.0
  0.0
  0.0
  0.0
  0.0
  0.0
  0.0
  0.0
  0.0
  0.0
  0.0
  0.0),
 :weights
 (1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0),
 :residuals
 {:weighted
  (-0.9133902951828023
   -1.400189240890926
   0.7171850689738104
   2.2522005779710943
   -0.9385744613364952
   -0.7730391812858866
   2.5560359006607554
   -0.37132459956455044
   -1.016286335526102
   1.2274223606948027
   1.431705716746059
   0.4106140552750013
   -0.9722144654173288
   -0.00864376376902154
   0.4829830983425847
   0.8235481653906476
   1.0001042010852128
   -0.14685072150970147
   -1.5220360198746938
   0.3626714554710073
   2.0339249709116896
   -0.814564566324202
   -1.2480920116124272
   0.004605713871298267
   0.6649123901396532
   -1.5126802164071016
   -1.823241893598048
   -0.7609465728110054
   -0.8854420022800573
   -0.49406706529077127
   1.172604324905942
   1.3638727611992927
   -0.532913217275885
   1.111692061243109
   -0.3825808086401601
   -0.23838865252963704
   -0.7527910576596278
   -1.0428942104294805
   -0.24251471782311818
   -1.0675502328788422
   -0.11214664483804349
   1.440729952674464
   0.4632253447250392
   -0.9386938246879204
   1.2225678446937653
   -0.871813255549573
   0.33358269188962275
   1.4607701931260184
   -0.2022907379059582
   -0.5507980770915166),
  :raw
  (-0.9133902951828023
   -1.400189240890926
   0.7171850689738104
   2.2522005779710943
   -0.9385744613364952
   -0.7730391812858866
   2.5560359006607554
   -0.37132459956455044
   -1.016286335526102
   1.2274223606948027
   1.431705716746059
   0.4106140552750013
   -0.9722144654173288
   -0.00864376376902154
   0.4829830983425847
   0.8235481653906476
   1.0001042010852128
   -0.14685072150970147
   -1.5220360198746938
   0.3626714554710073
   2.0339249709116896
   -0.814564566324202
   -1.2480920116124272
   0.004605713871298267
   0.6649123901396532
   -1.5126802164071016
   -1.823241893598048
   -0.7609465728110054
   -0.8854420022800573
   -0.49406706529077127
   1.172604324905942
   1.3638727611992927
   -0.532913217275885
   1.111692061243109
   -0.3825808086401601
   -0.23838865252963704
   -0.7527910576596278
   -1.0428942104294805
   -0.24251471782311818
   -1.0675502328788422
   -0.11214664483804349
   1.440729952674464
   0.4632253447250392
   -0.9386938246879204
   1.2225678446937653
   -0.871813255549573
   0.33358269188962275
   1.4607701931260184
   -0.2022907379059582
   -0.5507980770915166),
  :loocv
  (-0.9498015261381599
   -1.4897123408316775
   0.7364791519329992
   2.4286683540984835
   -0.9636760496084196
   -0.8266176905444703
   2.630136647343255
   -0.37902591033555666
   -1.0666236095220296
   1.2659502857634348
   1.4716971697325016
   0.42508491394572684
   -0.9920981195529521
   -0.008849622720030868
   0.4964617435782756
   0.8430037134620346
   1.0713054357072833
   -0.1503327090944281
   -1.6433406680771299
   0.3858455957462735
   2.0950175791697347
   -0.8702902226335034
   -1.3035596651146915
   0.004720414097183885
   0.6827750526799393
   -1.5852610696596774
   -1.9261966717113073
   -0.7799815553322498
   -0.903558118454171
   -0.5204565696204568
   1.2104952751361338
   1.4510317255105758
   -0.5622007844963817
   1.1363239727339187
   -0.3936449224109432
   -0.24432469977397667
   -0.7765625603022648
   -1.0786097553659422
   -0.24966061838997283
   -1.148104071910613
   -0.11607358402022609
   1.5399902859065124
   0.4997122519686498
   -0.9588538498490189
   1.2481346414976087
   -0.9325202160859831
   0.3511823278652585
   1.4955119129903225
   -0.2187163885677346
   -0.5623105701072049)},
 :fitted
 (11.190152119554517
  13.9526613365153
  2.0650126523964945
  -4.359161603369814
  8.733802699439359
  14.441179518397119
  1.567153910197967
  4.6590856945701695
  12.447018806915962
  9.772718459108024
  9.02913604780838
  10.469593225642388
  5.143580161056823
  7.853640625609774
  9.023022015171334
  7.784364722541174
  14.60530913597505
  7.8160381625283115
  -4.465875059181402
  -3.109431828630323
  9.498526567699743
  -3.522120556062981
  -0.9794997816527502
  2.625843412302104
  8.764921110798324
  -1.4229766154492482
  -2.3739656213018177
  2.5917222282601777
  5.720406981635469
  12.886956001332878
  9.95016488921671
  -3.110095344529622
  -2.214443322519533
  3.674818460458688
  9.256707471722825
  2.626828999787559
  1.0301336847137117
  10.299564612696965
  9.376845500859034
  14.964080416135188
  10.43154422120388
  -3.565075945769995
  15.231774694912085
  4.055429584387445
  4.482315372633323
  -3.629993953902906
  12.814970822395166
  2.997698063632807
  -4.583143018814509
  4.4924976779376085),
 :df {:residual 48, :model 1, :intercept 1},
 :observations 50,
 :names ["Intercept" "x"],
 :cv 1.1113924154551942,
 :r-squared 0.9698387836375214,
 :adjusted-r-squared 0.9692104249633031,
 :sigma2 1.1765564171744207,
 :sigma 1.0846918535576917,
 :tss 1872.428066085171,
 :rss 56.47470802437219,
 :regss 1815.9533580607988,
 :msreg 1815.9533580607988,
 :qt 2.010634757624228,
 :f-statistic 1543.447752740946,
 :p-value 0.0,
 :ll
 {:log-likelihood -73.99117383310164,
  :aic 153.98234766620328,
  :bic 159.7184166824877,
  :aic-rss 10.088494345736,
  :bic-rss 13.912540356592292},
 :analysis #<Delay@7c453542: :not-delivered>,
 :decomposition :cholesky,
 :augmentation nil,
 :hat
 (0.03833562060423677
  0.06009421919051326
  0.026197731339099795
  0.07266030202501395
  0.026047745279260646
  0.06481655288951424
  0.02817372502578912
  0.020318692102574695
  0.04719309937128097
  0.030433995317120814
  0.027173697013844023
  0.03404227766260634
  0.020042023811700227
  0.02326189008525432
  0.027149413645738023
  0.023078840295361593
  0.06646212391806165
  0.023161876119318144
  0.07381588647980979
  0.06006065776245149
  0.029160904837016548
  0.06403111842469658
  0.042550912694421274
  0.024298763524591316
  0.02616185589993938
  0.0457847950988649
  0.0534497746908629
  0.02440440083629409
  0.02004975197954859
  0.050704527274831365
  0.03130202241056286
  0.06006689087422569
  0.05209449724751351
  0.021676838720164536
  0.028106837255816954
  0.024295731253659712
  0.030611188148684652
  0.033112573624317285
  0.02862245801094941
  0.07016248875218896
  0.033831463164765554
  0.06445516841271451
  0.07301583481267074
  0.02102512824480274
  0.02048400545406419
  0.06509988683270819
  0.05011538047093404
  0.023230654040620087
  0.07510022805945117
  0.020473549009568534),
 :effective-dimension 2.0}

Printing the model gives a tabular summary: We’ll capture the printed output and display it via Kindly for cleaner formatting.

(kind/code
 (with-out-str
   (println
    simple-linear-data-model)))

Residuals:

|      :min |       :q1 |  :median |      :q3 |     :max |
|-----------+-----------+----------+----------+----------|
| -1.823242 | -0.892429 | -0.22034 | 0.867687 | 2.556036 |

Coefficients:

|     :name | :estimate |  :stderr |   :t-value | :p-value | :confidence-interval |
|-----------+-----------+----------+------------+----------+----------------------|
| Intercept | -5.013951 | 0.306699 | -16.348144 |      0.0 | [-5.63061 -4.397293] |
|         x |  2.031378 | 0.051706 |  39.286738 |      0.0 |   [1.927415 2.13534] |

F-statistic: 1543.447752740946 on degrees of freedom: {:residual 48, :model 1, :intercept 1}
p-value: 0.0

R2: 0.9698387836375214
Adjusted R2: 0.9692104249633031
Residual standard error: 1.0846918535576917 on 48 degrees of freedom
AIC: 153.98234766620328

As you can see, the estimated coefficients match our intercept \(b\) and slope \(a\) (the coefficient of \(x\)).

10.2.2 Dataset ergonomics

Below are a couple of helper functions that simplify how we use regression with datasets and display model summaries. We have similar ideas under development in the Tablemath library, but it is still in an experimental stage and not part of Noj yet.

(defn lm
  "Compute a linear regression model for `dataset`.
  The first column marked as target is the target.
  All the columns unmarked as target are the features.
  The resulting model is of type `fastmath.ml.regression.LMData`,
  created via [Fastmath](https://github.com/generateme/fastmath).
  
  See [fastmath.ml.regression.lm](https://generateme.github.io/fastmath/clay/ml.html#lm)
  for `options`."
  ([dataset]
   (lm dataset nil))
  ([dataset options]
   (let [inference-column-name (-> dataset
                                   ds-mod/inference-target-column-names
                                   first)
         ds-without-target (-> dataset
                               (tc/drop-columns [inference-column-name]))]
     (reg/lm
      ;; ys
      (get dataset inference-column-name)
      ;; xss
      (tc/rows ds-without-target)
      ;; options
      (merge {:names (-> ds-without-target
                         tc/column-names
                         vec)}
             options)))))

(defn summary
  "Generate a summary of a linear model."
  [lmdata]
  (kind/code
   (with-out-str
     (println
      lmdata))))

(-> simple-linear-data
    (ds-mod/set-inference-target :y)
    lm
    summary)

Residuals:

|      :min |       :q1 |  :median |      :q3 |     :max |
|-----------+-----------+----------+----------+----------|
| -1.823242 | -0.892429 | -0.22034 | 0.867687 | 2.556036 |

Coefficients:

|     :name | :estimate |  :stderr |   :t-value | :p-value | :confidence-interval |
|-----------+-----------+----------+------------+----------+----------------------|
| Intercept | -5.013951 | 0.306699 | -16.348144 |      0.0 | [-5.63061 -4.397293] |
|        :x |  2.031378 | 0.051706 |  39.286738 |      0.0 |   [1.927415 2.13534] |

F-statistic: 1543.447752740946 on degrees of freedom: {:residual 48, :model 1, :intercept 1}
p-value: 0.0

R2: 0.9698387836375214
Adjusted R2: 0.9692104249633031
Residual standard error: 1.0846918535576917 on 48 degrees of freedom
AIC: 153.98234766620328

10.2.3 Prediction

Once we have a linear model, we can generate new predictions. For instance, let’s predict \(y\) when \(x=3\):

(simple-linear-data-model [3])

1.0801816719296964

10.2.4 Displaying the regression line

We can visualize the fitted line by adding a smooth layer to our scatter plot. Tableplot makes this convenient:

(-> simple-linear-data
    (plotly/layer-point {:=name "data"})
    (plotly/layer-smooth {:=name "prediction"}))

Alternatively, we can build the regression line explicitly. We’ll obtain predictions and then plot them:

(-> simple-linear-data
    (tc/map-columns :prediction
                    [:x]
                    simple-linear-data-model)
    (plotly/layer-point {:=name "data"})
    (plotly/layer-smooth {:=y :prediction
                          :=name "prediction"}))

10.3 Multiple linear regression

We can easily extend these ideas to multiple linear predictors.

(def multiple-linear-data
  (let [rng (rand/rng 1234)
        n 50
        a0 2
        a1 -3
        b -5]
    (-> {:x0 (repeatedly n #(rand/frandom rng 0 10))
         :x1 (repeatedly n #(rand/frandom rng 0 10))}
        tc/dataset
        (tc/map-columns :y
                        [:x0 :x1]
                        (fn [x0 x1]
                          (+ (* a0 x0)
                             (* a1 x1)
                             b
                             (rand/grandom rng)))))))

(def multiple-linear-data-model
  (-> multiple-linear-data
      (ds-mod/set-inference-target :y)
      lm))

(summary multiple-linear-data-model)

Residuals:

|      :min |       :q1 |  :median |      :q3 |     :max |
|-----------+-----------+----------+----------+----------|
| -2.172344 | -0.717733 | 0.032101 | 0.714545 | 2.011173 |

Coefficients:

|     :name | :estimate |  :stderr |   :t-value | :p-value |  :confidence-interval |
|-----------+-----------+----------+------------+----------+-----------------------|
| Intercept | -5.074428 |  0.42848 | -11.842872 |      0.0 | [-5.936418 -4.212439] |
|       :x0 |  1.921843 | 0.051446 |  37.356694 |      0.0 |   [1.818348 2.025339] |
|       :x1 | -2.958076 | 0.048685 | -60.759489 |      0.0 | [-3.056018 -2.860135] |

F-statistic: 2925.9909328276863 on degrees of freedom: {:residual 47, :model 2, :intercept 1}
p-value: 0.0

R2: 0.9920325233963441
Adjusted R2: 0.9916934818387417
Residual standard error: 0.9784374924014486 on 47 degrees of freedom
AIC: 144.62024561182105

Visualizing multiple dimensions is more involved. In the case of two features, we can use a 3D scatterplot and a 3D surface. Let us do that using Tableplot’s Plotly API.

(-> multiple-linear-data
    (plotly/layer-point {:=coordinates :3d
                         :=x :x0
                         :=y :x1
                         :=z :y})
    (plotly/layer-surface {:=dataset (let [xs (range 11)
                                           ys (range 11)]
                                       (tc/dataset
                                        {:x xs
                                         :y ys
                                         :z (for [y ys]
                                              (for [x xs]
                                                (multiple-linear-data-model
                                                 [x y])))}))
                           :=mark-opacity 0.5}))

10.4 Coming soon: Polynomial regression 🛠

10.5 Coming soon: One-hot encoding 🛠

10.6 Coming soon: Regularization 🛠

10.7 Example: Predicting Bicycle Traffic

As in the Python Data Science Handbook, we’ll try predicting the daily number of bicycle trips across the Fremont Bridge in Seattle. The features will include weather, season, day of week, and related factors.

10.7.1 Reading and parsing data

(def column-name-mapping
  {"Fremont Bridge Sidewalks, south of N 34th St" :total
   "Fremont Bridge Sidewalks, south of N 34th St Cyclist West Sidewalk" :west
   "Fremont Bridge Sidewalks, south of N 34th St Cyclist East Sidewalk" :east
   "Date" :datetime})

(column-name-mapping
 "Fremont Bridge Sidewalks, south of N 34th St")

:total

(def counts
  (tc/dataset "data/seattle-bikes-and-weather/Fremont_Bridge_Bicycle_Counter.csv.gz"
              {:key-fn column-name-mapping
               :parser-fn {"Date" [:local-date-time "MM/dd/yyyy hh:mm:ss a"]}}))

counts

data/seattle-bikes-and-weather/Fremont_Bridge_Bicycle_Counter.csv.gz [106608 4]:

:datetime	:total	:west	:east
2012-10-02T13:00	55	7	48
2012-10-02T14:00	130	55	75
2012-10-02T15:00	152	81	71
2012-10-02T16:00	278	167	111
2012-10-02T17:00	563	393	170
2012-10-02T18:00	381	236	145
2012-10-02T19:00	175	104	71
2012-10-02T20:00	86	51	35
2012-10-02T21:00	63	35	28
2012-10-02T22:00	42	27	15
…	…	…	…
2024-11-30T13:00	147	62	85
2024-11-30T14:00	154	73	81
2024-11-30T15:00	118	57	61
2024-11-30T16:00	88	51	37
2024-11-30T17:00	46	11	35
2024-11-30T18:00	46	18	28
2024-11-30T19:00	69	14	55
2024-11-30T20:00	18	8	10
2024-11-30T21:00	49	15	34
2024-11-30T22:00	14	4	10
2024-11-30T23:00	10	5	5

(def weather
  (tc/dataset "data/seattle-bikes-and-weather/BicycleWeather.csv.gz"
              {:key-fn keyword}))

weather

data/seattle-bikes-and-weather/BicycleWeather.csv.gz [1340 26]:

:STATION	:STATION_NAME	:DATE	:PRCP	:SNWD	:SNOW	:TMAX	:TMIN	:AWND	:WDF2	:WDF5	:WSF2	:WSF5	:FMTM	:WT14	:WT01	:WT17	:WT05	:WT02	:WT22	:WT04	:WT13	:WT16	:WT08	:WT18	:WT03
GHCND:USW00024233	SEATTLE TACOMA INTERNATIONAL AIRPORT WA US	20120101	0	0	0	128	50	47	100	90	89	112	-9999	1	-9999	-9999	-9999	-9999	-9999	-9999	-9999	-9999	-9999	-9999	-9999
GHCND:USW00024233	SEATTLE TACOMA INTERNATIONAL AIRPORT WA US	20120102	109	0	0	106	28	45	180	200	130	179	-9999	-9999	1	-9999	-9999	-9999	-9999	-9999	1	1	-9999	-9999	-9999
GHCND:USW00024233	SEATTLE TACOMA INTERNATIONAL AIRPORT WA US	20120103	8	0	0	117	72	23	180	170	54	67	-9999	-9999	-9999	-9999	-9999	-9999	-9999	-9999	-9999	1	-9999	-9999	-9999
GHCND:USW00024233	SEATTLE TACOMA INTERNATIONAL AIRPORT WA US	20120104	203	0	0	122	56	47	180	190	107	148	-9999	-9999	1	-9999	-9999	-9999	-9999	-9999	1	1	-9999	-9999	-9999
GHCND:USW00024233	SEATTLE TACOMA INTERNATIONAL AIRPORT WA US	20120105	13	0	0	89	28	61	200	220	107	165	-9999	-9999	1	-9999	-9999	-9999	-9999	-9999	-9999	1	-9999	-9999	-9999
GHCND:USW00024233	SEATTLE TACOMA INTERNATIONAL AIRPORT WA US	20120106	25	0	0	44	22	22	180	180	45	63	-9999	1	1	-9999	-9999	-9999	-9999	-9999	-9999	1	-9999	-9999	-9999
GHCND:USW00024233	SEATTLE TACOMA INTERNATIONAL AIRPORT WA US	20120107	0	0	0	72	28	23	170	180	54	63	-9999	-9999	1	-9999	-9999	-9999	-9999	-9999	1	1	-9999	-9999	-9999
GHCND:USW00024233	SEATTLE TACOMA INTERNATIONAL AIRPORT WA US	20120108	0	0	0	100	28	20	160	200	45	63	-9999	-9999	-9999	-9999	-9999	-9999	-9999	-9999	-9999	-9999	-9999	-9999	-9999
GHCND:USW00024233	SEATTLE TACOMA INTERNATIONAL AIRPORT WA US	20120109	43	0	0	94	50	34	200	200	67	89	-9999	1	1	-9999	-9999	-9999	-9999	-9999	1	1	-9999	-9999	-9999
GHCND:USW00024233	SEATTLE TACOMA INTERNATIONAL AIRPORT WA US	20120110	10	0	0	61	6	34	20	30	89	107	-9999	-9999	1	-9999	-9999	-9999	-9999	-9999	-9999	1	-9999	-9999	-9999
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
GHCND:USW00024233	SEATTLE TACOMA INTERNATIONAL AIRPORT WA US	20150822	0	0	0	267	122	25	20	20	63	76	-9999	-9999	-9999	-9999	-9999	-9999	-9999	-9999	-9999	-9999	1	-9999	-9999
GHCND:USW00024233	SEATTLE TACOMA INTERNATIONAL AIRPORT WA US	20150823	0	0	0	278	139	18	10	10	67	81	-9999	-9999	-9999	-9999	-9999	-9999	-9999	-9999	-9999	-9999	1	-9999	-9999
GHCND:USW00024233	SEATTLE TACOMA INTERNATIONAL AIRPORT WA US	20150824	0	0	0	239	122	23	190	190	54	67	-9999	-9999	-9999	-9999	-9999	-9999	-9999	-9999	-9999	-9999	-9999	-9999	-9999
GHCND:USW00024233	SEATTLE TACOMA INTERNATIONAL AIRPORT WA US	20150825	0	0	0	256	122	34	350	360	63	76	-9999	-9999	-9999	-9999	-9999	-9999	-9999	-9999	-9999	-9999	-9999	-9999	-9999
GHCND:USW00024233	SEATTLE TACOMA INTERNATIONAL AIRPORT WA US	20150826	0	0	0	283	139	17	30	40	58	67	-9999	-9999	-9999	-9999	-9999	-9999	-9999	-9999	-9999	-9999	-9999	-9999	-9999
GHCND:USW00024233	SEATTLE TACOMA INTERNATIONAL AIRPORT WA US	20150827	0	0	0	294	144	21	230	200	45	63	-9999	-9999	-9999	-9999	-9999	-9999	-9999	-9999	-9999	-9999	-9999	-9999	-9999
GHCND:USW00024233	SEATTLE TACOMA INTERNATIONAL AIRPORT WA US	20150828	5	0	0	233	156	26	230	240	81	103	-9999	-9999	1	-9999	-9999	-9999	-9999	-9999	-9999	-9999	-9999	-9999	-9999
GHCND:USW00024233	SEATTLE TACOMA INTERNATIONAL AIRPORT WA US	20150829	325	0	0	222	133	58	210	210	157	206	-9999	-9999	1	-9999	-9999	-9999	-9999	-9999	-9999	-9999	-9999	-9999	-9999
GHCND:USW00024233	SEATTLE TACOMA INTERNATIONAL AIRPORT WA US	20150830	102	0	0	200	128	47	200	200	89	112	-9999	-9999	1	-9999	-9999	-9999	-9999	-9999	-9999	-9999	-9999	-9999	-9999
GHCND:USW00024233	SEATTLE TACOMA INTERNATIONAL AIRPORT WA US	20150831	0	0	0	189	161	58	210	210	112	134	-9999	-9999	-9999	-9999	-9999	-9999	-9999	-9999	-9999	-9999	-9999	-9999	-9999
GHCND:USW00024233	SEATTLE TACOMA INTERNATIONAL AIRPORT WA US	20150901	58	0	0	194	139	-9999	-9999	-9999	-9999	-9999	-9999	-9999	-9999	-9999	-9999	-9999	-9999	-9999	-9999	-9999	-9999	-9999	-9999

10.7.2 Preprocessing

The bike counts come in hourly data, but our weather information is daily. We’ll need to aggregate the hourly counts into daily totals before combining the datasets.

In the Python handbook, one does:

daily = counts.resample('d').sum()

Since Tablecloth’s time series features are still evolving, we’ll be a bit more explicit:

(def daily-totals
  (-> counts
      (tc/group-by (fn [{:keys [datetime]}]
                     {:date (datetime/local-date-time->local-date
                             datetime)}))
      (tc/aggregate-columns [:total :west :east]
                            tcc/sum)))

daily-totals

_unnamed [4443 4]:

:date	:total	:west	:east
2012-10-02	1938.0	1165.0	773.0
2012-10-03	3521.0	1761.0	1760.0
2012-10-04	3475.0	1767.0	1708.0
2012-10-05	3148.0	1590.0	1558.0
2012-10-06	2006.0	926.0	1080.0
2012-10-07	2142.0	951.0	1191.0
2012-10-08	3537.0	1708.0	1829.0
2012-10-09	3501.0	1742.0	1759.0
2012-10-10	3235.0	1587.0	1648.0
2012-10-11	3047.0	1468.0	1579.0
…	…	…	…
2024-11-20	2300.0	787.0	1513.0
2024-11-21	2382.0	775.0	1607.0
2024-11-22	1473.0	536.0	937.0
2024-11-23	1453.0	652.0	801.0
2024-11-24	727.0	311.0	416.0
2024-11-25	1483.0	486.0	997.0
2024-11-26	2173.0	727.0	1446.0
2024-11-27	1522.0	548.0	974.0
2024-11-28	631.0	261.0	370.0
2024-11-29	833.0	368.0	465.0
2024-11-30	1178.0	509.0	669.0

10.7.3 Prediction by day-of-week

Next, we’ll explore a simple regression by day of week.

(def days-of-week
  [:Mon :Tue :Wed :Thu :Fri :Sat :Sun])

We’ll convert the numeric day-of-week to the corresponding keyword:

(def idx->day-of-week
  (comp days-of-week dec))

For example,

(idx->day-of-week 1)

:Mon

(idx->day-of-week 7)

:Sun

Now, let’s build our dataset:

(def totals-with-day-of-week
  (-> daily-totals
      (tc/add-column :day-of-week
                     (fn [ds]
                       (map idx->day-of-week
                            (datetime/long-temporal-field
                             :day-of-week
                             (:date ds)))))
      (tc/select-columns [:total :day-of-week])))

totals-with-day-of-week

_unnamed [4443 2]:

:total	:day-of-week
1938.0	:Tue
3521.0	:Wed
3475.0	:Thu
3148.0	:Fri
2006.0	:Sat
2142.0	:Sun
3537.0	:Mon
3501.0	:Tue
3235.0	:Wed
3047.0	:Thu
…	…
2300.0	:Wed
2382.0	:Thu
1473.0	:Fri
1453.0	:Sat
727.0	:Sun
1483.0	:Mon
2173.0	:Tue
1522.0	:Wed
631.0	:Thu
833.0	:Fri
1178.0	:Sat

(def totals-with-one-hot-days-of-week
  (-> (reduce (fn [dataset day-of-week]
                (-> dataset
                    (tc/add-column day-of-week
                                   #(-> (:day-of-week %)
                                        (tcc/eq day-of-week)
                                        ;; convert booleans to 0/1
                                        (tcc/* 1)))))
              totals-with-day-of-week
              days-of-week)
      (tc/drop-columns [:day-of-week])
      (ds-mod/set-inference-target :total)))

(-> totals-with-one-hot-days-of-week
    (tc/select-columns ds-mod/inference-column?))

_unnamed [0 0]

Since the binary columns sum to 1, they’re collinear, and we won’t use an intercept. This way, each coefficient directly reflects the expected bike count for that day of week.

(def days-of-week-model
  (lm totals-with-one-hot-days-of-week
      {:intercept? false}))

Let’s take a look at the results:

(-> days-of-week-model
    println
    with-out-str
    kind/code)

Residuals:

|         :min |         :q1 |    :median |        :q3 |        :max |
|--------------+-------------+------------+------------+-------------|
| -3034.944882 | -841.755906 | -90.179528 | 828.820472 | 3745.244094 |

Coefficients:

| :name |   :estimate |   :stderr |  :t-value | :p-value |      :confidence-interval |
|-------+-------------+-----------+-----------+----------+---------------------------|
|  :Mon | 2785.741325 | 44.521658 | 62.570476 |      0.0 | [2698.456664 2873.025986] |
|  :Tue | 3115.527559 | 44.486587 | 70.032964 |      0.0 | [3028.311653 3202.743465] |
|  :Wed | 3109.944882 | 44.486587 | 69.907472 |      0.0 | [3022.728976 3197.160788] |
|  :Thu | 2961.179528 | 44.486587 | 66.563423 |      0.0 | [2873.963622 3048.395433] |
|  :Fri | 2623.755906 | 44.486587 | 58.978583 |      0.0 |     [2536.54 2710.971811] |
|  :Sat | 1688.029921 | 44.486587 | 37.944693 |      0.0 | [1600.814015 1775.245827] |
|  :Sun | 1576.178233 | 44.521658 | 35.402506 |      0.0 | [1488.893572 1663.462894] |

F-statistic: 3472.719277560978 on degrees of freedom: {:residual 4436, :model 7, :intercept 0}
p-value: 0.0

R2: 0.8456776967289252
Adjusted R2: 0.8454341764126724
Residual standard error: 1121.0266948292917 on 4436 degrees of freedom
AIC: 75015.17638480052

We can clearly see weekend versus weekday differences.

10.7.4 Coming soon: more predictors for the bike counts 🛠

source: notebooks/noj_book/linear_regression_intro.clj