10 Introduction to Linear Regression
last update: 2024-12-29
In this tutorial, we introduce the fundamentals of linear regression, guided by the In Depth: Linear Regression chapter of the Python Data Science Handbook by Jake VanderPlas.
10.1 Setup
(ns noj-book.linear-regression-intro
(:require
[tech.v3.dataset :as ds]
[tablecloth.api :as tc]
[tablecloth.column.api :as tcc]
[tech.v3.datatype.datetime :as datetime]
[tech.v3.dataset.modelling :as ds-mod]
[fastmath.ml.regression :as reg]
[scicloj.kindly.v4.kind :as kind]
[fastmath.random :as rand]
[scicloj.tableplot.v1.plotly :as plotly]))10.2 Simple Linear Regression
We begin with the classic straight-line model: for data points \((x, y)\), we assume there is a linear relationship allowing us to predict \(y\) as \[y = ax + b.\] In this formulation, \(a\) is the slope and \(b\) is the intercept, the point where our line would cross the \(y\) axis.
To illustrate, we’ll use Fastmath and Tablecloth to create synthetic data in which the relationship is known to hold with \(a=2\) and \(b=-5\).
For each row in the dataset below, we draw \(x\) uniformly from 0 to 10 and compute \(y = ax + b\) plus an extra random noise term (drawn from a standard Gaussian distribution). This noise is added independently for every row.
(def simple-linear-data
(let [rng (rand/rng 1234)
n 50
a 2
b -5]
(-> {:x (repeatedly n #(rand/frandom rng 0 10))}
tc/dataset
(tc/map-columns :y
[:x]
(fn [x]
(+ (* a x)
b
(rand/grandom rng)))))))simple-linear-data_unnamed [50 2]:
| :x | :y |
|---|---|
| 3.48859739 | 1.08560681 |
| 0.48541188 | -4.07867530 |
| 8.93892097 | 14.11140130 |
| 7.93593836 | 10.89734818 |
| 6.79865551 | 9.01246325 |
| 7.05022621 | 8.12570237 |
| 3.84797812 | 2.99742109 |
| 1.95390701 | -1.14582381 |
| 8.07666588 | 13.35579347 |
| 8.58096504 | 10.99638794 |
| … | … |
| 3.17139387 | 1.43395259 |
| 4.99531937 | 4.96595427 |
| 9.09207916 | 12.42714471 |
| 8.42910576 | 10.15793283 |
| 6.37899351 | 7.97876469 |
| 5.93604755 | 5.46140711 |
| 6.86225224 | 8.71118456 |
| 5.55634880 | 6.14850255 |
| 2.90768981 | 0.92595381 |
| 6.86996031 | 9.85371056 |
| 6.47757578 | 8.27424061 |
Let’s plot these points using Tableplot’s Plotly API.
(-> simple-linear-data
plotly/layer-point)10.2.1 Regression using Fastmath
We can now fit a linear model to the data using the Fastmath library.
(def simple-linear-data-model
(reg/lm
;; ys - a "column" sequence of `y` values:
(simple-linear-data :y)
;; xss - a sequence of "rows", each containing `x` values:
;; (one `x` per row, in our case):
(-> simple-linear-data
(tc/select-columns [:x])
tc/rows)
;; options
{:names ["x"]}))(type simple-linear-data-model)fastmath.ml.regression.LMDatasimple-linear-data-model{:model :ols,
:intercept? true,
:offset? false,
:transformer nil,
:xtxinv
#object[org.apache.commons.math3.linear.BlockRealMatrix 0x9d78427 "BlockRealMatrix{{0.0973576238,-0.0143433434},{-0.0143433434,0.0026594858}}"],
:intercept -4.9052311451903545,
:beta [1.9507813575717319],
:coefficients
[{:estimate -4.9052311451903545,
:stderr 0.3096209952294687,
:t-value -15.842695491482909,
:p-value 1.0700555696919785E-20,
:confidence-interval [-5.527765879888929 -4.28269641049178]}
{:estimate 1.9507813575717319,
:stderr 0.05117339798525415,
:t-value 38.12100494350323,
:p-value 0.0,
:confidence-interval [1.8478903449168422 2.0536723702266215]}],
:offset
(0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0),
:weights
(1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0),
:residuals
{:weighted
(-0.8146528016351406
-0.12037660999682886
1.5787520509647823
0.32129871852937697
0.6550039716881173
-0.7225163456889181
0.3960882600520459
-0.052238036881430805
2.5052153848219927
-0.8379675458414706
-1.2912891912443083
-0.6727643053993022
0.7952809972951282
-0.11646293690949072
-1.1915764566456084
-0.4624177689637605
1.1492885514199793
-0.610470852752858
1.860372406148107
-0.34945739991156266
-1.311106630684344
0.9151607185455184
-0.275609857770605
-0.8069174439310363
-1.7552727585787924
0.7002131212331131
-0.4950128593848473
1.9325811854163994
0.7623623077074146
-0.7066619685381852
0.7736011816324582
-0.8560121439395614
-0.7664565890475838
0.32215362692215915
-0.7643267127708198
1.1118376746460648
-1.9074493112822586
-0.00790058966717555
0.8912084704361929
0.15248769519294192
0.12640951802971667
-0.4042826736282823
-1.380178398885846
0.43997421760270505
-1.213292647916278
0.22966197329011706
0.21451203558504872
0.1589178832243482
1.3571512078030334
0.5431376797099183),
:raw
(-0.8146528016351406
-0.12037660999682886
1.5787520509647823
0.32129871852937697
0.6550039716881173
-0.7225163456889181
0.3960882600520459
-0.052238036881430805
2.5052153848219927
-0.8379675458414706
-1.2912891912443083
-0.6727643053993022
0.7952809972951282
-0.11646293690949072
-1.1915764566456084
-0.4624177689637605
1.1492885514199793
-0.610470852752858
1.860372406148107
-0.34945739991156266
-1.311106630684344
0.9151607185455184
-0.275609857770605
-0.8069174439310363
-1.7552727585787924
0.7002131212331131
-0.4950128593848473
1.9325811854163994
0.7623623077074146
-0.7066619685381852
0.7736011816324582
-0.8560121439395614
-0.7664565890475838
0.32215362692215915
-0.7643267127708198
1.1118376746460648
-1.9074493112822586
-0.00790058966717555
0.8912084704361929
0.15248769519294192
0.12640951802971667
-0.4042826736282823
-1.380178398885846
0.43997421760270505
-1.213292647916278
0.22966197329011706
0.21451203558504872
0.1589178832243482
1.3571512078030334
0.5431376797099183),
:loocv
(-0.8395436712975999
-0.13142403638812628
1.6678731312884962
0.3337107262756963
0.6719731211994787
-0.7427958360262121
0.40680794024941325
-0.05507203057851599
2.6072903577612627
-0.8793165074253352
-1.3396950154087999
-0.7341617611389729
0.8504061015389626
-0.12007428712654346
-1.2616003236053106
-0.4951471169530465
1.1787959567973785
-0.6425223015984393
1.9370611917542246
-0.36018320616443394
-1.3667564032623383
0.9365169769813158
-0.2826525067420349
-0.8290171819593157
-1.8101288693490762
0.7628828911176664
-0.5338975631591838
1.9837497202158507
0.796438965125828
-0.7505814435252348
0.8444812200189481
-0.8772368730691381
-0.7918879174922124
0.33651527906408657
-0.7823953219707781
1.1457623866283695
-1.9914613360839173
-0.008170744597601103
0.9199579419608932
0.15771259601187043
0.1290447648458839
-0.42844018382914206
-1.4444726006858617
0.4501402088469974
-1.2390443057270368
0.23572938198327448
0.21890562965795246
0.16492625983341427
1.3930919461226192
0.5559960717547276)},
:fitted
(1.9002596132174014
-3.9582986942596325
12.532649249122683
10.57604945903313
8.357459280169966
8.84821871496077
2.6013328260552084
-1.0935857699194007
10.850578081525002
11.83435548894087
10.420431737268755
-3.9261054606223658
13.624458809337044
9.412982472024991
-1.5118106063981291
13.737843361928908
2.93255465504805
12.155118092477219
10.910434915363
9.356560885395357
0.1712256489273276
7.6189267104188705
8.268206106695173
8.702425818816414
9.455911434566541
-3.8144262064179184
-3.0788816878027747
2.736531151909924
-0.09426277589477028
13.039544020516862
-3.9488652524096977
8.065921858543268
9.779463723014311
-0.08062503982062807
3.5117682171853026
1.9078231103880476
-0.018562689835587065
1.2923017851755674
1.6034888580423017
1.281464896473313
4.839544750007246
12.831427386771557
11.538111229817742
7.538790476529554
6.674699760843557
8.481522586607484
5.933990511301625
0.7670359293672089
8.496559351060013
7.731102926665308),
:df {:residual 48, :model 1, :intercept 1},
:observations 50,
:names ["Intercept" "x"],
:cv 1.011234785455179,
:r-squared 0.9680258142076104,
:adjusted-r-squared 0.9673596853369356,
:sigma2 0.9846703003839242,
:sigma 0.9923055478953668,
:tss 1478.1979039378052,
:rss 47.26417441842836,
:regss 1430.9337295193768,
:msreg 1430.9337295193768,
:qt 2.010634757624228,
:f-statistic 1453.2110179025954,
:p-value 0.0,
:ll
{:log-likelihood -69.54016644063114,
:aic 145.08033288126228,
:bic 150.8164018975467,
:aic-rss 1.1864795607950285,
:bic-rss 5.010525571651321},
:analysis #<Delay@498017ec: :not-delivered>,
:decomposition :cholesky,
:augmentation nil,
:hat
(0.029648093974656325
0.08405940568338455
0.05343396847868426
0.03719391307807442
0.025252720646134413
0.0273015670709528
0.026350715255939027
0.05145976400933276
0.03914982949076533
0.047023979687286444
0.036131973029489006
0.08362932937888126
0.06482209399024248
0.030075966332798736
0.05550400205953746
0.06610024953934973
0.025031817599346542
0.04988379199577221
0.03959027517177574
0.029778751672210536
0.04071667229446512
0.02280392022858478
0.024916279896493018
0.026657756327858644
0.030305085841766728
0.08214861103089953
0.07283176859667147
0.02579384600688622
0.04278627604945172
0.05851393658331623
0.08393323226880006
0.02419498060463299
0.032114808021273644
0.04267756335424035
0.023093963745137454
0.02960885466151044
0.042186118946633946
0.033063685591747397
0.031250854211248714
0.033129255056680865
0.020421183449901723
0.056384790952506786
0.04451050284338219
0.022584055022171378
0.020783484248086334
0.025738873288133065
0.020070722163559113
0.036430684932374835
0.025799257844838875
0.023126767792132272),
:effective-dimension 2.0000000000000004}Printing the model gives a tabular summary: We’ll capture the printed output and display it via Kindly for cleaner formatting.
(kind/code
(with-out-str
(println
simple-linear-data-model)))Residuals:
| :min | :q1 | :median | :q3 | :max |
|-----------+-----------+-----------+---------+----------|
| -1.907449 | -0.764859 | -0.030069 | 0.71575 | 2.505215 |
Coefficients:
| :name | :estimate | :stderr | :t-value | :p-value | :confidence-interval |
|-----------+-----------+----------+------------+----------+-----------------------|
| Intercept | -4.905231 | 0.309621 | -15.842695 | 0.0 | [-5.527766 -4.282696] |
| x | 1.950781 | 0.051173 | 38.121005 | 0.0 | [1.84789 2.053672] |
F-statistic: 1453.2110179025954 on degrees of freedom: {:residual 48, :model 1, :intercept 1}
p-value: 0.0
R2: 0.9680258142076104
Adjusted R2: 0.9673596853369356
Residual standard error: 0.9923055478953668 on 48 degrees of freedom
AIC: 145.08033288126228
As you can see, the estimated coefficients match our intercept \(b\) and slope \(a\) (the coefficient of \(x\)).
10.2.2 Dataset ergonomics
Below are a couple of helper functions that simplify how we use regression with datasets and display model summaries. We have similar ideas under development in the Tablemath library, but it is still in an experimental stage and not part of Noj yet.
(defn lm
"Compute a linear regression model for `dataset`.
The first column marked as target is the target.
All the columns unmarked as target are the features.
The resulting model is of type `fastmath.ml.regression.LMData`,
created via [Fastmath](https://github.com/generateme/fastmath).
See [fastmath.ml.regression.lm](https://generateme.github.io/fastmath/clay/ml.html#lm)
for `options`."
([dataset]
(lm dataset nil))
([dataset options]
(let [inference-column-name (-> dataset
ds-mod/inference-target-column-names
first)
ds-without-target (-> dataset
(tc/drop-columns [inference-column-name]))]
(reg/lm
;; ys
(get dataset inference-column-name)
;; xss
(tc/rows ds-without-target)
;; options
(merge {:names (-> ds-without-target
tc/column-names
vec)}
options)))))(defn summary
"Generate a summary of a linear model."
[lmdata]
(kind/code
(with-out-str
(println
lmdata))))(-> simple-linear-data
(ds-mod/set-inference-target :y)
lm
summary)Residuals:
| :min | :q1 | :median | :q3 | :max |
|-----------+-----------+-----------+---------+----------|
| -1.907449 | -0.764859 | -0.030069 | 0.71575 | 2.505215 |
Coefficients:
| :name | :estimate | :stderr | :t-value | :p-value | :confidence-interval |
|-----------+-----------+----------+------------+----------+-----------------------|
| Intercept | -4.905231 | 0.309621 | -15.842695 | 0.0 | [-5.527766 -4.282696] |
| :x | 1.950781 | 0.051173 | 38.121005 | 0.0 | [1.84789 2.053672] |
F-statistic: 1453.2110179025954 on degrees of freedom: {:residual 48, :model 1, :intercept 1}
p-value: 0.0
R2: 0.9680258142076104
Adjusted R2: 0.9673596853369356
Residual standard error: 0.9923055478953668 on 48 degrees of freedom
AIC: 145.08033288126228
10.2.3 Prediction
Once we have a linear model, we can generate new predictions. For instance, let’s predict \(y\) when \(x=3\):
(simple-linear-data-model [3])0.947112927524841610.2.4 Displaying the regression line
We can visualize the fitted line by adding a smooth layer to our scatter plot. Tableplot makes this convenient:
(-> simple-linear-data
(plotly/layer-point {:=name "data"})
(plotly/layer-smooth {:=name "prediction"}))Alternatively, we can build the regression line explicitly. We’ll obtain predictions and then plot them:
(-> simple-linear-data
(tc/map-columns :prediction
[:x]
simple-linear-data-model)
(plotly/layer-point {:=name "data"})
(plotly/layer-smooth {:=y :prediction
:=name "prediction"}))10.3 Multiple linear regression
We can easily extend these ideas to multiple linear predictors.
(def multiple-linear-data
(let [rng (rand/rng 1234)
n 50
a0 2
a1 -3
b -5]
(-> {:x0 (repeatedly n #(rand/frandom rng 0 10))
:x1 (repeatedly n #(rand/frandom rng 0 10))}
tc/dataset
(tc/map-columns :y
[:x0 :x1]
(fn [x0 x1]
(+ (* a0 x0)
(* a1 x1)
b
(rand/grandom rng)))))))(def multiple-linear-data-model
(-> multiple-linear-data
(ds-mod/set-inference-target :y)
lm))(summary multiple-linear-data-model)Residuals:
| :min | :q1 | :median | :q3 | :max |
|-----------+---------+----------+----------+----------|
| -2.514773 | -0.7368 | 0.032963 | 0.681916 | 2.306532 |
Coefficients:
| :name | :estimate | :stderr | :t-value | :p-value | :confidence-interval |
|-----------+-----------+----------+------------+----------+-----------------------|
| Intercept | -5.200192 | 0.388033 | -13.401408 | 0.0 | [-5.980815 -4.41957] |
| :x0 | 1.959646 | 0.05302 | 36.960802 | 0.0 | [1.852985 2.066308] |
| :x1 | -2.924595 | 0.05133 | -56.976684 | 0.0 | [-3.027857 -2.821333] |
F-statistic: 2391.536381491289 on degrees of freedom: {:residual 47, :model 2, :intercept 1}
p-value: 0.0
R2: 0.9902692977297971
Adjusted R2: 0.9898552252927671
Residual standard error: 1.0796263676757927 on 47 degrees of freedom
AIC: 154.46158567694772
Visualizing multiple dimensions is more involved. In the case of two features, we can use a 3D scatterplot and a 3D surface. Let us do that using Tableplot’s Plotly API.
(-> multiple-linear-data
(plotly/layer-point {:=coordinates :3d
:=x :x0
:=y :x1
:=z :y})
(plotly/layer-surface {:=dataset (let [xs (range 11)
ys (range 11)]
(tc/dataset
{:x xs
:y ys
:z (for [y ys]
(for [x xs]
(multiple-linear-data-model
[x y])))}))
:=mark-opacity 0.5}))10.4 Coming soon: Polynomial regression 🛠
10.5 Coming soon: One-hot encoding 🛠
10.6 Coming soon: Regularization 🛠
10.7 Example: Predicting Bicycle Traffic
As in the Python Data Science Handbook, we’ll try predicting the daily number of bicycle trips across the Fremont Bridge in Seattle. The features will include weather, season, day of week, and related factors.
10.7.1 Reading and parsing data
(def column-name-mapping
{"Fremont Bridge Sidewalks, south of N 34th St" :total
"Fremont Bridge Sidewalks, south of N 34th St Cyclist West Sidewalk" :west
"Fremont Bridge Sidewalks, south of N 34th St Cyclist East Sidewalk" :east
"Date" :datetime})(column-name-mapping
"Fremont Bridge Sidewalks, south of N 34th St"):total(def counts
(tc/dataset "data/seattle-bikes-and-weather/Fremont_Bridge_Bicycle_Counter.csv.gz"
{:key-fn column-name-mapping
:parser-fn {"Date" [:local-date-time "MM/dd/yyyy hh:mm:ss a"]}}))countsdata/seattle-bikes-and-weather/Fremont_Bridge_Bicycle_Counter.csv.gz [106608 4]:
| :datetime | :total | :west | :east |
|---|---|---|---|
| 2012-10-02T13:00 | 55 | 7 | 48 |
| 2012-10-02T14:00 | 130 | 55 | 75 |
| 2012-10-02T15:00 | 152 | 81 | 71 |
| 2012-10-02T16:00 | 278 | 167 | 111 |
| 2012-10-02T17:00 | 563 | 393 | 170 |
| 2012-10-02T18:00 | 381 | 236 | 145 |
| 2012-10-02T19:00 | 175 | 104 | 71 |
| 2012-10-02T20:00 | 86 | 51 | 35 |
| 2012-10-02T21:00 | 63 | 35 | 28 |
| 2012-10-02T22:00 | 42 | 27 | 15 |
| … | … | … | … |
| 2024-11-30T13:00 | 147 | 62 | 85 |
| 2024-11-30T14:00 | 154 | 73 | 81 |
| 2024-11-30T15:00 | 118 | 57 | 61 |
| 2024-11-30T16:00 | 88 | 51 | 37 |
| 2024-11-30T17:00 | 46 | 11 | 35 |
| 2024-11-30T18:00 | 46 | 18 | 28 |
| 2024-11-30T19:00 | 69 | 14 | 55 |
| 2024-11-30T20:00 | 18 | 8 | 10 |
| 2024-11-30T21:00 | 49 | 15 | 34 |
| 2024-11-30T22:00 | 14 | 4 | 10 |
| 2024-11-30T23:00 | 10 | 5 | 5 |
(def weather
(tc/dataset "data/seattle-bikes-and-weather/BicycleWeather.csv.gz"
{:key-fn keyword}))weatherdata/seattle-bikes-and-weather/BicycleWeather.csv.gz [1340 26]:
| :STATION | :STATION_NAME | :DATE | :PRCP | :SNWD | :SNOW | :TMAX | :TMIN | :AWND | :WDF2 | :WDF5 | :WSF2 | :WSF5 | :FMTM | :WT14 | :WT01 | :WT17 | :WT05 | :WT02 | :WT22 | :WT04 | :WT13 | :WT16 | :WT08 | :WT18 | :WT03 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| GHCND:USW00024233 | SEATTLE TACOMA INTERNATIONAL AIRPORT WA US | 20120101 | 0 | 0 | 0 | 128 | 50 | 47 | 100 | 90 | 89 | 112 | -9999 | 1 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 |
| GHCND:USW00024233 | SEATTLE TACOMA INTERNATIONAL AIRPORT WA US | 20120102 | 109 | 0 | 0 | 106 | 28 | 45 | 180 | 200 | 130 | 179 | -9999 | -9999 | 1 | -9999 | -9999 | -9999 | -9999 | -9999 | 1 | 1 | -9999 | -9999 | -9999 |
| GHCND:USW00024233 | SEATTLE TACOMA INTERNATIONAL AIRPORT WA US | 20120103 | 8 | 0 | 0 | 117 | 72 | 23 | 180 | 170 | 54 | 67 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 | 1 | -9999 | -9999 | -9999 |
| GHCND:USW00024233 | SEATTLE TACOMA INTERNATIONAL AIRPORT WA US | 20120104 | 203 | 0 | 0 | 122 | 56 | 47 | 180 | 190 | 107 | 148 | -9999 | -9999 | 1 | -9999 | -9999 | -9999 | -9999 | -9999 | 1 | 1 | -9999 | -9999 | -9999 |
| GHCND:USW00024233 | SEATTLE TACOMA INTERNATIONAL AIRPORT WA US | 20120105 | 13 | 0 | 0 | 89 | 28 | 61 | 200 | 220 | 107 | 165 | -9999 | -9999 | 1 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 | 1 | -9999 | -9999 | -9999 |
| GHCND:USW00024233 | SEATTLE TACOMA INTERNATIONAL AIRPORT WA US | 20120106 | 25 | 0 | 0 | 44 | 22 | 22 | 180 | 180 | 45 | 63 | -9999 | 1 | 1 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 | 1 | -9999 | -9999 | -9999 |
| GHCND:USW00024233 | SEATTLE TACOMA INTERNATIONAL AIRPORT WA US | 20120107 | 0 | 0 | 0 | 72 | 28 | 23 | 170 | 180 | 54 | 63 | -9999 | -9999 | 1 | -9999 | -9999 | -9999 | -9999 | -9999 | 1 | 1 | -9999 | -9999 | -9999 |
| GHCND:USW00024233 | SEATTLE TACOMA INTERNATIONAL AIRPORT WA US | 20120108 | 0 | 0 | 0 | 100 | 28 | 20 | 160 | 200 | 45 | 63 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 |
| GHCND:USW00024233 | SEATTLE TACOMA INTERNATIONAL AIRPORT WA US | 20120109 | 43 | 0 | 0 | 94 | 50 | 34 | 200 | 200 | 67 | 89 | -9999 | 1 | 1 | -9999 | -9999 | -9999 | -9999 | -9999 | 1 | 1 | -9999 | -9999 | -9999 |
| GHCND:USW00024233 | SEATTLE TACOMA INTERNATIONAL AIRPORT WA US | 20120110 | 10 | 0 | 0 | 61 | 6 | 34 | 20 | 30 | 89 | 107 | -9999 | -9999 | 1 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 | 1 | -9999 | -9999 | -9999 |
| … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
| GHCND:USW00024233 | SEATTLE TACOMA INTERNATIONAL AIRPORT WA US | 20150822 | 0 | 0 | 0 | 267 | 122 | 25 | 20 | 20 | 63 | 76 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 | 1 | -9999 | -9999 |
| GHCND:USW00024233 | SEATTLE TACOMA INTERNATIONAL AIRPORT WA US | 20150823 | 0 | 0 | 0 | 278 | 139 | 18 | 10 | 10 | 67 | 81 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 | 1 | -9999 | -9999 |
| GHCND:USW00024233 | SEATTLE TACOMA INTERNATIONAL AIRPORT WA US | 20150824 | 0 | 0 | 0 | 239 | 122 | 23 | 190 | 190 | 54 | 67 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 |
| GHCND:USW00024233 | SEATTLE TACOMA INTERNATIONAL AIRPORT WA US | 20150825 | 0 | 0 | 0 | 256 | 122 | 34 | 350 | 360 | 63 | 76 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 |
| GHCND:USW00024233 | SEATTLE TACOMA INTERNATIONAL AIRPORT WA US | 20150826 | 0 | 0 | 0 | 283 | 139 | 17 | 30 | 40 | 58 | 67 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 |
| GHCND:USW00024233 | SEATTLE TACOMA INTERNATIONAL AIRPORT WA US | 20150827 | 0 | 0 | 0 | 294 | 144 | 21 | 230 | 200 | 45 | 63 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 |
| GHCND:USW00024233 | SEATTLE TACOMA INTERNATIONAL AIRPORT WA US | 20150828 | 5 | 0 | 0 | 233 | 156 | 26 | 230 | 240 | 81 | 103 | -9999 | -9999 | 1 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 |
| GHCND:USW00024233 | SEATTLE TACOMA INTERNATIONAL AIRPORT WA US | 20150829 | 325 | 0 | 0 | 222 | 133 | 58 | 210 | 210 | 157 | 206 | -9999 | -9999 | 1 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 |
| GHCND:USW00024233 | SEATTLE TACOMA INTERNATIONAL AIRPORT WA US | 20150830 | 102 | 0 | 0 | 200 | 128 | 47 | 200 | 200 | 89 | 112 | -9999 | -9999 | 1 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 |
| GHCND:USW00024233 | SEATTLE TACOMA INTERNATIONAL AIRPORT WA US | 20150831 | 0 | 0 | 0 | 189 | 161 | 58 | 210 | 210 | 112 | 134 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 |
| GHCND:USW00024233 | SEATTLE TACOMA INTERNATIONAL AIRPORT WA US | 20150901 | 58 | 0 | 0 | 194 | 139 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 | -9999 |
10.7.2 Preprocessing
The bike counts come in hourly data, but our weather information is daily. We’ll need to aggregate the hourly counts into daily totals before combining the datasets.
In the Python handbook, one does:
daily = counts.resample('d').sum()Since Tablecloth’s time series features are still evolving, we’ll be a bit more explicit:
(def daily-totals
(-> counts
(tc/group-by (fn [{:keys [datetime]}]
{:date (datetime/local-date-time->local-date
datetime)}))
(tc/aggregate-columns [:total :west :east]
tcc/sum)))daily-totals_unnamed [4443 4]:
| :date | :total | :west | :east |
|---|---|---|---|
| 2012-10-02 | 1938.0 | 1165.0 | 773.0 |
| 2012-10-03 | 3521.0 | 1761.0 | 1760.0 |
| 2012-10-04 | 3475.0 | 1767.0 | 1708.0 |
| 2012-10-05 | 3148.0 | 1590.0 | 1558.0 |
| 2012-10-06 | 2006.0 | 926.0 | 1080.0 |
| 2012-10-07 | 2142.0 | 951.0 | 1191.0 |
| 2012-10-08 | 3537.0 | 1708.0 | 1829.0 |
| 2012-10-09 | 3501.0 | 1742.0 | 1759.0 |
| 2012-10-10 | 3235.0 | 1587.0 | 1648.0 |
| 2012-10-11 | 3047.0 | 1468.0 | 1579.0 |
| … | … | … | … |
| 2024-11-20 | 2300.0 | 787.0 | 1513.0 |
| 2024-11-21 | 2382.0 | 775.0 | 1607.0 |
| 2024-11-22 | 1473.0 | 536.0 | 937.0 |
| 2024-11-23 | 1453.0 | 652.0 | 801.0 |
| 2024-11-24 | 727.0 | 311.0 | 416.0 |
| 2024-11-25 | 1483.0 | 486.0 | 997.0 |
| 2024-11-26 | 2173.0 | 727.0 | 1446.0 |
| 2024-11-27 | 1522.0 | 548.0 | 974.0 |
| 2024-11-28 | 631.0 | 261.0 | 370.0 |
| 2024-11-29 | 833.0 | 368.0 | 465.0 |
| 2024-11-30 | 1178.0 | 509.0 | 669.0 |
10.7.3 Prediction by day-of-week
Next, we’ll explore a simple regression by day of week.
(def days-of-week
[:Mon :Tue :Wed :Thu :Fri :Sat :Sun])We’ll convert the numeric day-of-week to the corresponding keyword:
(def idx->day-of-week
(comp days-of-week dec))For example,
(idx->day-of-week 1):Mon(idx->day-of-week 7):SunNow, let’s build our dataset:
(def totals-with-day-of-week
(-> daily-totals
(tc/add-column :day-of-week
(fn [ds]
(map idx->day-of-week
(datetime/long-temporal-field
:day-of-week
(:date ds)))))
(tc/select-columns [:total :day-of-week])))totals-with-day-of-week_unnamed [4443 2]:
| :total | :day-of-week |
|---|---|
| 1938.0 | :Tue |
| 3521.0 | :Wed |
| 3475.0 | :Thu |
| 3148.0 | :Fri |
| 2006.0 | :Sat |
| 2142.0 | :Sun |
| 3537.0 | :Mon |
| 3501.0 | :Tue |
| 3235.0 | :Wed |
| 3047.0 | :Thu |
| … | … |
| 2300.0 | :Wed |
| 2382.0 | :Thu |
| 1473.0 | :Fri |
| 1453.0 | :Sat |
| 727.0 | :Sun |
| 1483.0 | :Mon |
| 2173.0 | :Tue |
| 1522.0 | :Wed |
| 631.0 | :Thu |
| 833.0 | :Fri |
| 1178.0 | :Sat |
(def totals-with-one-hot-days-of-week
(-> (reduce (fn [dataset day-of-week]
(-> dataset
(tc/add-column day-of-week
#(-> (:day-of-week %)
(tcc/eq day-of-week)
;; convert booleans to 0/1
(tcc/* 1)))))
totals-with-day-of-week
days-of-week)
(tc/drop-columns [:day-of-week])
(ds-mod/set-inference-target :total)))(-> totals-with-one-hot-days-of-week
(tc/select-columns ds-mod/inference-column?))_unnamed [0 0]
Since the binary columns sum to 1, they’re collinear, and we won’t use an intercept. This way, each coefficient directly reflects the expected bike count for that day of week.
(def days-of-week-model
(lm totals-with-one-hot-days-of-week
{:intercept? false}))Let’s take a look at the results:
(-> days-of-week-model
println
with-out-str
kind/code)Residuals:
| :min | :q1 | :median | :q3 | :max |
|--------------+-------------+------------+------------+-------------|
| -3034.944882 | -841.755906 | -90.179528 | 828.820472 | 3745.244094 |
Coefficients:
| :name | :estimate | :stderr | :t-value | :p-value | :confidence-interval |
|-------+-------------+-----------+-----------+----------+---------------------------|
| :Mon | 2785.741325 | 44.521658 | 62.570476 | 0.0 | [2698.456664 2873.025986] |
| :Tue | 3115.527559 | 44.486587 | 70.032964 | 0.0 | [3028.311653 3202.743465] |
| :Wed | 3109.944882 | 44.486587 | 69.907472 | 0.0 | [3022.728976 3197.160788] |
| :Thu | 2961.179528 | 44.486587 | 66.563423 | 0.0 | [2873.963622 3048.395433] |
| :Fri | 2623.755906 | 44.486587 | 58.978583 | 0.0 | [2536.54 2710.971811] |
| :Sat | 1688.029921 | 44.486587 | 37.944693 | 0.0 | [1600.814015 1775.245827] |
| :Sun | 1576.178233 | 44.521658 | 35.402506 | 0.0 | [1488.893572 1663.462894] |
F-statistic: 3472.719277560978 on degrees of freedom: {:residual 4436, :model 7, :intercept 0}
p-value: 0.0
R2: 0.8456776967289252
Adjusted R2: 0.8454341764126724
Residual standard error: 1121.0266948292917 on 4436 degrees of freedom
AIC: 75015.17638480052
We can clearly see weekend versus weekday differences.