8  Intro to Linear Regression - DRAFT πŸ› 

Here we offer an intro to linear regression following the In Depth: Linear Regression section of the Python Data Science Handbook by Jake VanderPlas.

8.1 Setup

(ns noj-book.linear-regression-intro
  (:require
   [tech.v3.dataset :as ds]
   [tablecloth.api :as tc]
   [tablecloth.column.api :as tcc]
   [tech.v3.datatype.datetime :as datetime]
   [tech.v3.dataset.modelling :as ds-mod]
   [fastmath.ml.regression :as reg]
   [scicloj.kindly.v4.kind :as kind]))

8.2 Reading and parsing data

(def column-name-mapping
  {"Fremont Bridge Sidewalks, south of N 34th St" :total
   "Fremont Bridge Sidewalks, south of N 34th St Cyclist West Sidewalk" :west
   "Fremont Bridge Sidewalks, south of N 34th St Cyclist East Sidewalk" :east
   "Date" :datetime})
(column-name-mapping
 "Fremont Bridge Sidewalks, south of N 34th St")
:total
(def counts
  (tc/dataset "data/seattle-bikes-and-weather/Fremont_Bridge_Bicycle_Counter.csv.gz"
              {:key-fn column-name-mapping
               :parser-fn {"Date" [:local-date-time "MM/dd/yyyy hh:mm:ss a"]}}))
counts

data/seattle-bikes-and-weather/Fremont_Bridge_Bicycle_Counter.csv.gz [106608 4]:

:datetime :total :west :east
2012-10-02T13:00 55 7 48
2012-10-02T14:00 130 55 75
2012-10-02T15:00 152 81 71
2012-10-02T16:00 278 167 111
2012-10-02T17:00 563 393 170
2012-10-02T18:00 381 236 145
2012-10-02T19:00 175 104 71
2012-10-02T20:00 86 51 35
2012-10-02T21:00 63 35 28
2012-10-02T22:00 42 27 15
… … … …
2024-11-30T13:00 147 62 85
2024-11-30T14:00 154 73 81
2024-11-30T15:00 118 57 61
2024-11-30T16:00 88 51 37
2024-11-30T17:00 46 11 35
2024-11-30T18:00 46 18 28
2024-11-30T19:00 69 14 55
2024-11-30T20:00 18 8 10
2024-11-30T21:00 49 15 34
2024-11-30T22:00 14 4 10
2024-11-30T23:00 10 5 5
(def weather
  (tc/dataset "data/seattle-bikes-and-weather/BicycleWeather.csv.gz"
              {:key-fn keyword}))
weather

data/seattle-bikes-and-weather/BicycleWeather.csv.gz [1340 26]:

:STATION :STATION_NAME :DATE :PRCP :SNWD :SNOW :TMAX :TMIN :AWND :WDF2 :WDF5 :WSF2 :WSF5 :FMTM :WT14 :WT01 :WT17 :WT05 :WT02 :WT22 :WT04 :WT13 :WT16 :WT08 :WT18 :WT03
GHCND:USW00024233 SEATTLE TACOMA INTERNATIONAL AIRPORT WA US 20120101 0 0 0 128 50 47 100 90 89 112 -9999 1 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999
GHCND:USW00024233 SEATTLE TACOMA INTERNATIONAL AIRPORT WA US 20120102 109 0 0 106 28 45 180 200 130 179 -9999 -9999 1 -9999 -9999 -9999 -9999 -9999 1 1 -9999 -9999 -9999
GHCND:USW00024233 SEATTLE TACOMA INTERNATIONAL AIRPORT WA US 20120103 8 0 0 117 72 23 180 170 54 67 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 1 -9999 -9999 -9999
GHCND:USW00024233 SEATTLE TACOMA INTERNATIONAL AIRPORT WA US 20120104 203 0 0 122 56 47 180 190 107 148 -9999 -9999 1 -9999 -9999 -9999 -9999 -9999 1 1 -9999 -9999 -9999
GHCND:USW00024233 SEATTLE TACOMA INTERNATIONAL AIRPORT WA US 20120105 13 0 0 89 28 61 200 220 107 165 -9999 -9999 1 -9999 -9999 -9999 -9999 -9999 -9999 1 -9999 -9999 -9999
GHCND:USW00024233 SEATTLE TACOMA INTERNATIONAL AIRPORT WA US 20120106 25 0 0 44 22 22 180 180 45 63 -9999 1 1 -9999 -9999 -9999 -9999 -9999 -9999 1 -9999 -9999 -9999
GHCND:USW00024233 SEATTLE TACOMA INTERNATIONAL AIRPORT WA US 20120107 0 0 0 72 28 23 170 180 54 63 -9999 -9999 1 -9999 -9999 -9999 -9999 -9999 1 1 -9999 -9999 -9999
GHCND:USW00024233 SEATTLE TACOMA INTERNATIONAL AIRPORT WA US 20120108 0 0 0 100 28 20 160 200 45 63 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999
GHCND:USW00024233 SEATTLE TACOMA INTERNATIONAL AIRPORT WA US 20120109 43 0 0 94 50 34 200 200 67 89 -9999 1 1 -9999 -9999 -9999 -9999 -9999 1 1 -9999 -9999 -9999
GHCND:USW00024233 SEATTLE TACOMA INTERNATIONAL AIRPORT WA US 20120110 10 0 0 61 6 34 20 30 89 107 -9999 -9999 1 -9999 -9999 -9999 -9999 -9999 -9999 1 -9999 -9999 -9999
… … … … … … … … … … … … … … … … … … … … … … … … … …
GHCND:USW00024233 SEATTLE TACOMA INTERNATIONAL AIRPORT WA US 20150822 0 0 0 267 122 25 20 20 63 76 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 1 -9999 -9999
GHCND:USW00024233 SEATTLE TACOMA INTERNATIONAL AIRPORT WA US 20150823 0 0 0 278 139 18 10 10 67 81 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 1 -9999 -9999
GHCND:USW00024233 SEATTLE TACOMA INTERNATIONAL AIRPORT WA US 20150824 0 0 0 239 122 23 190 190 54 67 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999
GHCND:USW00024233 SEATTLE TACOMA INTERNATIONAL AIRPORT WA US 20150825 0 0 0 256 122 34 350 360 63 76 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999
GHCND:USW00024233 SEATTLE TACOMA INTERNATIONAL AIRPORT WA US 20150826 0 0 0 283 139 17 30 40 58 67 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999
GHCND:USW00024233 SEATTLE TACOMA INTERNATIONAL AIRPORT WA US 20150827 0 0 0 294 144 21 230 200 45 63 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999
GHCND:USW00024233 SEATTLE TACOMA INTERNATIONAL AIRPORT WA US 20150828 5 0 0 233 156 26 230 240 81 103 -9999 -9999 1 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999
GHCND:USW00024233 SEATTLE TACOMA INTERNATIONAL AIRPORT WA US 20150829 325 0 0 222 133 58 210 210 157 206 -9999 -9999 1 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999
GHCND:USW00024233 SEATTLE TACOMA INTERNATIONAL AIRPORT WA US 20150830 102 0 0 200 128 47 200 200 89 112 -9999 -9999 1 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999
GHCND:USW00024233 SEATTLE TACOMA INTERNATIONAL AIRPORT WA US 20150831 0 0 0 189 161 58 210 210 112 134 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999
GHCND:USW00024233 SEATTLE TACOMA INTERNATIONAL AIRPORT WA US 20150901 58 0 0 194 139 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999 -9999

8.3 Preprocessing

Our bike counts data are hourly, but the weather data is daily. To join them, we will need to convert the bike hourly counts to daily counts.

In the Python book, this is done as follows in Pandas:

daily = counts.resample('d').sum()

Tablecloth’s full support for time series is still under construction. For now, we will have to be a bit more verbose:

(def daily-totals
  (-> counts
      (tc/group-by (fn [{:keys [datetime]}]
                     {:date (datetime/local-date-time->local-date
                             datetime)}))
      (tc/aggregate-columns [:total :west :east]
                            tcc/sum)))
daily-totals

_unnamed [4443 4]:

:date :total :west :east
2012-10-02 1938.0 1165.0 773.0
2012-10-03 3521.0 1761.0 1760.0
2012-10-04 3475.0 1767.0 1708.0
2012-10-05 3148.0 1590.0 1558.0
2012-10-06 2006.0 926.0 1080.0
2012-10-07 2142.0 951.0 1191.0
2012-10-08 3537.0 1708.0 1829.0
2012-10-09 3501.0 1742.0 1759.0
2012-10-10 3235.0 1587.0 1648.0
2012-10-11 3047.0 1468.0 1579.0
… … … …
2024-11-20 2300.0 787.0 1513.0
2024-11-21 2382.0 775.0 1607.0
2024-11-22 1473.0 536.0 937.0
2024-11-23 1453.0 652.0 801.0
2024-11-24 727.0 311.0 416.0
2024-11-25 1483.0 486.0 997.0
2024-11-26 2173.0 727.0 1446.0
2024-11-27 1522.0 548.0 974.0
2024-11-28 631.0 261.0 370.0
2024-11-29 833.0 368.0 465.0
2024-11-30 1178.0 509.0 669.0

8.4 Prediction by day-of-week

Let us prepare the data for regression on the day of week.

(def days-of-week
  [:Mon :Tue :Wed :Thu :Fri :Sat :Sun])

We will convert numbers to days-of-week keywords:

(def idx->day-of-week
  (comp days-of-week dec))

E.g.,

(idx->day-of-week 1)
:Mon
(idx->day-of-week 7)
:Sun

Now, let us prepare the data:

(def totals-with-day-of-week
  (-> daily-totals
      (tc/add-column :day-of-week
                     (fn [ds]
                       (map idx->day-of-week
                            (datetime/long-temporal-field
                             :day-of-week
                             (:date ds)))))
      (tc/select-columns [:total :day-of-week])))
totals-with-day-of-week

_unnamed [4443 2]:

:total :day-of-week
1938.0 :Tue
3521.0 :Wed
3475.0 :Thu
3148.0 :Fri
2006.0 :Sat
2142.0 :Sun
3537.0 :Mon
3501.0 :Tue
3235.0 :Wed
3047.0 :Thu
… …
2300.0 :Wed
2382.0 :Thu
1473.0 :Fri
1453.0 :Sat
727.0 :Sun
1483.0 :Mon
2173.0 :Tue
1522.0 :Wed
631.0 :Thu
833.0 :Fri
1178.0 :Sat
(def totals-with-one-hot-days-of-week
  (-> (reduce (fn [dataset day-of-week]
                (-> dataset
                    (tc/add-column day-of-week
                                   #(-> (:day-of-week %)
                                        (tcc/eq day-of-week)
                                        ;; turn booleans into 0s and 1s
                                        (tcc/* 1)))))
              totals-with-day-of-week
              days-of-week)
      (tc/drop-columns [:day-of-week])
      (ds-mod/set-inference-target :total)))
(-> totals-with-one-hot-days-of-week
    (tc/select-columns ds-mod/inference-column?))

_unnamed [0 0]

Let us compute the linear regression model using Fastmath. We will use this wrapper function that handles a dataset (a concept which is unknown to Fastmath):

(defn lm [dataset options]
  (let [inference-column-name (-> dataset
                                  ds-mod/inference-target-column-names
                                  first)
        ds-without-target (-> dataset
                              (tc/drop-columns [inference-column-name]))]
    (reg/lm
     ;; ys
     (get dataset inference-column-name)
     ;; xss
     (tc/rows ds-without-target)
     ;; options
     (merge {:names (-> ds-without-target
                        tc/column-names
                        vec)}
            options))))

The binary columns are collinear (sum up to 1), but we will avoide the intercept. This way, the interpretation of each coefficient is the expected bike count for the corresponding day of week.

(def days-of-week-model
  (lm totals-with-one-hot-days-of-week
      {:intercept? false}))

Here are the regression results:

(-> days-of-week-model
    println
    with-out-str
    kind/code)
Residuals:

|         :min |         :q1 |    :median |        :q3 |        :max |
|--------------+-------------+------------+------------+-------------|
| -3034.944882 | -841.755906 | -90.179528 | 828.820472 | 3745.244094 |

Coefficients:

| :name |   :estimate |   :stderr |  :t-value | :p-value |      :confidence-interval |
|-------+-------------+-----------+-----------+----------+---------------------------|
|  :Mon | 2785.741325 | 44.521658 | 62.570476 |      0.0 | [2698.456664 2873.025986] |
|  :Tue | 3115.527559 | 44.486587 | 70.032964 |      0.0 | [3028.311653 3202.743465] |
|  :Wed | 3109.944882 | 44.486587 | 69.907472 |      0.0 | [3022.728976 3197.160788] |
|  :Thu | 2961.179528 | 44.486587 | 66.563423 |      0.0 | [2873.963622 3048.395433] |
|  :Fri | 2623.755906 | 44.486587 | 58.978583 |      0.0 |     [2536.54 2710.971811] |
|  :Sat | 1688.029921 | 44.486587 | 37.944693 |      0.0 | [1600.814015 1775.245827] |
|  :Sun | 1576.178233 | 44.521658 | 35.402506 |      0.0 | [1488.893572 1663.462894] |

F-statistic: 3472.719277560978 on degrees of freedom: {:residual 4436, :model 7, :intercept 0}
p-value: 0.0

R2: 0.8456776967289252
Adjusted R2: 0.8454341764126724
Residual standard error: 1121.0266948292917 on 4436 degrees of freedom
AIC: 75015.17638480052

We can see the difference between weekends and weekdays.

source: notebooks/noj_book/linear_regression_intro.clj