7 Intro to statistics - DRAFT 🛠
In this tutorial, we will demonstrate some basic methods of statistics.
7.1 Setup and data
We will use the Chicago bike trips dataset, that is read and processed at the Intro to Table Processing with Tablecloth.
ns noj-book.statistics-intro
(:require [tablecloth.api :as tc]
(
[noj-book.tablecloth-table-processing]:as plotly])) [scicloj.tableplot.v1.plotly
def preprocessed-trips
( noj-book.tablecloth-table-processing/preprocessed-trips)
7.2 Checking basic statistics of variables
-> preprocessed-trips
(:hour :duration-in-seconds])
(tc/select-columns [ tc/info)
data/chicago-bikes/202304_divvy_tripdata.csv.gz: descriptive-stats [2 11]:
:col-name | :datatype | :n-valid | :n-missing | :min | :mean | :max | :standard-deviation | :skew | :first | :last |
---|---|---|---|---|---|---|---|---|---|---|
:hour | :int64 | 426590 | 0 | 0.0 | 14.23275276 | 23.0 | 4.84054550 | -0.58147195 | 8 | 8 |
:duration-in-seconds | :int64 | 426590 | 0 | -536.0 | 1032.64061511 | 1103729.0 | 8532.49143608 | 78.01094578 | 249 | 123 |
We see that the duration in seconds has some unreasonable values: trips of negative length and trips which are many-hours-long.
Let us check how frequent that is.
defn duration-diagnostics [{:keys [duration-in-minutes]}]
(:negative-duration (neg? duration-in-minutes)
{:unreasonably-long-duration (> duration-in-minutes (* 2 60))})
-> preprocessed-trips
(
(tc/group-by duration-diagnostics):trips tc/row-count})) (tc/aggregate {
_unnamed [3 3]:
:negative-duration | :unreasonably-long-duration | :trips |
---|---|---|
false | false | 423924 |
false | true | 2662 |
true | false | 4 |
7.3 Data cleaning
Let us keep only trips of reasonable duration.
def clean-trips
(-> preprocessed-trips
(fn [{:keys [duration-in-minutes]}]
(tc/select-rows (<= 0
(
duration-in-minutes* 2 60)))))) (
7.4 Visually exploring the distribution of variables
The distribution of start hour:
-> clean-trips
(:hour])
(tc/group-by [:n tc/row-count})
(tc/aggregate {:hour
(plotly/layer-bar {:=x :n})) :=y
The distribution of trip duration: Let us use histograms – binning the values and counting.
-> clean-trips
(:duration-in-minutes
(plotly/layer-histogram {:=x 100})) :=histogram-nbins
The distribution of trip duration in different parts of the day:
-> clean-trips
(:day-part
(tc/map-columns :hour]
[fn [hour]
(cond (<= 6 hour 12) :morning
(<= 12 hour 18) :afternoon
(<= 18 hour 23) :evening
(:else :night)))
:duration-in-minutes
(plotly/layer-histogram {:=x 100
:=histogram-nbins :day-part
:=color 0.8})) :=mark-opacity
TODO: Use density estimates rather than histograms here.
The distribution of trip duration for different bike types:
-> clean-trips
(:duration-in-minutes
(plotly/layer-histogram {:=x 100
:=histogram-nbins :rideable-type
:=color 0.8})) :=mark-opacity