8 Distributions of single variables - DRAFT 🛠
author: Cvetomir Dimov and Daniel Slutsky
last change: 2025-02-24
In this tutorial, we will demonstrate how to visualize and summarize variables.
8.1 Setup and data
We will use the Chicago bike trips dataset, that is read and processed at the Intro to Table Processing with Tablecloth.
ns noj-book.statistics-intro
(:require [tablecloth.api :as tc]
(
[noj-book.tablecloth-table-processing]:as plotly]
[scicloj.tableplot.v1.plotly :as stats])) [fastmath.stats
def preprocessed-trips
( noj-book.tablecloth-table-processing/preprocessed-trips)
8.2 Checking basic statistics of variables in a dataset.
-> preprocessed-trips
(:hour :duration-in-seconds])
(tc/select-columns [ tc/info)
202304-divvy-tripdata.zip: descriptive-stats [2 11]:
:col-name | :datatype | :n-valid | :n-missing | :min | :mean | :max | :standard-deviation | :skew | :first | :last |
---|---|---|---|---|---|---|---|---|---|---|
:hour | :int64 | 426590 | 0 | 0.0 | 14.23275276 | 23.0 | 4.84054550 | -0.58147195 | 8 | 8 |
:duration-in-seconds | :int64 | 426590 | 0 | -536.0 | 1032.64061511 | 1103729.0 | 8532.49143608 | 78.01094578 | 249 | 123 |
We see that the duration in seconds has some unreasonable values: trips of negative length and trips which are many-hours-long.
Let us check how frequent that is.
defn duration-diagnostics [{:keys [duration-in-minutes]}]
(:negative-duration (neg? duration-in-minutes)
{:unreasonably-long-duration (> duration-in-minutes (* 2 60))})
-> preprocessed-trips
(
(tc/group-by duration-diagnostics):trips tc/row-count})) (tc/aggregate {
_unnamed [3 3]:
:negative-duration | :unreasonably-long-duration | :trips |
---|---|---|
false | false | 423924 |
false | true | 2662 |
true | false | 4 |
8.3 Data cleaning
Let us keep only trips of reasonable duration.
def clean-trips
(-> preprocessed-trips
(fn [{:keys [duration-in-minutes]}]
(tc/select-rows (<= 0
(
duration-in-minutes40)))))
8.4 Visually exploring the distribution of variables
The distribution of start hour:
-> clean-trips
(:hour])
(tc/group-by [:n tc/row-count})
(tc/aggregate {:hour
(plotly/layer-bar {:=x :n})) :=y
The distribution of trip duration: Let us use histograms – binning the values and counting.
-> clean-trips
(:duration-in-minutes
(plotly/layer-histogram {:=x 100})) :=histogram-nbins
The distribution of trip duration in different parts of the day:
-> clean-trips
(:day-part
(tc/map-columns :hour]
[fn [hour]
(cond (<= 6 hour 12) :morning
(<= 12 hour 18) :afternoon
(<= 18 hour 23) :evening
(:else :night)))
:duration-in-minutes
(plotly/layer-histogram {:=x 100
:=histogram-nbins :day-part
:=color 0.8})) :=mark-opacity
TODO: Use density estimates rather than histograms here.
The distribution of trip duration for different bike types:
-> clean-trips
(:duration-in-minutes
(plotly/layer-histogram {:=x 100
:=histogram-nbins :rideable-type
:=color 0.8})) :=mark-opacity
8.5 Describing the distribution of a continuous variable
The fastmath
library contains function for computing the most common measures of sample central tendency and variability (see the Statistics section). We will demonstrate a few.
-> clean-trips
(:n tc/row-count
(tc/aggregate {:mean-duration #(-> % :duration-in-seconds stats/mean)
:min-duration #(-> % :duration-in-seconds stats/minimum)
:max-duration #(-> % :duration-in-seconds stats/maximum)
:median-duration #(-> % :duration-in-seconds stats/median)
:q1-duration #(-> % :duration-in-seconds stats/stats-map :Q1)
:percentile10-duration #(-> % :duration-in-seconds (stats/percentile 10))
:sd-duration #(-> % :duration-in-seconds stats/stddev)
:cov-duration #(-> % :duration-in-seconds stats/variation)
:mad-duration #(-> % :duration-in-seconds stats/mad)
:iqr-duration #(-> % :duration-in-seconds stats/iqr)}))
_unnamed [1 11]:
:percentile10-duration | :n | :mean-duration | :max-duration | :min-duration | :sd-duration | :median-duration | :mad-duration | :q1-duration | :iqr-duration | :cov-duration |
---|---|---|---|---|---|---|---|---|---|---|
169.0 | 404180 | 648.17606512 | 2400.0 | 0.0 | 493.51028516 | 506.0 | 257.0 | 291.0 | 579.0 | 0.76138307 |
Note that the function stats-map
outputs many of the most commonly used statistics, including the ones computed above.
-> clean-trips
(:duration-in-minutes
stats/stats-mapselect-keys [:Size :Min :Max :Range :Mean :Median :Mode :Q1 :Q3
(:SD :Variance :MAD :SEM :IQR :Kurtosis :Skewness]))
:MAD 4.283333333333333,
{:Skewness 1.2669327507375887,
:Max 40.0,
:Variance 67.65344487828531,
:Size 404180,
:Mode 0.06666666666666667,
:Mean 10.802934418658129,
:Q1 4.85,
:Q3 14.5,
:Min 0.0,
:Range 40.0,
:SD 8.225171419386061,
:IQR 9.65,
:SEM 0.01293771404646492,
:Kurtosis 1.2317123730642425,
:Median 8.433333333333334}
Box plots and violin plots can be used to visualize several of these summary statistics.
-> clean-trips
(
(plotly/layer-boxplot:duration-in-minutes})) {:=y
These can also be produced by group to compare the distributions across different groups.
-> clean-trips
(
(plotly/layer-violin:rideable-type
{:=x :duration-in-minutes
:=y true
:=box-visible :rideable-type})) :=color