8  Distributions of single variables - DRAFT 🛠

author: Cvetomir Dimov and Daniel Slutsky

last change: 2025-02-24

In this tutorial, we will demonstrate how to visualize and summarize variables.

8.1 Setup and data

We will use the Chicago bike trips dataset, that is read and processed at the Intro to Table Processing with Tablecloth.

(ns noj-book.statistics-intro
  (:require [tablecloth.api :as tc]
            [noj-book.tablecloth-table-processing]
            [scicloj.tableplot.v1.plotly :as plotly]
            [fastmath.stats :as stats]))
(def preprocessed-trips
  noj-book.tablecloth-table-processing/preprocessed-trips)

8.2 Checking basic statistics of variables in a dataset.

(-> preprocessed-trips
    (tc/select-columns [:hour :duration-in-seconds])
    tc/info)

202304-divvy-tripdata.zip: descriptive-stats [2 11]:

:col-name :datatype :n-valid :n-missing :min :mean :max :standard-deviation :skew :first :last
:hour :int64 426590 0 0.0 14.23275276 23.0 4.84054550 -0.58147195 8 8
:duration-in-seconds :int64 426590 0 -536.0 1032.64061511 1103729.0 8532.49143608 78.01094578 249 123

We see that the duration in seconds has some unreasonable values: trips of negative length and trips which are many-hours-long.

Let us check how frequent that is.

(defn duration-diagnostics [{:keys [duration-in-minutes]}]
  {:negative-duration (neg? duration-in-minutes)
   :unreasonably-long-duration (> duration-in-minutes (* 2 60))})
(-> preprocessed-trips
    (tc/group-by duration-diagnostics)
    (tc/aggregate {:trips tc/row-count}))

_unnamed [3 3]:

:negative-duration :unreasonably-long-duration :trips
false false 423924
false true 2662
true false 4

8.3 Data cleaning

Let us keep only trips of reasonable duration.

(def clean-trips
  (-> preprocessed-trips
      (tc/select-rows (fn [{:keys [duration-in-minutes]}]
                        (<= 0
                            duration-in-minutes
                            40)))))

8.4 Visually exploring the distribution of variables

The distribution of start hour:

(-> clean-trips
    (tc/group-by [:hour])
    (tc/aggregate {:n tc/row-count})
    (plotly/layer-bar {:=x :hour
                       :=y :n}))

The distribution of trip duration: Let us use histograms – binning the values and counting.

(-> clean-trips
    (plotly/layer-histogram {:=x :duration-in-minutes
                             :=histogram-nbins 100}))

The distribution of trip duration in different parts of the day:

(-> clean-trips
    (tc/map-columns :day-part
                    [:hour]
                    (fn [hour]
                      (cond (<= 6 hour 12) :morning
                            (<= 12 hour 18) :afternoon
                            (<= 18 hour 23) :evening
                            :else :night)))
    (plotly/layer-histogram {:=x :duration-in-minutes
                             :=histogram-nbins 100
                             :=color :day-part
                             :=mark-opacity 0.8}))

TODO: Use density estimates rather than histograms here.

The distribution of trip duration for different bike types:

(-> clean-trips
    (plotly/layer-histogram {:=x :duration-in-minutes
                             :=histogram-nbins 100
                             :=color :rideable-type
                             :=mark-opacity 0.8}))

8.5 Describing the distribution of a continuous variable

The fastmath library contains function for computing the most common measures of sample central tendency and variability (see the Statistics section). We will demonstrate a few.

(-> clean-trips
    (tc/aggregate {:n tc/row-count
                   :mean-duration #(-> % :duration-in-seconds stats/mean)
                   :min-duration #(-> % :duration-in-seconds stats/minimum)
                   :max-duration #(-> % :duration-in-seconds stats/maximum)
                   :median-duration #(-> % :duration-in-seconds stats/median)
                   :q1-duration #(-> % :duration-in-seconds stats/stats-map :Q1)
                   :percentile10-duration #(-> % :duration-in-seconds (stats/percentile 10))
                   :sd-duration #(-> % :duration-in-seconds stats/stddev)
                   :cov-duration #(-> % :duration-in-seconds stats/variation)
                   :mad-duration #(-> % :duration-in-seconds stats/mad)
                   :iqr-duration #(-> % :duration-in-seconds stats/iqr)}))

_unnamed [1 11]:

:percentile10-duration :n :mean-duration :max-duration :min-duration :sd-duration :median-duration :mad-duration :q1-duration :iqr-duration :cov-duration
169.0 404180 648.17606512 2400.0 0.0 493.51028516 506.0 257.0 291.0 579.0 0.76138307

Note that the function stats-map outputs many of the most commonly used statistics, including the ones computed above.

(-> clean-trips
    :duration-in-minutes
    stats/stats-map
    (select-keys [:Size :Min :Max :Range :Mean :Median :Mode :Q1 :Q3
                  :SD :Variance :MAD :SEM :IQR :Kurtosis :Skewness]))
{:MAD 4.283333333333333,
 :Skewness 1.2669327507375887,
 :Max 40.0,
 :Variance 67.65344487828531,
 :Size 404180,
 :Mode 0.06666666666666667,
 :Mean 10.802934418658129,
 :Q1 4.85,
 :Q3 14.5,
 :Min 0.0,
 :Range 40.0,
 :SD 8.225171419386061,
 :IQR 9.65,
 :SEM 0.01293771404646492,
 :Kurtosis 1.2317123730642425,
 :Median 8.433333333333334}

Box plots and violin plots can be used to visualize several of these summary statistics.

(-> clean-trips
    (plotly/layer-boxplot
     {:=y :duration-in-minutes}))

These can also be produced by group to compare the distributions across different groups.

(-> clean-trips
    (plotly/layer-violin
     {:=x :rideable-type 
      :=y :duration-in-minutes
      :=box-visible true
      :=color :rideable-type}))

8.6 Robust statistics

Outliers can have a significant influence on some summary values, such as mean or standard deviation. This why we cleaned the bike trips data in the first place. Removing outliers can be done systematically by removing the most extreme percentage of the data.

(-> preprocessed-trips
    :duration-in-minutes
    stats/trim
    stats/stats-map)
{:MAD 2.8500000000000014,
 :UOF 30.61666666666668,
 :Skewness 0.5819050204623866,
 :Max 18.91666666666667,
 :Variance 15.080367496645339,
 :Size 256246,
 :LAV 4.35,
 :UIF 21.541666666666675,
 :Mode 5.25,
 :Mean 9.701546040393245,
 :Q1 6.416666666666667,
 :Q3 12.46666666666667,
 :Min 4.35,
 :LIF -2.6583333333333377,
 :Range 14.566666666666672,
 :Total 2485982.3666666076,
 :SD 3.8833448851016747,
 :IQR 6.050000000000003,
 :Outliers (),
 :UAV 18.91666666666667,
 :LOF -11.733333333333341,
 :SEM 0.007671449228984798,
 :Kurtosis -0.6971061581307603,
 :Median 8.916666666666668}

A similar approach is winsorizing the data, which involves removing the extremes of the distribution and replacing their values with the most extreme remaining values.

(-> preprocessed-trips
    :duration-in-minutes
    stats/winsor
    stats/stats-map)
{:MAD 4.566666666666668,
 :UOF 49.61666666666669,
 :Skewness 0.4229671561020442,
 :Max 18.91666666666667,
 :Variance 31.141867536020854,
 :Size 426590,
 :LAV 4.35,
 :UIF 32.89166666666668,
 :Mode 4.35,
 :Mean 10.475397219814083,
 :Q1 5.016666666666667,
 :Q3 16.16666666666667,
 :Min 4.35,
 :LIF -11.708333333333343,
 :Range 14.566666666666672,
 :Total 4468699.700000489,
 :SD 5.580489901076863,
 :IQR 11.150000000000006,
 :Outliers (),
 :UAV 18.91666666666667,
 :LOF -28.43333333333335,
 :SEM 0.0085441131523983,
 :Kurtosis -1.3714377426823943,
 :Median 8.916666666666668}

Note that summary statistics such median, quartiles, and MAD are robust to outliers as well.

8.7 Significance testing

We can test whether bike trip durations are significantly different from a value with a one sample Studend t-test. By default, the significance level is :alpha = 0.05 and the value is :mu = 0.

(-> clean-trips
    :duration-in-minutes
    (stats/t-test-one-sample))
{:stat 834.9956089507258,
 :confidence-interval [10.777576889155512 10.828291948161157],
 :n 404180,
 :p-value 0.0,
 :df 404179,
 :level 0.95,
 :test-type :two-sided,
 :estimate 10.802934418658335,
 :alpha 0.05,
 :t 834.9956089507258,
 :mu 0.0,
 :stderr 0.01293771404646492}

:alpha and :mu can be changed as follows:

(-> clean-trips
    :duration-in-minutes
    (stats/t-test-one-sample {:alpha 0.01 :mu 10.75}))
{:stat 4.0914815761288885,
 :confidence-interval [10.769608918318076 10.836259918998595],
 :n 404180,
 :p-value 4.287090047250253E-5,
 :df 404179,
 :level 0.99,
 :test-type :two-sided,
 :estimate 10.802934418658335,
 :alpha 0.01,
 :t 4.0914815761288885,
 :mu 10.75,
 :stderr 0.01293771404646492}

We can compare the means of two groups with a two sample Student t-test. Let us compare the means of bike trips with a classic bike with those with an electric bike.

(stats/t-test-two-samples
 (-> clean-trips
     (tc/select-rows #(= "electric_bike" (:rideable-type %)))
     :duration-in-minutes)
 (-> clean-trips
     (tc/select-rows #(= "classic_bike" (:rideable-type %)))
     :duration-in-minutes))
{:estimated-mu [9.8756963069081 11.888455175200976],
 :stat -74.81863109560089,
 :confidence-interval [-2.065485716197491 -1.9600320203882604],
 :n [240124 158361],
 :p-value 0.0,
 :df 304287.46411660506,
 :level 0.95,
 :test-type :two-sided,
 :nx 240124,
 :equal-variances? false,
 :sides :two-sided,
 :estimate -2.0127588682928756,
 :alpha 0.05,
 :t -74.81863109560089,
 :mu 0.0,
 :ny 158361,
 :stderr 0.026901840341358767,
 :paired? false}
source: notebooks/noj_book/statistics_intro.clj