8 Distributions of single variables - DRAFT 🛠

author: Cvetomir Dimov and Daniel Slutsky

last change: 2025-02-24

In this tutorial, we will demonstrate how to visualize and summarize variables.

8.1 Setup and data

We will use the Chicago bike trips dataset, that is read and processed at the Intro to Table Processing with Tablecloth.

(ns noj-book.statistics-intro
  (:require [tablecloth.api :as tc]
            [noj-book.tablecloth-table-processing]
            [scicloj.tableplot.v1.plotly :as plotly]
            [fastmath.stats :as stats]))

(def preprocessed-trips
  noj-book.tablecloth-table-processing/preprocessed-trips)

8.2 Checking basic statistics of variables in a dataset.

(-> preprocessed-trips
    (tc/select-columns [:hour :duration-in-seconds])
    tc/info)

202304-divvy-tripdata.zip: descriptive-stats [2 11]:

:col-name	:datatype	:n-valid	:n-missing	:min	:mean	:max	:standard-deviation	:skew	:first	:last
:hour	:int64	426590	0	0.0	14.23275276	23.0	4.84054550	-0.58147195	8	8
:duration-in-seconds	:int64	426590	0	-536.0	1032.64061511	1103729.0	8532.49143608	78.01094578	249	123

We see that the duration in seconds has some unreasonable values: trips of negative length and trips which are many-hours-long.

Let us check how frequent that is.

(defn duration-diagnostics [{:keys [duration-in-minutes]}]
  {:negative-duration (neg? duration-in-minutes)
   :unreasonably-long-duration (> duration-in-minutes (* 2 60))})

(-> preprocessed-trips
    (tc/group-by duration-diagnostics)
    (tc/aggregate {:trips tc/row-count}))

_unnamed [3 3]:

:negative-duration	:unreasonably-long-duration	:trips
false	false	423924
false	true	2662
true	false	4

8.3 Data cleaning

Let us keep only trips of reasonable duration.

(def clean-trips
  (-> preprocessed-trips
      (tc/select-rows (fn [{:keys [duration-in-minutes]}]
                        (<= 0
                            duration-in-minutes
                            40)))))

8.4 Visually exploring the distribution of variables

The distribution of start hour:

(-> clean-trips
    (tc/group-by [:hour])
    (tc/aggregate {:n tc/row-count})
    (plotly/layer-bar {:=x :hour
                       :=y :n}))

The distribution of trip duration: Let us use histograms – binning the values and counting.

(-> clean-trips
    (plotly/layer-histogram {:=x :duration-in-minutes
                             :=histogram-nbins 100}))

The distribution of trip duration in different parts of the day:

(-> clean-trips
    (tc/map-columns :day-part
                    [:hour]
                    (fn [hour]
                      (cond (<= 6 hour 12) :morning
                            (<= 12 hour 18) :afternoon
                            (<= 18 hour 23) :evening
                            :else :night)))
    (plotly/layer-histogram {:=x :duration-in-minutes
                             :=histogram-nbins 100
                             :=color :day-part
                             :=mark-opacity 0.8}))

TODO: Use density estimates rather than histograms here.

The distribution of trip duration for different bike types:

(-> clean-trips
    (plotly/layer-histogram {:=x :duration-in-minutes
                             :=histogram-nbins 100
                             :=color :rideable-type
                             :=mark-opacity 0.8}))

8.5 Describing the distribution of a continuous variable

The fastmath library contains function for computing the most common measures of sample central tendency and variability (see the Statistics section). We will demonstrate a few.

(-> clean-trips
    (tc/aggregate {:n tc/row-count
                   :mean-duration #(-> % :duration-in-seconds stats/mean)
                   :min-duration #(-> % :duration-in-seconds stats/minimum)
                   :max-duration #(-> % :duration-in-seconds stats/maximum)
                   :median-duration #(-> % :duration-in-seconds stats/median)
                   :q1-duration #(-> % :duration-in-seconds stats/stats-map :Q1)
                   :percentile10-duration #(-> % :duration-in-seconds (stats/percentile 10))
                   :sd-duration #(-> % :duration-in-seconds stats/stddev)
                   :cov-duration #(-> % :duration-in-seconds stats/variation)
                   :mad-duration #(-> % :duration-in-seconds stats/mad)
                   :iqr-duration #(-> % :duration-in-seconds stats/iqr)}))

_unnamed [1 11]:

:percentile10-duration	:n	:mean-duration	:max-duration	:min-duration	:sd-duration	:median-duration	:mad-duration	:q1-duration	:iqr-duration	:cov-duration
169.0	404180	648.17606512	2400.0	0.0	493.51028516	506.0	257.0	291.0	579.0	0.76138307

Note that the function stats-map outputs many of the most commonly used statistics, including the ones computed above.

(-> clean-trips
    :duration-in-minutes
    stats/stats-map
    (select-keys [:Size :Min :Max :Range :Mean :Median :Mode :Q1 :Q3
                  :SD :Variance :MAD :SEM :IQR :Kurtosis :Skewness]))

{:MAD 4.283333333333333,
 :Skewness 1.2669327507375887,
 :Max 40.0,
 :Variance 67.65344487828531,
 :Size 404180,
 :Mode 0.06666666666666667,
 :Mean 10.802934418658129,
 :Q1 4.85,
 :Q3 14.5,
 :Min 0.0,
 :Range 40.0,
 :SD 8.225171419386061,
 :IQR 9.65,
 :SEM 0.01293771404646492,
 :Kurtosis 1.2317123730642425,
 :Median 8.433333333333334}

Box plots and violin plots can be used to visualize several of these summary statistics.

(-> clean-trips
    (plotly/layer-boxplot
     {:=y :duration-in-minutes}))

These can also be produced by group to compare the distributions across different groups.

(-> clean-trips
    (plotly/layer-violin
     {:=x :rideable-type 
      :=y :duration-in-minutes
      :=box-visible true
      :=color :rideable-type}))

8.6 Robust statistics

Outliers can have a significant influence on some summary values, such as mean or standard deviation. This why we cleaned the bike trips data in the first place. Removing outliers can be done systematically by removing the most extreme percentage of the data.

(-> preprocessed-trips
    :duration-in-minutes
    stats/trim
    stats/stats-map)

{:MAD 2.8500000000000014,
 :UOF 30.61666666666668,
 :Skewness 0.5819050204623866,
 :Max 18.91666666666667,
 :Variance 15.080367496645339,
 :Size 256246,
 :LAV 4.35,
 :UIF 21.541666666666675,
 :Mode 5.25,
 :Mean 9.701546040393245,
 :Q1 6.416666666666667,
 :Q3 12.46666666666667,
 :Min 4.35,
 :LIF -2.6583333333333377,
 :Range 14.566666666666672,
 :Total 2485982.3666666076,
 :SD 3.8833448851016747,
 :IQR 6.050000000000003,
 :Outliers (),
 :UAV 18.91666666666667,
 :LOF -11.733333333333341,
 :SEM 0.007671449228984798,
 :Kurtosis -0.6971061581307603,
 :Median 8.916666666666668}

A similar approach is winsorizing the data, which involves removing the extremes of the distribution and replacing their values with the most extreme remaining values.

(-> preprocessed-trips
    :duration-in-minutes
    stats/winsor
    stats/stats-map)

{:MAD 4.566666666666668,
 :UOF 49.61666666666669,
 :Skewness 0.4229671561020442,
 :Max 18.91666666666667,
 :Variance 31.141867536020854,
 :Size 426590,
 :LAV 4.35,
 :UIF 32.89166666666668,
 :Mode 4.35,
 :Mean 10.475397219814083,
 :Q1 5.016666666666667,
 :Q3 16.16666666666667,
 :Min 4.35,
 :LIF -11.708333333333343,
 :Range 14.566666666666672,
 :Total 4468699.700000489,
 :SD 5.580489901076863,
 :IQR 11.150000000000006,
 :Outliers (),
 :UAV 18.91666666666667,
 :LOF -28.43333333333335,
 :SEM 0.0085441131523983,
 :Kurtosis -1.3714377426823943,
 :Median 8.916666666666668}

Note that summary statistics such median, quartiles, and MAD are robust to outliers as well.

8.7 Significance testing

We can test whether bike trip durations are significantly different from a value with a one sample Student’s t-test. By default, the significance level is :alpha = 0.05 and the value is :mu = 0.

(-> clean-trips
    :duration-in-minutes
    (stats/t-test-one-sample))

{:stat 834.9956089507258,
 :confidence-interval [10.777576889155512 10.828291948161157],
 :n 404180,
 :p-value 0.0,
 :df 404179,
 :level 0.95,
 :test-type :two-sided,
 :estimate 10.802934418658335,
 :alpha 0.05,
 :t 834.9956089507258,
 :mu 0.0,
 :stderr 0.01293771404646492}

:alpha and :mu can be changed as follows:

(-> clean-trips
    :duration-in-minutes
    (stats/t-test-one-sample {:alpha 0.01 :mu 10.75}))

{:stat 4.0914815761288885,
 :confidence-interval [10.769608918318076 10.836259918998595],
 :n 404180,
 :p-value 4.287090047250253E-5,
 :df 404179,
 :level 0.99,
 :test-type :two-sided,
 :estimate 10.802934418658335,
 :alpha 0.01,
 :t 4.0914815761288885,
 :mu 10.75,
 :stderr 0.01293771404646492}

We can compare the means of two groups with a two sample Student t-test. Let us compare the means of bike trips with a classic bike with those with an electric bike.

(stats/t-test-two-samples
 (-> clean-trips
     (tc/select-rows #(= "electric_bike" (:rideable-type %)))
     :duration-in-minutes)
 (-> clean-trips
     (tc/select-rows #(= "classic_bike" (:rideable-type %)))
     :duration-in-minutes))

{:estimated-mu [9.8756963069081 11.888455175200976],
 :stat -74.81863109560089,
 :confidence-interval [-2.065485716197491 -1.9600320203882604],
 :n [240124 158361],
 :p-value 0.0,
 :df 304287.46411660506,
 :level 0.95,
 :test-type :two-sided,
 :nx 240124,
 :equal-variances? false,
 :sides :two-sided,
 :estimate -2.0127588682928756,
 :alpha 0.05,
 :t -74.81863109560089,
 :mu 0.0,
 :ny 158361,
 :stderr 0.026901840341358767,
 :paired? false}

source: notebooks/noj_book/statistics_intro.clj