9  Working with distributions

Author: Cvetomir Dimov

We often have to fit diftributions to describe variable values in our data. In Clojure, the workhorse for this kind of work is the fitdistr package, which is based on the similarly named R package, but built on top of the Clojure fastmath. In this chapter, we will present how to fit distributions with fitdistr.

First, let us load all necessary libraries.

(ns noj-book.distribution-fitting
  (:require [tablecloth.api :as tc]
            [fitdistr.core :as fd]
            [fitdistr.distributions :as fdd]
            [scicloj.kindly.v4.kind :as kind]
            [scicloj.tableplot.v1.plotly :as plotly]))

9.1 Using distributions

Distributions are generated with the function distribution. Distribution parameter values are specified in a map.

(def a-normal-distribution (fd/distribution :normal {:mu 4 :sd 9}))

The density (i.e., pdf) and cumulative density (i.e., cdf) of the distribution can be computed for different values.

(fd/pdf a-normal-distribution 22)
0.005998996279243116
(fd/cdf a-normal-distribution 22)
0.9772498680518208

Note that probability outputs the pdf for continuous and probability for discrete distributions. For a continuous distribution like the normal, the output is the same as pdf.

(fd/probability a-normal-distribution 4)
0.044326920044603625

Distributions can be sampled with the ->seq function, which takes a distribution and sample size as inputs, and returns a sequence of random values.

(fd/->seq a-normal-distribution 4)
(0.8502080509548167
 -20.706679039107993
 5.1432446012264785
 9.300730144967122)
(-> (tc/dataset {:some-data (fd/->seq a-normal-distribution 10000)})
    (plotly/layer-histogram
     {:=x :some-data
      :=histnorm "count"
      :=histogram-nbins 100}))
(-> (tc/dataset {:x-data (fd/->seq a-normal-distribution 1000)
                 :y-data (fd/->seq (fd/distribution :gamma {:scale 2 :shape 2}) 1000)})
    (plotly/layer-point
     {:=x :x-data
      :=y :y-data}))

9.2 Fitting distributions

The function fit can be used for fitting distributions to data as follows (fit method distribution data params). Multipe methods are supported, which include :mle (maximum log-likelihood estimation), :mme (method of moments), :ad (Anderson-Darling), and so on (see fit’s reference for a full list). Let us fit a normal distribution to a sample from the similarly shaped logistic distribution:

(def fitted-distribution
  (fd/fit :mle :normal (fd/->seq (fd/distribution :logistic {:mu 3 :s 4}) 1000)))

The fitted distribution contains goodness-of-fit statistics, the estimated parameter values, and the distribution name.

fitted-distribution
{:stats
 {:mle -3377.651804257699,
  :aic 6759.303608515398,
  :bic 6769.1191190733625},
 :params {:mu 2.829171391243558, :sd 7.090198030074224},
 :distribution-name :normal,
 :distribution
 #object[org.apache.commons.math3.distribution.NormalDistribution 0x43686eff "org.apache.commons.math3.distribution.NormalDistribution@43686eff"],
 :method :mle}

It can be converted back to a distribution and used in all the ways listed above.

(fd/->distribution fitted-distribution)
#object[org.apache.commons.math3.distribution.NormalDistribution 0x43686eff "org.apache.commons.math3.distribution.NormalDistribution@43686eff"]

9.3 Supported distributions

Currently, fitdistr supports a large number of distributions.

(-> fdd/distribution-data
    methods
    keys
    count)
41

Here is a full list of them:

(-> fdd/distribution-data
    methods
    keys
    sort)
(:bb
 :bernoulli
 :beta
 :binomial
 :cauchy
 :chi
 :chi-squared
 :chi-squared-noncentral
 :erlang
 :exponential
 :f
 :fatigue-life
 :frechet
 :gamma
 :geometric
 :gumbel
 :half-normal
 :hyperbolic-secant
 :inverse-gamma
 :inverse-gaussian
 :johnson-sb
 :johnson-sl
 :johnson-su
 :laplace
 :levy
 :log-logistic
 :log-normal
 :logarithmic
 :logistic
 :nakagami
 :negative-binomial
 :normal
 :pareto
 :pascal
 :pearson-6
 :poisson
 :power
 :rayleigh
 :t
 :triangular
 :weibull)

Their parameter names are as expected by those familiar with the distributions. A complete list can be found in the documentation. In the two following sections, we will list the most common ones.

9.3.1 Commonly used discrete probability distributions

distribution fitdistr name parameters
Binomial distribution :binomial :p :trials
Bernoulli distribution :bernoulli :p
Geometric distribution :geometric :p
Poisson distribution :poisson :p
Negative binomial distribution :negative-binomial :r :p

9.3.2 Commonly used continuous probability distributions

distribution fitdistr name parameters
Normal distribution :normal :mu :sigma
Student’s t distribution :t :degrees-of-freedom
beta distribution :beta :alpha :beta
Logistic distribution :logistic :mu :s
Chi-squared distribution :chi-squared :degrees-of-freedom
Gamma distribution :gamma :scale
Weibull distribution :weibull :alpha :beta
Log-normal distribution :log-normal :scale :shape
source: notebooks/noj_book/distribution_fitting.clj