9 Working with distributions
Author: Cvetomir Dimov
We often have to fit diftributions to describe variable values in our data. In Clojure, the workhorse for this kind of work is the fitdistr package, which is based on the similarly named R package, but built on top of the Clojure fastmath. In this chapter, we will present how to fit distributions with fitdistr
.
First, let us load all necessary libraries.
ns noj-book.distribution-fitting
(:require [tablecloth.api :as tc]
(:as fd]
[fitdistr.core :as fdd]
[fitdistr.distributions :as kind]
[scicloj.kindly.v4.kind :as plotly])) [scicloj.tableplot.v1.plotly
9.1 Using distributions
Distributions are generated with the function distribution
. Distribution parameter values are specified in a map.
def a-normal-distribution (fd/distribution :normal {:mu 4 :sd 9})) (
The density (i.e., pdf) and cumulative density (i.e., cdf) of the distribution can be computed for different values.
22) (fd/pdf a-normal-distribution
0.005998996279243116
22) (fd/cdf a-normal-distribution
0.9772498680518208
Note that probability
outputs the pdf for continuous and probability for discrete distributions. For a continuous distribution like the normal, the output is the same as pdf
.
4) (fd/probability a-normal-distribution
0.044326920044603625
Distributions can be sampled with the ->seq
function, which takes a distribution and sample size as inputs, and returns a sequence of random values.
4) (fd/->seq a-normal-distribution
0.8502080509548167
(20.706679039107993
-5.1432446012264785
9.300730144967122)
-> (tc/dataset {:some-data (fd/->seq a-normal-distribution 10000)})
(
(plotly/layer-histogram:some-data
{:=x "count"
:=histnorm 100})) :=histogram-nbins
-> (tc/dataset {:x-data (fd/->seq a-normal-distribution 1000)
(:y-data (fd/->seq (fd/distribution :gamma {:scale 2 :shape 2}) 1000)})
(plotly/layer-point:x-data
{:=x :y-data})) :=y
9.2 Fitting distributions
The function fit
can be used for fitting distributions to data as follows (fit method distribution data params)
. Multipe methods are supported, which include :mle
(maximum log-likelihood estimation), :mme
(method of moments), :ad
(Anderson-Darling), and so on (see fit’s reference for a full list). Let us fit a normal distribution to a sample from the similarly shaped logistic distribution:
def fitted-distribution
(:mle :normal (fd/->seq (fd/distribution :logistic {:mu 3 :s 4}) 1000))) (fd/fit
The fitted distribution contains goodness-of-fit statistics, the estimated parameter values, and the distribution name.
fitted-distribution
:stats
{:mle -3377.651804257699,
{:aic 6759.303608515398,
:bic 6769.1191190733625},
:params {:mu 2.829171391243558, :sd 7.090198030074224},
:distribution-name :normal,
:distribution
0x43686eff "org.apache.commons.math3.distribution.NormalDistribution@43686eff"],
#object[org.apache.commons.math3.distribution.NormalDistribution :method :mle}
It can be converted back to a distribution and used in all the ways listed above.
(fd/->distribution fitted-distribution)
0x43686eff "org.apache.commons.math3.distribution.NormalDistribution@43686eff"] #object[org.apache.commons.math3.distribution.NormalDistribution
9.3 Supported distributions
Currently, fitdistr
supports a large number of distributions.
-> fdd/distribution-data
(methods
keys
count)
41
Here is a full list of them:
-> fdd/distribution-data
(methods
keys
sort)
:bb
(:bernoulli
:beta
:binomial
:cauchy
:chi
:chi-squared
:chi-squared-noncentral
:erlang
:exponential
:f
:fatigue-life
:frechet
:gamma
:geometric
:gumbel
:half-normal
:hyperbolic-secant
:inverse-gamma
:inverse-gaussian
:johnson-sb
:johnson-sl
:johnson-su
:laplace
:levy
:log-logistic
:log-normal
:logarithmic
:logistic
:nakagami
:negative-binomial
:normal
:pareto
:pascal
:pearson-6
:poisson
:power
:rayleigh
:t
:triangular
:weibull)
Their parameter names are as expected by those familiar with the distributions. A complete list can be found in the documentation. In the two following sections, we will list the most common ones.
9.3.1 Commonly used discrete probability distributions
distribution | fitdistr name | parameters |
---|---|---|
Binomial distribution | :binomial |
:p :trials |
Bernoulli distribution | :bernoulli |
:p |
Geometric distribution | :geometric |
:p |
Poisson distribution | :poisson |
:p |
Negative binomial distribution | :negative-binomial |
:r :p |
9.3.2 Commonly used continuous probability distributions
distribution | fitdistr name | parameters |
---|---|---|
Normal distribution | :normal |
:mu :sigma |
Student’s t distribution | :t |
:degrees-of-freedom |
beta distribution | :beta |
:alpha :beta |
Logistic distribution | :logistic |
:mu :s |
Chi-squared distribution | :chi-squared |
:degrees-of-freedom |
Gamma distribution | :gamma |
:scale |
Weibull distribution | :weibull |
:alpha :beta |
Log-normal distribution | :log-normal |
:scale :shape |