2 Intro

Here is our introduction to using PyMC from Clojure.

We follow the linear regression example from the Introductory Overview of PyMC.

2.1 Setup

Relevant Clojure namespaces:

(ns intro
  (:require [libpython-clj2.require :refer [require-python]]
            [libpython-clj2.python :refer [py. py.. py.-] :as py]
            [fastmath.random :as random]
            [tablecloth.api :as tc]
            [tablecloth.column.api :as tcc]
            [tech.v3.datatype :as dtype]
            [scicloj.tableplot.v1.plotly :as plotly]
            [scicloj.kind-pyplot.v1.api :as pyplot]
            [scicloj.kindly.v4.kind :as kind]))

Relevant Python modules:

(require-python '[builtins :as python]
                'operator
                '[arviz :as az]
                '[arviz.style :as az.style]
                '[pandas :as pd]
                '[matplotlib.pyplot :as plt]
                '[numpy :as np]
                '[numpy.random :as np.random]
                '[pymc :as pm])

:ok

Some convenience functions to access Python idioms:

(defn brackets [obj entry]
  (py. obj __getitem__ entry))

(def colon
  (python/slice nil nil))

Theme for ArViZ visualizations:

(arviz.style/use "arviz-darkgrid")

nil

2.2 Synthetic data

(def random-seed 8927)

(def dataset-size 101)

(def true-parameter-values
  {:alpha 1
   :sigma 1
   :beta [1 2.5]})

We will generate a dataset by the following recipe:

(defn gen-dataset [{:keys [size random-seed
                           alpha sigma beta]}]
  (let [rng (random/rng :isaac random-seed)]
    (-> {:x1 (take size (random/->seq rng))
         :x2 (-> (take size (random/->seq rng))
                 (tcc/* 0.2))}
        tc/dataset
        (tc/add-column :y
                       #(-> (tcc/+ alpha
                                   (tcc/* (beta 0) (:x1 %))
                                   (tcc/* (beta 1) (:x2 %))
                                   (tcc/* sigma
                                          (dtype/make-reader
                                           :float32 size (rand)))))))))

(def dataset
  (gen-dataset (merge {:random-seed random-seed
                       :size dataset-size}
                      true-parameter-values)))

(tc/head dataset)

_unnamed [5 3]:

:x1	:x2	:y
0.55055707	0.12467207	2.46114559
0.33115609	0.12809006	1.79691978
0.75138350	0.16112505	2.33566515
0.84946778	0.04481401	2.30623415
0.48675832	0.13809987	2.59085654

Let us visualize our dataset:

(->> [:x1 :x2]
     (mapv (fn [x]
             (-> dataset
                 (plotly/layer-point
                  {:=x :x1}))))
     kind/fragment)

2.3 Using PyMC

pm/__version__

"5.17.0"

Let us define a Bayesian model for our data:

(def basic-model (pm/Model))

(py/with [_ basic-model]
         (let [{:keys [x1 x2 y]} (-> dataset
                                     (update-vals np/array))
               alpha (pm/Normal "alpha"
                                :mu 0
                                :sigma 10)
               beta (pm/Normal "beta"
                               :mu 0
                               :sigma 10
                               :shape 2)
               sigma (pm/HalfNormal "sigma"
                                    :sigma 1)
               mu (operator/add alpha
                                (operator/mul (brackets beta 0)
                                              x1)
                                (operator/mul (brackets beta 0)
                                              x2))
               y_obs (pm/Normal "y_obs"
                                :mu mu
                                :sigma sigma
                                :observed y)]))

nil

Now we can sample from the posterior:

(def idata
  (py/with [_ basic-model]
           (pm/sample)))

Here is the resulting structure:

(-> idata
    (py.- posterior)
    (py.- alpha)
    (py. sel :draw (python/slice 0 4)))

<xarray.DataArray 'alpha' (chain: 2, draw: 5)> Size: 80B
array([[ -4.20078972,   5.765901  ,   6.6765565 ,  -6.9900269 ,
         -6.9900269 ],
       [-17.89999621, -12.05483791,  -8.86032589,   5.65595574,
         -5.7629998 ]])
Coordinates:
  * chain    (chain) int64 16B 0 1
  * draw     (draw) int64 40B 0 1 2 3 4

Alternativelty, we could also use the Slice sampling algorithm instead of the default NUTS.

(def slice-idata
  (py/with [_ basic-model]
           (let [step (pm/Slice)]
             (pm/sample 5000 :step step))))

slice-idata

Inference data with groups:
    > posterior
    > sample_stats

2.4 Posterior analysis

Let us plot our sampling using ArViZ:

(pyplot/pyplot
 #(az/plot_trace idata :combined true))

source: projects/stats/pymc/notebooks/intro.clj