2  Intro

Here is our introduction to using PyMC from Clojure.

We follow the linear regression example from the Introductory Overview of PyMC.

2.1 Setup

Relevant Clojure namespaces:

(ns intro
  (:require [libpython-clj2.require :refer [require-python]]
            [libpython-clj2.python :refer [py. py.. py.-] :as py]
            [fastmath.random :as random]
            [tablecloth.api :as tc]
            [tablecloth.column.api :as tcc]
            [tech.v3.datatype :as dtype]
            [scicloj.tableplot.v1.plotly :as plotly]
            [scicloj.kind-pyplot.v1.api :as pyplot]
            [scicloj.kindly.v4.kind :as kind]))

Relevant Python modules:

(require-python '[builtins :as python]
                'operator
                '[arviz :as az]
                '[arviz.style :as az.style]
                '[pandas :as pd]
                '[matplotlib.pyplot :as plt]
                '[numpy :as np]
                '[numpy.random :as np.random]
                '[pymc :as pm])
:ok

Some convenience functions to access Python idioms:

(defn brackets [obj entry]
  (py. obj __getitem__ entry))
(def colon
  (python/slice nil nil))

Theme for ArViZ visualizations:

(arviz.style/use "arviz-darkgrid")
nil

2.2 Synthetic data

(def random-seed 8927)
(def dataset-size 101)
(def true-parameter-values
  {:alpha 1
   :sigma 1
   :beta [1 2.5]})

We will generate a dataset by the following recipe:

(defn gen-dataset [{:keys [size random-seed
                           alpha sigma beta]}]
  (let [rng (random/rng :isaac random-seed)]
    (-> {:x1 (take size (random/->seq rng))
         :x2 (-> (take size (random/->seq rng))
                 (tcc/* 0.2))}
        tc/dataset
        (tc/add-column :y
                       #(-> (tcc/+ alpha
                                   (tcc/* (beta 0) (:x1 %))
                                   (tcc/* (beta 1) (:x2 %))
                                   (tcc/* sigma
                                          (dtype/make-reader
                                           :float32 size (rand)))))))))
(def dataset
  (gen-dataset (merge {:random-seed random-seed
                       :size dataset-size}
                      true-parameter-values)))
(tc/head dataset)

_unnamed [5 3]:

:x1 :x2 :y
0.55055707 0.12467207 2.45793549
0.33115609 0.12809006 2.47644528
0.75138350 0.16112505 2.79757416
0.84946778 0.04481401 2.02392108
0.48675832 0.13809987 2.70167490

Let us visualize our dataset:

(->> [:x1 :x2]
     (mapv (fn [x]
             (-> dataset
                 (plotly/layer-point
                  {:=x :x1}))))
     kind/fragment)

2.3 Using PyMC

pm/__version__
"5.17.0"

Let us define a Bayesian model for our data:

(def basic-model (pm/Model))
(py/with [_ basic-model]
         (let [{:keys [x1 x2 y]} (-> dataset
                                     (update-vals np/array))
               alpha (pm/Normal "alpha"
                                :mu 0
                                :sigma 10)
               beta (pm/Normal "beta"
                               :mu 0
                               :sigma 10
                               :shape 2)
               sigma (pm/HalfNormal "sigma"
                                    :sigma 1)
               mu (operator/add alpha
                                (operator/mul (brackets beta 0)
                                              x1)
                                (operator/mul (brackets beta 0)
                                              x2))
               y_obs (pm/Normal "y_obs"
                                :mu mu
                                :sigma sigma
                                :observed y)]))
nil

Now we can sample from the posterior:

(def idata
  (py/with [_ basic-model]
           (pm/sample)))

Here is the resulting structure:

(-> idata
    (py.- posterior)
    (py.- alpha)
    (py. sel :draw (python/slice 0 4)))
<xarray.DataArray 'alpha' (chain: 2, draw: 5)> Size: 80B
array([[-9.08933908, -0.65013018, -3.24050702, 19.30412319, 18.01323153],
       [-5.75061134,  0.45468873, -8.74273953,  0.37804316,  3.01527021]])
Coordinates:
  * chain    (chain) int64 16B 0 1
  * draw     (draw) int64 40B 0 1 2 3 4

Alternativelty, we could also use the Slice sampling algorithm instead of the default NUTS.

(def slice-idata
  (py/with [_ basic-model]
           (let [step (pm/Slice)]
             (pm/sample 5000 :step step))))
slice-idata
Inference data with groups:
    > posterior
    > sample_stats

2.4 Posterior analysis

Let us plot our sampling using ArViZ:

(pyplot/pyplot
 #(az/plot_trace idata :combined true))
2024-11-03T18:52:46.891798 image/svg+xml Matplotlib v3.9.2, https://matplotlib.org/
source: projects/stats/pymc/notebooks/intro.clj