8  Statistics (experimental πŸ› )

author: Daniel Slutsky

(ns noj-book.statistics
  (:require [scicloj.noj.v1.stats :as stats]
            [tablecloth.api :as tc]))

8.1 Example data

(def iris
  (-> "https://vincentarelbundock.github.io/Rdatasets/csv/datasets/iris.csv"
      (tc/dataset {:key-fn keyword})
      (tc/rename-columns {:Sepal.Length :sepal-length
                          :Sepal.Width :sepal-width
                          :Petal.Length :petal-length
                          :Petal.Width :petal-width
                          :Species :species})))

8.2 Multivariate regression

The stats/regression-model function computes a regressiom model (using scicloj.ml) and adds some relevant information such as the R^2 measure.

(-> iris
    (stats/regression-model
     :sepal-length
     [:sepal-width :petal-length :petal-width]
     {:model-type :smile.regression/elastic-net})
    (dissoc :model-data))
{:feature-columns [:sepal-width :petal-length :petal-width],
 :target-columns [:sepal-length],
 :explained #function[clojure.lang.AFunction/1],
 :R2 0.8582120394596505,
 :id #uuid "a5f55f73-e805-464d-a5b9-c23889c96e23",
 :predictions #tech.v3.dataset.column<float64>[150]
:sepal-length
[5.022, 4.724, 4.775, 4.851, 5.081, 5.360, 4.911, 5.030, 4.664, 4.903, 5.209, 5.098, 4.775, 4.572, 5.184, 5.522, 5.089, 4.970, 5.352, 5.217...],
 :predict
 #function[scicloj.noj.v1.stats/regression-model/predict--66315],
 :options {:model-type :smile.regression/elastic-net}}
(-> iris
    (stats/regression-model
     :sepal-length
     [:sepal-width :petal-length :petal-width]
     {:model-type :smile.regression/ordinary-least-square})
    (dissoc :model-data))
{:feature-columns [:sepal-width :petal-length :petal-width],
 :target-columns [:sepal-length],
 :explained #function[clojure.lang.AFunction/1],
 :R2 0.8586117200663171,
 :id #uuid "2bde1b4a-8df0-4b06-8e90-b300d63d29e7",
 :predictions #tech.v3.dataset.column<float64>[150]
:sepal-length
[5.015, 4.690, 4.749, 4.826, 5.080, 5.377, 4.895, 5.021, 4.625, 4.882, 5.216, 5.092, 4.746, 4.533, 5.199, 5.561, 5.094, 4.960, 5.368, 5.226...],
 :predict
 #function[scicloj.noj.v1.stats/regression-model/predict--66315],
 :options {:model-type :smile.regression/ordinary-least-square}}

The stats/linear-regression-model convenience function uses specifically the :smile.regression/ordinary-least-square model type.

(-> iris
    (stats/linear-regression-model
     :sepal-length
     [:sepal-width :petal-length :petal-width])
    (dissoc :model-data))
{:feature-columns [:sepal-width :petal-length :petal-width],
 :target-columns [:sepal-length],
 :explained #function[clojure.lang.AFunction/1],
 :R2 0.8586117200663171,
 :id #uuid "fef57df1-4364-4832-8dc8-dd5491047526",
 :predictions #tech.v3.dataset.column<float64>[150]
:sepal-length
[5.015, 4.690, 4.749, 4.826, 5.080, 5.377, 4.895, 5.021, 4.625, 4.882, 5.216, 5.092, 4.746, 4.533, 5.199, 5.561, 5.094, 4.960, 5.368, 5.226...],
 :predict
 #function[scicloj.noj.v1.stats/regression-model/predict--66315],
 :options {:model-type :smile.regression/ordinary-least-square}}

8.3 Adding regression predictions to a dataset

The stats/add-predictions function models a target column using feature columns, adds a new prediction column with the model predictions.

(-> iris
    (stats/add-predictions
     :sepal-length
     [:sepal-width :petal-length :petal-width]
     {:model-type :smile.regression/ordinary-least-square}))

https://vincentarelbundock.github.io/Rdatasets/csv/datasets/iris.csv [150 7]:

:rownames :sepal-length :sepal-width :petal-length :petal-width :species :sepal-length-prediction
1 5.1 3.5 1.4 0.2 setosa 5.01541576
2 4.9 3.0 1.4 0.2 setosa 4.68999718
3 4.7 3.2 1.3 0.2 setosa 4.74925142
4 4.6 3.1 1.5 0.2 setosa 4.82599409
5 5.0 3.6 1.4 0.2 setosa 5.08049948
6 5.4 3.9 1.7 0.4 setosa 5.37719368
7 4.6 3.4 1.4 0.3 setosa 4.89468378
8 5.0 3.4 1.5 0.2 setosa 5.02124524
9 4.4 2.9 1.4 0.2 setosa 4.62491347
10 4.9 3.1 1.5 0.1 setosa 4.88164236
… … … … … … …
140 6.9 3.1 5.4 2.1 virginica 6.53429168
141 6.7 3.1 5.6 2.4 virginica 6.50917327
142 6.9 3.1 5.1 2.3 virginica 6.21025556
143 5.8 2.7 5.1 1.9 virginica 6.17251376
144 6.8 3.2 5.9 2.3 virginica 6.84264484
145 6.7 3.3 5.7 2.5 virginica 6.65460564
146 6.7 3.0 5.2 2.3 virginica 6.21608504
147 6.3 2.5 5.0 1.9 virginica 5.97143313
148 6.5 3.0 5.2 2.0 virginica 6.38302984
149 6.2 3.4 5.4 2.3 virginica 6.61824630
150 5.9 3.0 5.1 1.8 virginica 6.42341317

It attaches the model’s information to the metadata of that new column.

(-> iris
    (stats/add-predictions
     :sepal-length
     [:sepal-width :petal-length :petal-width]
     {:model-type :smile.regression/ordinary-least-square})
    :sepal-length-prediction
    meta
    (update :model
            dissoc :model-data :predict :predictions))
{:name :sepal-length-prediction,
 :datatype :float64,
 :n-elems 150,
 :column-type :prediction,
 :model
 {:feature-columns [:sepal-width :petal-length :petal-width],
  :target-columns [:sepal-length],
  :explained #function[clojure.lang.AFunction/1],
  :R2 0.8586117200663171,
  :id #uuid "7ab519fe-f93c-40f9-9e77-89c682cad32e",
  :options {:model-type :smile.regression/ordinary-least-square}}}
source: notebooks/noj_book/statistics.clj