3 Introduction to Unsupervised Machine Learning with metamorph.ml

This tutorial introduces unsupervised machine learning using the metamorph.ml library. We’ll cover:

Clustering algorithms (K-means, hierarchical clustering)
Dimensionality reduction (PCA)
Feature scaling and preprocessing
Text feature extraction with TF-IDF
Evaluation techniques for unsupervised learning
Building complete unsupervised ML pipelines

Unlike supervised learning, unsupervised learning works with unlabeled data to discover hidden patterns, group similar observations, or reduce dimensionality for visualization and preprocessing.

Disclaimer: (created with the help of Claude Code)

(ns unsupervised-ml-intro
  (:require
   [clojure.string :as str]
   [tablecloth.api :as tc]
   [tablecloth.pipeline :as tc-mm]
   [tech.v3.dataset :as ds]
   [tech.v3.dataset.column-filters :as cf]
   [tech.v3.datatype.functional :as dfn]
   [scicloj.metamorph.core :as mm]
   [scicloj.metamorph.ml :as ml]
   [scicloj.metamorph.ml.preprocessing :as prep]
   [scicloj.metamorph.ml.rdatasets :as rdatasets]
   [scicloj.ml.smile.projections :as projections]
   [scicloj.ml.smile.clustering :as clustering]
   [scicloj.ml.smile.manifold]))

3.1 Part 1: Clustering with K-Means

Clustering groups similar data points together without predefined labels. K-means is one of the most popular clustering algorithms.

3.1.1 1.1 Loading and Exploring Data

We’ll use the Iris dataset, but ignore the species labels to treat it as an unsupervised problem.

(def iris-ds
  (-> (rdatasets/datasets-iris)
      (ds/drop-columns [:rownames :species])))  ; Remove labels for unsupervised learning

Iris dataset (unlabeled):

(tc/head iris-ds 5)

https://vincentarelbundock.github.io/Rdatasets/doc/datasets/iris.html [5 4]:

:sepal-length	:sepal-width	:petal-length	:petal-width
5.1	3.5	1.4	0.2
4.9	3.0	1.4	0.2
4.7	3.2	1.3	0.2
4.6	3.1	1.5	0.2
5.0	3.6	1.4	0.2

Dataset shape: 4 rows × 150 columns

View column statistics:

(ds/descriptive-stats iris-ds)

https://vincentarelbundock.github.io/Rdatasets/doc/datasets/iris.html: descriptive-stats [4 11]:

:col-name	:datatype	:n-valid	:min	:mean	:max	:standard-deviation	:skew	:first	:last
:sepal-length	:float64	150	4.3	5.84333333	7.9	0.82806613	0.31491096	5.1	5.9
:sepal-width	:float64	150	2.0	3.05733333	4.4	0.43586628	0.31896566	3.5	3.0
:petal-length	:float64	150	1.0	3.75800000	6.9	1.76529823	-0.27488418	1.4	5.1
:petal-width	:float64	150	0.1	1.19933333	2.5	0.76223767	-0.10296675	0.2	1.8

3.1.2 1.2 Data Preprocessing

Before clustering, we should standardize features so that features with larger scales don’t dominate the distance calculations.

(def numeric-cols (tc/column-names (cf/numeric iris-ds)))

(def preprocessing-pipeline
  (mm/pipeline
   (prep/std-scale numeric-cols {:mean? true :stddev? true})))

Apply preprocessing in fit mode:

(def fitted-preproc-ctx
  (preprocessing-pipeline
   {:metamorph/data iris-ds
    :metamorph/mode :fit}))

(def scaled-iris
  (:metamorph/data fitted-preproc-ctx))

Scaled data (first 5 rows):

(tc/head scaled-iris 5)

https://vincentarelbundock.github.io/Rdatasets/doc/datasets/iris.html [5 4]:

:sepal-length	:sepal-width	:petal-length	:petal-width
-0.89767388	1.01560199	-1.33575163	-1.31105215
-1.13920048	-0.13153881	-1.33575163	-1.31105215
-1.38072709	0.32731751	-1.39239929	-1.31105215
-1.50149039	0.09788935	-1.27910398	-1.31105215
-1.01843718	1.24503015	-1.33575163	-1.31105215

Check that scaling worked (mean ≈ 0, std ≈ 1):

Scaled data statistics:

(ds/descriptive-stats scaled-iris)

https://vincentarelbundock.github.io/Rdatasets/doc/datasets/iris.html: descriptive-stats [4 11]:

:col-name	:datatype	:n-valid	:min	:mean	:max	:standard-deviation	:skew	:first	:last
:sepal-length	:float64	150	-1.86378030	-4.43719136E-16	2.48369858	1.0	0.31491096	-0.89767388	0.06843254
:sepal-width	:float64	150	-2.42582042	-8.17679258E-16	3.08045544	1.0	0.31896566	1.01560199	-0.13153881
:petal-length	:float64	150	-1.56234224	-2.58311890E-16	1.77986923	1.0	-0.27488418	-1.33575163	0.76021149
:petal-width	:float64	150	-1.44224482	-3.70074342E-17	1.70637941	1.0	-0.10296675	-1.31105215	0.78803068

3.1.3 1.3 K-Means Clustering

K-means partitions data into K clusters by minimizing within-cluster variance.

(def kmeans-pipeline
  (mm/pipeline
   ;; Standardize features
   (prep/std-scale numeric-cols {:mean? true :stddev? true})
   ;; K-means clustering
   {:metamorph/id :model}
   (ml/model {:model-type :fastmath.cluster/k-means
              :clustering-method-args [3 100 1e-4]})))

:k 3 -> Number of clusters
:max-iter 100 -> Maximum iterations
:tolerance 1e-4 -> Convergence tolerance

Fit the clustering model:

(def kmeans-result
  (kmeans-pipeline
   {:metamorph/data iris-ds
    :metamorph/mode :fit}))

Extract the trained model:

(def kmeans-model
  (-> kmeans-result :model :model-data))

^kind/println
(-> kmeans-model :obj str)

Cluster distortion: 139.09920
Cluster size of 150 data points:
Cluster    1     50 (33.3%)
Cluster    2     44 (29.3%)
Cluster    3     56 (37.3%)

K-means clustering complete!

3.1.4 1.4 Analyzing Cluster Assignments

Get the cluster assignments (which cluster each point belongs to):

(def cluster-assignments
  (-> kmeans-model :clustering))

Number of unique clusters found: 3

Add cluster assignments to the original data:

(def iris-with-clusters
  (tc/add-column iris-ds :cluster cluster-assignments))

Data with cluster assignments:

(tc/head iris-with-clusters 10)

https://vincentarelbundock.github.io/Rdatasets/doc/datasets/iris.html [10 5]:

:sepal-length	:sepal-width	:petal-length	:petal-width
5.1	3.5	1.4	0.2
4.9	3.0	1.4	0.2
4.7	3.2	1.3	0.2
4.6	3.1	1.5	0.2
5.0	3.6	1.4	0.2
5.4	3.9	1.7	0.4
4.6	3.4	1.4	0.3
5.0	3.4	1.5	0.2
4.4	2.9	1.4	0.2
4.9	3.1	1.5	0.1

View cluster sizes:

(-> iris-with-clusters
    (tc/group-by [:cluster])
    (tc/aggregate {:count tc/row-count})
    (tc/order-by [:cluster]))

_unnamed [3 2]:

:cluster	:count
0	50
1	44
2	56

3.1.5 1.5 Cluster Statistics

Examine the characteristics of each cluster:

(-> iris-with-clusters
    (tc/group-by [:cluster])
    (tc/aggregate {:mean-sepal-length #(dfn/mean (% :sepal-length))
                   :mean-sepal-width #(dfn/mean (% :sepal-width))
                   :mean-petal-length #(dfn/mean (% :petal-length))
                   :mean-petal-width #(dfn/mean (% :petal-width))
                   :count tc/row-count})
    (tc/order-by [:cluster]))

_unnamed [3 6]:

:cluster	:mean-sepal-length	:mean-sepal-width	:mean-petal-length	:mean-petal-width	:count
0	5.00600000	3.42800000	1.46200000	0.24600000	50
1	6.80681818	3.12045455	5.52272727	1.98181818	44
2	5.83392857	2.67678571	4.42142857	1.43571429	56

3.2 Part 2: Dimensionality Reduction with PCA

Principal Component Analysis (PCA) reduces the number of features while retaining most of the variance in the data. It’s useful for:

Visualization (reducing to 2-3 dimensions)
Preprocessing before modeling
Noise reduction

3.2.1 2.1 Applying PCA

(def pca-pipeline
  (mm/pipeline
   ;; Standardize features (required for PCA)
   (prep/std-scale numeric-cols {:mean? true :stddev? true})
   ;; PCA
   {:metamorph/id :pca}
   (projections/reduce-dimensions :pca-cov 2 numeric-cols {})))          ; Reduce to 2 dimensions

Fit PCA:

(def pca-result (mm/fit-pipe iris-ds pca-pipeline))

(def pca-transformed
  (-> pca-result
      :metamorph/data
      (tc/select-columns ["pca-cov-0" "pca-cov-1"])))

PCA transformation complete!

Original dimensions: 4

Reduced dimensions: 2

View the transformed data:

(tc/head pca-transformed 10)

https://vincentarelbundock.github.io/Rdatasets/doc/datasets/iris.html [10 2]:

pca-cov-0	pca-cov-1
-2.25714118	-0.47842383
-2.07401302	0.67188269
-2.35633511	0.34076642
-2.29170679	0.59539986
-2.38186270	-0.64467566
-2.06870061	-1.48420530
-2.43586845	-0.04748512
-2.22539189	-0.22240300
-2.32684533	1.11160370
-2.17703491	0.46744757

3.2.2 2.2 Explained Variance

PCA components capture different amounts of variance. The first component captures the most variance, the second captures the second most, etc.

(def pca-model
  (-> pca-result :pca :fit-result :model))

The PCA model contains information about explained variance and component loadings.

(-> pca-model bean keys)

(:center
 :class
 :cumulativeVarianceProportion
 :loadings
 :projection
 :variance
 :varianceProportion)

ex. cummulative variance proportion:

(-> pca-model .getCumulativeVarianceProportion vec)

[0.7296244541329989 0.9581320720000164 0.9948212908928452 1.0]

3.2.3 2.3 Combining PCA with Clustering

A common pattern is to use PCA for dimensionality reduction, then cluster in the reduced space:

(def pca-kmeans-pipeline
  (mm/pipeline
   ;; Standardize
   (prep/std-scale numeric-cols {:mean? true :stddev? true})
   ;; Reduce dimensions
   {:metamorph/id :pca}
   (projections/reduce-dimensions :pca-cov 2 numeric-cols {})

   (tc-mm/select-columns ["pca-cov-0" "pca-cov-1"])

   ;; Cluster in reduced space
   {:metamorph/id :kmeans}
   (clustering/cluster :k-means [3 100 1e-4] :clustering)))

(def pca-kmeans-result
  (mm/fit-pipe iris-ds pca-kmeans-pipeline))

PCA + K-means pipeline complete!

Get cluster assignments from the combined pipeline:

(def pca-clusters
  (-> pca-kmeans-result :metamorph/data :clustering seq))

(def iris-pca-clusters
  (tc/add-column iris-ds :pca-cluster pca-clusters))

(-> iris-pca-clusters
    (tc/group-by [:pca-cluster])
    (tc/aggregate {:count tc/row-count})
    (tc/order-by [:pca-cluster]))

_unnamed [3 2]:

:pca-cluster	:count
0	47
1	50
2	53

3.3 Part 3: Hierarchical Clustering

Hierarchical clustering builds a tree (dendrogram) of clusters, allowing exploration at different granularities.

(def hclust-pipeline (mm/pipeline (prep/std-scale cf/numeric {:mean? true :stddev? true}) {:metamorph/id :model} (ml/model {:model-type :smile.clustering/hierarchical :k 3 ; Cut tree to get 3 clusters :linkage :complete}))) ; Linkage method: :single, :complete, :average, :ward

(def hclust-result (hclust-pipeline {:metamorph/data iris-ds :metamorph/mode :fit}))

(def hclust-assignments (:cluster-id (:metamorph/data hclust-result)))

(def iris-hclust (tc/add-column iris-ds :hclust-cluster hclust-assignments))

^{:kind/md true :kindly/hide-code true} “Hierarchical clustering results:”

(-> iris-hclust (tc/group-by [:hclust-cluster]) (tc/aggregate {:count tc/row-count :mean-petal-length #(dfn/mean (% :petal-length))}) (tc/order-by [:hclust-cluster]))

3.4 Part 4: Feature Engineering and Preprocessing

Proper preprocessing is crucial for unsupervised learning.

3.4.1 4.1 Standard Scaling (Z-score Normalization)

(def std-scaled-ds
  (-> (mm/pipeline
       (prep/std-scale numeric-cols {:mean? true :stddev? true}))
      (apply [{:metamorph/data iris-ds
               :metamorph/mode :fit}])
      :metamorph/data))

Standard scaling: Transforms features to have mean=0 and std=1

(ds/descriptive-stats std-scaled-ds [:mean :standard-deviation])

https://vincentarelbundock.github.io/Rdatasets/doc/datasets/iris.html: descriptive-stats [4 11]:

:col-name	:datatype	:n-valid	:min	:mean	:max	:standard-deviation	:skew	:first	:last
:sepal-length	:float64	150	-1.86378030	-4.43719136E-16	2.48369858	1.0	0.31491096	-0.89767388	0.06843254
:sepal-width	:float64	150	-2.42582042	-8.17679258E-16	3.08045544	1.0	0.31896566	1.01560199	-0.13153881
:petal-length	:float64	150	-1.56234224	-2.58311890E-16	1.77986923	1.0	-0.27488418	-1.33575163	0.76021149
:petal-width	:float64	150	-1.44224482	-3.70074342E-17	1.70637941	1.0	-0.10296675	-1.31105215	0.78803068

3.4.2 4.2 Min-Max Scaling

(def minmax-scaled-ds
  (-> (mm/pipeline
       (prep/min-max-scale numeric-cols {:min -1 :max 1}))
      (apply [{:metamorph/data iris-ds
               :metamorph/mode :fit}])
      :metamorph/data))

Min-max scaling: Transforms features to a specific range (here: -1 to 1)

(ds/descriptive-stats minmax-scaled-ds [:min :max])

https://vincentarelbundock.github.io/Rdatasets/doc/datasets/iris.html: descriptive-stats [4 11]:

:col-name	:datatype	:n-valid	:min	:mean	:max	:standard-deviation	:skew	:first	:last
:sepal-length	:float64	150	-1.0	-0.14259259	1.0	0.46003674	0.31491096	-0.55555556	-0.11111111
:sepal-width	:float64	150	-1.0	-0.11888889	1.0	0.36322190	0.31896566	0.25000000	-0.16666667
:petal-length	:float64	150	-1.0	-0.06508475	1.0	0.59840618	-0.27488418	-0.86440678	0.38983051
:petal-width	:float64	150	-1.0	-0.08388889	1.0	0.63519806	-0.10296675	-0.91666667	0.41666667

3.4.3 4.3 Robust Scaling for Outliers

When data has outliers, standard scaling can be affected. Consider using quantile-based scaling or removing outliers first.

3.5 Part 5: Text Clustering with TF-IDF

Unsupervised learning is commonly used for text data. Let’s create a simple example of text clustering.

Create a small text dataset:

(def documents-ds
  (tc/dataset {:doc-id (range 6)
               :text ["machine learning is fascinating"
                      "deep learning uses neural networks"
                      "I love pizza and pasta"
                      "Italian food is delicious"
                      "supervised learning needs labels"
                      "My favorite food is sushi"]}))

Text documents:

documents-ds

_unnamed [6 2]:

:doc-id	:text
0	machine learning is fascinating
1	deep learning uses neural networks
2	I love pizza and pasta
3	Italian food is delicious
4	supervised learning needs labels
5	My favorite food is sushi

3.5.1 5.1 Tokenization and TF-IDF

First, we need to tokenize text and compute TF-IDF features:

(require '[scicloj.metamorph.ml.text :as text])

Tokenize documents:

(defn simple-tokenize [text]
  (-> text
      str/lower-case
      (str/split #"\s+")))

(def tokenized-docs
  (tc/add-column documents-ds
                 :tokens
                 (map simple-tokenize (:text documents-ds))))

Tokenized documents:

(tc/select-columns tokenized-docs [:doc-id :tokens])

_unnamed [6 2]:

:doc-id	:tokens
0	[machine learning is fascinating]
1	[deep learning uses neural networks]
2	[i love pizza and pasta]
3	[italian food is delicious]
4	[supervised learning needs labels]
5	[my favorite food is sushi]

Convert to tidy text format (one row per token):

(def tidy-docs
  (tc/dataset
   (mapcat (fn [row]
             (map (fn [token]
                    {:doc-id (:doc-id row)
                     :token token})
                  (:tokens row)))
           (ds/mapseq-reader tokenized-docs))))

Tidy text format (sample):

(tc/head tidy-docs 10)

_unnamed [10 2]:

:doc-id	:token
0	machine
0	learning
0	is
0	fascinating
1	deep
1	learning
1	uses
1	neural
1	networks
2	i

Compute term frequencies:

(def term-counts
  (-> tidy-docs
      (tc/group-by [:doc-id :token])
      (tc/aggregate {:n tc/row-count})))

Term counts (sample):

(tc/head term-counts 10)

_unnamed [10 3]:

:doc-id	:token	:n
0	machine	1
0	learning	1
0	is	1
0	fascinating	1
1	deep	1
1	learning	1
1	uses	1
1	neural	1
1	networks	1
2	i	1

3.5.2 5.2 Document Similarity

After TF-IDF vectorization, we can cluster documents based on their semantic similarity.

Note: Full TF-IDF clustering requires converting the term-document matrix to a format suitable for clustering. The scicloj.metamorph.ml.text namespace provides functions for this.

3.6 Part 6: Evaluation Metrics for Unsupervised Learning

Unlike supervised learning, we don’t have true labels to evaluate against. Instead, we use intrinsic quality measures.

3.6.1 6.1 Within-Cluster Sum of Squares (WCSS)

WCSS measures cluster compactness. Lower is better.

(defn euclidean-distance [point1 point2]
  (dfn/sqrt
   (dfn/sum
    (dfn/pow
     (dfn/- point1 point2)
     2))))

3.6.2 6.2 Silhouette Score

Silhouette score measures how similar a point is to its own cluster compared to other clusters. Ranges from -1 to 1, higher is better.

Common evaluation approaches:

Elbow method: Plot WCSS vs. number of clusters, look for the ‘elbow’
Silhouette analysis: Compute silhouette score for each point
Gap statistic: Compare within-cluster dispersion to null reference
Domain validation: Check if clusters make sense in your domain

3.7 Part 7: The Elbow Method for Choosing K

The elbow method helps determine the optimal number of clusters.

(defn fit-kmeans-for-k [ds k]
  (let [pipeline (mm/pipeline
                  (prep/std-scale numeric-cols {:mean? true :stddev? true})
                  {:metamorph/id :model}
                  (ml/model {:model-type :fastmath.cluster/k-means
                             :clustering-method-args [k 100]}))
        result (pipeline {:metamorph/data ds
                          :metamorph/mode :fit})]
    {:k k
     :model (-> result :model :model-data)
     :result result}))

Try different values of K:

(def k-values [2 3 4 5 6 7 8])

Testing different values of K…

(def elbow-results
  (mapv #(fit-kmeans-for-k iris-ds %) k-values))

Tried K values: [2 3 4 5 6 7 8]

To find the optimal K, plot WCSS vs K and look for an ‘elbow’ where the rate of decrease slows down.

elbow-results

3.8 Part 8: Complete Unsupervised Workflow

Here’s a complete workflow combining preprocessing, dimensionality reduction, and clustering:

(defn unsupervised-workflow
  "Complete unsupervised learning workflow"
  [dataset n-components n-clusters]

  (let [;; Build the pipeline
        pipeline (mm/pipeline
                  ;; Step 1: Standardize features
                  (prep/std-scale numeric-cols {:mean? true :stddev? true})

                  ;; Step 2: Dimensionality reduction with PCA
                  {:metamorph/id :pca}
                  (projections/reduce-dimensions :pca-cov n-components numeric-cols {})

                  (tc-mm/drop-columns [:sepal-length :sepal-width :petal-length :petal-width])
                  ;; Step 3: Cluster in reduced space
                  {:metamorph/id :kmeans}
                  (clustering/cluster :k-means [n-clusters 100 1e-4] :clustering))

        ;; Fit the pipeline
        fitted (mm/fit-pipe dataset pipeline)


        ;; transform dataset

        pca-model (-> fitted :pca :model-data)
        kmeans-model (-> fitted :kmeans :model-data)

        cluster-assignments (-> fitted :kmeans :clustering)]



    {:pipeline pipeline
     :cluster-assignments cluster-assignments
     :pca-model pca-model
     :kmeans-model kmeans-model
     :fitted-ctx fitted}))

(def workflow-result
  (unsupervised-workflow iris-ds 2 3))

workflow-result

(-> workflow-result keys)

(:pipeline :cluster-assignments :pca-model :kmeans-model :fitted-ctx)

Complete workflow executed!

Add clusters to original data:

(def iris-final
  (-> iris-ds
      (tc/add-column :cluster (:cluster-assignments workflow-result))))

Final clustered data:

iris-final

https://vincentarelbundock.github.io/Rdatasets/doc/datasets/iris.html [150 5]:

:sepal-length	:sepal-width	:petal-length	:petal-width	:cluster
5.1	3.5	1.4	0.2	1
4.9	3.0	1.4	0.2	1
4.7	3.2	1.3	0.2	1
4.6	3.1	1.5	0.2	1
5.0	3.6	1.4	0.2	1
5.4	3.9	1.7	0.4	1
4.6	3.4	1.4	0.3	1
5.0	3.4	1.5	0.2	1
4.4	2.9	1.4	0.2	1
4.9	3.1	1.5	0.1	1
…	…	…	…	…
6.9	3.1	5.4	2.1	0
6.7	3.1	5.6	2.4	0
6.9	3.1	5.1	2.3	0
5.8	2.7	5.1	1.9	2
6.8	3.2	5.9	2.3	0
6.7	3.3	5.7	2.5	0
6.7	3.0	5.2	2.3	0
6.3	2.5	5.0	1.9	2
6.5	3.0	5.2	2.0	0
6.2	3.4	5.4	2.3	0
5.9	3.0	5.1	1.8	0

Cluster statistics:

(-> iris-final
    (tc/group-by [:cluster])
    (tc/aggregate {:count tc/row-count
                   :avg-sepal-length #(dfn/mean (% :sepal-length))
                   :avg-petal-length #(dfn/mean (% :petal-length))
                   :avg-petal-width #(dfn/mean (% :petal-width))})
    (tc/order-by [:cluster]))

_unnamed [3 5]:

:cluster	:count	:avg-sepal-length	:avg-petal-length	:avg-petal-width
0	55	6.69636364	5.41818182	1.93818182
1	49	5.01632653	1.46530612	0.24489796
2	46	5.70434783	4.21521739	1.33260870

3.9 Part 9: Applying Models to New Data

Once trained, unsupervised models can transform new data using the learned patterns.

Create some new data (using a sample from the original):

(def new-data
  (tc/random iris-ds 80))

New data to transform:

(tc/head new-data 5)

https://vincentarelbundock.github.io/Rdatasets/doc/datasets/iris.html [5 4]:

:sepal-length	:sepal-width	:petal-length	:petal-width
5.4	3.7	1.5	0.2
6.1	3.0	4.9	1.8
5.1	3.8	1.9	0.4
5.1	3.5	1.4	0.3
7.2	3.6	6.1	2.5

Apply the trained pipeline:

(def new-data-transformed
  (-> (mm/transform-pipe
       new-data
       (:pipeline workflow-result)
       (:fitted-ctx workflow-result))
      :metamorph/data))

Transformed new data with cluster assignments:

(-> new-data-transformed :clustering frequencies)

{1 20, 0 34, 2 26}

3.10 Part 10: Advanced Topics

3.10.1 10.1 DBSCAN (Density-Based Clustering)

DBSCAN can find clusters of arbitrary shape and identify outliers.

(def dbscan-pipeline (mm/pipeline (ds-mm/std-scale cf/numeric {:mean? true :stddev? true}) {:metamorph/id :model} (ml/model {:model-type :smile.clustering/dbscan :min-pts 5 ; Minimum points for a cluster :radius 0.5}))) ; Neighborhood radius

(def dbscan-result (dbscan-pipeline {:metamorph/data iris-ds :metamorph/mode :fit}))

DBSCAN clustering: Can detect outliers (labeled as cluster -1)

(def dbscan-clusters (:cluster-id (:metamorph/data dbscan-result)))

(-> (tc/add-column iris-ds :dbscan-cluster dbscan-clusters) (tc/group-by [:dbscan-cluster]) (tc/aggregate {:count tc/row-count}) (tc/order-by [:dbscan-cluster]))

3.10.2 10.2 Different Linkage Methods in Hierarchical Clustering

Hierarchical clustering linkage methods:

:single - Minimum distance between clusters (can create long chains)
:complete - Maximum distance between clusters (creates tight clusters)
:average - Average distance between all pairs
:ward - Minimizes within-cluster variance (often best results)

(defn try-linkage [linkage-method] (let [pipeline (mm/pipeline (ds-mm/std-scale cf/numeric {:mean? true :stddev? true}) {:metamorph/id :model} (ml/model {:model-type :smile.clustering/hierarchical :k 3 :linkage linkage-method})) result (pipeline {:metamorph/data iris-ds :metamorph/mode :fit})] {:linkage linkage-method :clusters (:cluster-id (:metamorph/data result))}))

(def linkage-comparison (map try-linkage [:single :complete :average :ward]))

Compared 4 different linkage methods

3.11 Part 11: Best Practices for Unsupervised Learning

3.11.1 Best Practices

Always scale your features - Most algorithms are sensitive to feature scales
- Use standard scaling (mean=0, std=1) for most cases
- Use min-max scaling when you need specific ranges
- Use robust scaling when you have outliers
Try multiple algorithms - Different algorithms work better for different data
- K-means: Fast, works well with spherical clusters
- Hierarchical: Good for exploring different granularities
- DBSCAN: Can find arbitrary shapes and outliers
Validate results - Without labels, validation requires creativity
- Visual inspection (especially after PCA to 2D/3D)
- Domain expertise: Do the clusters make sense?
- Stability: Do results change much with different random seeds?
- Multiple metrics: Use several quality measures
Use dimensionality reduction carefully
- PCA is great for visualization and noise reduction
- But it can remove important information
- Try clustering with and without PCA
Preprocess appropriately for your data type
- Numerical: Scaling, outlier handling
- Categorical: One-hot encoding
- Text: TF-IDF, embeddings
- Mixed: Handle each type appropriately
Experiment with hyperparameters
- Number of clusters (K)
- Distance metrics
- Linkage methods (for hierarchical)
- PCA components
- Use the elbow method and silhouette analysis

3.12 Summary

In this tutorial, we covered:

K-means clustering - Partitioning data into K groups
Hierarchical clustering - Building cluster trees
DBSCAN - Density-based clustering with outlier detection
PCA - Dimensionality reduction for visualization and preprocessing
Feature scaling - Standard and min-max scaling
Text processing - TF-IDF for document clustering
Evaluation - Methods for assessing cluster quality
Complete workflows - End-to-end unsupervised learning pipelines
Best practices - Guidelines for successful unsupervised learning

3.13 Next Steps

Explore other Smile clustering algorithms (X-Means, G-Means)
Try t-SNE or UMAP for non-linear dimensionality reduction
Combine unsupervised and supervised learning (semi-supervised)
Use clustering for feature engineering in supervised tasks
Apply to real-world problems: customer segmentation, anomaly detection, etc.

For more information:

Tutorial complete! You now have a solid foundation for unsupervised machine learning with metamorph.ml.

source: notebooks/unsupervised-ml-intro.clj