3 Introduction to Unsupervised Machine Learning with metamorph.ml
This tutorial introduces unsupervised machine learning using the metamorph.ml library. We’ll cover:
- Clustering algorithms (K-means, hierarchical clustering)
- Dimensionality reduction (PCA)
- Feature scaling and preprocessing
- Text feature extraction with TF-IDF
- Evaluation techniques for unsupervised learning
- Building complete unsupervised ML pipelines
Unlike supervised learning, unsupervised learning works with unlabeled data to discover hidden patterns, group similar observations, or reduce dimensionality for visualization and preprocessing.
Disclaimer: (created with the help of Claude Code)
(ns unsupervised-ml-intro
(:require
[clojure.string :as str]
[tablecloth.api :as tc]
[tablecloth.pipeline :as tc-mm]
[tech.v3.dataset :as ds]
[tech.v3.dataset.column-filters :as cf]
[tech.v3.datatype.functional :as dfn]
[scicloj.metamorph.core :as mm]
[scicloj.metamorph.ml :as ml]
[scicloj.metamorph.ml.preprocessing :as prep]
[scicloj.metamorph.ml.rdatasets :as rdatasets]
[scicloj.ml.smile.projections :as projections]
[scicloj.ml.smile.clustering :as clustering]
[scicloj.ml.smile.manifold]))3.1 Part 1: Clustering with K-Means
Clustering groups similar data points together without predefined labels. K-means is one of the most popular clustering algorithms.
3.1.1 1.1 Loading and Exploring Data
We’ll use the Iris dataset, but ignore the species labels to treat it as an unsupervised problem.
(def iris-ds
(-> (rdatasets/datasets-iris)
(ds/drop-columns [:rownames :species]))) ; Remove labels for unsupervised learningIris dataset (unlabeled):
(tc/head iris-ds 5)https://vincentarelbundock.github.io/Rdatasets/doc/datasets/iris.html [5 4]:
| :sepal-length | :sepal-width | :petal-length | :petal-width |
|---|---|---|---|
| 5.1 | 3.5 | 1.4 | 0.2 |
| 4.9 | 3.0 | 1.4 | 0.2 |
| 4.7 | 3.2 | 1.3 | 0.2 |
| 4.6 | 3.1 | 1.5 | 0.2 |
| 5.0 | 3.6 | 1.4 | 0.2 |
Dataset shape: 4 rows × 150 columns
View column statistics:
(ds/descriptive-stats iris-ds)https://vincentarelbundock.github.io/Rdatasets/doc/datasets/iris.html: descriptive-stats [4 11]:
| :col-name | :datatype | :n-valid | :n-missing | :min | :mean | :max | :standard-deviation | :skew | :first | :last |
|---|---|---|---|---|---|---|---|---|---|---|
| :sepal-length | :float64 | 150 | 0 | 4.3 | 5.84333333 | 7.9 | 0.82806613 | 0.31491096 | 5.1 | 5.9 |
| :sepal-width | :float64 | 150 | 0 | 2.0 | 3.05733333 | 4.4 | 0.43586628 | 0.31896566 | 3.5 | 3.0 |
| :petal-length | :float64 | 150 | 0 | 1.0 | 3.75800000 | 6.9 | 1.76529823 | -0.27488418 | 1.4 | 5.1 |
| :petal-width | :float64 | 150 | 0 | 0.1 | 1.19933333 | 2.5 | 0.76223767 | -0.10296675 | 0.2 | 1.8 |
3.1.2 1.2 Data Preprocessing
Before clustering, we should standardize features so that features with larger scales don’t dominate the distance calculations.
(def numeric-cols (tc/column-names (cf/numeric iris-ds)))(def preprocessing-pipeline
(mm/pipeline
(prep/std-scale numeric-cols {:mean? true :stddev? true})))Apply preprocessing in fit mode:
(def fitted-preproc-ctx
(preprocessing-pipeline
{:metamorph/data iris-ds
:metamorph/mode :fit}))(def scaled-iris
(:metamorph/data fitted-preproc-ctx))Scaled data (first 5 rows):
(tc/head scaled-iris 5)https://vincentarelbundock.github.io/Rdatasets/doc/datasets/iris.html [5 4]:
| :sepal-length | :sepal-width | :petal-length | :petal-width |
|---|---|---|---|
| -0.89767388 | 1.01560199 | -1.33575163 | -1.31105215 |
| -1.13920048 | -0.13153881 | -1.33575163 | -1.31105215 |
| -1.38072709 | 0.32731751 | -1.39239929 | -1.31105215 |
| -1.50149039 | 0.09788935 | -1.27910398 | -1.31105215 |
| -1.01843718 | 1.24503015 | -1.33575163 | -1.31105215 |
Check that scaling worked (mean ≈ 0, std ≈ 1):
Scaled data statistics:
(ds/descriptive-stats scaled-iris)https://vincentarelbundock.github.io/Rdatasets/doc/datasets/iris.html: descriptive-stats [4 11]:
| :col-name | :datatype | :n-valid | :n-missing | :min | :mean | :max | :standard-deviation | :skew | :first | :last |
|---|---|---|---|---|---|---|---|---|---|---|
| :sepal-length | :float64 | 150 | 0 | -1.86378030 | -4.43719136E-16 | 2.48369858 | 1.0 | 0.31491096 | -0.89767388 | 0.06843254 |
| :sepal-width | :float64 | 150 | 0 | -2.42582042 | -8.17679258E-16 | 3.08045544 | 1.0 | 0.31896566 | 1.01560199 | -0.13153881 |
| :petal-length | :float64 | 150 | 0 | -1.56234224 | -2.58311890E-16 | 1.77986923 | 1.0 | -0.27488418 | -1.33575163 | 0.76021149 |
| :petal-width | :float64 | 150 | 0 | -1.44224482 | -3.70074342E-17 | 1.70637941 | 1.0 | -0.10296675 | -1.31105215 | 0.78803068 |
3.1.3 1.3 K-Means Clustering
K-means partitions data into K clusters by minimizing within-cluster variance.
(def kmeans-pipeline
(mm/pipeline
;; Standardize features
(prep/std-scale numeric-cols {:mean? true :stddev? true})
;; K-means clustering
{:metamorph/id :model}
(ml/model {:model-type :fastmath.cluster/k-means
:clustering-method-args [3 100 1e-4]})))- :k 3 -> Number of clusters
- :max-iter 100 -> Maximum iterations
- :tolerance 1e-4 -> Convergence tolerance
Fit the clustering model:
(def kmeans-result
(kmeans-pipeline
{:metamorph/data iris-ds
:metamorph/mode :fit}))Extract the trained model:
(def kmeans-model
(-> kmeans-result :model :model-data))^kind/println
(-> kmeans-model :obj str)Cluster distortion: 139.96219
Cluster size of 150 data points:
Cluster 1 49 (32.7%)
Cluster 2 55 (36.7%)
Cluster 3 46 (30.7%)
K-means clustering complete!
3.1.4 1.4 Analyzing Cluster Assignments
Get the cluster assignments (which cluster each point belongs to):
(def cluster-assignments
(-> kmeans-model :clustering))Number of unique clusters found: 3
Add cluster assignments to the original data:
(def iris-with-clusters
(tc/add-column iris-ds :cluster cluster-assignments))Data with cluster assignments:
(tc/head iris-with-clusters 10)https://vincentarelbundock.github.io/Rdatasets/doc/datasets/iris.html [10 5]:
| :sepal-length | :sepal-width | :petal-length | :petal-width | :cluster |
|---|---|---|---|---|
| 5.1 | 3.5 | 1.4 | 0.2 | 0 |
| 4.9 | 3.0 | 1.4 | 0.2 | 0 |
| 4.7 | 3.2 | 1.3 | 0.2 | 0 |
| 4.6 | 3.1 | 1.5 | 0.2 | 0 |
| 5.0 | 3.6 | 1.4 | 0.2 | 0 |
| 5.4 | 3.9 | 1.7 | 0.4 | 0 |
| 4.6 | 3.4 | 1.4 | 0.3 | 0 |
| 5.0 | 3.4 | 1.5 | 0.2 | 0 |
| 4.4 | 2.9 | 1.4 | 0.2 | 0 |
| 4.9 | 3.1 | 1.5 | 0.1 | 0 |
View cluster sizes:
(-> iris-with-clusters
(tc/group-by [:cluster])
(tc/aggregate {:count tc/row-count})
(tc/order-by [:cluster]))_unnamed [3 2]:
| :cluster | :count |
|---|---|
| 0 | 49 |
| 1 | 55 |
| 2 | 46 |
3.1.5 1.5 Cluster Statistics
Examine the characteristics of each cluster:
(-> iris-with-clusters
(tc/group-by [:cluster])
(tc/aggregate {:mean-sepal-length #(dfn/mean (% :sepal-length))
:mean-sepal-width #(dfn/mean (% :sepal-width))
:mean-petal-length #(dfn/mean (% :petal-length))
:mean-petal-width #(dfn/mean (% :petal-width))
:count tc/row-count})
(tc/order-by [:cluster]))_unnamed [3 6]:
| :cluster | :mean-sepal-length | :mean-sepal-width | :mean-petal-length | :mean-petal-width | :count |
|---|---|---|---|---|---|
| 0 | 5.01632653 | 3.45102041 | 1.46530612 | 0.24489796 | 49 |
| 1 | 6.69636364 | 3.06000000 | 5.41818182 | 1.93818182 | 55 |
| 2 | 5.70434783 | 2.63478261 | 4.21521739 | 1.33260870 | 46 |
3.2 Part 2: Dimensionality Reduction with PCA
Principal Component Analysis (PCA) reduces the number of features while retaining most of the variance in the data. It’s useful for:
- Visualization (reducing to 2-3 dimensions)
- Preprocessing before modeling
- Noise reduction
3.2.1 2.1 Applying PCA
(def pca-pipeline
(mm/pipeline
;; Standardize features (required for PCA)
(prep/std-scale numeric-cols {:mean? true :stddev? true})
;; PCA
{:metamorph/id :pca}
(projections/reduce-dimensions :pca-cov 2 numeric-cols {}))) ; Reduce to 2 dimensionsFit PCA:
(def pca-result (mm/fit-pipe iris-ds pca-pipeline))(def pca-transformed
(-> pca-result
:metamorph/data
(tc/select-columns ["pca-cov-0" "pca-cov-1"])))PCA transformation complete!
Original dimensions: 4
Reduced dimensions: 2
View the transformed data:
(tc/head pca-transformed 10)https://vincentarelbundock.github.io/Rdatasets/doc/datasets/iris.html [10 2]:
| pca-cov-0 | pca-cov-1 |
|---|---|
| -2.25714118 | -0.47842383 |
| -2.07401302 | 0.67188269 |
| -2.35633511 | 0.34076642 |
| -2.29170679 | 0.59539986 |
| -2.38186270 | -0.64467566 |
| -2.06870061 | -1.48420530 |
| -2.43586845 | -0.04748512 |
| -2.22539189 | -0.22240300 |
| -2.32684533 | 1.11160370 |
| -2.17703491 | 0.46744757 |
3.2.2 2.2 Explained Variance
PCA components capture different amounts of variance. The first component captures the most variance, the second captures the second most, etc.
(def pca-model
(-> pca-result :pca :fit-result :model))The PCA model contains information about explained variance and component loadings.
(-> pca-model bean keys)(:center
:class
:cumulativeVarianceProportion
:loadings
:projection
:variance
:varianceProportion)ex. cummulative variance proportion:
(-> pca-model .getCumulativeVarianceProportion vec)[0.7296244541329989 0.9581320720000164 0.9948212908928452 1.0]3.2.3 2.3 Combining PCA with Clustering
A common pattern is to use PCA for dimensionality reduction, then cluster in the reduced space:
(def pca-kmeans-pipeline
(mm/pipeline
;; Standardize
(prep/std-scale numeric-cols {:mean? true :stddev? true})
;; Reduce dimensions
{:metamorph/id :pca}
(projections/reduce-dimensions :pca-cov 2 numeric-cols {})
(tc-mm/select-columns ["pca-cov-0" "pca-cov-1"])
;; Cluster in reduced space
{:metamorph/id :kmeans}
(clustering/cluster :k-means [3 100 1e-4] :clustering)))(def pca-kmeans-result
(mm/fit-pipe iris-ds pca-kmeans-pipeline))PCA + K-means pipeline complete!
Get cluster assignments from the combined pipeline:
(def pca-clusters
(-> pca-kmeans-result :metamorph/data :clustering seq))(def iris-pca-clusters
(tc/add-column iris-ds :pca-cluster pca-clusters))(-> iris-pca-clusters
(tc/group-by [:pca-cluster])
(tc/aggregate {:count tc/row-count})
(tc/order-by [:pca-cluster]))_unnamed [3 2]:
| :pca-cluster | :count |
|---|---|
| 0 | 35 |
| 1 | 100 |
| 2 | 15 |
3.3 Part 3: Hierarchical Clustering
Hierarchical clustering builds a tree (dendrogram) of clusters, allowing exploration at different granularities.
(def hclust-pipeline (mm/pipeline (prep/std-scale cf/numeric {:mean? true :stddev? true}) {:metamorph/id :model} (ml/model {:model-type :smile.clustering/hierarchical :k 3 ; Cut tree to get 3 clusters :linkage :complete}))) ; Linkage method: :single, :complete, :average, :ward
(def hclust-result (hclust-pipeline {:metamorph/data iris-ds :metamorph/mode :fit}))
(def hclust-assignments (:cluster-id (:metamorph/data hclust-result)))
(def iris-hclust (tc/add-column iris-ds :hclust-cluster hclust-assignments))
^{:kind/md true :kindly/hide-code true} “Hierarchical clustering results:”
(-> iris-hclust (tc/group-by [:hclust-cluster]) (tc/aggregate {:count tc/row-count :mean-petal-length #(dfn/mean (% :petal-length))}) (tc/order-by [:hclust-cluster]))
3.4 Part 4: Feature Engineering and Preprocessing
Proper preprocessing is crucial for unsupervised learning.
3.4.1 4.1 Standard Scaling (Z-score Normalization)
(def std-scaled-ds
(-> (mm/pipeline
(prep/std-scale numeric-cols {:mean? true :stddev? true}))
(apply [{:metamorph/data iris-ds
:metamorph/mode :fit}])
:metamorph/data))Standard scaling: Transforms features to have mean=0 and std=1
(ds/descriptive-stats std-scaled-ds [:mean :standard-deviation])https://vincentarelbundock.github.io/Rdatasets/doc/datasets/iris.html: descriptive-stats [4 11]:
| :col-name | :datatype | :n-valid | :n-missing | :min | :mean | :max | :standard-deviation | :skew | :first | :last |
|---|---|---|---|---|---|---|---|---|---|---|
| :sepal-length | :float64 | 150 | 0 | -1.86378030 | -4.43719136E-16 | 2.48369858 | 1.0 | 0.31491096 | -0.89767388 | 0.06843254 |
| :sepal-width | :float64 | 150 | 0 | -2.42582042 | -8.17679258E-16 | 3.08045544 | 1.0 | 0.31896566 | 1.01560199 | -0.13153881 |
| :petal-length | :float64 | 150 | 0 | -1.56234224 | -2.58311890E-16 | 1.77986923 | 1.0 | -0.27488418 | -1.33575163 | 0.76021149 |
| :petal-width | :float64 | 150 | 0 | -1.44224482 | -3.70074342E-17 | 1.70637941 | 1.0 | -0.10296675 | -1.31105215 | 0.78803068 |
3.4.2 4.2 Min-Max Scaling
(def minmax-scaled-ds
(-> (mm/pipeline
(prep/min-max-scale numeric-cols {:min -1 :max 1}))
(apply [{:metamorph/data iris-ds
:metamorph/mode :fit}])
:metamorph/data))Min-max scaling: Transforms features to a specific range (here: -1 to 1)
(ds/descriptive-stats minmax-scaled-ds [:min :max])https://vincentarelbundock.github.io/Rdatasets/doc/datasets/iris.html: descriptive-stats [4 11]:
| :col-name | :datatype | :n-valid | :n-missing | :min | :mean | :max | :standard-deviation | :skew | :first | :last |
|---|---|---|---|---|---|---|---|---|---|---|
| :sepal-length | :float64 | 150 | 0 | -1.0 | -0.14259259 | 1.0 | 0.46003674 | 0.31491096 | -0.55555556 | -0.11111111 |
| :sepal-width | :float64 | 150 | 0 | -1.0 | -0.11888889 | 1.0 | 0.36322190 | 0.31896566 | 0.25000000 | -0.16666667 |
| :petal-length | :float64 | 150 | 0 | -1.0 | -0.06508475 | 1.0 | 0.59840618 | -0.27488418 | -0.86440678 | 0.38983051 |
| :petal-width | :float64 | 150 | 0 | -1.0 | -0.08388889 | 1.0 | 0.63519806 | -0.10296675 | -0.91666667 | 0.41666667 |
3.4.3 4.3 Robust Scaling for Outliers
When data has outliers, standard scaling can be affected. Consider using quantile-based scaling or removing outliers first.
3.5 Part 5: Text Clustering with TF-IDF
Unsupervised learning is commonly used for text data. Let’s create a simple example of text clustering.
Create a small text dataset:
(def documents-ds
(tc/dataset {:doc-id (range 6)
:text ["machine learning is fascinating"
"deep learning uses neural networks"
"I love pizza and pasta"
"Italian food is delicious"
"supervised learning needs labels"
"My favorite food is sushi"]}))Text documents:
documents-ds_unnamed [6 2]:
| :doc-id | :text |
|---|---|
| 0 | machine learning is fascinating |
| 1 | deep learning uses neural networks |
| 2 | I love pizza and pasta |
| 3 | Italian food is delicious |
| 4 | supervised learning needs labels |
| 5 | My favorite food is sushi |
3.5.1 5.1 Tokenization and TF-IDF
First, we need to tokenize text and compute TF-IDF features:
(require '[scicloj.metamorph.ml.text :as text])Tokenize documents:
(defn simple-tokenize [text]
(-> text
str/lower-case
(str/split #"\s+")))(def tokenized-docs
(tc/add-column documents-ds
:tokens
(map simple-tokenize (:text documents-ds))))Tokenized documents:
(tc/select-columns tokenized-docs [:doc-id :tokens])_unnamed [6 2]:
| :doc-id | :tokens |
|---|---|
| 0 | [machine learning is fascinating] |
| 1 | [deep learning uses neural networks] |
| 2 | [i love pizza and pasta] |
| 3 | [italian food is delicious] |
| 4 | [supervised learning needs labels] |
| 5 | [my favorite food is sushi] |
Convert to tidy text format (one row per token):
(def tidy-docs
(tc/dataset
(mapcat (fn [row]
(map (fn [token]
{:doc-id (:doc-id row)
:token token})
(:tokens row)))
(ds/mapseq-reader tokenized-docs))))Tidy text format (sample):
(tc/head tidy-docs 10)_unnamed [10 2]:
| :doc-id | :token |
|---|---|
| 0 | machine |
| 0 | learning |
| 0 | is |
| 0 | fascinating |
| 1 | deep |
| 1 | learning |
| 1 | uses |
| 1 | neural |
| 1 | networks |
| 2 | i |
Compute term frequencies:
(def term-counts
(-> tidy-docs
(tc/group-by [:doc-id :token])
(tc/aggregate {:n tc/row-count})))Term counts (sample):
(tc/head term-counts 10)_unnamed [10 3]:
| :doc-id | :token | :n |
|---|---|---|
| 0 | machine | 1 |
| 0 | learning | 1 |
| 0 | is | 1 |
| 0 | fascinating | 1 |
| 1 | deep | 1 |
| 1 | learning | 1 |
| 1 | uses | 1 |
| 1 | neural | 1 |
| 1 | networks | 1 |
| 2 | i | 1 |
3.5.2 5.2 Document Similarity
After TF-IDF vectorization, we can cluster documents based on their semantic similarity.
Note: Full TF-IDF clustering requires converting the term-document matrix to a format suitable for clustering. The scicloj.metamorph.ml.text namespace provides functions for this.
3.6 Part 6: Evaluation Metrics for Unsupervised Learning
Unlike supervised learning, we don’t have true labels to evaluate against. Instead, we use intrinsic quality measures.
3.6.1 6.1 Within-Cluster Sum of Squares (WCSS)
WCSS measures cluster compactness. Lower is better.
(defn euclidean-distance [point1 point2]
(dfn/sqrt
(dfn/sum
(dfn/pow
(dfn/- point1 point2)
2))))3.6.2 6.2 Silhouette Score
Silhouette score measures how similar a point is to its own cluster compared to other clusters. Ranges from -1 to 1, higher is better.
Common evaluation approaches:
- Elbow method: Plot WCSS vs. number of clusters, look for the ‘elbow’
- Silhouette analysis: Compute silhouette score for each point
- Gap statistic: Compare within-cluster dispersion to null reference
- Domain validation: Check if clusters make sense in your domain
3.7 Part 7: The Elbow Method for Choosing K
The elbow method helps determine the optimal number of clusters.
(defn fit-kmeans-for-k [ds k]
(let [pipeline (mm/pipeline
(prep/std-scale numeric-cols {:mean? true :stddev? true})
{:metamorph/id :model}
(ml/model {:model-type :fastmath.cluster/k-means
:clustering-method-args [k 100]}))
result (pipeline {:metamorph/data ds
:metamorph/mode :fit})]
{:k k
:model (-> result :model :model-data)
:result result}))Try different values of K:
(def k-values [2 3 4 5 6 7 8])Testing different values of K…
(def elbow-results
(mapv #(fit-kmeans-for-k iris-ds %) k-values))Tried K values: [2 3 4 5 6 7 8]
To find the optimal K, plot WCSS vs K and look for an ‘elbow’ where the rate of decrease slows down.
elbow-results
3.8 Part 8: Complete Unsupervised Workflow
Here’s a complete workflow combining preprocessing, dimensionality reduction, and clustering:
(defn unsupervised-workflow
"Complete unsupervised learning workflow"
[dataset n-components n-clusters]
(let [;; Build the pipeline
pipeline (mm/pipeline
;; Step 1: Standardize features
(prep/std-scale numeric-cols {:mean? true :stddev? true})
;; Step 2: Dimensionality reduction with PCA
{:metamorph/id :pca}
(projections/reduce-dimensions :pca-cov n-components numeric-cols {})
(tc-mm/drop-columns [:sepal-length :sepal-width :petal-length :petal-width])
;; Step 3: Cluster in reduced space
{:metamorph/id :kmeans}
(clustering/cluster :k-means [n-clusters 100 1e-4] :clustering))
;; Fit the pipeline
fitted (mm/fit-pipe dataset pipeline)
;; transform dataset
pca-model (-> fitted :pca :model-data)
kmeans-model (-> fitted :kmeans :model-data)
cluster-assignments (-> fitted :kmeans :clustering)]
{:pipeline pipeline
:cluster-assignments cluster-assignments
:pca-model pca-model
:kmeans-model kmeans-model
:fitted-ctx fitted}))(def workflow-result
(unsupervised-workflow iris-ds 2 3))workflow-result
(-> workflow-result keys)(:pipeline :cluster-assignments :pca-model :kmeans-model :fitted-ctx)Complete workflow executed!
Add clusters to original data:
(def iris-final
(-> iris-ds
(tc/add-column :cluster (:cluster-assignments workflow-result))))Final clustered data:
iris-finalhttps://vincentarelbundock.github.io/Rdatasets/doc/datasets/iris.html [150 5]:
| :sepal-length | :sepal-width | :petal-length | :petal-width | :cluster |
|---|---|---|---|---|
| 5.1 | 3.5 | 1.4 | 0.2 | 0 |
| 4.9 | 3.0 | 1.4 | 0.2 | 0 |
| 4.7 | 3.2 | 1.3 | 0.2 | 0 |
| 4.6 | 3.1 | 1.5 | 0.2 | 0 |
| 5.0 | 3.6 | 1.4 | 0.2 | 0 |
| 5.4 | 3.9 | 1.7 | 0.4 | 0 |
| 4.6 | 3.4 | 1.4 | 0.3 | 0 |
| 5.0 | 3.4 | 1.5 | 0.2 | 0 |
| 4.4 | 2.9 | 1.4 | 0.2 | 0 |
| 4.9 | 3.1 | 1.5 | 0.1 | 0 |
| … | … | … | … | … |
| 6.9 | 3.1 | 5.4 | 2.1 | 1 |
| 6.7 | 3.1 | 5.6 | 2.4 | 1 |
| 6.9 | 3.1 | 5.1 | 2.3 | 1 |
| 5.8 | 2.7 | 5.1 | 1.9 | 2 |
| 6.8 | 3.2 | 5.9 | 2.3 | 1 |
| 6.7 | 3.3 | 5.7 | 2.5 | 1 |
| 6.7 | 3.0 | 5.2 | 2.3 | 1 |
| 6.3 | 2.5 | 5.0 | 1.9 | 2 |
| 6.5 | 3.0 | 5.2 | 2.0 | 1 |
| 6.2 | 3.4 | 5.4 | 2.3 | 1 |
| 5.9 | 3.0 | 5.1 | 1.8 | 1 |
Cluster statistics:
(-> iris-final
(tc/group-by [:cluster])
(tc/aggregate {:count tc/row-count
:avg-sepal-length #(dfn/mean (% :sepal-length))
:avg-petal-length #(dfn/mean (% :petal-length))
:avg-petal-width #(dfn/mean (% :petal-width))})
(tc/order-by [:cluster]))_unnamed [3 5]:
| :cluster | :count | :avg-sepal-length | :avg-petal-length | :avg-petal-width |
|---|---|---|---|---|
| 0 | 49 | 5.01632653 | 1.46530612 | 0.24489796 |
| 1 | 55 | 6.69636364 | 5.41818182 | 1.93818182 |
| 2 | 46 | 5.70434783 | 4.21521739 | 1.33260870 |
3.9 Part 9: Applying Models to New Data
Once trained, unsupervised models can transform new data using the learned patterns.
Create some new data (using a sample from the original):
(def new-data
(tc/random iris-ds 80))New data to transform:
(tc/head new-data 5)https://vincentarelbundock.github.io/Rdatasets/doc/datasets/iris.html [5 4]:
| :sepal-length | :sepal-width | :petal-length | :petal-width |
|---|---|---|---|
| 6.1 | 2.8 | 4.0 | 1.3 |
| 5.4 | 3.0 | 4.5 | 1.5 |
| 6.4 | 2.7 | 5.3 | 1.9 |
| 5.5 | 3.5 | 1.3 | 0.2 |
| 6.5 | 2.8 | 4.6 | 1.5 |
Apply the trained pipeline:
(def new-data-transformed
(-> (mm/transform-pipe
new-data
(:pipeline workflow-result)
(:fitted-ctx workflow-result))
:metamorph/data))Transformed new data with cluster assignments:
(-> new-data-transformed :clustering frequencies){2 32, 1 26, 0 22}3.10 Part 10: Advanced Topics
3.10.1 10.1 DBSCAN (Density-Based Clustering)
DBSCAN can find clusters of arbitrary shape and identify outliers.
(def dbscan-pipeline (mm/pipeline (ds-mm/std-scale cf/numeric {:mean? true :stddev? true}) {:metamorph/id :model} (ml/model {:model-type :smile.clustering/dbscan :min-pts 5 ; Minimum points for a cluster :radius 0.5}))) ; Neighborhood radius
(def dbscan-result (dbscan-pipeline {:metamorph/data iris-ds :metamorph/mode :fit}))
DBSCAN clustering: Can detect outliers (labeled as cluster -1)
(def dbscan-clusters (:cluster-id (:metamorph/data dbscan-result)))
(-> (tc/add-column iris-ds :dbscan-cluster dbscan-clusters) (tc/group-by [:dbscan-cluster]) (tc/aggregate {:count tc/row-count}) (tc/order-by [:dbscan-cluster]))
3.10.2 10.2 Different Linkage Methods in Hierarchical Clustering
Hierarchical clustering linkage methods:
:single- Minimum distance between clusters (can create long chains):complete- Maximum distance between clusters (creates tight clusters):average- Average distance between all pairs:ward- Minimizes within-cluster variance (often best results)
(defn try-linkage [linkage-method] (let [pipeline (mm/pipeline (ds-mm/std-scale cf/numeric {:mean? true :stddev? true}) {:metamorph/id :model} (ml/model {:model-type :smile.clustering/hierarchical :k 3 :linkage linkage-method})) result (pipeline {:metamorph/data iris-ds :metamorph/mode :fit})] {:linkage linkage-method :clusters (:cluster-id (:metamorph/data result))}))
(def linkage-comparison (map try-linkage [:single :complete :average :ward]))
Compared 4 different linkage methods
3.11 Part 11: Best Practices for Unsupervised Learning
3.11.1 Best Practices
Always scale your features - Most algorithms are sensitive to feature scales
- Use standard scaling (mean=0, std=1) for most cases
- Use min-max scaling when you need specific ranges
- Use robust scaling when you have outliers
Try multiple algorithms - Different algorithms work better for different data
- K-means: Fast, works well with spherical clusters
- Hierarchical: Good for exploring different granularities
- DBSCAN: Can find arbitrary shapes and outliers
Validate results - Without labels, validation requires creativity
- Visual inspection (especially after PCA to 2D/3D)
- Domain expertise: Do the clusters make sense?
- Stability: Do results change much with different random seeds?
- Multiple metrics: Use several quality measures
Use dimensionality reduction carefully
- PCA is great for visualization and noise reduction
- But it can remove important information
- Try clustering with and without PCA
Preprocess appropriately for your data type
- Numerical: Scaling, outlier handling
- Categorical: One-hot encoding
- Text: TF-IDF, embeddings
- Mixed: Handle each type appropriately
Experiment with hyperparameters
- Number of clusters (K)
- Distance metrics
- Linkage methods (for hierarchical)
- PCA components
- Use the elbow method and silhouette analysis
3.12 Summary
In this tutorial, we covered:
- K-means clustering - Partitioning data into K groups
- Hierarchical clustering - Building cluster trees
- DBSCAN - Density-based clustering with outlier detection
- PCA - Dimensionality reduction for visualization and preprocessing
- Feature scaling - Standard and min-max scaling
- Text processing - TF-IDF for document clustering
- Evaluation - Methods for assessing cluster quality
- Complete workflows - End-to-end unsupervised learning pipelines
- Best practices - Guidelines for successful unsupervised learning
3.13 Next Steps
- Explore other Smile clustering algorithms (X-Means, G-Means)
- Try t-SNE or UMAP for non-linear dimensionality reduction
- Combine unsupervised and supervised learning (semi-supervised)
- Use clustering for feature engineering in supervised tasks
- Apply to real-world problems: customer segmentation, anomaly detection, etc.
For more information:
Tutorial complete! You now have a solid foundation for unsupervised machine learning with metamorph.ml.
source: notebooks/unsupervised-ml-intro.clj