This is part of the Scicloj Clojure Data Tutorials.

(comment
  (require '[scicloj.clay.v2.api :as clay])
  (clay/start!)
  (clay/make! {:source-path "notebooks/index.clj"
               :show false
               }))

The following code shows how to perform text classification from a Kaggle dataset and make a submission file, ready to get uploaded to Kaggle for scoring.

It makes use of the tidy text / TFIDF functionality present in metamorph.ml and the ability of the xgboost model to handle tidy text data as input.

First we need a fn to tokenize a line of text The simplest such function is:

(defn- tokenize-fn [text]
  (str/split text #" "))

#'index/tokenize-fn

It does not do any text normalization, which is always required in NLP tasks in order to have a more general model.

The following reads line-by-line a file from disk and converts it on the fly to the tidy text representation, it which each word is a row in a dataset.

line-parse-fn needs to split an input line into [text meta], and the text is then further handled by tokenize-fn and split into tokens. The format of the data has the text in field 4 and the label in 5. We ignore all other columns so far:

(defn- line-parse-fn [line]
  [(nth line 3)
   (Integer/parseInt (nth line 4))])

#'index/line-parse-fn

This triggers the parsing and produces a (seq of) “long” datasets (1 for our small text) and the vocabulary obtained during parsing.

(def tidy-train
  (text/->tidy-text (csv/read-csv (io/reader "train.csv"))
                    seq
                    line-parse-fn
                    tokenize-fn
                    :skip-lines 1))

(def tidy-train-ds 
  (-> tidy-train :datasets first))

The combination of columns :document, :token-pos and :token-index together with the vocabulary table is an exact representation of the text Unless we normalize it as part of hte tokenize-fn

meta is any other information of a row to be kept, usualy the “label” in case of training data.

tidy-train-ds

_unnamed [113650 4]:

:token-idx	:token-pos	:document	:meta
1	0	0	1
2	1	0	1
3	2	0	1
4	3	0	1
5	4	0	1
6	5	0	1
7	6	0	1
8	7	0	1
9	8	0	1
10	9	0	1
…	…	…	…
5529	2	7612	1
12372	3	7612	1
25359	4	7612	1
30	5	7612	1
2552	6	7612	1
44	7	7612	1
25361	8	7612	1
69	9	7612	1
11698	10	7612	1
3844	11	7612	1
32017	12	7612	1

The lookup table allow to convert from :token-idx to words and back if needed.

(def train--token-lookup-table (:token-lookup-table tidy-train))

(map str (take 20 train--token-lookup-table))

("attack.=&gt;2828"
 "Ercjmnea:=&gt;22642"
 "#failure.\n#annonymous=&gt;23860"
 "concluded.=&gt;27252"
 "criminal=&gt;16677"
 "http://t.co/gzTolLl5WoÛ_=&gt;21789"
 "FOR:=&gt;31985"
 "http://t.co/qp6q8RS8ON=&gt;30308"
 "#watch=&gt;12433"
 "exercised=&gt;27047"
 "http://t.co/zCKXtFc9PT=&gt;7775"
 "HEAR=&gt;7573"
 "@fa07af174a71408=&gt;26138"
 "ended=&gt;1477"
 "@Coach_Keith44=&gt;27742"
 "http://t.co/Q0X7e84R4e=&gt;26341"
 "http://t.co/dVONWIv3l1=&gt;9946"
 "plummeting=&gt;22166"
 "heated=&gt;15565"
 "architect=&gt;14412")

As we can see, the tokens are not cleaned / standardized at all. This gives as well a large vocabulary size of

(count train--token-lookup-table)

Now we convert the text into a bag-of-words format, which looses any word order and calculates a metric which is known to work well for text classification, the so called TFIDF score.

(def train-tfidf
  (text/->tfidf tidy-train-ds))

The resulting table represent conceptually well three “sparse matrices” where :document and :token-idx are x,y coordinates and matrix cell values are :token-count, term-frequency (:tf) or TFIDF

Not present rows (the large majority) are 0 values. A subset of machine learning algorithms can deal with sparse matrices, without then need to convert them into dense matrices first, which is in most cases impossible due to the memory consumption The train-tfidf dataset represents therefore 3 sparse matrices with dimensions

(tcc/reduce-max (:document train-tfidf))

times

(tcc/reduce-max (:token-idx train-tfidf))

time 3

(* (tcc/reduce-max (:document train-tfidf))
   (tcc/reduce-max (:token-idx train-tfidf))
   3)

731140212

while only having shape:

(tc/shape train-tfidf)

[109209 6]

This is because most matrix elements are 0, as any text does “not contain” most words.

As TFIDF (and its variants) are one of the most common numeric representations for text, “sparse matrixes” and models supporting them is a prerequisite for NLP.

Only since a few years we have “dense text representations” based on “embeddings”, which will not be discussed here today, Now we get the data ready for training.

(def train-ds
  (-> train-tfidf
      (tc/rename-columns {:meta :label})
      (tc/select-columns [:document :token-idx :tfidf :label]) ;; we only need those
      (ds-mod/set-inference-target [:label])))

train-ds

_unnamed [109209 4]:

:document	:token-idx	:tfidf	:label
6144	27206	0.32346299	0
6144	27205	0.32346299	0
6144	24	0.11381646	0
6144	437	0.11094396	0
6144	238	0.15503055	0
6144	2942	0.24820548	0
6144	27204	0.32346299	0
6144	14277	0.27329135	0
6144	26	0.05693115	0
6144	27203	0.32346299	0
…	…	…	…
4093	19883	0.21564199	1
4094	19415	0.52525669	1
4094	12016	0.56803268	1
4094	19884	0.77631116	1
4094	87	0.16245157	1
4094	74	0.49331650	1
4095	1982	0.54708558	1
4095	19886	0.77631116	1
4095	486	0.44566867	1
4095	19885	0.71610516	1
4095	4	0.11680283	1

(def n-sparse-columns (inc (tcc/reduce-max (train-ds :token-idx))))

The model used is from library scicloj.ml.xgboost which is the well known xgboost model behind a wrapper to make it work with tidy text data.

We use :tfidf column as the “feature”.

(require '[scicloj.ml.xgboost])

registers the mode under key :xgboost/classification

(def model
  (ml/train train-ds {:model-type :xgboost/classification
                         :sparse-column :tfidf
                         :seed 123
                         :num-class 2
                         :n-sparse-columns n-sparse-columns}))

Now we have a trained model, which we can use for prediction on the test data. This time we do parsing and tfidf in one go.

Important here:

We pass the vocabulary “obtained before” in order to be sure, that :token-idx maps to the same words in both datasets. In case of “new tokens”, we ignore them and map them to a special token, “[UNKNOWN]”

(def tfidf-test-ds
  (->
   (text/->tidy-text (csv/read-csv (io/reader "test.csv"))
                     seq
                     (fn [line]
                       [(nth line 3) {:id (first line)}])
                     tokenize-fn
                     :skip-lines 1
                     :new-token-behaviour :as-unknown
                     :token->index-map train--token-lookup-table)
   :datasets
   first
   text/->tfidf
   (tc/select-columns [:document :token-idx :tfidf :meta]) 
   ;; he :id for Kaggle
   (tc/add-column
    :id (fn [df] (map
                  #(:id %)
                  (:meta df))))
   (tc/drop-columns [:meta])))

This gives the dataset which can be passed into the predict function of metamorph.ml

tfidf-test-ds

_unnamed [39633 4]:

:document	:token-idx	:tfidf	:id
3072	0	0.01384913	10170
3072	15162	0.11533672	10170
3072	7	0.07246824	10170
3072	18598	0.15981558	10170
3072	56	0.03792239	10170
3072	2500	0.13470392	10170
3072	7796	0.18492721	10170
3072	598	0.09596953	10170
3072	16087	0.15323985	10170
3072	26	0.03646101	10170
…	…	…	…
2047	24	0.04779571	6886
2047	16157	0.21689257	6886
2047	1115	0.18646623	6886
2047	7836	0.17140527	6886
2047	214	0.07519423	6886
2047	22814	0.22947052	6886
2047	22812	0.22947052	6886
2047	22811	0.25097266	6886
2047	155	0.11138390	6886
2047	22726	0.20104621	6886
2047	22816	0.25097266	6886

(def prediction
  (ml/predict tfidf-test-ds model))

The raw predictions contain the “document” each prediction is about. This we can use to match predictions and the input “ids” in order to produce teh format required by Kaggle

prediction

:_unnamed [3263 4]:

0	1	:label	:document
0.25788912	0.74211085	1.0	3072
0.62455165	0.37544832	0.0	1024
0.57005841	0.42994162	0.0	0
0.62455165	0.37544832	0.0	1
0.56063342	0.43936658	0.0	3073
0.63305563	0.36694437	0.0	2
0.48976091	0.51023906	1.0	3074
0.62455165	0.37544832	0.0	3
0.19340244	0.80659753	1.0	3075
0.22445914	0.77554083	1.0	4
…	…	…	…
0.49459830	0.50540173	1.0	2037
0.71280539	0.28719464	0.0	2038
0.82480311	0.17519693	0.0	2039
0.41419104	0.58580893	1.0	2040
0.67722660	0.32277343	0.0	2041
0.41419104	0.58580893	1.0	2042
0.21795417	0.78204578	1.0	2043
0.37909898	0.62090099	1.0	2044
0.33443969	0.66556036	1.0	2045
0.87839675	0.12160332	0.0	2046
0.42923918	0.57076079	1.0	2047

(->
 (tc/right-join prediction tfidf-test-ds :document)
 (tc/unique-by [:id :label])
 (tc/select-columns [:id :label])
 (tc/update-columns {:label (partial map int)})
 (tc/rename-columns {:label :target})
 (tc/write-csv! "submission.csv"))

The produced CVS file can be uploaded to Kaggle for scoring.

(->>
 (io/reader "submission.csv")
 line-seq
 (take 10))

("id,target"
 "10170,1"
 "3358,0"
 "0,0"
 "2,0"
 "10178,0"
 "3,0"
 "10180,1"
 "9,0"
 "10181,1")

source: projects/ml/text-classification/notebooks/index.clj