Predict real vs. fake disaster tweets
Posted September 10, 2022 by behrica ‐ 3 min read
Predict real vs. fake disaster tweets with dvc, Clojure and python
Any a little bit more serious machine learning, like participating in a Kaggle competition, requires in my view a form of ML experiment tracking.
As ML requires a lot of different trying of code, models and hyper-parameters, we need to have a tools which keeps track of this.
These type of tools can be programming language independent, as they see the code (whatever code) such as one of the assets to track (among hyper-parameters, performance metrics and data)
One such open source tool is DVC - Data Version Control
In my view this tools is very relevant for ML in Clojure, due to its main characteristics:
A DVC pipeline is based on steps , where each step is a shell script (so can be written in Clojure as well)
It therefore allows a different form of interop
, in which certain steps
are written in Python / R (the modeling step for example) while
preprocessing is done in Clojure.
This makes a lot of sense, as the pure modeling is often a few lines only
and is tricky to use via interop (due to multithreading, GPU usage, long
running). We can reduce the python code to the concrete single-line
call to “train(…)” if we want.
The different steps can then communicate via data files on disk. (arrow is a good format for this, as supported by Clojure via tablecloth/dataset, python and R.
DVC takes care of these inter-step-dependencies, and only reruns what has changed. The details of this are in the DVC documentation.
In the following DVC pipeline I have 3 steps, 2 in Clojure, one in Python.
- preprocess: Done in pure Clojure using tablecloth
- train: Done in pure python with simpletransformers, reading the data files produced in preprocess and train the model
- predict_kaggle: Done in Clojure, but using
simpletransformers
python library via libpython-clj
stages:
preprocess:
cmd: clj preprocess.clj
deps:
- preprocess.clj
- train.csv
- test.csv
outs:
- train.arrow
- test.arrow
train:
cmd: python train.py
deps:
- train.py
- train.arrow
- test.arrow
outs:
- outputs
metrics:
- eval.json
params:
- train.num_train_epochs
- train.model_type
- train.model_name
- train.train_batch_size
predict_kaggle:
cmd: clj predict_kaggle.clj
deps:
- outputs
- predict_kaggle.clj
outs:
- kaggle_submission.csv
The reason why I did step 2 in pure python is stability.
I did not want to use libpython-clj
for eventually very long running training runs.
It has some issues with simpletransformers based on PyTorch
Using a pre-trained roberta model from the huggingface hub, and a bit of preprocessing gave me a score of 0.82 and position 119 in the leader board.
So even though that Kaggle does not support Clojure (in the web notebook interface), we can use Clojure to participate in Kaggle competitions.
We could of course implement the “train” step in Clojure as well, in case we want to use a model from:
- Clojure (using scicloj.ml for example)
- Java (via Clojure interop)
- python (using libpython-clj)
- R (using clojisr)
I suggest to use dvc and a multi-step pipeline even in a pure JVM (Clojure + Java) setup. It nicely structures the work and is able to track all assets.
References: