{on hold} ds4clj: a data science course for Clojure devs

Following broad support on Clojureverse and Reddit conversations and elsewhere, we have been planning a data-science course for Clojure devs during 2023.

For various reasons of community priorities, this project was delayed. As of May 2024, we are reconsidering a revised version, which is probably a semi-structured series of talks alongside the more hands-on real-world-data group.

The content below is the old 2022-2023 draft.

exploration meetings preparing for the course

2022-07-24, initial brainstorming - event
2022-08-06, prep meeting about R & Tidyverse - event
2022-09-10, NLP study session 1 - summary & video
2022-10-30, NLP study session 2: Predict real vs. fake disaster tweets with DVC, Clojure and Python - summary & video
2022-11-05, an intro to table processing with Tablecloth - a session of the data-recur group - summary & video

goals

provide Clojure devs with basic theory, practices, and tools for common data science tasks;
- also welcome open-minded people who are new to Clojure;
create learning resources for future use;
encourage Clojurians to become active contributors to the emerging stack.

requirements

at least one of the following:
- basic knowlede of Clojure
- a very open mind towards new programming languages
being ready to put a few hours a week into learning and practice, for a few months

chat

We’ll use a dedicated stream at the Clojurians Zulip chat.
Stream: ds4clj
It is recommended to follow a few of the other chat streams.

time

We’ll have a class once a month.
We may have practice meetings in between.
Each class is 3 hours, containing a lecture, a short break, and then discussion and Q&A.

recording

The meetings will be recorded and shared at the Clojurians Zulip chat.
Some parts of the recordings (e.g. the lecture) will be shared publicly.

homework

Homework will be composed of:

exercises on class topics
personal projects (as individuals or in small groups) – for example:
- exploring datasets
- reproducing previously publushed research
- contributing to the stack of relevant libraries
- contributing to documentation

recommended resources

📖 [Clj4BT] Clojure for the Brave and True by Daniel Higginbotham

This is a great intro to Clojure.

It is useful for those who need a refresh with the language, or are new to it.

Note: Chapter 2 suggests a specific development environment using Emacs (and is also a bit outdated). Emacs is wonderful, but it is not required for learning Clojure. Please reach out for help you wish to learn the book with another environment.

📖 [R4DS] R for Data Science by Wickham and Grolemund

This is a good intro to the R language and its use in basic data-science tasks. It uses the Tidyverse collection of R packages and the so-called “tidy” approach, which is common in today’s R community.

We will use parts of it a basic intro to R. Knowing some R would make participants more independent in approaching study resources on their own. Python could have been another option, but we prefer R, since its ecosystem is arguably more in harmony with the functional approach and with expressing statistical ideas.

📖 [Clj4DS] [Clojure for Data Science](https://www.packtpub.com/product/clojure-for-data-science/9781784397180** by Henry Garner

This is an excellent intro to data science topics, but it uses Clojure libraries which are not actively developed anymore.

It will be used for a few of the case studies, that we will adapt to this course.

list of topics (tentative)

language

(mostly self learning)

Clojure
R

from today’s brainstorming:

theory & methods

(very basic intros)

hello world: a typical workflow
reshaping data (the “tidy” notion)
correctness: testing, reproducibility
descriptive statistics
frequentist statistical inference
supervised learning: principles & workflow, regression, classification
working with tree-structured data
probabilistic modelling through Bayesian statistics
unsuprevised learning: clustering, dimension reduction
linear algebra
deep networks
nlp
async data streams
graph data

libraries & tools

(some introduced briefly, some more thoroughly)

tables: tablecloth, tech.ml.dataset
arrays: dtype-next
transducers: xforms, injest
correctness (schemas): malli
data vis & notebook tooling: portal, oz, clerk, clay
data vis grammars: hanami, cljplot?
math stats: fastmath, kixi.stats
machine learning algorithms & pipelines: scicloj.ml
interoperation w/ other languages: libpython-clj, clojisr
Bayesian statistics: inferme, clj-stan
tree data: clojure.walk, core.match, specter, meander, tupelo
data ingestion: jdbc-next?, some web scraping, arrow?
linear algebra, deep learning: neanderthal, deep-diamond
parallel computing: geni, clojask
nlp: datalinguist?, spacy through interop
graph data: loom, asami, neo4j?

course plan (very tentative)

month	topic	libraries	homework
	core topics

1	common workflow	tablecloth, fastmath, hanami,	learn some R and Clojure
		scicloj.ml

2	descriptive stats,	fastmath, kixi.stats	apply to a real-world dataset
	frequentist stats

3	data visualization	hanami, cljplot?	apply to a real-world dataset

4	arrays & tables	dtype-next, tech.ml.dataset,	apply to a real-world dataset,
		tablecloth	run some speed comparisons

5	basic supervised learning workflow,	scicloj.ml, malli	reproduce some kaggle notebooks
	reproducibility, tests

6	probability, Bayesian stats	inferme, clj-stan	reading in Statistical Rethinking,
			reproducing some examples

7	advanced supervised learning workflows	scicloj.ml	reproduce some kaggle notebooks,
			explore variations & improvements

8	unsupervised learning	fastmath, scicloj.ml	reproduce some kaggle notebooks,
			explore variations & improvements

9	python and R interop	libpython-clj, clojisr	go through some tutorials by
			interop

	specialized topics

10	working with tree-sructured data,	clojure.walk, specter, meander,	scrape & analyse some
	web scraping	hickory	unstructured data

11	linear algebra, basic image processing	neanderthal	process some images

12	deep learning	deep diamond	reproduce some tutorials

13	async data streams	xforms, kixi.stats	analyse some user events

14	natural language processing	datalinguist,	analyse some texts,
		spaCy through interop	write rules to capture intents

15	graph data	loom, asami, neo4j?	analyse some graph data

16	big datasets	geni, clojask, tech.ml.dataset	query and analyse a big dataset,
			run some speed comparisons

{on hold} ds4clj: a data science course for Clojure devs

exploration meetings preparing for the course #

goals #

requirements #

chat #

time #

recording #

homework #

recommended resources #

list of topics (tentative) #

language #

theory & methods #

libraries & tools #

course plan (very tentative) #