ds4clj: a data science course for Clojure devs


black-eyed-pea

  • Following the recent broad support on Clojureverse and Reddit conversations and elsewhere, we will begin a data-science course for Clojure devs in ⭐ Fall 2022 ⭐.

The details below are tentative and will probably change.

meetings

  • 2022-07-24, initial brainstorming - event
  • 2022-08-06, prep meeting about R & Tidyverse - event

goals

  • provide Clojure devs with basic theory, practices, and tools for common data science tasks;
    • also welcome open-minded people who are new to Clojure;
  • create learning resources for future use;
  • encourage Clojurians to become active contributors to the emerging stack.

requirements

  • at least one of the following:

    • basic knowlede of Clojure
    • a very open mind towards new programming languages
  • being ready to put a few hours a week into learning and practice, for a few months

chat

time

  • We’ll have a class once a month.
  • We may have practice meetings in between.
  • Each class is 3 hours, containing a lecture, a short break, and then discussion and Q&A.

recording

  • The meetings will be recorded and shared at the Clojurians Zulip chat.
  • Some parts of the recordings (e.g. the lecture) will be shared publicly.

homework

Homework will be composed of:

  • exercises on class topics
  • personal projects (as individuals or in small groups) – for example:
    • exploring datasets
    • reproducing previously publushed research
    • contributing to the stack of relevant libraries
    • contributing to documentation

📖 [Clj4BT] Clojure for the Brave and True by Daniel Higginbotham

This is a great intro to Clojure.

It is useful for those who need a refresh with the language, or are new to it.

Note: Chapter 2 suggests a specific development environment using Emacs (and is also a bit outdated). Emacs is wonderful, but it is not required for learning Clojure. Please reach out for help you wish to learn the book with another environment.

📖 [R4DS] R for Data Science by Wickham and Grolemund

This is a good intro to the R language and its use in basic data-science tasks. It uses the Tidyverse collection of R packages and the so-called “tidy” approach, which is common in today’s R community.

We will use parts of it a basic intro to R. Knowing some R would make participants more independent in approaching study resources on their own. Python could have been another option, but we prefer R, since its ecosystem is arguably more in harmony with the functional approach and with expressing statistical ideas.

📖 [Clj4DS] [Clojure for Data Science](https://www.packtpub.com/product/clojure-for-data-science/9781784397180** by Henry Garner

This is an excellent intro to data science topics, but it uses Clojure libraries which are not actively developed anymore.

It will be used for a few of the case studies, that we will adapt to this course.

list of topics (tentative)

language

(mostly self learning)

  • Clojure
  • R

from today’s brainstorming:

theory & methods

(very basic intros)

  • hello world: a typical workflow
  • reshaping data (the “tidy” notion)
  • correctness: testing, reproducibility
  • descriptive statistics
  • frequentist statistical inference
  • supervised learning: principles & workflow, regression, classification
  • working with tree-structured data
  • probabilistic modelling through Bayesian statistics
  • unsuprevised learning: clustering, dimension reduction
  • linear algebra
  • deep networks
  • nlp
  • async data streams
  • graph data

libraries & tools

(some introduced briefly, some more thoroughly)

  • tables: tablecloth, tech.ml.dataset
  • arrays: dtype-next
  • transducers: xforms, injest
  • correctness (schemas): malli
  • data vis & notebook tooling: portal, oz, clerk, clay
  • data vis grammars: hanami, cljplot?
  • math stats: fastmath, kixi.stats
  • machine learning algorithms & pipelines: scicloj.ml
  • interoperation w/ other languages: libpython-clj, clojisr
  • Bayesian statistics: inferme, clj-stan
  • tree data: clojure.walk, core.match, specter, meander, tupelo
  • data ingestion: jdbc-next?, some web scraping, arrow?
  • linear algebra, deep learning: neanderthal, deep-diamond
  • parallel computing: geni, clojask
  • nlp: datalinguist?, spacy through interop
  • graph data: loom, asami, neo4j?

course plan (very tentative)

monthtopiclibrarieshomework
core topics
1common workflowtablecloth, fastmath, hanami,learn some R and Clojure
scicloj.ml
2descriptive stats,fastmath, kixi.statsapply to a real-world dataset
frequentist stats
3data visualizationhanami, cljplot?apply to a real-world dataset
4arrays & tablesdtype-next, tech.ml.dataset,apply to a real-world dataset,
tableclothrun some speed comparisons
5basic supervised learning workflow,scicloj.ml, mallireproduce some kaggle notebooks
reproducibility, tests
6probability, Bayesian statsinferme, clj-stanreading in Statistical Rethinking,
reproducing some examples
7advanced supervised learning workflowsscicloj.mlreproduce some kaggle notebooks,
explore variations & improvements
8unsupervised learningfastmath, scicloj.mlreproduce some kaggle notebooks,
explore variations & improvements
9python and R interoplibpython-clj, clojisrgo through some tutorials by
interop
specialized topics
10working with tree-sructured data,clojure.walk, specter, meander,scrape & analyse some
web scrapinghickoryunstructured data
11linear algebra, basic image processingneanderthalprocess some images
12deep learningdeep diamondreproduce some tutorials
13async data streamsxforms, kixi.statsanalyse some user events
14natural language processingdatalinguist,analyse some texts,
spaCy through interopwrite rules to capture intents
15graph dataloom, asami, neo4j?analyse some graph data
16big datasetsgeni, clojask, tech.ml.datasetquery and analyse a big dataset,
run some speed comparisons