The Clojure community is moving a lot lately on the data science front, but we were feeling we needed some organization and more open discussion about these themes. This is the Clojureverse thread that started it all. Here we try to collect and record the current state of things, and I would like to stress the fact that this is owned by the community!
The structure of this:
Generic computation libraries. Here we should strive for the best: both GPU and CPU capability, multidimensional arrays, broadcasting, etc
There are many libraries popping out at various levels of maturity, some of them are:
We probably don't need more libraries in this realm. What would be great next is:
Plotting is important for both analysis and presentation of results. Thanks to Clojurescript we might probably have an edge over other languages here.
Here there are many libraries as well, some of them are:
There's a lot of active development in this realm, what would be helpful:
Deal with coordinates on a map.
There's something in this realm, mostly dated:
This is another area where Clojure could shine thanks to its concurrency model. The fact it would be easy to deal with Spark or Onyx it's certainly a plus in case you have big data, while for smaller things parallel Clojure might be enough and speed up pipelines considerably.
Today's data scientists are used to work with tabular data, we have to deal with it.
The picture has improved lately, but there still isn't consensus.
Graphs can smartly and efficiently solve many problems, most of the time a well thought and built graph can substitute much more complex solutions
The state of things is pretty good and it makes sense considering the native Clojure data structures and the nature of graphs
Graphs are mostly a solved problem, but only lately they are starting to be used extensively and there are many improvements that can be achieved in distributing graphs
Very important as the base for ML systems, simulations and data analytics.
There are already many examples:
The main building blocks are all here, what we are missing are:
General modeling, the aim should be to have something simple, usable, reliable and with a consistent interface.
Something is moving lately in this area:
We can still decide wether we want to pursue an R model (with many small libraries) or the scikit-learn way (one big framework with batteries included), the important thing should be to have a common interface to algorithms and utilities.
Such interface would be the opposite of what happens in the R world, where developers and researchers are more free to deliver their ideas (R is usually the first language to get implementations of new algorithms), but at the same time the cognitive overhead for users is pretty high.
Natural Language Processing is at the bleeding edge at the moment, but Clojure is lagging behind at the moment.
There are mainly 2 libraries dealing with these things at the moment, and one is currently looking for maintainers:
It might very well be that all we need is a couple of very thorough and dedicated libraries, but we're not there yet
clojurenlpis currently looking for maintainers, get in touch with them if you're interested
Before doing anything with CNNs you have to read, process and transform images. The state of things here is much better than for many of the other sections!
We're basically ready to do anything we want with images!
Important for computer vision, NLP and other problems.
We're pretty much covered especially thanks to Carin Meier's work, what can be really improved are docs, examples and tutorials.
Disclaimer None of the lists are to be considered complete, they are just some examples. Everything is amendable by the community, if you think something is missing, wrong, misplaced or anything else just let the community know!