The Clojure community is moving a lot lately on the data science front, but we were feeling we needed some organization and more open discussion about these themes. This is the Clojureverse thread that started it all. Here we try to collect and record the current state of things, and I would like to stress the fact that this is owned by the community!
I really like how the Nim community is dealing with the same sorts of problems we're facing, so I'll try the same thing here to foster discussion. We might want to move these things in their own topic in the future or on other platforms, but that's not the point right now.
The structure of this:
Generic computation libraries. Here we should strive for the best: both GPU and CPU capability, multidimensional arrays, broadcasting, etc
There are many libraries popping out at various levels of maturity, some of them are:
I think we can all agree that this degree of spread is not good, all these libraries represent wasted time and resources that might be spent on moving further other parts of the stack. We should settle on one-two of them and move on.
Plotting is important for both analysis and presentation of results. Thanks to Clojurescript we might probably have an edge over other languages here.
Here there are many libraries as well, **some* of them are:
In this area taste is really important so it's more normal to have more spread over different libraries. What we should do is to work on what is already available and make the plotting experience seamless:
(bar my-data) ;=> nil
The result would be a bar chart with reasonable defaults.
Deal with coordinates on a map.
Not much that I'm aware of:
Today's data scientists are used to work with tabular data, we have to deal with it.
Not good: there are lots of stumps here and there but nothing has ever caught on. Some examples:
Here I would move on wrapping Arrow which have to potential to become the standard in the recent future, but anything that works is very welcome!
Very important as the base for ML systems and evaluation of models.
There are already many examples:
What is missing here is the tooling: we need more abstractions over basic functionality. For instance a function to get the ROC-AUC score for model validation.
Also better docs and examples of what is achievable with these libraries.
General modeling, the aim should be to have something simple, usable, reliable and with a consistent interface.
Something is moving lately in this area:
As stated earlier either we pursue an R model (with many small libraries) or the scikit-learn way (one big framework with batteries included), the important thing should be to have a common interface to algorithms and utilities. This would be the opposite of what happens in the R world.
Important for computer vision, NLP and other problems.
We're pretty much covered especially thanks to Carin Meier's work, what can be really improved are docs, examples and tutorials.
Just build on what's already there
DisclaimerNone of the lists are to be considered complete, they are just some examples. Of course these are my opinions, but everything is amendable by the community and I would really love to get a productive discussion about these topics. If you think something is missing, wrong, misplaced or anything else just let the community know! Yeah, I know about Incanter, I didn't mention it on purpose, but if someone thinks that it is current and useful we can surely discuss it :smile: