Launching Columns for Tablecloth

Posted April 13, 2024 by Ethan Miller ‐ 4 min read

This is a cross-post of a recent post by Ethan Miller at his blog, announcing the release of a substantial addition to the Tablecloth library.


The New Column API

Today we at scicloj deployed a new Column API (tablecloth.column.api) into the data processing library Tablecloth (available as of version 7.029.1). This new API adds a new primitive to the Tablecloth system: the column. Here’s how we use it:

(require '[tablecloth.column.api :as tcc])

(tcc/column [1 2 3 4 5])
;; => #tech.v3.dataset.column<int64>[5]
null
[1, 2, 3, 4, 5]

The new column is the same as the columns that comprise a dataset. It is a one-dimensional typed sequence of values. Underneath the hood, the column is just the column defined in tech.ml.dataset, the library that backs Tablecloth.

The difference is that now when you are using Tablecloth you have the option of interacting directly with a column using an API that provides a set of operations that always take and return a column.

Basic Usage

Let’s go through a simple example. Let’s say we have some test scores that we need to analyze:

(def test-scores (tcc/column [85 92 78 88 95 83 80 90]))

test-scores
;; => #tech.v3.dataset.column<int64>[8]
null
[85, 92, 78, 88, 95, 83, 80, 90]

Now that we have these values in a column, we can easily perform operations on them:

(tcc/mean test-scores)
;; => 86.375

(tcc/standard-deviation test-scores)
;; => 5.926634795564849

There are a many operations that one can perform. At the moment, the available operations are those that you would have previously accessed by importing the tech.v3.datatype.functional namespace from dtype-next.

To get a fuller picture of the Column API and how it works, please consult the Column API section in the Tablecloth documentation.

Easier Column Operations on the Dataset

The changes we’ve deployed also improve the expressive power of Tablecloth’s standard Dataset API. Previously, if you needed to do something simple like a group by and aggregation on a column in a dataset, the code could become unnecessarily verbose:

(defonce stocks
  (tc/dataset "https://raw.githubusercontent.com/techascent/tech.ml.dataset/master/test/data/stocks.csv" {:key-fn keyword}))


(tc/column-names stocks)

(-> stocks
    (tc/group-by [:symbol])
    (tc/aggregate (fn [ds]
                    (-> ds
                        :price
                        tech.v3.datatype.functional/mean))))
;; => _unnamed [5 2]:

| :symbol |      summary |
|---------|-------------:|
|    MSFT |  24.73674797 |
|    AMZN |  47.98707317 |
|     IBM |  91.26121951 |
|    GOOG | 415.87044118 |
|    AAPL |  64.73048780 |

With the new column operations within for datasets, you can now simply write:

(-> stocks
    (tc/group-by [:symbol])
    (tc/mean [:price]))

The same set operations available to be run on the column can be called on columns in the datasest. However, when operating a dataset, functions that would return a scalar value act as aggregator functions, as seen above.

Functions that would return a new column allow the user to specify a target column to be added to the dataset, as in this example where we first use the method above to add a column with the mean back to stocks:

(def stocks-with-mean
  (-> stocks
      (tc/group-by [:symbol])
      (tc/mean [:price])
      (tc/rename-columns {"summary" :mean-price})
      (tc/inner-join stocks :symbol)))


stocks-with-mean
;; => inner-join [560 4]:
;;    | :symbol | :mean-price |      :date | :price |
;;    |---------|------------:|------------|-------:|
;;    |    MSFT | 24.73674797 | 2000-01-01 |  39.81 |
;;    |    MSFT | 24.73674797 | 2000-02-01 |  36.35 |
;;    |    MSFT | 24.73674797 | 2000-03-01 |  43.22 |
;;    |    MSFT | 24.73674797 | 2000-04-01 |  28.37 |

Then we use a dataset column operation that returns a column – column division, in this case – to add a new column holding the relative daily price of the stock:

(-> stocks-with-mean
    (tc// :relative-daily-price [:price :mean-price]))
;; => inner-join [560 5]:
;;    | :symbol | :mean-price |      :date | :price | :relative-daily-price |
;;    |---------|------------:|------------|-------:|----------------------:|
;;    |    MSFT | 24.73674797 | 2000-01-01 |  39.81 |            1.60934655 |
;;    |    MSFT | 24.73674797 | 2000-02-01 |  36.35 |            1.46947368 |
;;    |    MSFT | 24.73674797 | 2000-03-01 |  43.22 |            1.74719814 |
;;    |    MSFT | 24.73674797 | 2000-04-01 |  28.37 |            1.14687670 |

For more information, on these operations, please consult the documentation here.

Thanks to Clojurist Together

This contribution to Tablecloth was supported by Clojurists Together through their Quarterly Fellowships for open source development.