(notespace)

Fri Jul 16 14:43:23 IDT 2021


Operating on Data in Dataset

One of the essential pieces of NumPy is the ability to perform quick element-wise operations, both with basic arithmetic (addition, subtraction, multiplication, etc.) and with more sophisticated operations (trigonometric functions, exponential and logarithmic functions, etc.). Pandas inherits much of this functionality from NumPy, and the ufuncs that we introduced in Computation on NumPy Arrays: Universal Functions are key to this.

Dataset includes a couple useful twists, however: for unary operations like negation and trigonometric functions, these ufuncs will preserve index and column labels in the output, and for binary operations such as addition and multiplication, Pandas will automatically align indices when passing the objects to the ufunc. This means that keeping the context of data and combining data from different sources–both potentially error-prone tasks with raw NumPy arrays–become essentially foolproof ones with Pandas. We will additionally see that there are well-defined operations between one-dimensional Series structures and two-dimensional DataFrame structures.

(require
  '[tech.v3.dataset :as ds]
  '[tech.v3.datatype :as dtype]
  '[tech.v3.datatype.functional :as dfn]
  '[tablecloth.api :as tablecloth]
  '[fastmath.random :as fm.rand])

(def DS
  (tablecloth/dataset
   (zipmap [:A :B :C :D]
           (repeatedly 4 (fn [] (repeatedly 3 #(fm.rand/frand 10)))))))

^kind/dataset
DS

_unnamed [3 4]:

:A:B:C:D
3.094001294.330386648.309686664.57290792
7.326993478.632243160.677282577.74195385
7.006019597.867050658.691642769.59445572

If we apply a NumPy ufunc on either of these objects, the result will be another Pandas object with the indices preserved:

^kind/dataset
(ds/update-elemwise DS dfn/exp)

_unnamed [3 4]:

:A:B:C:D
22.0651908775.973655244063.0396744996.82526136
1520.802546205609.647466991.968521142302.96764301
1103.254351382609.856838695952.9534958014683.14797648

Or, for a slightly more complex calculation:

^kind/dataset
(ds/update-elemwise DS #(dfn// (dfn/* % Math/PI)))

_unnamed [3 4]:

:A:B:C:D
0.102879690.073506110.038305880.06960776
0.043443450.036874530.469980920.04111493
0.045433770.040461150.036622520.03317644

Any of the ufuncs discussed in Computation on NumPy Arrays: Universal Functions can be used in a similar manner.

UFuncs: Index Alignment

For binary operations on two Series or DataFrame objects, Pandas will align indices in the process of performing the operation. This is very convenient when working with incomplete data, as we'll see in some of the examples that follow.

Index alignment in Dataset

A similar type of alignment takes place for both columns and indices when performing operations on DataFrames:

(def A
  (tablecloth/dataset
   (zipmap [:A :B]
           (repeatedly 2 (fn [] (repeatedly 2 #(fm.rand/irand 20)))))))

^kind/dataset
A

_unnamed [2 2]:

:A:B
615
813

(def B
  (tablecloth/dataset
   (zipmap [:B :A :C]
           (repeatedly 3 (fn [] (repeatedly 3 #(fm.rand/irand 20)))))))

^kind/dataset
B

_unnamed [3 3]:

:B:A:C
521
111910
19177

Notice that indices are aligned correctly irrespective of their order in the two objects, and indices in the result are sorted. As was the case with Series, we can use the associated object's arithmetic method and pass any desired fill_value to be used in place of missing entries. Here we'll fill with the mean of all values in A (computed by first stacking the rows of A):

fill = A.stack().mean()
A.add(B, fill_value=fill)

The following table lists Python operators and their equivalent Pandas object methods:

+ 	add()
- 	sub(), subtract()
* 	mul(), multiply()
/ 	truediv(), div(), divide()
// 	floordiv()
% 	mod()
** 	pow()

Ufuncs: Operations Between DataFrame and Series

When performing operations between a DataFrame and a Series, the index and column alignment is similarly maintained. Operations between a DataFrame and a Series are similar to operations between a two-dimensional and one-dimensional NumPy array. Consider one common operation, where we find the difference of a two-dimensional array and one of its rows: