scicloj.tcutils.api

between

(between ds col-name low high)(between ds col-selector low high {:keys [missing-default]})

Detect where values fall in a specified range in a numeric column. This is a shortcut for (< low x high).

Usage

(between ds col-name low high)

(between ds col-name low high {:missing-default val})

Arguments

  • ds - A tech.ml.dataset (i.e a tablecloth dataset)
  • column-name - Name of the column to use in the comparison
  • low - Lower bound for values of column-name
  • high - Upper bound for values of column-name
  • options - optional Options map containing the key missing-default to specify what value to use in the case that the value of (col-name row) is nil. Throws an error if there are any missing values in the column and this option is not provided.

Returns

A dataset with only rows that contain values between low and high in column col-name

clean-column-names

(clean-column-names ds)

Convert column names of a dataset into ASCII-only, kebab-cased keywords. Throws an error if any column would be left with no name, e.g. one that was an all non-ASCII string.

Usage

clean-column-names(ds)

Arguments

  • ds - A tech.ml.dataset (i.e a tablecloth dataset)

Returns

A dataset with the column names converted to ASCII-only, kebab-cased keywords.

cumsum

(cumsum ds column-name)(cumsum ds new-column-name column-name)

Compute the cumulative sum of a column

Usage

(cumsum ds column-name)

(cumsum ds new-column-name column-name)

Arguments

  • ds - A tech.ml.dataset (i.e a tablecloth dataset)
  • new-column-name - optional Name for the column where newly computed values will go. When ommitted new column name defaults to the keyword <old-column-name>-cumulative-sum
  • column-name - Name of the column to use to compute the cumulative sum

Returns

A dataset with the additional column containing the cumulative sum.

duplicate-rows

(duplicate-rows ds)

Filter a dataset for only duplicated rows.

Usage

(duplicate-rows ds)

Arguments

  • ds - A tech.ml.dataset (i.e a tablecloth dataset)

Returns

A dataset containing only rows that are exact duplicates.

lag

(lag ds column-name lag-size)(lag ds new-column-name column-name lag-size)

Compute previous (lagged) values from one column in a new column, can be used e.g. to compare values behind the current value.

Usage

(lag ds column-name lag-size)

(lag ds new-column-name column-name lag-size)

Arguments

  • ds - A tech.ml.dataset (i.e a tablecloth dataset)
  • new-column-name - optional Name for the column where newly computed values will go. When ommitted new column name defaults to the keyword <old-column-name>-lag-<lag-size>
  • column-name - Name of the column to use to compute the lagged values
  • lag-size - positive integer indicating how many rows to skip over to compute the lag

Returns

A dataset with the new column populated with the lagged values.

lead

(lead ds column-name lead-size)(lead ds new-column-name column-name lead-size)

Compute next (lead) values from one column in a new column, can be used e.g. to compare values ahead of the current value.

Usage

(lead ds column-name lead-size)

(lead ds new-column-name column-name lead-size)

Arguments

  • ds - A tech.ml.dataset (i.e a tablecloth dataset)
  • new-column-name - optional Name for the column where newly computed values will go. When ommitted new column name defaults to the keyword <old-column-name>-lead-<lead-size>
  • column-name - Name of the column to use to compute the lead values
  • lead-size - positive integer indicating how many rows to skip over to compute the lead

Returns

A dataset with the column populated with the lead values.