Data Transformation Example

(ns book.part-0-end-to-end.2-data-preparation-example
  (:require
   [clojure.string :as str]
   [tablecloth.api :as tc]))

Now that we have a “clean” dataset to work with, we’ll go through the necessary steps to prepare it for use in training a model. In the previous chapter, we tidied up our dataset in many important ways: - Structured format: We organized the data in a consistent, tabular format, with one column per variable and one row per observation

A “machine learning-ready” dataset has a few key characteristics that distinguish it from a more typical dataset you might find online, even one that is well-organized and clean:

A dataset suitable for machine learning has several key characteristics that distinguish it from a typical dataset you might find online:

Structured format:

Typically organized in a tabular format (rows and columns) Consistent structure across all entries Clear delineation between features (input variables) and target variables (what you’re trying to predict)

Sufficient volume:

Generally larger than conventional datasets Enough samples to capture patterns and variability in the data

Quality and cleanliness:

Minimal errors or inconsistencies Properly labeled data, especially for supervised learning tasks

Representative:

Covers the full range of scenarios or cases the model might encounter Balanced representation of different classes or outcomes

Relevant features:

Contains variables that are likely to be predictive of the target variable Excludes irrelevant or redundant information

Appropriate data types:

Numerical data for continuous variables Categorical data properly encoded (e.g., one-hot encoding) Consistent data types within each feature

Normalized or standardized:

Numerical features often scaled to a common range Helps prevent certain features from dominating due to their scale

Time-aware (for time-series data):

Proper chronological ordering No data leakage from future to past

Split-ready:

Can be easily divided into training, validation, and test sets These splits maintain the statistical properties of the whole dataset

Metadata and documentation:

Clear descriptions of each feature Information on data collection methods and any preprocessing steps

Ethical considerations:

Respects privacy and data protection regulations Unbiased representation of different demographic groups

Versioning:

Often includes version control to track changes and updates

Noise and outliers:

Contains a realistic level of noise to prevent overfitting Outliers are identified and handled appropriately

(def housing-data
  (tc/dataset "data/end_to_end/housing-data.nippy"))

Select relevant features

List all column names

(tc/column-names housing-data)
(:assessment-account-number
 :municipal-unit
 :civic-number
 :civic-additional
 :civic-direction
 :civic-street-name
 :civic-suffix
 :civic-city-name
 :sale-price
 :sale-date
 :parcels-in-sale
 :living-units
 :year-built
 :square-foot-living-area
 :style
 :bedrooms
 :bathrooms
 :under-construction
 :construction-grade
 :finished-basement
 :garage
 :square-feet)

Remove all the address columns, these won’t be useful for training our model.

(defn remove-address-columns [ds]
  (-> ds
      (tc/drop-columns #(str/includes? (name %) "civic"))))

(tc/select-columns #(not (str/starts-with? (name %) “civic”)))

(remove-address-columns housing-data)

data/end_to_end/housing-data.nippy [363075 16]:

:assessment-account-number :municipal-unit :sale-price :sale-date :parcels-in-sale :living-units :year-built :square-foot-living-area :style :bedrooms :bathrooms :under-construction :construction-grade :finished-basement :garage :square-feet
911593 HALIFAX REGIONAL MUNICIPALITY (HRM) 225000 June 28 2018 2 1 1987 988 1 Storey 1 N Average N N 107680
911593 HALIFAX REGIONAL MUNICIPALITY (HRM) 380000 July 22 2022 2 1 1987 988 1 Storey 1 N Average N N 107680
8502838 MUNICIPALITY OF THE DISTRICT OF ARGYLE 0 January 19 2023 0 2020 640 2 Storey 1 0 N Low N N
3698858 MUNICIPALITY OF THE COUNTY OF KINGS 130000 June 28 2022 1 0 1632 2 Storey 4 1 Y Average N N 7562
3698858 MUNICIPALITY OF THE COUNTY OF KINGS 85000 August 29 2019 1 0 1632 2 Storey 4 1 Y Average N N 7562
3944026 MUNI. OF THE DISTRICT OF GUYSBOROUGH 8500 October 09 2020 2 1 340 1 Storey 1 1 N Low N N 5400
3944026 MUNI. OF THE DISTRICT OF GUYSBOROUGH 0 February 10 2023 2 1 340 1 Storey 1 1 N Low N N 5400
3944026 MUNI. OF THE DISTRICT OF GUYSBOROUGH 0 June 13 2023 2 1 340 1 Storey 1 1 N Low N N 5400
3885003 MUNICIPALITY OF THE COUNTY OF INVERNESS 0 September 24 2020 1 1 1910 1170 1.5 Storey 4 1 N Average N N 1320304
3885003 MUNICIPALITY OF THE COUNTY OF INVERNESS 0 October 08 2020 1 1 1910 1170 1.5 Storey 4 1 N Average N N 1320304
10863791 MUNI. OF THE DISTRICT OF LUNENBURG 90000 February 26 2021 1 1 2022 1106 1 Storey 1 2 N Average Y Y 118483
2745755 HALIFAX REGIONAL MUNICIPALITY (HRM) 42000 June 30 2015 1 1 2017 1170 1.5 Storey 2 2 N Average N Y 51687
2745755 HALIFAX REGIONAL MUNICIPALITY (HRM) 69000 June 23 2011 1 1 2017 1170 1.5 Storey 2 2 N Average N Y 51687
2745755 HALIFAX REGIONAL MUNICIPALITY (HRM) 591500 December 22 2021 2 1 2017 1170 1.5 Storey 2 2 N Average N Y 51687
10751993 WEST HANTS REGIONAL MUNICIPALITY 44900 April 29 2022 1 1 2022 1230 1 Storey 3 3 N Average N N 48352
10751993 WEST HANTS REGIONAL MUNICIPALITY 0 October 25 2018 2 1 2022 1230 1 Storey 3 3 N Average N N 48352
10751993 WEST HANTS REGIONAL MUNICIPALITY 37000 June 04 2020 2 1 2022 1230 1 Storey 3 3 N Average N N 48352
10751993 WEST HANTS REGIONAL MUNICIPALITY 51000 July 19 2021 2 1 2022 1230 1 Storey 3 3 N Average N N 48352
7667817 HALIFAX REGIONAL MUNICIPALITY (HRM) 0 November 15 2019 1 1 1996 1050 1.5 Storey 0 N Low N N 332798
10986400 HALIFAX REGIONAL MUNICIPALITY (HRM) 767575 May 17 2022 9 1 2023 2412 Split Entry 4 3 N Good Y Y 143400
10986400 HALIFAX REGIONAL MUNICIPALITY (HRM) 0 January 31 2022 9 1 2023 2412 Split Entry 4 3 N Good Y Y 143400

Handle missing values

Most machine learning algorithms can’t handle missing values, so we need to decide how to deal with them before training our model.

TODO: write about missing values

Get the columns that have missing values

(-> housing-data remove-address-columns tc/info (tc/select-rows #(pos? (:n-missing %))))

(-> housing-data (tc/select-rows #((some-fn nil? zero?) (:sale-price %))) (tc/group-by [:municipal-unit]) (tc/aggregate {:count tc/row-count}) (tc/order-by :count :desc))

A quick survey shows us that the rows with missing values are almost all in the HRM, which makes it easier to feel ok about dropping all of this data. There will be a huge overabundance of data from the HRM, since that’s where most of the houses are. If we were losing all of our data from a given small municipality, for example, we’d want to figure something out.

Deal with missing values. I’m just going to drop them, there are techniques we could use to fill in the missing values, but we have enough data here that it’s not worth the downsides.

(defn drop-rows-missing-values [ds] (let [columns-with-missing-values (-> ds tc/info (tc/select-rows #(pos? (:n-missing %))) :col-name)] (reduce (fn [ds col-name] (tc/drop-rows ds #((some-fn nil? zero?) (col-name %)))) ds columns-with-missing-values)))

(-> housing-data remove-address-columns drop-rows-missing-values tc/row-count)

(->> housing-data :year-built frequencies (sort-by first))

There are different approaches to cleaning up data. One is to go by category, e.g. drop all the missing values, convert all the non-numeric values, etc. But realistically we want to be more careful and clean up each column within the context of the other values in the column.

(-> housing-data )

(-> housing-data (tc/select-rows #(nil? (:construction-grade %))) )

(def date-format (DateTimeFormatter/ofPattern “MMMM dd yyyy”))

(defn- to-date [s] )

(defn to-year [s] ((memfn getYear) (LocalDate/parse s date-format)))

(defn- y-n-to-int [v] (if (= v “Y”) 1 0))

We can use a simple linear model to fill in the missing values for construction grade

(def clean-housing-data (-> housing-data ;; Clean up missing values (tc/drop-missing columns-to-drop)

Drop rows with a sale price of zero. These are not useful for our purposes. (tc/drop-rows #(= 0 (:sale-price %)))

Fill in missing :construction-grade values: ()

Convert the sale-date to just a year, this is granular enough for our purposes (tc/convert-types :sale-date [[:int to-year]])) )

(def non-numeric-columns (let [numerical-cols (tc/column-names clean-housing-data :type/numerical)] (tc/column-names clean-housing-data (complement (set numerical-cols)))))

(def construction-grade-map {“Low” 0 “Fair” 1 “Average” 2 “Good” 3 “Manufactured Home” 3 “Very Good” 4 “Excellent” 5})

(defn update-construction-grade [v] (get construction-grade-map v))

(def training-data (-> clean-housing-data ;; Deal with non-numeric columns. “Y”/“N” columns can be converted to 0s and 1s: (tc/update-columns [:garage :finished-basement :under-construction] (partial map y-n-to-int))

;; Remove columns that describe the address, we don’t care about these for the purposes of our model (tc/drop-columns #(str/includes? (name %) “civic”))

  (tc/update-columns :construction-grade (partial map update-construction-grade))
  :style
  frequencies

  ))

Making the style column numeric

(->> clean-housing-data :style frequencies (sort-by second))

(tc/drop-columns (tc/info clean-housing-data) [:first :last :mode :standard-deviation :skew :mean])

clean-housing-data

(-> clean-housing-data (haclo/layer-point #:haclo{:x :square-foot-living-area :y :sale-price :xscale {:type “log”}}))

(haclo/layer-point #:haclo{:x :square-foot-living-area :y :sale-price})

(math/correlation-table clean-housing-data)

(-> housing-data (tc/drop-missing :sale-price) (tc/drop-rows #(= 0 (:sale-price %)))

(tc/select-rows #(< 10000000 (:sale-price %))))

There are a handful of properties that sold for 10M

(-> housing-data (tc/select-rows #(> 10000000 (:sale-price %))))

source: src/book/part_0_end_to_end/2_data_preparation_example.clj