Data Transformation Example

(ns book.part-0-end-to-end.2-data-preparation-example
  (:require
   [clojure.string :as str]
   [tablecloth.api :as tc]))

Now that we have a “clean” dataset to work with, we’ll go through the necessary steps to prepare it for use in training a model. In the previous chapter, we tidied up our dataset in many important ways: - Structured format: We organized the data in a consistent, tabular format, with one column per variable and one row per observation

A “machine learning-ready” dataset has a few key characteristics that distinguish it from a more typical dataset you might find online, even one that is well-organized and clean:

Properly handle missing values: We need to either remove, impute, or otherwise mark missing values in such way that they can be handled by our algorithms. The strategy for handling missing values needs to be consistent across the dataset.
Correcting datatypes: Most algorithms require numeric values. We need to make sure our continuous variables have a valid numeric type
Normalizing scales
Handling outliers

A dataset suitable for machine learning has several key characteristics that distinguish it from a typical dataset you might find online:

Structured format:

Typically organized in a tabular format (rows and columns) Consistent structure across all entries Clear delineation between features (input variables) and target variables (what you’re trying to predict)

Sufficient volume:

Generally larger than conventional datasets Enough samples to capture patterns and variability in the data

Quality and cleanliness:

Minimal errors or inconsistencies Properly labeled data, especially for supervised learning tasks

Representative:

Covers the full range of scenarios or cases the model might encounter Balanced representation of different classes or outcomes

Relevant features:

Contains variables that are likely to be predictive of the target variable Excludes irrelevant or redundant information

Appropriate data types:

Numerical data for continuous variables Categorical data properly encoded (e.g., one-hot encoding) Consistent data types within each feature

Normalized or standardized:

Numerical features often scaled to a common range Helps prevent certain features from dominating due to their scale

Time-aware (for time-series data):

Proper chronological ordering No data leakage from future to past

Split-ready:

Can be easily divided into training, validation, and test sets These splits maintain the statistical properties of the whole dataset

Metadata and documentation:

Clear descriptions of each feature Information on data collection methods and any preprocessing steps

Ethical considerations:

Respects privacy and data protection regulations Unbiased representation of different demographic groups

Versioning:

Often includes version control to track changes and updates

Noise and outliers:

Contains a realistic level of noise to prevent overfitting Outliers are identified and handled appropriately

(def housing-data
  (tc/dataset "data/end_to_end/housing-data.nippy"))

Select relevant features

List all column names

(tc/column-names housing-data)

(:assessment-account-number
 :municipal-unit
 :civic-number
 :civic-additional
 :civic-direction
 :civic-street-name
 :civic-suffix
 :civic-city-name
 :sale-price
 :sale-date
 :parcels-in-sale
 :living-units
 :year-built
 :square-foot-living-area
 :style
 :bedrooms
 :bathrooms
 :under-construction
 :construction-grade
 :finished-basement
 :garage
 :square-feet)

Remove all the address columns, these won’t be useful for training our model.

(defn remove-address-columns [ds]
  (-> ds
      (tc/drop-columns #(str/includes? (name %) "civic"))))

(tc/select-columns #(not (str/starts-with? (name %) “civic”)))

(remove-address-columns housing-data)

data/end_to_end/housing-data.nippy [363075 16]:

:assessment-account-number	:municipal-unit	:sale-price	:sale-date	:parcels-in-sale	:living-units	:year-built	:square-foot-living-area	:style	:bedrooms	:bathrooms	:under-construction	:construction-grade	:finished-basement	:garage	:square-feet
911593	HALIFAX REGIONAL MUNICIPALITY (HRM)	225000	June 28 2018	2	1	1987	988	1 Storey		1	N	Average	N	N	107680
911593	HALIFAX REGIONAL MUNICIPALITY (HRM)	380000	July 22 2022	2	1	1987	988	1 Storey		1	N	Average	N	N	107680
8502838	MUNICIPALITY OF THE DISTRICT OF ARGYLE	0	January 19 2023		0	2020	640	2 Storey	1	0	N	Low	N	N
3698858	MUNICIPALITY OF THE COUNTY OF KINGS	130000	June 28 2022	1	0		1632	2 Storey	4	1	Y	Average	N	N	7562
3698858	MUNICIPALITY OF THE COUNTY OF KINGS	85000	August 29 2019	1	0		1632	2 Storey	4	1	Y	Average	N	N	7562
3944026	MUNI. OF THE DISTRICT OF GUYSBOROUGH	8500	October 09 2020	2	1		340	1 Storey	1	1	N	Low	N	N	5400
3944026	MUNI. OF THE DISTRICT OF GUYSBOROUGH	0	February 10 2023	2	1		340	1 Storey	1	1	N	Low	N	N	5400
3944026	MUNI. OF THE DISTRICT OF GUYSBOROUGH	0	June 13 2023	2	1		340	1 Storey	1	1	N	Low	N	N	5400
3885003	MUNICIPALITY OF THE COUNTY OF INVERNESS	0	September 24 2020	1	1	1910	1170	1.5 Storey	4	1	N	Average	N	N	1320304
3885003	MUNICIPALITY OF THE COUNTY OF INVERNESS	0	October 08 2020	1	1	1910	1170	1.5 Storey	4	1	N	Average	N	N	1320304
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
10863791	MUNI. OF THE DISTRICT OF LUNENBURG	90000	February 26 2021	1	1	2022	1106	1 Storey	1	2	N	Average	Y	Y	118483
2745755	HALIFAX REGIONAL MUNICIPALITY (HRM)	42000	June 30 2015	1	1	2017	1170	1.5 Storey	2	2	N	Average	N	Y	51687
2745755	HALIFAX REGIONAL MUNICIPALITY (HRM)	69000	June 23 2011	1	1	2017	1170	1.5 Storey	2	2	N	Average	N	Y	51687
2745755	HALIFAX REGIONAL MUNICIPALITY (HRM)	591500	December 22 2021	2	1	2017	1170	1.5 Storey	2	2	N	Average	N	Y	51687
10751993	WEST HANTS REGIONAL MUNICIPALITY	44900	April 29 2022	1	1	2022	1230	1 Storey	3	3	N	Average	N	N	48352
10751993	WEST HANTS REGIONAL MUNICIPALITY	0	October 25 2018	2	1	2022	1230	1 Storey	3	3	N	Average	N	N	48352
10751993	WEST HANTS REGIONAL MUNICIPALITY	37000	June 04 2020	2	1	2022	1230	1 Storey	3	3	N	Average	N	N	48352
10751993	WEST HANTS REGIONAL MUNICIPALITY	51000	July 19 2021	2	1	2022	1230	1 Storey	3	3	N	Average	N	N	48352
7667817	HALIFAX REGIONAL MUNICIPALITY (HRM)	0	November 15 2019	1	1	1996	1050	1.5 Storey		0	N	Low	N	N	332798
10986400	HALIFAX REGIONAL MUNICIPALITY (HRM)	767575	May 17 2022	9	1	2023	2412	Split Entry	4	3	N	Good	Y	Y	143400
10986400	HALIFAX REGIONAL MUNICIPALITY (HRM)	0	January 31 2022	9	1	2023	2412	Split Entry	4	3	N	Good	Y	Y	143400

Handle missing values

Most machine learning algorithms can’t handle missing values, so we need to decide how to deal with them before training our model.

TODO: write about missing values

Get the columns that have missing values

(-> housing-data remove-address-columns tc/info (tc/select-rows #(pos? (:n-missing %))))

(-> housing-data (tc/select-rows #((some-fn nil? zero?) (:sale-price %))) (tc/group-by [:municipal-unit]) (tc/aggregate {:count tc/row-count}) (tc/order-by :count :desc))

A quick survey shows us that the rows with missing values are almost all in the HRM, which makes it easier to feel ok about dropping all of this data. There will be a huge overabundance of data from the HRM, since that’s where most of the houses are. If we were losing all of our data from a given small municipality, for example, we’d want to figure something out.

Deal with missing values. I’m just going to drop them, there are techniques we could use to fill in the missing values, but we have enough data here that it’s not worth the downsides.

(defn drop-rows-missing-values [ds] (let [columns-with-missing-values (-> ds tc/info (tc/select-rows #(pos? (:n-missing %))) :col-name)] (reduce (fn [ds col-name] (tc/drop-rows ds #((some-fn nil? zero?) (col-name %)))) ds columns-with-missing-values)))

(-> housing-data remove-address-columns drop-rows-missing-values tc/row-count)

(->> housing-data :year-built frequencies (sort-by first))

There are different approaches to cleaning up data. One is to go by category, e.g. drop all the missing values, convert all the non-numeric values, etc. But realistically we want to be more careful and clean up each column within the context of the other values in the column.

(-> housing-data )

(-> housing-data (tc/select-rows #(nil? (:construction-grade %))) )

(def date-format (DateTimeFormatter/ofPattern “MMMM dd yyyy”))

(defn- to-date [s] )

(defn to-year [s] ((memfn getYear) (LocalDate/parse s date-format)))

(defn- y-n-to-int [v] (if (= v “Y”) 1 0))

We can use a simple linear model to fill in the missing values for construction grade

(def clean-housing-data (-> housing-data ;; Clean up missing values (tc/drop-missing columns-to-drop)

Drop rows with a sale price of zero. These are not useful for our purposes. (tc/drop-rows #(= 0 (:sale-price %)))

Fill in missing :construction-grade values: ()

Convert the sale-date to just a year, this is granular enough for our purposes (tc/convert-types :sale-date [[:int to-year]])) )

(def non-numeric-columns (let [numerical-cols (tc/column-names clean-housing-data :type/numerical)] (tc/column-names clean-housing-data (complement (set numerical-cols)))))

(def construction-grade-map {“Low” 0 “Fair” 1 “Average” 2 “Good” 3 “Manufactured Home” 3 “Very Good” 4 “Excellent” 5})

(defn update-construction-grade [v] (get construction-grade-map v))

(def training-data (-> clean-housing-data ;; Deal with non-numeric columns. “Y”/“N” columns can be converted to 0s and 1s: (tc/update-columns [:garage :finished-basement :under-construction] (partial map y-n-to-int))

;; Remove columns that describe the address, we don’t care about these for the purposes of our model (tc/drop-columns #(str/includes? (name %) “civic”))

  (tc/update-columns :construction-grade (partial map update-construction-grade))
  :style
  frequencies

  ))

Making the style column numeric

(->> clean-housing-data :style frequencies (sort-by second))

(tc/drop-columns (tc/info clean-housing-data) [:first :last :mode :standard-deviation :skew :mean])

clean-housing-data

(-> clean-housing-data (haclo/layer-point #:haclo{:x :square-foot-living-area :y :sale-price :xscale {:type “log”}}))

(haclo/layer-point #:haclo{:x :square-foot-living-area :y :sale-price})

(math/correlation-table clean-housing-data)

(-> housing-data (tc/drop-missing :sale-price) (tc/drop-rows #(= 0 (:sale-price %)))

(tc/select-rows #(< 10000000 (:sale-price %))))

There are a handful of properties that sold for 10M

(-> housing-data (tc/select-rows #(> 10000000 (:sale-price %))))

source: src/book/part_0_end_to_end/2_data_preparation_example.clj