Data Import Example

TODO: Summary

To build a model that predicts house prices, we need data about recent sales of similar houses. Luckily the organization that tracks all of this data in my province has released it for public use under a very permissive open data license, so we’ll use that data to find what we’re looking for.

First we need the data about actual house sales. That’s available in this Parcel Sales History dataset. Standard tabular file formats like .csv just work in tablecloth, so there’s not much to worry about here with importing the data:

(def sales-history
  (tc/dataset "data/end_to_end/parcel_sales_history.csv"))

We can get a quick summary of the dataset with tc/info:

(tc/info sales-history)

data/end_to_end/parcel_sales_history.csv: descriptive-stats [14 12]:

:col-name :datatype :n-valid :n-missing :min :mean :mode :max :standard-deviation :skew :first :last
Municipal Unit :string 553234 0 HALIFAX REGIONAL MUNICIPALITY (HRM) HALIFAX REGIONAL MUNICIPALITY (HRM) MUNI. OF THE DISTRICT OF LUNENBURG
Assessment Account Number :int32 553234 0 396.000000 5464364.27661351 1.10765560E+07 3559869.76102067 0.18744869 10587948 3933504
Civic Number :int32 407909 145325 0.000000 1249.59833198 5.19470000E+04 2989.77914396 6.12767133 75 918
Civic Additional :string 1365 551869 A
Civic Direction :string 2224 551010
Civic Street Name :string 550304 2930 HIGHWAY 1 CENTRAL NO 10
Civic Suffix :string 497196 56038 RD AVE HWY
Civic City Name :string 552342 892 HALIFAX HALIFAX COOKVILLE
Sale Price :int32 550742 2492 0.000000 216081.15903999 3.70000000E+08 2006809.13640543 89.31325733 0 0
Sale Date :string 553234 0 September 28 2012 January 08 2021 April 26 2024
Parcels In Sale :int16 545757 7477 0.000000 1.89112554 1.30000000E+02 3.23739061 15.40358232 1 1
Y Map Coordinate :float64 526436 26798 43.398415 45.03052545 4.70250420E+01 0.67659690 0.20517924 44.65 44.41
X Map Coordinate :float64 526436 26798 -66.392040 -63.44632344 -5.96961680E+01 1.60085428 0.61790586 -63.64 -64.55
Map Coordinates :string 526436 26798 (44.651645, -63.639508) (44.413596, -64.552656)

The column names are capitalized strings, which is fine but it’s easier and idiomatic to work with kebab-cased keywords as column-names in Clojure, so we’ll update our import to transform them on load. We can use the to-clean-keyword helper from tcutils to do this. It’s similar to the make_clean_names helper from R’s janitor package, which is commonly used when datasets are loaded to, well, clean up the column names. You can see the full implementation here, but basically it removes any non-ASCII characters in strings, and then kebab-case and keywordizes them.

Tablecloth also gives each dataset a name, which defaults to the file name. We’ll set ours to something more succinct here:

(def sales-history
  (tc/dataset "data/end_to_end/parcel_sales_history.csv"
              {:key-fn tcustr/to-clean-keyword
               :dataset-name "sales-history"}))
(tc/info sales-history)

sales-history: descriptive-stats [14 12]:

:col-name :datatype :n-valid :n-missing :min :mean :mode :max :standard-deviation :skew :first :last
:municipal-unit :string 553234 0 HALIFAX REGIONAL MUNICIPALITY (HRM) HALIFAX REGIONAL MUNICIPALITY (HRM) MUNI. OF THE DISTRICT OF LUNENBURG
:assessment-account-number :int32 553234 0 396.000000 5464364.27661351 1.10765560E+07 3559869.76102067 0.18744869 10587948 3933504
:civic-number :int32 407909 145325 0.000000 1249.59833198 5.19470000E+04 2989.77914396 6.12767133 75 918
:civic-additional :string 1365 551869 A
:civic-direction :string 2224 551010
:civic-street-name :string 550304 2930 HIGHWAY 1 CENTRAL NO 10
:civic-suffix :string 497196 56038 RD AVE HWY
:civic-city-name :string 552342 892 HALIFAX HALIFAX COOKVILLE
:sale-price :int32 550742 2492 0.000000 216081.15903999 3.70000000E+08 2006809.13640543 89.31325733 0 0
:sale-date :string 553234 0 September 28 2012 January 08 2021 April 26 2024
:parcels-in-sale :int16 545757 7477 0.000000 1.89112554 1.30000000E+02 3.23739061 15.40358232 1 1
:y-map-coordinate :float64 526436 26798 43.398415 45.03052545 4.70250420E+01 0.67659690 0.20517924 44.65 44.41
:x-map-coordinate :float64 526436 26798 -66.392040 -63.44632344 -5.96961680E+01 1.60085428 0.61790586 -63.64 -64.55
:map-coordinates :string 526436 26798 (44.651645, -63.639508) (44.413596, -64.552656)

Here we can see that this dataset includes information about the sale of the house (e.g. address, sale date, sale price, etc), but no features of the house itself (e.g. number of bedrooms, bathrooms, garages, etc.). We could build a model that predicts sale price based solely on a house’s location, but basic common sense tells us there’s more that goes into the value of the home. It’s not obvious exactly what the biggest determinants of price are, but anyone who has shopped for a house before knows that things like the age of the home, its size and features matter. To get this information, we’ll need a second dataset: Residential Dwelling Characteristics.

We can load it the same way:

(def characteristics
  (tc/dataset "data/end_to_end/residential_dwelling_characteristics.csv"
              {:key-fn tcustr/to-clean-keyword
               :dataset-name "characteristics"}))

Upon inspection we can see that this dataset includes the information about houses that we’re looking for, and it also as a column for “Assessment Account Number”, which we should be able to use to join it up to our other one.

(tc/info characteristics)

characteristics: descriptive-stats [21 12]:

:col-name :datatype :n-valid :n-missing :min :mean :mode :max :standard-deviation :skew :first :last
:municipal-unit :string 379101 0 HALIFAX REGIONAL MUNICIPALITY (HRM) MUNICIPALITY OF THE DISTRICT OF CLARE HALIFAX REGIONAL MUNICIPALITY (HRM)
:assessment-account-number :int32 379101 0 485.000000 4599930.15107056 1.10765990E+07 3141825.67393673 0.49704026 1286048 5728266
:civic-number :int32 374134 4967 0.000000 1245.77175825 9.05984000E+05 3355.53596798 56.94117266 358 20
:civic-additional :string 1179 377922
:civic-direction :string 1341 377760
:civic-street-name :string 378551 550 HIGHWAY 1 LITTLE BROOK RITCEY
:civic-street-suffix :string 338432 40669 RD RD CRT
:civic-city-name :string 378609 492 HALIFAX LITTLE BROOK STATION PORTERS LAKE
:living-units :int16 379101 0 0.000000 1.19882828 5.71000000E+02 7.17163974 66.61425743 1 1
:year-built :int16 329438 49663 200.000000 1978.24914551 2.04800000E+03 29.17932661 -2.12051956 1976 1985
:square-foot-living-area :int16 366348 12753 0.000000 1537.18592977 2.39310000E+04 750.57258849 1.95137428 2392 2338
:style :string 366445 12656 1 Storey 2 Storey 1.5 Storey
:bedrooms :int16 326866 52235 0.000000 3.01704368 2.40000000E+01 1.02378973 1.33747502 4 4
:bathrooms :int16 379101 0 0.000000 1.68568007 2.50000000E+01 0.95784940 1.28420156 2 4
:under-construction :string 379101 0 N N N
:construction-grade :string 366269 12832 Average Average Average
:finished-basement :string 379101 0 N N Y
:garage :string 379101 0 N N Y
:y-map-coordinate :float64 354986 24115 43.398415 45.04529417 4.70260480E+01 0.66341981 0.29089356 44.29 44.74
:x-map-coordinate :float64 354986 24115 -66.387871 -63.36304830 -5.97949100E+01 1.56388011 0.69450136 -66.10 -63.30
:map-coordinates :string 354986 24115 (44.293843, -66.10439) (44.743576, -63.302604)

There’s one other piece of information that I want for our model. In Nova Scotia, it’s not uncommon to own relatively large plots of land, and this can make house prices seem surprising. A small or very old house might seem like it’s priced too high, for example, if you don’t notice the size of the lot that it’s on. We can get this information, again indexed by the Assessment Account Number, from the Parcel Land Sizes dataset:

(def land-sizes
  (tc/dataset "data/end_to_end/parcel_land_sizes.csv"
              {:key-fn tcustr/to-clean-keyword
               :dataset-name "land-sizes"}))
(tc/info land-sizes)

land-sizes: descriptive-stats [13 12]:

:col-name :datatype :n-valid :n-missing :min :mean :mode :max :standard-deviation :skew :first :last
:municipal-unit :string 647472 0 HALIFAX REGIONAL MUNICIPALITY (HRM) HALIFAX REGIONAL MUNICIPALITY (HRM) WEST HANTS REGIONAL MUNICIPALITY
:assessment-account-number :int32 647472 0 35.000000 5327202.15206990 1.10797510E+07 3.46299091E+06 0.22692123 911593 11000738
:civic-number :int32 436880 210592 0.000000 1299.84449734 9.05984000E+05 3.38721369E+03 48.31175816 445
:civic-additional :string 1462 646010
:civic-direction :string 2603 644869
:civic-street-name :string 638079 9393 WEST PORTERS LAKE SOUTH RAWDON
:civic-street-suffix :string 574720 72752 RD RD RD
:civic-city-name :string 643321 4151 HALIFAX PORTERS LAKE HILLSVALE
:acreage :float64 644235 3237 -80.010000 22.77548066 3.23921000E+05 8.40222965E+02 177.93796144 2.470 1.850
:square-feet :int64 606490 40982 -1056000.000000 951048.04373197 8.15726340E+09 3.05072574E+07 143.20927713 107680 80642
:y-map-coordinate :float64 604633 42839 43.391144 45.07207757 4.72261700E+01 6.97087997E-01 0.11462748 44.72
:x-map-coordinate :float64 604633 42839 -66.392040 -63.35237711 -5.96961680E+01 1.66378362E+00 0.51926351 -63.31
:map-coordinates :string 604633 42839 (44.719344, -63.308412)

Since loading the data is so easy here, we’ll take a moment to introduce tablecloth. Tablecloth is the main data manipulation library in the Clojure data science ecosystem. It provides a dataframe-like abstraction for working with data and similar operations to those found in popular data analysis libraries in other languages, like pandas in Python or dplyr in R. It provides a familiar interface for data scientists used to working with tabular data in other ecosystems while leveraging Clojure’s strengths in functional programming and immutability. It’s built on top of tech.ml.dataset, Clojure’s main library for high-performance data processing.

We’ll use tablecloth throughout the rest of this section to clean up our data.

We want a single dataset that contains only information relevant to building our model. Right now the relevant information is spread across our three datasets. To see what information we care about, first we can inspect the column names from each dataset:

(tc/column-names sales-history)
(:municipal-unit
 :assessment-account-number
 :civic-number
 :civic-additional
 :civic-direction
 :civic-street-name
 :civic-suffix
 :civic-city-name
 :sale-price
 :sale-date
 :parcels-in-sale
 :y-map-coordinate
 :x-map-coordinate
 :map-coordinates)
(tc/column-names characteristics)
(:municipal-unit
 :assessment-account-number
 :civic-number
 :civic-additional
 :civic-direction
 :civic-street-name
 :civic-street-suffix
 :civic-city-name
 :living-units
 :year-built
 :square-foot-living-area
 :style
 :bedrooms
 :bathrooms
 :under-construction
 :construction-grade
 :finished-basement
 :garage
 :y-map-coordinate
 :x-map-coordinate
 :map-coordinates)
(tc/column-names land-sizes)
(:municipal-unit
 :assessment-account-number
 :civic-number
 :civic-additional
 :civic-direction
 :civic-street-name
 :civic-street-suffix
 :civic-city-name
 :acreage
 :square-feet
 :y-map-coordinate
 :x-map-coordinate
 :map-coordinates)

There are methods we can use to determine which features are most relevant for our target, but for now we can use common sense to omit a few, like the map coordinates. We’ll use tablecloth to join all the datasets together by assessment account number:

(def housing-data
  (-> sales-history
      (tc/select-columns [:assessment-account-number
                          :municipal-unit
                          :civic-number
                          :civic-additional
                          :civic-direction
                          :civic-street-name
                          :civic-suffix
                          :civic-city-name
                          :sale-price
                          :sale-date
                          :parcels-in-sale])
      (tc/inner-join (tc/select-columns characteristics
                                        [:assessment-account-number
                                         :living-units
                                         :year-built
                                         :square-foot-living-area
                                         :style
                                         :bedrooms
                                         :bathrooms
                                         :under-construction
                                         :construction-grade
                                         :finished-basement
                                         :garage])
                     :assessment-account-number)
      (tc/inner-join (tc/select-columns land-sizes
                                        [:assessment-account-number
                                         :square-feet])
                     :assessment-account-number)))

Now we have a single, combined dataset to work with. We’ll do some basic data cleaning and integrity checks before saving it to work with in the next steps.

Data cleaning

Correcting datatypes

First, we can check that the datatypes of each column make sense:

(-> housing-data
    tc/info
    (tc/select-columns [:col-name :datatype]))

inner-join: descriptive-stats [22 2]:

:col-name :datatype
:assessment-account-number :int32
:municipal-unit :string
:civic-number :int32
:civic-additional :string
:civic-direction :string
:civic-street-name :string
:civic-suffix :string
:civic-city-name :string
:sale-price :int32
:sale-date :string
:parcels-in-sale :int16
:living-units :int16
:year-built :int16
:square-foot-living-area :int16
:style :string
:bedrooms :int16
:bathrooms :int16
:under-construction :string
:construction-grade :string
:finished-basement :string
:garage :string
:square-feet :int64

These mostly look good except for :sale-date, which should probably be either a number (if it’s a year) or some date format. We’ll also double check some of the string columns to make sure their values are coherent.

First, inspecting the :sale-date column:

(:sale-date housing-data)
#tech.v3.dataset.column<string>[363075]
:sale-date
[June 28 2018, July 22 2022, January 19 2023, June 28 2022, August 29 2019, October 09 2020, February 10 2023, June 13 2023, September 24 2020, October 08 2020, January 09 2017, May 11 2018, August 31 2022, October 15 2020, November 01 2021, July 16 2013, September 25 2014, September 25 2014, July 02 2019, March 10 2022...]

We can see it’s strings that should be full dates. Tablecloth makes it easy to parse the strings into a supported date format. First we’ll define a helper to parse these strings as dates using some of Java’s time utils:

(def date-format (DateTimeFormatter/ofPattern "MMMM dd yyyy"))
(defn- to-date [s]
  (LocalDate/parse s date-format))
#'book.part-0-end-to-end.1-data-import-and-cleanup-example/to-date

Then use tablecloth’s convert-types to efficiently process the whole column:

(-> housing-data
    (tc/convert-types :sale-date [[:packed-local-date to-date]]))

inner-join [363075 22]:

:assessment-account-number :municipal-unit :civic-number :civic-additional :civic-direction :civic-street-name :civic-suffix :civic-city-name :sale-price :sale-date :parcels-in-sale :living-units :year-built :square-foot-living-area :style :bedrooms :bathrooms :under-construction :construction-grade :finished-basement :garage :square-feet
911593 HALIFAX REGIONAL MUNICIPALITY (HRM) 445 WEST PORTERS LAKE RD PORTERS LAKE 225000 2018-06-28 2 1 1987 988 1 Storey 1 N Average N N 107680
911593 HALIFAX REGIONAL MUNICIPALITY (HRM) 445 WEST PORTERS LAKE RD PORTERS LAKE 380000 2022-07-22 2 1 1987 988 1 Storey 1 N Average N N 107680
8502838 MUNICIPALITY OF THE DISTRICT OF ARGYLE 1125 DU TITTLE CH SURETTES ISLAND 0 2023-01-19 0 2020 640 2 Storey 1 0 N Low N N
3698858 MUNICIPALITY OF THE COUNTY OF KINGS 9739 MAIN ST CANNING 130000 2022-06-28 1 0 1632 2 Storey 4 1 Y Average N N 7562
3698858 MUNICIPALITY OF THE COUNTY OF KINGS 9739 MAIN ST CANNING 85000 2019-08-29 1 0 1632 2 Storey 4 1 Y Average N N 7562
3944026 MUNI. OF THE DISTRICT OF GUYSBOROUGH 5231 HIGHWAY 316 CHARLOS COVE 8500 2020-10-09 2 1 340 1 Storey 1 1 N Low N N 5400
3944026 MUNI. OF THE DISTRICT OF GUYSBOROUGH 5231 HIGHWAY 316 CHARLOS COVE 0 2023-02-10 2 1 340 1 Storey 1 1 N Low N N 5400
3944026 MUNI. OF THE DISTRICT OF GUYSBOROUGH 5231 HIGHWAY 316 CHARLOS COVE 0 2023-06-13 2 1 340 1 Storey 1 1 N Low N N 5400
3885003 MUNICIPALITY OF THE COUNTY OF INVERNESS 1394 MABOU HARBOUR RD MABOU HARBOUR 0 2020-09-24 1 1 1910 1170 1.5 Storey 4 1 N Average N N 1320304
3885003 MUNICIPALITY OF THE COUNTY OF INVERNESS 1394 MABOU HARBOUR RD MABOU HARBOUR 0 2020-10-08 1 1 1910 1170 1.5 Storey 4 1 N Average N N 1320304
10863791 MUNI. OF THE DISTRICT OF LUNENBURG 320 EVELYN DR PARKDALE 90000 2021-02-26 1 1 2022 1106 1 Storey 1 2 N Average Y Y 118483
2745755 HALIFAX REGIONAL MUNICIPALITY (HRM) 1178 KETCH HARBOUR RD KETCH HARBOUR 42000 2015-06-30 1 1 2017 1170 1.5 Storey 2 2 N Average N Y 51687
2745755 HALIFAX REGIONAL MUNICIPALITY (HRM) 1178 KETCH HARBOUR RD KETCH HARBOUR 69000 2011-06-23 1 1 2017 1170 1.5 Storey 2 2 N Average N Y 51687
2745755 HALIFAX REGIONAL MUNICIPALITY (HRM) 1178 KETCH HARBOUR RD KETCH HARBOUR 591500 2021-12-22 2 1 2017 1170 1.5 Storey 2 2 N Average N Y 51687
10751993 WEST HANTS REGIONAL MUNICIPALITY 7289 HIGHWAY 1 ARDOISE 44900 2022-04-29 1 1 2022 1230 1 Storey 3 3 N Average N N 48352
10751993 WEST HANTS REGIONAL MUNICIPALITY 7289 HIGHWAY 1 ARDOISE 0 2018-10-25 2 1 2022 1230 1 Storey 3 3 N Average N N 48352
10751993 WEST HANTS REGIONAL MUNICIPALITY 7289 HIGHWAY 1 ARDOISE 37000 2020-06-04 2 1 2022 1230 1 Storey 3 3 N Average N N 48352
10751993 WEST HANTS REGIONAL MUNICIPALITY 7289 HIGHWAY 1 ARDOISE 51000 2021-07-19 2 1 2022 1230 1 Storey 3 3 N Average N N 48352
7667817 HALIFAX REGIONAL MUNICIPALITY (HRM) 56 LAKE RD PLEASANT VALLEY 0 2019-11-15 1 1 1996 1050 1.5 Storey 0 N Low N N 332798
10986400 HALIFAX REGIONAL MUNICIPALITY (HRM) 538 MIDNIGHT RUN MIDDLE SACKVILLE 767575 2022-05-17 9 1 2023 2412 Split Entry 4 3 N Good Y Y 143400
10986400 HALIFAX REGIONAL MUNICIPALITY (HRM) 538 MIDNIGHT RUN MIDDLE SACKVILLE 0 2022-01-31 9 1 2023 2412 Split Entry 4 3 N Good Y Y 143400

Now the sale date values can be worked with as dates, which is much better than strings. Next we’ll look at the remaining string columns (they’re always suspicious).

(-> housing-data
    (tc/convert-types :sale-date [[:packed-local-date to-date]])
    (tc/select-columns :type/string))

inner-join [363075 11]:

:municipal-unit :civic-additional :civic-direction :civic-street-name :civic-suffix :civic-city-name :style :under-construction :construction-grade :finished-basement :garage
HALIFAX REGIONAL MUNICIPALITY (HRM) WEST PORTERS LAKE RD PORTERS LAKE 1 Storey N Average N N
HALIFAX REGIONAL MUNICIPALITY (HRM) WEST PORTERS LAKE RD PORTERS LAKE 1 Storey N Average N N
MUNICIPALITY OF THE DISTRICT OF ARGYLE DU TITTLE CH SURETTES ISLAND 2 Storey N Low N N
MUNICIPALITY OF THE COUNTY OF KINGS MAIN ST CANNING 2 Storey Y Average N N
MUNICIPALITY OF THE COUNTY OF KINGS MAIN ST CANNING 2 Storey Y Average N N
MUNI. OF THE DISTRICT OF GUYSBOROUGH HIGHWAY 316 CHARLOS COVE 1 Storey N Low N N
MUNI. OF THE DISTRICT OF GUYSBOROUGH HIGHWAY 316 CHARLOS COVE 1 Storey N Low N N
MUNI. OF THE DISTRICT OF GUYSBOROUGH HIGHWAY 316 CHARLOS COVE 1 Storey N Low N N
MUNICIPALITY OF THE COUNTY OF INVERNESS MABOU HARBOUR RD MABOU HARBOUR 1.5 Storey N Average N N
MUNICIPALITY OF THE COUNTY OF INVERNESS MABOU HARBOUR RD MABOU HARBOUR 1.5 Storey N Average N N
MUNI. OF THE DISTRICT OF LUNENBURG EVELYN DR PARKDALE 1 Storey N Average Y Y
HALIFAX REGIONAL MUNICIPALITY (HRM) KETCH HARBOUR RD KETCH HARBOUR 1.5 Storey N Average N Y
HALIFAX REGIONAL MUNICIPALITY (HRM) KETCH HARBOUR RD KETCH HARBOUR 1.5 Storey N Average N Y
HALIFAX REGIONAL MUNICIPALITY (HRM) KETCH HARBOUR RD KETCH HARBOUR 1.5 Storey N Average N Y
WEST HANTS REGIONAL MUNICIPALITY HIGHWAY 1 ARDOISE 1 Storey N Average N N
WEST HANTS REGIONAL MUNICIPALITY HIGHWAY 1 ARDOISE 1 Storey N Average N N
WEST HANTS REGIONAL MUNICIPALITY HIGHWAY 1 ARDOISE 1 Storey N Average N N
WEST HANTS REGIONAL MUNICIPALITY HIGHWAY 1 ARDOISE 1 Storey N Average N N
HALIFAX REGIONAL MUNICIPALITY (HRM) LAKE RD PLEASANT VALLEY 1.5 Storey N Low N N
HALIFAX REGIONAL MUNICIPALITY (HRM) MIDNIGHT RUN MIDDLE SACKVILLE Split Entry N Good Y Y
HALIFAX REGIONAL MUNICIPALITY (HRM) MIDNIGHT RUN MIDDLE SACKVILLE Split Entry N Good Y Y

All of the civic address columns are fine; we won’t use those in our model anyway. Some of the other ones – :under-construction, :finished-basement, and :garage – are boolean columns that use “Y”/“N” right now, which is ok. It’s clear and human readable, but we can easily convert these values to actual booleans, which are equally compact and human readable but more machine-friendly. This also has the added benefit of ensuring there are no bogus values, since each value must be only either true or false. Booleans are one of a few datatypes that have a built-in coercer, so we don’t need to write any special helpers to do it ourselves. We can add this to our previous type conversion using the syntax for multiple transformations:

(-> housing-data
    (tc/convert-types {:sale-date [:packed-local-date to-date]
                       :under-construction :boolean
                       :finished-basement :boolean
                       :garage :boolean}))

inner-join [363075 22]:

:assessment-account-number :municipal-unit :civic-number :civic-additional :civic-direction :civic-street-name :civic-suffix :civic-city-name :sale-price :sale-date :parcels-in-sale :living-units :year-built :square-foot-living-area :style :bedrooms :bathrooms :under-construction :construction-grade :finished-basement :garage :square-feet
911593 HALIFAX REGIONAL MUNICIPALITY (HRM) 445 WEST PORTERS LAKE RD PORTERS LAKE 225000 2018-06-28 2 1 1987 988 1 Storey 1 false Average false false 107680
911593 HALIFAX REGIONAL MUNICIPALITY (HRM) 445 WEST PORTERS LAKE RD PORTERS LAKE 380000 2022-07-22 2 1 1987 988 1 Storey 1 false Average false false 107680
8502838 MUNICIPALITY OF THE DISTRICT OF ARGYLE 1125 DU TITTLE CH SURETTES ISLAND 0 2023-01-19 0 2020 640 2 Storey 1 0 false Low false false
3698858 MUNICIPALITY OF THE COUNTY OF KINGS 9739 MAIN ST CANNING 130000 2022-06-28 1 0 1632 2 Storey 4 1 true Average false false 7562
3698858 MUNICIPALITY OF THE COUNTY OF KINGS 9739 MAIN ST CANNING 85000 2019-08-29 1 0 1632 2 Storey 4 1 true Average false false 7562
3944026 MUNI. OF THE DISTRICT OF GUYSBOROUGH 5231 HIGHWAY 316 CHARLOS COVE 8500 2020-10-09 2 1 340 1 Storey 1 1 false Low false false 5400
3944026 MUNI. OF THE DISTRICT OF GUYSBOROUGH 5231 HIGHWAY 316 CHARLOS COVE 0 2023-02-10 2 1 340 1 Storey 1 1 false Low false false 5400
3944026 MUNI. OF THE DISTRICT OF GUYSBOROUGH 5231 HIGHWAY 316 CHARLOS COVE 0 2023-06-13 2 1 340 1 Storey 1 1 false Low false false 5400
3885003 MUNICIPALITY OF THE COUNTY OF INVERNESS 1394 MABOU HARBOUR RD MABOU HARBOUR 0 2020-09-24 1 1 1910 1170 1.5 Storey 4 1 false Average false false 1320304
3885003 MUNICIPALITY OF THE COUNTY OF INVERNESS 1394 MABOU HARBOUR RD MABOU HARBOUR 0 2020-10-08 1 1 1910 1170 1.5 Storey 4 1 false Average false false 1320304
10863791 MUNI. OF THE DISTRICT OF LUNENBURG 320 EVELYN DR PARKDALE 90000 2021-02-26 1 1 2022 1106 1 Storey 1 2 false Average true true 118483
2745755 HALIFAX REGIONAL MUNICIPALITY (HRM) 1178 KETCH HARBOUR RD KETCH HARBOUR 42000 2015-06-30 1 1 2017 1170 1.5 Storey 2 2 false Average false true 51687
2745755 HALIFAX REGIONAL MUNICIPALITY (HRM) 1178 KETCH HARBOUR RD KETCH HARBOUR 69000 2011-06-23 1 1 2017 1170 1.5 Storey 2 2 false Average false true 51687
2745755 HALIFAX REGIONAL MUNICIPALITY (HRM) 1178 KETCH HARBOUR RD KETCH HARBOUR 591500 2021-12-22 2 1 2017 1170 1.5 Storey 2 2 false Average false true 51687
10751993 WEST HANTS REGIONAL MUNICIPALITY 7289 HIGHWAY 1 ARDOISE 44900 2022-04-29 1 1 2022 1230 1 Storey 3 3 false Average false false 48352
10751993 WEST HANTS REGIONAL MUNICIPALITY 7289 HIGHWAY 1 ARDOISE 0 2018-10-25 2 1 2022 1230 1 Storey 3 3 false Average false false 48352
10751993 WEST HANTS REGIONAL MUNICIPALITY 7289 HIGHWAY 1 ARDOISE 37000 2020-06-04 2 1 2022 1230 1 Storey 3 3 false Average false false 48352
10751993 WEST HANTS REGIONAL MUNICIPALITY 7289 HIGHWAY 1 ARDOISE 51000 2021-07-19 2 1 2022 1230 1 Storey 3 3 false Average false false 48352
7667817 HALIFAX REGIONAL MUNICIPALITY (HRM) 56 LAKE RD PLEASANT VALLEY 0 2019-11-15 1 1 1996 1050 1.5 Storey 0 false Low false false 332798
10986400 HALIFAX REGIONAL MUNICIPALITY (HRM) 538 MIDNIGHT RUN MIDDLE SACKVILLE 767575 2022-05-17 9 1 2023 2412 Split Entry 4 3 false Good true true 143400
10986400 HALIFAX REGIONAL MUNICIPALITY (HRM) 538 MIDNIGHT RUN MIDDLE SACKVILLE 0 2022-01-31 9 1 2023 2412 Split Entry 4 3 false Good true true 143400

Great. Now all of the datatypes in our dataset make sense. This was the lowest-hanging fruit in cleaning up this data. Next we’ll move on reshaping the dataset to make it tidy.

Reshaping data

Next, we’ll look at the :style column. This column is not useful in it’s current state. It contains a mix of information about the type of building and the number of stories, written as a string. This will require idiosyncratic and error-prone parsing by anyone who wants to do any analysis that involves this column. These are the distinct values the column contains:

(-> housing-data
    :style
    distinct)
("1 Storey"
 "2 Storey"
 "1.5 Storey"
 nil
 "Split Level"
 "2 Storey Townhouse"
 "Manufactured Home"
 "2 Storey Duplex"
 "Split Entry"
 "2 Storey Semi Detached"
 "2.5 Storey"
 "1.5 Storey Semi Detached"
 "1 Storey Duplex"
 "1 Storey Townhouse"
 "1 Storey Semi Detached"
 "3 Storey Triplex"
 "Additon"
 "1.5 Storey Duplex"
 "1.5 Storey Townhouse"
 "3 Storey"
 "2 Storey Triplex"
 "2.5 Storey Semi Detached"
 "1.5 Storey Triplex"
 "3 Storey Townhouse"
 "1 Storey Triplex"
 "1 Storey Quadruplex"
 "1.5 Storey Quadruplex"
 "3 Storey Semi Detached"
 "2 Storey Quadruplex"
 "2.5 Storey Townhouse"
 "3 Storey Duplex"
 "3 Storey Quadruplex"
 "2.5 Storey Duplex"
 "2.5 Storey Triplex"
 "2.5 Storey Quadruplex")

The first thing to notice here is that two of these values are not like the others. “Manufactured Home” and “Additon” are ambiguous styles. Since we’re not given any information about this column in the dataset metadata, we have to make some assumptions about what these values indicate. Maybe the dataset creators didn’t have enough information about these homes to describe them accurately, maybe they considered them somehow fundamentally different from the other house types and in a category of their own, or maybe something else. Either way, we’ll take these out-of-place values to indicate that these rows should be treated differently than the rest of the dataset.

If we were interested in carrying out this different treatment, we could move these rows to a separate dataset, but for our purposes here we can just drop the rows, building on the transformations from the previous step:

(-> housing-data
    (tc/convert-types {:sale-date [:packed-local-date to-date]
                       :under-construction :boolean
                       :finished-basement :boolean
                       :garage :boolean})
    (tc/drop-rows #(#{"Manufactured Home" "Additon"} (:style %))))

inner-join [351383 22]:

:assessment-account-number :municipal-unit :civic-number :civic-additional :civic-direction :civic-street-name :civic-suffix :civic-city-name :sale-price :sale-date :parcels-in-sale :living-units :year-built :square-foot-living-area :style :bedrooms :bathrooms :under-construction :construction-grade :finished-basement :garage :square-feet
911593 HALIFAX REGIONAL MUNICIPALITY (HRM) 445 WEST PORTERS LAKE RD PORTERS LAKE 225000 2018-06-28 2 1 1987 988 1 Storey 1 false Average false false 107680
911593 HALIFAX REGIONAL MUNICIPALITY (HRM) 445 WEST PORTERS LAKE RD PORTERS LAKE 380000 2022-07-22 2 1 1987 988 1 Storey 1 false Average false false 107680
8502838 MUNICIPALITY OF THE DISTRICT OF ARGYLE 1125 DU TITTLE CH SURETTES ISLAND 0 2023-01-19 0 2020 640 2 Storey 1 0 false Low false false
3698858 MUNICIPALITY OF THE COUNTY OF KINGS 9739 MAIN ST CANNING 130000 2022-06-28 1 0 1632 2 Storey 4 1 true Average false false 7562
3698858 MUNICIPALITY OF THE COUNTY OF KINGS 9739 MAIN ST CANNING 85000 2019-08-29 1 0 1632 2 Storey 4 1 true Average false false 7562
3944026 MUNI. OF THE DISTRICT OF GUYSBOROUGH 5231 HIGHWAY 316 CHARLOS COVE 8500 2020-10-09 2 1 340 1 Storey 1 1 false Low false false 5400
3944026 MUNI. OF THE DISTRICT OF GUYSBOROUGH 5231 HIGHWAY 316 CHARLOS COVE 0 2023-02-10 2 1 340 1 Storey 1 1 false Low false false 5400
3944026 MUNI. OF THE DISTRICT OF GUYSBOROUGH 5231 HIGHWAY 316 CHARLOS COVE 0 2023-06-13 2 1 340 1 Storey 1 1 false Low false false 5400
3885003 MUNICIPALITY OF THE COUNTY OF INVERNESS 1394 MABOU HARBOUR RD MABOU HARBOUR 0 2020-09-24 1 1 1910 1170 1.5 Storey 4 1 false Average false false 1320304
3885003 MUNICIPALITY OF THE COUNTY OF INVERNESS 1394 MABOU HARBOUR RD MABOU HARBOUR 0 2020-10-08 1 1 1910 1170 1.5 Storey 4 1 false Average false false 1320304
10863791 MUNI. OF THE DISTRICT OF LUNENBURG 320 EVELYN DR PARKDALE 90000 2021-02-26 1 1 2022 1106 1 Storey 1 2 false Average true true 118483
2745755 HALIFAX REGIONAL MUNICIPALITY (HRM) 1178 KETCH HARBOUR RD KETCH HARBOUR 42000 2015-06-30 1 1 2017 1170 1.5 Storey 2 2 false Average false true 51687
2745755 HALIFAX REGIONAL MUNICIPALITY (HRM) 1178 KETCH HARBOUR RD KETCH HARBOUR 69000 2011-06-23 1 1 2017 1170 1.5 Storey 2 2 false Average false true 51687
2745755 HALIFAX REGIONAL MUNICIPALITY (HRM) 1178 KETCH HARBOUR RD KETCH HARBOUR 591500 2021-12-22 2 1 2017 1170 1.5 Storey 2 2 false Average false true 51687
10751993 WEST HANTS REGIONAL MUNICIPALITY 7289 HIGHWAY 1 ARDOISE 44900 2022-04-29 1 1 2022 1230 1 Storey 3 3 false Average false false 48352
10751993 WEST HANTS REGIONAL MUNICIPALITY 7289 HIGHWAY 1 ARDOISE 0 2018-10-25 2 1 2022 1230 1 Storey 3 3 false Average false false 48352
10751993 WEST HANTS REGIONAL MUNICIPALITY 7289 HIGHWAY 1 ARDOISE 37000 2020-06-04 2 1 2022 1230 1 Storey 3 3 false Average false false 48352
10751993 WEST HANTS REGIONAL MUNICIPALITY 7289 HIGHWAY 1 ARDOISE 51000 2021-07-19 2 1 2022 1230 1 Storey 3 3 false Average false false 48352
7667817 HALIFAX REGIONAL MUNICIPALITY (HRM) 56 LAKE RD PLEASANT VALLEY 0 2019-11-15 1 1 1996 1050 1.5 Storey 0 false Low false false 332798
10986400 HALIFAX REGIONAL MUNICIPALITY (HRM) 538 MIDNIGHT RUN MIDDLE SACKVILLE 767575 2022-05-17 9 1 2023 2412 Split Entry 4 3 false Good true true 143400
10986400 HALIFAX REGIONAL MUNICIPALITY (HRM) 538 MIDNIGHT RUN MIDDLE SACKVILLE 0 2022-01-31 9 1 2023 2412 Split Entry 4 3 false Good true true 143400

Removing observations about different types of units from our dataset is one step in the right direction. To move another step closer to a tidy dataset, where each variable is a column and each observation is a row, the next thing we need to do is split up the two pieces of information in this column into two new ones. To do this we can use tablecloth’s reshaping tools, specifically separate-column in this case.

In order to properly handle all the possible values in the column, we’ll write a new helper to parse the number of stories and building type from the list of possible :style strings.

This also introduces a complicated but extremely common problem. We have to start inferring more things about the data. In other words, up to this point we have only been working with the raw data we found from the source. Splitting up this style column into multiple ones means making some assumptions, like what kind of building type do we assign to values that only have a number of stories? Given the context, we can assume it’s a detached home, but at this point we’re encoding some of our own knowledge into the dataset and it’s worth being explicit about it. We’ll also assume that a “Split Level” or “Split Entry” home has 2 stories.

Based on this understanding of the data, we’ll start with a little function to extract the number of stories and building type from any possible value in the :style column:

(defn parse-building-type [val]
  (case val
    nil nil
    "Split Level" [2 "Detached"]
    "Split Entry" [2 "Detached"]
    (let [[_ stories building-type] (re-matches #"(\d+\.?\d?) Storey\s?(.+)?" val)]
      [(parse-double stories) (or building-type "Detached")])))

Then we can use this helper with tablecloth’s separate-column to generate our two new columns. We’ll also bind this intermediate dataset to a new symbol to facilitate working with it in future steps:

(def partially-cleaned-housing-data
  (-> housing-data
      (tc/convert-types {:sale-date [:packed-local-date to-date]
                         :under-construction :boolean
                         :finished-basement :boolean
                         :garage :boolean})
      (tc/drop-rows #(#{"Manufactured Home" "Additon"} (:style %)))
      (tc/separate-column :style [:stories :building-type] parse-building-type)))

Data Integrity

Now that we’ve fixed up the structural or technical issues with the data, there is a whole class of new problems to deal with that have much less obvious solutions. For these next types of issues, we’ll need to actually look at the data and make some judgment calls about how to handle apparent errors.

Handling missing values

Most

Get columns with missing values

(-> partially-cleaned-housing-data
    tc/info
    (tc/select-rows (comp pos? :n-missing))
    :col-name)
#tech.v3.dataset.column<keyword>[15]
:col-name
[:civic-number, :civic-additional, :civic-direction, :civic-street-name, :civic-suffix, :civic-city-name, :sale-price, :parcels-in-sale, :year-built, :square-foot-living-area, :stories, :building-type, :bedrooms, :construction-grade, :square-feet]

Drop any numeric column with value of 0:

(def no-missing-values
  (let [cols-missing-values [:sale-price
                             :parcels-in-sale
                             :year-built
                             :square-foot-living-area
                             :stories
                             :building-type
                             :bedrooms
                             :construction-grade
                             :square-feet]
        numeric-cols (-> partially-cleaned-housing-data
                         (tc/select-columns :type/numerical)
                         tc/column-names)]
    (-> partially-cleaned-housing-data
        (tc/drop-rows #(some (fn [col-name]
                               (nil? (col-name %)))
                             cols-missing-values))
        (tc/drop-rows #(some (fn [col-name]
                               (zero? (col-name % 0)))
                             numeric-cols)))))

Drop missing sale price, also drop 0 sale price, these don’t provide the information we’re looking for

Check for duplicate values

Removing fully duplicate rows

The first thing we can do is check for entirely duplicate rows. The easiest way to do this is to use scicloj.tcutils.api/duplicate-rows

(tcu/duplicate-rows no-missing-values)

inner-join [0 23]:

:assessment-account-number :municipal-unit :civic-number :civic-additional :civic-direction :civic-street-name :civic-suffix :civic-city-name :sale-price :sale-date :parcels-in-sale :living-units :year-built :square-foot-living-area :stories :building-type :bedrooms :bathrooms :under-construction :construction-grade :finished-basement :garage :square-feet

In this case it turns out there are none, which is good. But it’s still possible we don’t want to include every row in the final dataset.

Selecting relevant rows

One key feature of tidy datasets, as mentioned above, is that they contain values collected about only a single type of observational unit. We’re interested in building a model to predict the price of single family homes in Nova Scotia, so we’ll want a dataset that contains only data about those types of homes. One clue that a dataset contains values about multiple types of things is values that get repeated in many rows.

We can inspect our dataset to see how many rows appear to be duplicates by grouping by the assessment account number and sale date. There should only be one sale per property per day, so we will want to select any rows that have more than one and investigate these rows:

(def potentially-duplicate-row-counts
  (-> no-missing-values
      (tc/group-by [:assessment-account-number :sale-date])
      (tc/aggregate {:count tc/row-count})
      (tc/select-rows #(< 1 (:count %)))))

(-> partially-cleaned-housing-data (tc/map-columns :sale-year :sale-date (memfn getYear)) (ploclo/layer-histogram {:=x :sale-year}))

TODO Explain

(let [potentially-duplicate-row-ids (-> potentially-duplicate-row-counts
                                        (tc/join-columns :id
                                                         [:assessment-account-number :sale-date]
                                                         {:result-type :seq}))]
  (-> no-missing-values
      (tc/map-columns :id [:assessment-account-number :sale-date] vector)
      (tc/inner-join potentially-duplicate-row-ids :id)
      (tc/drop-columns :id)
      (tc/order-by [:assessment-account-number :sale-date])))

inner-join [2184 24]:

:assessment-account-number :municipal-unit :civic-number :civic-additional :civic-direction :civic-street-name :civic-suffix :civic-city-name :sale-price :sale-date :parcels-in-sale :living-units :year-built :square-foot-living-area :stories :building-type :bedrooms :bathrooms :under-construction :construction-grade :finished-basement :garage :square-feet :count
9075 MUNI. OF THE DISTRICT OF LUNENBURG 491 OAKLAND RD INDIAN POINT 155000 2014-08-06 1 2 2023 668 1.0 Detached 1 1 false Average false false 67954 2
9075 MUNI. OF THE DISTRICT OF LUNENBURG 491 OAKLAND RD INDIAN POINT 155000 2014-08-06 1 2 1977 1390 1.0 Detached 3 2 false Average false true 67954 2
9075 MUNI. OF THE DISTRICT OF LUNENBURG 491 OAKLAND RD INDIAN POINT 170000 2018-11-02 1 2 2023 668 1.0 Detached 1 1 false Average false false 67954 2
9075 MUNI. OF THE DISTRICT OF LUNENBURG 491 OAKLAND RD INDIAN POINT 170000 2018-11-02 1 2 1977 1390 1.0 Detached 3 2 false Average false true 67954 2
18341 TOWN OF KENTVILLE 242 MAIN ST KENTVILLE 93000 2016-08-01 1 1 1888 972 1.5 Detached 3 2 false Average false false 4413 2
18341 TOWN OF KENTVILLE 242 MAIN ST KENTVILLE 1000 2016-08-01 1 1 1888 972 1.5 Detached 3 2 false Average false false 4413 2
43397 MUNI. OF THE DISTRICT OF EAST HANTS 775 HIGHWAY 1 MOUNT UNIACKE 123000 2015-10-29 2 1 2022 1008 1.0 Detached 1 2 true Average true true 15893 2
43397 MUNI. OF THE DISTRICT OF EAST HANTS 775 HIGHWAY 1 MOUNT UNIACKE 123000 2015-10-29 2 1 1903 864 1.5 Detached 3 1 false Average false true 15893 2
44296 HALIFAX REGIONAL MUNICIPALITY (HRM) 6555 CHESTER AVE HALIFAX 1200 2017-01-07 1 2 1943 1710 2.0 Detached 4 2 false Average false false 4343 2
44296 HALIFAX REGIONAL MUNICIPALITY (HRM) 6555 CHESTER AVE HALIFAX 315000 2017-01-07 1 2 1943 1710 2.0 Detached 4 2 false Average false false 4343 2
10939399 TOWN OF BERWICK 12 GALA DR BERWICK 70000 2022-01-10 2 1 2021 1462 1.0 Semi Detached 3 2 false Average false true 6738 2
10939429 TOWN OF BERWICK 13 GALA DR BERWICK 70000 2022-06-30 2 1 2021 1462 1.0 Semi Detached 3 2 false Average false true 7693 2
10939429 TOWN OF BERWICK 13 GALA DR BERWICK 337283 2022-06-30 1 1 2021 1462 1.0 Semi Detached 3 2 false Average false true 7693 2
10944831 MUNI. OF THE DISTRICT OF EAST HANTS 426 ESS RD UPPER NINE MILE RIVER 20000 2021-12-06 1 2 2023 1264 1.0 Detached 1 1 false Average false true 437725 2
10944831 MUNI. OF THE DISTRICT OF EAST HANTS 426 ESS RD UPPER NINE MILE RIVER 20000 2021-12-06 1 2 2022 1344 1.0 Detached 3 2 false Average false true 437725 2
10969220 WEST HANTS REGIONAL MUNICIPALITY 36 ELIZABETH AVE GARLANDS CROSSING 330955 2022-02-24 1 1 2021 1302 1.0 Semi Detached 2 2 false Average false true 5016 2
10969220 WEST HANTS REGIONAL MUNICIPALITY 36 ELIZABETH AVE GARLANDS CROSSING 240000 2022-02-24 4 1 2021 1302 1.0 Semi Detached 2 2 false Average false true 5016 2
10991625 MUNICIPALITY OF THE COUNTY OF INVERNESS 2581 SHORE RD MARGAREE HARBOUR 600000 2022-06-03 1 2 1970 2380 1.5 Detached 4 3 false Average true false 125986 2
10991625 MUNICIPALITY OF THE COUNTY OF INVERNESS 2581 SHORE RD MARGAREE HARBOUR 600000 2022-06-03 1 2 1985 552 1.0 Detached 1 1 false Average false false 125986 2
11021141 MUNI. OF THE DISTRICT OF EAST HANTS 57 BRIANNA DR LANTZ 600000 2023-04-28 4 2 2023 2005 2.0 Detached 3 3 true Average false true 11320 2
11021141 MUNI. OF THE DISTRICT OF EAST HANTS 57 BRIANNA DR LANTZ 600000 2023-04-28 4 2 2023 2029 2.0 Detached 3 3 true Average false true 11320 2

Remove rows with more than one living unit, these aren’t in the category of thing that we want to model:

(let [potentially-duplicate-row-ids (-> potentially-duplicate-row-counts
                                        (tc/join-columns :id
                                                         [:assessment-account-number :sale-date]
                                                         {:result-type :seq}))]
  (-> no-missing-values
      (tc/map-columns :id [:assessment-account-number :sale-date] vector)
      (tc/inner-join potentially-duplicate-row-ids :id)
      (tc/drop-columns :id)
      (tc/order-by [:assessment-account-number :sale-date])
      (tc/drop-rows #(> (:living-units %) 1))))

inner-join [686 24]:

:assessment-account-number :municipal-unit :civic-number :civic-additional :civic-direction :civic-street-name :civic-suffix :civic-city-name :sale-price :sale-date :parcels-in-sale :living-units :year-built :square-foot-living-area :stories :building-type :bedrooms :bathrooms :under-construction :construction-grade :finished-basement :garage :square-feet :count
18341 TOWN OF KENTVILLE 242 MAIN ST KENTVILLE 93000 2016-08-01 1 1 1888 972 1.5 Detached 3 2 false Average false false 4413 2
18341 TOWN OF KENTVILLE 242 MAIN ST KENTVILLE 1000 2016-08-01 1 1 1888 972 1.5 Detached 3 2 false Average false false 4413 2
43397 MUNI. OF THE DISTRICT OF EAST HANTS 775 HIGHWAY 1 MOUNT UNIACKE 123000 2015-10-29 2 1 2022 1008 1.0 Detached 1 2 true Average true true 15893 2
43397 MUNI. OF THE DISTRICT OF EAST HANTS 775 HIGHWAY 1 MOUNT UNIACKE 123000 2015-10-29 2 1 1903 864 1.5 Detached 3 1 false Average false true 15893 2
47244 MUNI. OF THE DISTRICT OF LUNENBURG 28 CONRADIN DR CONQUERALL BANK 459000 2014-01-06 1 1 1957 792 1.0 Detached 3 3 false Low false false 105851 2
47244 MUNI. OF THE DISTRICT OF LUNENBURG 28 CONRADIN DR CONQUERALL BANK 459000 2014-01-06 1 1 1957 2558 1.0 Detached 3 2 false Average false true 105851 2
47244 MUNI. OF THE DISTRICT OF LUNENBURG 28 CONRADIN DR CONQUERALL BANK 600000 2017-04-03 1 1 1957 792 1.0 Detached 3 3 false Low false false 105851 2
47244 MUNI. OF THE DISTRICT OF LUNENBURG 28 CONRADIN DR CONQUERALL BANK 600000 2017-04-03 1 1 1957 2558 1.0 Detached 3 2 false Average false true 105851 2
50997 MUNICIPALITY OF THE COUNTY OF CUMBERLAND 10271 DURHAM ST PUGWASH 14000 2019-02-01 1 1 2022 448 1.0 Detached 1 1 false Fair false false 6400 2
50997 MUNICIPALITY OF THE COUNTY OF CUMBERLAND 10271 DURHAM ST PUGWASH 14000 2019-02-01 1 1 2023 704 1.0 Detached 1 1 false Fair false false 6400 2
10904447 MUNICIPALITY OF THE COUNTY OF ANTIGONISH 51 ANCHOR LANE HARBOUR CENTRE 215000 2020-10-20 1 1 2021 3088 2.0 Detached 1 1 false Good false true 383328 2
10905206 HALIFAX REGIONAL MUNICIPALITY (HRM) 166 TALUS AVE BEDFORD 1015000 2021-12-21 1 1 2021 4528 2.0 Detached 5 5 false Good true false 6458 2
10905206 HALIFAX REGIONAL MUNICIPALITY (HRM) 166 TALUS AVE BEDFORD 930355 2021-12-21 1 1 2021 4528 2.0 Detached 5 5 false Good true false 6458 2
10905311 HALIFAX REGIONAL MUNICIPALITY (HRM) 199 TALUS AVE BEDFORD 699943 2021-12-22 1 1 2021 3012 2.0 Detached 4 5 false Good true true 6364 2
10905311 HALIFAX REGIONAL MUNICIPALITY (HRM) 199 TALUS AVE BEDFORD 880000 2021-12-22 1 1 2021 3012 2.0 Detached 4 5 false Good true true 6364 2
10939399 TOWN OF BERWICK 12 GALA DR BERWICK 361828 2022-01-10 1 1 2021 1462 1.0 Semi Detached 3 2 false Average false true 6738 2
10939399 TOWN OF BERWICK 12 GALA DR BERWICK 70000 2022-01-10 2 1 2021 1462 1.0 Semi Detached 3 2 false Average false true 6738 2
10939429 TOWN OF BERWICK 13 GALA DR BERWICK 70000 2022-06-30 2 1 2021 1462 1.0 Semi Detached 3 2 false Average false true 7693 2
10939429 TOWN OF BERWICK 13 GALA DR BERWICK 337283 2022-06-30 1 1 2021 1462 1.0 Semi Detached 3 2 false Average false true 7693 2
10969220 WEST HANTS REGIONAL MUNICIPALITY 36 ELIZABETH AVE GARLANDS CROSSING 330955 2022-02-24 1 1 2021 1302 1.0 Semi Detached 2 2 false Average false true 5016 2
10969220 WEST HANTS REGIONAL MUNICIPALITY 36 ELIZABETH AVE GARLANDS CROSSING 240000 2022-02-24 4 1 2021 1302 1.0 Semi Detached 2 2 false Average false true 5016 2

Leaves us with 686 rows to explain

(let [potentially-duplicate-row-ids (-> potentially-duplicate-row-counts
                                        (tc/join-columns :id
                                                         [:assessment-account-number :sale-date]
                                                         {:result-type :seq}))]
  (-> no-missing-values
      (tc/map-columns :id [:assessment-account-number :sale-date] vector)
      (tc/inner-join potentially-duplicate-row-ids :id)
      (tc/drop-columns :id)
      (tc/order-by [:assessment-account-number :sale-date])
      (tc/drop-rows #(> (:living-units %) 1))))

inner-join [686 24]:

:assessment-account-number :municipal-unit :civic-number :civic-additional :civic-direction :civic-street-name :civic-suffix :civic-city-name :sale-price :sale-date :parcels-in-sale :living-units :year-built :square-foot-living-area :stories :building-type :bedrooms :bathrooms :under-construction :construction-grade :finished-basement :garage :square-feet :count
18341 TOWN OF KENTVILLE 242 MAIN ST KENTVILLE 93000 2016-08-01 1 1 1888 972 1.5 Detached 3 2 false Average false false 4413 2
18341 TOWN OF KENTVILLE 242 MAIN ST KENTVILLE 1000 2016-08-01 1 1 1888 972 1.5 Detached 3 2 false Average false false 4413 2
43397 MUNI. OF THE DISTRICT OF EAST HANTS 775 HIGHWAY 1 MOUNT UNIACKE 123000 2015-10-29 2 1 2022 1008 1.0 Detached 1 2 true Average true true 15893 2
43397 MUNI. OF THE DISTRICT OF EAST HANTS 775 HIGHWAY 1 MOUNT UNIACKE 123000 2015-10-29 2 1 1903 864 1.5 Detached 3 1 false Average false true 15893 2
47244 MUNI. OF THE DISTRICT OF LUNENBURG 28 CONRADIN DR CONQUERALL BANK 459000 2014-01-06 1 1 1957 792 1.0 Detached 3 3 false Low false false 105851 2
47244 MUNI. OF THE DISTRICT OF LUNENBURG 28 CONRADIN DR CONQUERALL BANK 459000 2014-01-06 1 1 1957 2558 1.0 Detached 3 2 false Average false true 105851 2
47244 MUNI. OF THE DISTRICT OF LUNENBURG 28 CONRADIN DR CONQUERALL BANK 600000 2017-04-03 1 1 1957 792 1.0 Detached 3 3 false Low false false 105851 2
47244 MUNI. OF THE DISTRICT OF LUNENBURG 28 CONRADIN DR CONQUERALL BANK 600000 2017-04-03 1 1 1957 2558 1.0 Detached 3 2 false Average false true 105851 2
50997 MUNICIPALITY OF THE COUNTY OF CUMBERLAND 10271 DURHAM ST PUGWASH 14000 2019-02-01 1 1 2022 448 1.0 Detached 1 1 false Fair false false 6400 2
50997 MUNICIPALITY OF THE COUNTY OF CUMBERLAND 10271 DURHAM ST PUGWASH 14000 2019-02-01 1 1 2023 704 1.0 Detached 1 1 false Fair false false 6400 2
10904447 MUNICIPALITY OF THE COUNTY OF ANTIGONISH 51 ANCHOR LANE HARBOUR CENTRE 215000 2020-10-20 1 1 2021 3088 2.0 Detached 1 1 false Good false true 383328 2
10905206 HALIFAX REGIONAL MUNICIPALITY (HRM) 166 TALUS AVE BEDFORD 1015000 2021-12-21 1 1 2021 4528 2.0 Detached 5 5 false Good true false 6458 2
10905206 HALIFAX REGIONAL MUNICIPALITY (HRM) 166 TALUS AVE BEDFORD 930355 2021-12-21 1 1 2021 4528 2.0 Detached 5 5 false Good true false 6458 2
10905311 HALIFAX REGIONAL MUNICIPALITY (HRM) 199 TALUS AVE BEDFORD 699943 2021-12-22 1 1 2021 3012 2.0 Detached 4 5 false Good true true 6364 2
10905311 HALIFAX REGIONAL MUNICIPALITY (HRM) 199 TALUS AVE BEDFORD 880000 2021-12-22 1 1 2021 3012 2.0 Detached 4 5 false Good true true 6364 2
10939399 TOWN OF BERWICK 12 GALA DR BERWICK 361828 2022-01-10 1 1 2021 1462 1.0 Semi Detached 3 2 false Average false true 6738 2
10939399 TOWN OF BERWICK 12 GALA DR BERWICK 70000 2022-01-10 2 1 2021 1462 1.0 Semi Detached 3 2 false Average false true 6738 2
10939429 TOWN OF BERWICK 13 GALA DR BERWICK 70000 2022-06-30 2 1 2021 1462 1.0 Semi Detached 3 2 false Average false true 7693 2
10939429 TOWN OF BERWICK 13 GALA DR BERWICK 337283 2022-06-30 1 1 2021 1462 1.0 Semi Detached 3 2 false Average false true 7693 2
10969220 WEST HANTS REGIONAL MUNICIPALITY 36 ELIZABETH AVE GARLANDS CROSSING 330955 2022-02-24 1 1 2021 1302 1.0 Semi Detached 2 2 false Average false true 5016 2
10969220 WEST HANTS REGIONAL MUNICIPALITY 36 ELIZABETH AVE GARLANDS CROSSING 240000 2022-02-24 4 1 2021 1302 1.0 Semi Detached 2 2 false Average false true 5016 2

Many reasons why these remaining rows are wrong, deal with them case-by-case

Multiple sale prices and multiple number of parcels

TODO: Finish this – revisit sanity of explaining every possible error in the data. How much cleaning is necessary to do a meaningful analysis? TO DROP:

(let [potentially-duplicate-row-ids (-> potentially-duplicate-row-counts
                                        (tc/join-columns :id
                                                         [:assessment-account-number :sale-date]
                                                         {:result-type :seq}))]
  (-> no-missing-values
      (tc/map-columns :id [:assessment-account-number :sale-date] vector)
      (tc/inner-join potentially-duplicate-row-ids :id)
      (tc/drop-columns :id)
      (tc/order-by [:assessment-account-number :sale-date])
      (tc/drop-rows #(> (:living-units %) 1))
      (tc/select-columns [:assessment-account-number :sale-date :sale-price :parcels-in-sale])
      (tc/fold-by [:assessment-account-number :sale-date])
      (tc/select-rows #(and (not (apply = (:sale-price %)))
                            (not (apply = (:parcels-in-sale %)))))
      (tc/unroll [:sale-price :parcels-in-sale])
      :assessment-account-number))
#tech.v3.dataset.column<int64>[14]
:assessment-account-number
[931772, 931772, 2216272, 2216272, 4009185, 4009185, 10286433, 10286433, 10939399, 10939399, 10939429, 10939429, 10969220, 10969220]
(let [potentially-duplicate-row-ids (-> potentially-duplicate-row-counts
                                        (tc/join-columns :id
                                                         [:assessment-account-number :sale-date]
                                                         {:result-type :seq}))]
  (-> no-missing-values
      (tc/map-columns :id [:assessment-account-number :sale-date] vector)
      (tc/inner-join potentially-duplicate-row-ids :id)
      (tc/drop-columns :id)
      (tc/order-by [:assessment-account-number :sale-date])
      (tc/drop-rows #(> (:living-units %) 1))
      ;; (tc/select-columns [:assessment-account-number :sale-date :sale-price])
      (tc/fold-by [:assessment-account-number :sale-date])
      (tc/select-rows #(not (apply = (:sale-price %))))
      (tc/unroll [:sale-price])))

_unnamed [301 24]:

:assessment-account-number :sale-date :building-type :civic-street-name :square-foot-living-area :civic-city-name :civic-additional :civic-direction :under-construction :civic-suffix :parcels-in-sale :garage :stories :construction-grade :year-built :finished-basement :civic-number :bedrooms :count :square-feet :bathrooms :living-units :municipal-unit :sale-price
18341 2016-08-01 [Detached Detached] [MAIN MAIN] [972 972] [KENTVILLE KENTVILLE] [] [] [false false] [ST ST] [1 1] [false false] [1.5 1.5] [Average Average] [1888 1888] [false false] [242 242] [3 3] [2 2] [4413 4413] [2 2] [1 1] [TOWN OF KENTVILLE TOWN OF KENTVILLE] 93000
18341 2016-08-01 [Detached Detached] [MAIN MAIN] [972 972] [KENTVILLE KENTVILLE] [] [] [false false] [ST ST] [1 1] [false false] [1.5 1.5] [Average Average] [1888 1888] [false false] [242 242] [3 3] [2 2] [4413 4413] [2 2] [1 1] [TOWN OF KENTVILLE TOWN OF KENTVILLE] 1000
53155 2017-12-05 [Detached Detached] [ALLEN ALLEN] [600 600] [PORT GREVILLE PORT GREVILLE] [] [] [false false] [RD RD] [1 1] [false false] [1.0 1.0] [Low Low] [1973 1973] [false false] [78 78] [1 1] [2 2] [21780 21780] [1 1] [1 1] [MUNICIPALITY OF THE COUNTY OF CUMBERLAND 1200
MUNICIPALITY OF THE COUNTY OF CUMBERLAND]
53155 2017-12-05 [Detached Detached] [ALLEN ALLEN] [600 600] [PORT GREVILLE PORT GREVILLE] [] [] [false false] [RD RD] [1 1] [false false] [1.0 1.0] [Low Low] [1973 1973] [false false] [78 78] [1 1] [2 2] [21780 21780] [1 1] [1 1] [MUNICIPALITY OF THE COUNTY OF CUMBERLAND 41310
MUNICIPALITY OF THE COUNTY OF CUMBERLAND]
87246 2017-06-15 [Semi Detached Semi Detached] [WOOD WOOD] [720 720] [NEW WATERFORD NEW WATERFORD] [] [] [false false] [AVE AVE] [1 1] [false false] [1.5 1.5] [Average Average] [1910 1910] [false false] [3516 3516] [2 2] [2 2] [4932 4932] [1 1] [1 1] [CAPE BRETON REGIONAL MUNICIPALITY (CBRM) 39000
CAPE BRETON REGIONAL MUNICIPALITY (CBRM)]
87246 2017-06-15 [Semi Detached Semi Detached] [WOOD WOOD] [720 720] [NEW WATERFORD NEW WATERFORD] [] [] [false false] [AVE AVE] [1 1] [false false] [1.5 1.5] [Average Average] [1910 1910] [false false] [3516 3516] [2 2] [2 2] [4932 4932] [1 1] [1 1] [CAPE BRETON REGIONAL MUNICIPALITY (CBRM) 24000
CAPE BRETON REGIONAL MUNICIPALITY (CBRM)]
97004 2024-04-01 [Detached Detached] [MILL VILLAGE EAST MILL VILLAGE EAST] [680 680] [CHARLESTON CHARLESTON] [] [] [false false] [RD RD] [1 1] [false false] [1.0 1.0] [Average Average] [1932 1932] [false false] [606 606] [3 3] [2 2] [10890 10890] [1 1] [1 1] [QUEENS, REGION OF QUEENS MUNICIPALITY 138434
QUEENS, REGION OF QUEENS MUNICIPALITY]
97004 2024-04-01 [Detached Detached] [MILL VILLAGE EAST MILL VILLAGE EAST] [680 680] [CHARLESTON CHARLESTON] [] [] [false false] [RD RD] [1 1] [false false] [1.0 1.0] [Average Average] [1932 1932] [false false] [606 606] [3 3] [2 2] [10890 10890] [1 1] [1 1] [QUEENS, REGION OF QUEENS MUNICIPALITY 34609
QUEENS, REGION OF QUEENS MUNICIPALITY]
133949 2012-01-24 [Detached Detached] [HIGHWAY 8 HIGHWAY 8] [627 627] [KEMPT KEMPT] [] [] [false false] [] [1 1] [false false] [1.5 1.5] [Average Average] [1943 1943] [false false] [12153 12153] [3 3] [2 2] [130680 130680] [1 1] [1 1] [QUEENS, REGION OF QUEENS MUNICIPALITY 700
QUEENS, REGION OF QUEENS MUNICIPALITY]
133949 2012-01-24 [Detached Detached] [HIGHWAY 8 HIGHWAY 8] [627 627] [KEMPT KEMPT] [] [] [false false] [] [1 1] [false false] [1.5 1.5] [Average Average] [1943 1943] [false false] [12153 12153] [3 3] [2 2] [130680 130680] [1 1] [1 1] [QUEENS, REGION OF QUEENS MUNICIPALITY 625
QUEENS, REGION OF QUEENS MUNICIPALITY]
10679354 2017-05-03 [Detached Detached] [TRANSOM TRANSOM] [3679 3679] [HALIFAX HALIFAX] [] [] [false false] [DR DR] [1 1] [false false] [2.0 2.0] [Good Good] [2016 2016] [true true] [321 321] [4 4] [2 2] [5379 5379] [4 4] [1 1] [HALIFAX REGIONAL MUNICIPALITY (HRM) 570000
HALIFAX REGIONAL MUNICIPALITY (HRM)]
10905206 2021-12-21 [Detached Detached] [TALUS TALUS] [4528 4528] [BEDFORD BEDFORD] [] [] [false false] [AVE AVE] [1 1] [false false] [2.0 2.0] [Good Good] [2021 2021] [true true] [166 166] [5 5] [2 2] [6458 6458] [5 5] [1 1] [HALIFAX REGIONAL MUNICIPALITY (HRM) 1015000
HALIFAX REGIONAL MUNICIPALITY (HRM)]
10905206 2021-12-21 [Detached Detached] [TALUS TALUS] [4528 4528] [BEDFORD BEDFORD] [] [] [false false] [AVE AVE] [1 1] [false false] [2.0 2.0] [Good Good] [2021 2021] [true true] [166 166] [5 5] [2 2] [6458 6458] [5 5] [1 1] [HALIFAX REGIONAL MUNICIPALITY (HRM) 930355
HALIFAX REGIONAL MUNICIPALITY (HRM)]
10905311 2021-12-22 [Detached Detached] [TALUS TALUS] [3012 3012] [BEDFORD BEDFORD] [] [] [false false] [AVE AVE] [1 1] [true true] [2.0 2.0] [Good Good] [2021 2021] [true true] [199 199] [4 4] [2 2] [6364 6364] [5 5] [1 1] [HALIFAX REGIONAL MUNICIPALITY (HRM) 699943
HALIFAX REGIONAL MUNICIPALITY (HRM)]
10905311 2021-12-22 [Detached Detached] [TALUS TALUS] [3012 3012] [BEDFORD BEDFORD] [] [] [false false] [AVE AVE] [1 1] [true true] [2.0 2.0] [Good Good] [2021 2021] [true true] [199 199] [4 4] [2 2] [6364 6364] [5 5] [1 1] [HALIFAX REGIONAL MUNICIPALITY (HRM) 880000
HALIFAX REGIONAL MUNICIPALITY (HRM)]
10939399 2022-01-10 [Semi Detached Semi Detached] [GALA GALA] [1462 1462] [BERWICK BERWICK] [] [] [false false] [DR DR] [1 2] [true true] [1.0 1.0] [Average Average] [2021 2021] [false false] [12 12] [3 3] [2 2] [6738 6738] [2 2] [1 1] [TOWN OF BERWICK TOWN OF BERWICK] 361828
10939399 2022-01-10 [Semi Detached Semi Detached] [GALA GALA] [1462 1462] [BERWICK BERWICK] [] [] [false false] [DR DR] [1 2] [true true] [1.0 1.0] [Average Average] [2021 2021] [false false] [12 12] [3 3] [2 2] [6738 6738] [2 2] [1 1] [TOWN OF BERWICK TOWN OF BERWICK] 70000
10939429 2022-06-30 [Semi Detached Semi Detached] [GALA GALA] [1462 1462] [BERWICK BERWICK] [] [] [false false] [DR DR] [2 1] [true true] [1.0 1.0] [Average Average] [2021 2021] [false false] [13 13] [3 3] [2 2] [7693 7693] [2 2] [1 1] [TOWN OF BERWICK TOWN OF BERWICK] 70000
10939429 2022-06-30 [Semi Detached Semi Detached] [GALA GALA] [1462 1462] [BERWICK BERWICK] [] [] [false false] [DR DR] [2 1] [true true] [1.0 1.0] [Average Average] [2021 2021] [false false] [13 13] [3 3] [2 2] [7693 7693] [2 2] [1 1] [TOWN OF BERWICK TOWN OF BERWICK] 337283
10969220 2022-02-24 [Semi Detached Semi Detached] [ELIZABETH ELIZABETH] [1302 1302] [GARLANDS CROSSING GARLANDS CROSSING] [] [] [false false] [AVE AVE] [1 4] [true true] [1.0 1.0] [Average Average] [2021 2021] [false false] [36 36] [2 2] [2 2] [5016 5016] [2 2] [1 1] [WEST HANTS REGIONAL MUNICIPALITY WEST HANTS REGIONAL MUNICIPALITY] 330955
10969220 2022-02-24 [Semi Detached Semi Detached] [ELIZABETH ELIZABETH] [1302 1302] [GARLANDS CROSSING GARLANDS CROSSING] [] [] [false false] [AVE AVE] [1 4] [true true] [1.0 1.0] [Average Average] [2021 2021] [false false] [36 36] [2 2] [2 2] [5016 5016] [2 2] [1 1] [WEST HANTS REGIONAL MUNICIPALITY WEST HANTS REGIONAL MUNICIPALITY] 240000

We want to select the rows that indicate potentially duplicate sales in our original dataset and see what’s going on, so first we’ll create a new ID column that combines the assessment account number and sale date to help us identify potentially duplicate rows, then we’ll select the rows from the original dataset that match these unique ID values and examine these rows to understand what’s going on with the potential duplicates.

To efficiently filter our original dataset by the new ID column we’re adding, we can use an inner-join, to only include rows from the target dataset that have an id in our source dataset. First we’ll create the same composite ID column on our original dataset, then join it with the one the one containing our IDs of interest to efficiently filter by these IDs.

(def duplicate-rows
  (let [potentially-duplicate-row-ids (-> potentially-duplicate-row-counts
                                          (tc/join-columns :id
                                                           [:assessment-account-number :sale-date]
                                                           {:result-type :seq}))]
    (-> partially-cleaned-housing-data
        (tc/map-columns :id [:assessment-account-number :sale-date] vector)
        (tc/inner-join potentially-duplicate-row-ids :id)
        (tc/drop-columns :id)
        (tc/order-by :assessment-account-number))))

We can check that this did what we expected by summing up the counts from our potentially duplicate sales dataset. It should equal the number of rows in our filtered original dataset:

(reduce + (:count potentially-duplicate-row-counts))
2184
(tc/row-count duplicate-rows)
2294

Now we need to handle these duplicate rows. We can just blindly delete them, but it’s better to try to understand what’s going on, in case there’s some valid reason for them to be duplicated, or something else we’re misunderstanding about our data. Eyeballing the data reveals some interesting things, like that some rows have a living-units value that isn’t 1 and some have multiple different year-built values for the same unit, but some visualisations will help to reveal more about the data.

duplicate-rows

inner-join [2294 24]:

:assessment-account-number :municipal-unit :civic-number :civic-additional :civic-direction :civic-street-name :civic-suffix :civic-city-name :sale-price :sale-date :parcels-in-sale :living-units :year-built :square-foot-living-area :stories :building-type :bedrooms :bathrooms :under-construction :construction-grade :finished-basement :garage :square-feet :count
9075 MUNI. OF THE DISTRICT OF LUNENBURG 491 OAKLAND RD INDIAN POINT 170000 2018-11-02 1 2 1977 1390 1.0 Detached 3 2 false Average false true 67954 2
9075 MUNI. OF THE DISTRICT OF LUNENBURG 491 OAKLAND RD INDIAN POINT 170000 2018-11-02 1 2 2023 668 1.0 Detached 1 1 false Average false false 67954 2
9075 MUNI. OF THE DISTRICT OF LUNENBURG 491 OAKLAND RD INDIAN POINT 155000 2014-08-06 1 2 1977 1390 1.0 Detached 3 2 false Average false true 67954 2
9075 MUNI. OF THE DISTRICT OF LUNENBURG 491 OAKLAND RD INDIAN POINT 155000 2014-08-06 1 2 2023 668 1.0 Detached 1 1 false Average false false 67954 2
18341 TOWN OF KENTVILLE 242 MAIN ST KENTVILLE 1000 2016-08-01 1 1 1888 972 1.5 Detached 3 2 false Average false false 4413 2
18341 TOWN OF KENTVILLE 242 MAIN ST KENTVILLE 93000 2016-08-01 1 1 1888 972 1.5 Detached 3 2 false Average false false 4413 2
43397 MUNI. OF THE DISTRICT OF EAST HANTS 775 HIGHWAY 1 MOUNT UNIACKE 123000 2015-10-29 2 1 1903 864 1.5 Detached 3 1 false Average false true 15893 2
43397 MUNI. OF THE DISTRICT OF EAST HANTS 775 HIGHWAY 1 MOUNT UNIACKE 123000 2015-10-29 2 1 2022 1008 1.0 Detached 1 2 true Average true true 15893 2
44296 HALIFAX REGIONAL MUNICIPALITY (HRM) 6555 CHESTER AVE HALIFAX 315000 2017-01-07 1 2 1943 1710 2.0 Detached 4 2 false Average false false 4343 2
44296 HALIFAX REGIONAL MUNICIPALITY (HRM) 6555 CHESTER AVE HALIFAX 1200 2017-01-07 1 2 1943 1710 2.0 Detached 4 2 false Average false false 4343 2
10939399 TOWN OF BERWICK 12 GALA DR BERWICK 361828 2022-01-10 1 1 2021 1462 1.0 Semi Detached 3 2 false Average false true 6738 2
10939429 TOWN OF BERWICK 13 GALA DR BERWICK 337283 2022-06-30 1 1 2021 1462 1.0 Semi Detached 3 2 false Average false true 7693 2
10939429 TOWN OF BERWICK 13 GALA DR BERWICK 70000 2022-06-30 2 1 2021 1462 1.0 Semi Detached 3 2 false Average false true 7693 2
10944831 MUNI. OF THE DISTRICT OF EAST HANTS 426 ESS RD UPPER NINE MILE RIVER 20000 2021-12-06 1 2 2022 1344 1.0 Detached 3 2 false Average false true 437725 2
10944831 MUNI. OF THE DISTRICT OF EAST HANTS 426 ESS RD UPPER NINE MILE RIVER 20000 2021-12-06 1 2 2023 1264 1.0 Detached 1 1 false Average false true 437725 2
10969220 WEST HANTS REGIONAL MUNICIPALITY 36 ELIZABETH AVE GARLANDS CROSSING 240000 2022-02-24 4 1 2021 1302 1.0 Semi Detached 2 2 false Average false true 5016 2
10969220 WEST HANTS REGIONAL MUNICIPALITY 36 ELIZABETH AVE GARLANDS CROSSING 330955 2022-02-24 1 1 2021 1302 1.0 Semi Detached 2 2 false Average false true 5016 2
10991625 MUNICIPALITY OF THE COUNTY OF INVERNESS 2581 SHORE RD MARGAREE HARBOUR 600000 2022-06-03 1 2 1985 552 1.0 Detached 1 1 false Average false false 125986 2
10991625 MUNICIPALITY OF THE COUNTY OF INVERNESS 2581 SHORE RD MARGAREE HARBOUR 600000 2022-06-03 1 2 1970 2380 1.5 Detached 4 3 false Average true false 125986 2
11021141 MUNI. OF THE DISTRICT OF EAST HANTS 57 BRIANNA DR LANTZ 600000 2023-04-28 4 2 2023 2005 2.0 Detached 3 3 true Average false true 11320 2
11021141 MUNI. OF THE DISTRICT OF EAST HANTS 57 BRIANNA DR LANTZ 600000 2023-04-28 4 2 2023 2029 2.0 Detached 3 3 true Average false true 11320 2

We’ll use hanamicloth for our data visualisation. First we can check out the distribution of living unit sizes:

(-> duplicate-rows (ploclo/layer-histogram {:=x :living-units}) )

(-> partially-cleaned-housing-data
    (tc/unique-by [:assessment-account-number :sale-date])
    tc/row-count
 )
345716

clean up and steps to turn this data into tidy data before saving it to work with in the next steps.

First, we can

If we inspect this new, combined dataset we can see there are quite a few sparse columns:

(-> housing-data
    tc/info
    (tc/select-rows #(< 0 (:n-missing %)))
    (tc/select-columns [:col-name :n-missing]))

inner-join: descriptive-stats [14 2]:

:col-name :n-missing
:civic-number 3456
:civic-additional 361858
:civic-direction 361775
:civic-street-name 94
:civic-suffix 36727
:civic-city-name 98
:sale-price 1116
:parcels-in-sale 3774
:year-built 47337
:square-foot-living-area 17512
:style 17407
:bedrooms 36486
:construction-grade 17635
:square-feet 13451

Some of these were columns that were in more than one of the original dataset and in combining them, by taking only the value from the first dataset, we may have lost some information. We can tell this by checking the descriptive stats of one of the other datasets, which shows different numbers of missing values:

(-> characteristics
    tc/info
    (tc/select-rows #(< 0 (:n-missing %)))
    (tc/select-columns [:col-name :n-missing]))

characteristics: descriptive-stats [14 2]:

:col-name :n-missing
:civic-number 4967
:civic-additional 377922
:civic-direction 377760
:civic-street-name 550
:civic-street-suffix 40669
:civic-city-name 492
:year-built 49663
:square-foot-living-area 12753
:style 12656
:bedrooms 52235
:construction-grade 12832
:y-map-coordinate 24115
:x-map-coordinate 24115
:map-coordinates 24115

In this case there are more missing values in the characteristics dataset than in the combined one, so it may be the case that there isn’t actually any extra information in the characteristics dataset, but to be sure we can check. To do that we’ll combine the three datasets, then check the columns we care about for any rows where the value from the first dataset is nil, but either of the other two have a value.

First, to simplify some of this code we’ll define the list of columns the datasets share:

(def common-columns
  [:assessment-account-number
   :municipal-unit
   :civic-number
   :civic-additional
   :civic-direction
   :civic-street-name
   :civic-suffix
   :civic-city-name])

Then we can join the datasets together, keeping the duplicated columns this time. Tablecloth will default to disambiguating the duplicated column names by using the dataset names, which works fine in this case:

(def combined-potentially-duplicated-columns
  (-> sales-history
      (tc/select-columns common-columns)
      (tc/inner-join (tc/select-columns characteristics common-columns)
                     :assessment-account-number)
      (tc/inner-join (tc/select-columns land-sizes common-columns)
                     :assessment-account-number)))

Now we can check whether any of the other datasets have information for the missing values. For the sake of demonstration we’ll check the :municipal-unit column first:

(-> combined-potentially-duplicated-columns
    (tc/select-rows (fn [row]
                      (let [v1 (:municipal-unit row)
                            v2 (:characteristics.municipal-unit row)
                            v3 (:land-sizes.municipal-unit row)]
                        (and (nil? v1)
                             (not (nil? (or v2 v3))))))))

inner-join [0 20]:

:assessment-account-number :municipal-unit :civic-number :civic-additional :civic-direction :civic-street-name :civic-suffix :civic-city-name :characteristics.municipal-unit :characteristics.civic-number :characteristics.civic-additional :characteristics.civic-direction :characteristics.civic-street-name :characteristics.civic-city-name :land-sizes.municipal-unit :land-sizes.civic-number :land-sizes.civic-additional :land-sizes.civic-direction :land-sizes.civic-street-name :land-sizes.civic-city-name

It turns out there are actually no rows that have a missing value for :municipal-unit in the sales-history dataset but do have a value in either of the other two datasets. To check the rest of the columns more quickly we can extract our predicate into a helper and have it accept a column name as an argument. This function will take a row and a column name and return true if our primary dataset has no value in the given row and column and either of the secondary datasets contain do have a value:

(defn- non-nil-in-secondary-datasets? [row col-name]
  (let [v1 (col-name row)
        v2 ((->> col-name name (str "characteristics.") keyword) row)
        v3 ((->> col-name name (str "land-sizes.") keyword) row)]
    (and (nil? v1)
         (not (nil? (or v2 v3))))))
#'book.part-0-end-to-end.1-data-import-and-cleanup-example/non-nil-in-secondary-datasets?

And it order to check all the column names at ones, we can run this check over a given list of column names we want to check:

(defn- check-for-values-in-secondary-datasets [ds column-names] (reduce (fn [ds col-namel] () ds column-names)))

(defn secondary-datsets-have-missing-data [column-names row] (some identity (reduce (fn [row col-name] (non-nil-in-secondary-datasets? row col-name)) row column-names)))

(-> combined-potentially-duplicated-columns (tc/select-rows (partial secondary-datsets-have-missing-data common-columns)))

Now we’ll go through and try to fill in as much information as possible. We can start by filtering out any rows that do not have missing values in our base dataset in the columns we’re trying to fill in. To keep our processing pipeline neat, we’ll write a simple helper that will return true for rows that fit our criteria:

(defn resolve [v1 v2 v3] ;; (some identity args) (or v1 v2 v3) )

(defn consolidate-columns [ds column-names] (reduce (fn [ds col-name] (let [characteristics-col-name (->> col-name name (str “characteristics.”) keyword) land-sizes-col-name (->> col-name name (str “land-sizes.”) keyword)] (-> ds (tc/map-columns col-name [col-name characteristics-col-name land-sizes-col-name] resolve) (tc/drop-columns [characteristics-col-name land-sizes-col-name])))) ds column-names))

(-> combined-potentially-duplicated-columns ;; (tc/head 4) (consolidate-columns common-columns) tc/info (tc/select-rows #(< 0 (:n-missing %))) (tc/select-columns [:col-name :n-missing]))

(defn no-missing-info)

(-> combined-potentially-duplicated-columns (tc/head 5) (tc/drop-rows (fn [row] (every? identity (map (fn [col-name] (not (nil? (col-name row)))) )))) ;; tc/row-count )

that do not have any differences across the three datasets. This is going to be verbose to inline so we’ll write a small helper:

looking for rows where there is a missing value in the main dataset but there is a value for that column in another dataset

filter out rows where all three datasets agree

(defn datasets-concur? [row col]
  (println "*******************")
  (println (col row))
  (println ((->> col name (str "characteristic.") keyword) row))
  (println ((->> col name (str "land-sizes.") keyword) row))
  (println "*******************")

  (= (col row)
     ((->> col name (str "characteristic.") keyword) row)
     ((->> col name (str "land-sizes.") keyword) row)))
(defn datasets-same? [row col]
  (let [v1 (col row)
        v2 ((->> col name (str "characteristic.") keyword) row)
        v3 ((->> col name (str "land-sizes.") keyword) row)]
    ;; either all values are the same
    ;; (or
    ;;  (-> [v1 v2 v3] set count (= 1)))
    (->> [v1 v2 v3] set (remove nil?) count (= 1))))
(defn no-discrepancies [row]
  (every? identity
          (map (partial datasets-concur? row)
           [:municipal-unit
            :civic-number
            :civic-additional
            :civic-direction
            :civic-street-name
            :civic-suffix
            :civic-city-name])))
(defn has-discrepancy? [row col-name]
  ;; returns true if values differ and are not nil, otherwise false
  (let [v1 (col-name row)
        v2 ((->> col-name name (str "characteristic.") keyword) row)
        v3 ((->> col-name name (str "land-sizes.") keyword) row)]

    (->> [v1 v2 v3] set (remove nil?) count (<= 2))))
(defn with-discrepancies [row]
  (some identity (map (partial has-discrepancy? row) common-columns))
  )
(-> combined-potentially-duplicated-columns
    ;; (tc/head 100)
    ;; (tc/drop-rows no-discrepancies)
    (tc/select-rows with-discrepancies)

    ;; (tc/select-rows #(not= (:municipal-unit %)
    ;;                        ((->> :municipal-unit
    ;;                              name
    ;;                              (str "characteristic.")
    ;;                              keyword) %)
    ;;                        ((->> :municipal-unit
    ;;                              name
    ;;                              (str "land-sizes.")
    ;;                              keyword) %)))
    ;; tc/row-count

    )

inner-join [4 20]:

:assessment-account-number :municipal-unit :civic-number :civic-additional :civic-direction :civic-street-name :civic-suffix :civic-city-name :characteristics.municipal-unit :characteristics.civic-number :characteristics.civic-additional :characteristics.civic-direction :characteristics.civic-street-name :characteristics.civic-city-name :land-sizes.municipal-unit :land-sizes.civic-number :land-sizes.civic-additional :land-sizes.civic-direction :land-sizes.civic-street-name :land-sizes.civic-city-name
10823821 MUNI. OF THE DISTRICT OF EAST HANTS 257 MEEK ARM TRAIL EAST UNIACKE MUNI. OF THE DISTRICT OF EAST HANTS 0 MEEK ARM EAST UNIACKE MUNI. OF THE DISTRICT OF EAST HANTS 0 MEEK ARM EAST UNIACKE
10823821 MUNI. OF THE DISTRICT OF EAST HANTS 257 MEEK ARM TRAIL EAST UNIACKE MUNI. OF THE DISTRICT OF EAST HANTS 0 MEEK ARM EAST UNIACKE MUNI. OF THE DISTRICT OF EAST HANTS 0 MEEK ARM EAST UNIACKE
733814 MUNICIPALITY OF THE COUNTY OF INVERNESS 15308 CABOT TRAIL CHETICAMP MUNICIPALITY OF THE COUNTY OF INVERNESS 15308 CABOT CH�TICAMP MUNICIPALITY OF THE COUNTY OF INVERNESS 15308 CABOT CH�TICAMP
10823714 MUNI. OF THE DISTRICT OF EAST HANTS 173 MEEK ARM TRAIL EAST UNIACKE MUNI. OF THE DISTRICT OF EAST HANTS 0 MEEK ARM EAST UNIACKE MUNI. OF THE DISTRICT OF EAST HANTS 0 MEEK ARM EAST UNIACKE
combined-potentially-duplicated-columns

inner-join [363075 20]:

:assessment-account-number :municipal-unit :civic-number :civic-additional :civic-direction :civic-street-name :civic-suffix :civic-city-name :characteristics.municipal-unit :characteristics.civic-number :characteristics.civic-additional :characteristics.civic-direction :characteristics.civic-street-name :characteristics.civic-city-name :land-sizes.municipal-unit :land-sizes.civic-number :land-sizes.civic-additional :land-sizes.civic-direction :land-sizes.civic-street-name :land-sizes.civic-city-name
911593 HALIFAX REGIONAL MUNICIPALITY (HRM) 445 WEST PORTERS LAKE RD PORTERS LAKE HALIFAX REGIONAL MUNICIPALITY (HRM) 445 WEST PORTERS LAKE PORTERS LAKE HALIFAX REGIONAL MUNICIPALITY (HRM) 445 WEST PORTERS LAKE PORTERS LAKE
911593 HALIFAX REGIONAL MUNICIPALITY (HRM) 445 WEST PORTERS LAKE RD PORTERS LAKE HALIFAX REGIONAL MUNICIPALITY (HRM) 445 WEST PORTERS LAKE PORTERS LAKE HALIFAX REGIONAL MUNICIPALITY (HRM) 445 WEST PORTERS LAKE PORTERS LAKE
8502838 MUNICIPALITY OF THE DISTRICT OF ARGYLE 1125 DU TITTLE CH SURETTES ISLAND MUNICIPALITY OF THE DISTRICT OF ARGYLE 1125 DU TITTLE SURETTES ISLAND MUNICIPALITY OF THE DISTRICT OF ARGYLE 1125 DU TITTLE SURETTES ISLAND
3698858 MUNICIPALITY OF THE COUNTY OF KINGS 9739 MAIN ST CANNING MUNICIPALITY OF THE COUNTY OF KINGS 9739 MAIN CANNING MUNICIPALITY OF THE COUNTY OF KINGS 9739 MAIN CANNING
3698858 MUNICIPALITY OF THE COUNTY OF KINGS 9739 MAIN ST CANNING MUNICIPALITY OF THE COUNTY OF KINGS 9739 MAIN CANNING MUNICIPALITY OF THE COUNTY OF KINGS 9739 MAIN CANNING
3944026 MUNI. OF THE DISTRICT OF GUYSBOROUGH 5231 HIGHWAY 316 CHARLOS COVE MUNI. OF THE DISTRICT OF GUYSBOROUGH 5231 HIGHWAY 316 CHARLOS COVE MUNI. OF THE DISTRICT OF GUYSBOROUGH 5231 HIGHWAY 316 CHARLOS COVE
3944026 MUNI. OF THE DISTRICT OF GUYSBOROUGH 5231 HIGHWAY 316 CHARLOS COVE MUNI. OF THE DISTRICT OF GUYSBOROUGH 5231 HIGHWAY 316 CHARLOS COVE MUNI. OF THE DISTRICT OF GUYSBOROUGH 5231 HIGHWAY 316 CHARLOS COVE
3944026 MUNI. OF THE DISTRICT OF GUYSBOROUGH 5231 HIGHWAY 316 CHARLOS COVE MUNI. OF THE DISTRICT OF GUYSBOROUGH 5231 HIGHWAY 316 CHARLOS COVE MUNI. OF THE DISTRICT OF GUYSBOROUGH 5231 HIGHWAY 316 CHARLOS COVE
3885003 MUNICIPALITY OF THE COUNTY OF INVERNESS 1394 MABOU HARBOUR RD MABOU HARBOUR MUNICIPALITY OF THE COUNTY OF INVERNESS 1394 MABOU HARBOUR MABOU HARBOUR MUNICIPALITY OF THE COUNTY OF INVERNESS 1394 MABOU HARBOUR MABOU HARBOUR
3885003 MUNICIPALITY OF THE COUNTY OF INVERNESS 1394 MABOU HARBOUR RD MABOU HARBOUR MUNICIPALITY OF THE COUNTY OF INVERNESS 1394 MABOU HARBOUR MABOU HARBOUR MUNICIPALITY OF THE COUNTY OF INVERNESS 1394 MABOU HARBOUR MABOU HARBOUR
10863791 MUNI. OF THE DISTRICT OF LUNENBURG 320 EVELYN DR PARKDALE MUNI. OF THE DISTRICT OF LUNENBURG 320 EVELYN PARKDALE MUNI. OF THE DISTRICT OF LUNENBURG 320 EVELYN PARKDALE
2745755 HALIFAX REGIONAL MUNICIPALITY (HRM) 1178 KETCH HARBOUR RD KETCH HARBOUR HALIFAX REGIONAL MUNICIPALITY (HRM) 1178 KETCH HARBOUR KETCH HARBOUR HALIFAX REGIONAL MUNICIPALITY (HRM) 1178 KETCH HARBOUR KETCH HARBOUR
2745755 HALIFAX REGIONAL MUNICIPALITY (HRM) 1178 KETCH HARBOUR RD KETCH HARBOUR HALIFAX REGIONAL MUNICIPALITY (HRM) 1178 KETCH HARBOUR KETCH HARBOUR HALIFAX REGIONAL MUNICIPALITY (HRM) 1178 KETCH HARBOUR KETCH HARBOUR
2745755 HALIFAX REGIONAL MUNICIPALITY (HRM) 1178 KETCH HARBOUR RD KETCH HARBOUR HALIFAX REGIONAL MUNICIPALITY (HRM) 1178 KETCH HARBOUR KETCH HARBOUR HALIFAX REGIONAL MUNICIPALITY (HRM) 1178 KETCH HARBOUR KETCH HARBOUR
10751993 WEST HANTS REGIONAL MUNICIPALITY 7289 HIGHWAY 1 ARDOISE WEST HANTS REGIONAL MUNICIPALITY 7289 HIGHWAY 1 ARDOISE WEST HANTS REGIONAL MUNICIPALITY 7289 HIGHWAY 1 ARDOISE
10751993 WEST HANTS REGIONAL MUNICIPALITY 7289 HIGHWAY 1 ARDOISE WEST HANTS REGIONAL MUNICIPALITY 7289 HIGHWAY 1 ARDOISE WEST HANTS REGIONAL MUNICIPALITY 7289 HIGHWAY 1 ARDOISE
10751993 WEST HANTS REGIONAL MUNICIPALITY 7289 HIGHWAY 1 ARDOISE WEST HANTS REGIONAL MUNICIPALITY 7289 HIGHWAY 1 ARDOISE WEST HANTS REGIONAL MUNICIPALITY 7289 HIGHWAY 1 ARDOISE
10751993 WEST HANTS REGIONAL MUNICIPALITY 7289 HIGHWAY 1 ARDOISE WEST HANTS REGIONAL MUNICIPALITY 7289 HIGHWAY 1 ARDOISE WEST HANTS REGIONAL MUNICIPALITY 7289 HIGHWAY 1 ARDOISE
7667817 HALIFAX REGIONAL MUNICIPALITY (HRM) 56 LAKE RD PLEASANT VALLEY HALIFAX REGIONAL MUNICIPALITY (HRM) 56 LAKE PLEASANT VALLEY HALIFAX REGIONAL MUNICIPALITY (HRM) 56 LAKE PLEASANT VALLEY
10986400 HALIFAX REGIONAL MUNICIPALITY (HRM) 538 MIDNIGHT RUN MIDDLE SACKVILLE HALIFAX REGIONAL MUNICIPALITY (HRM) 538 MIDNIGHT MIDDLE SACKVILLE HALIFAX REGIONAL MUNICIPALITY (HRM) 538 MIDNIGHT MIDDLE SACKVILLE
10986400 HALIFAX REGIONAL MUNICIPALITY (HRM) 538 MIDNIGHT RUN MIDDLE SACKVILLE HALIFAX REGIONAL MUNICIPALITY (HRM) 538 MIDNIGHT MIDDLE SACKVILLE HALIFAX REGIONAL MUNICIPALITY (HRM) 538 MIDNIGHT MIDDLE SACKVILLE
(->
  (tc/info characteristics)
 (tc/select-rows #(#{:municipal-unit
                     :civic-number
                     :civic-additional
                     :civic-direction
                     :civic-street-name
                     :civic-suffix
                     :civic-city-name}
                    (:col-name %)))
(tc/select-columns [:col-name :n-missing]))

characteristics: descriptive-stats [6 2]:

:col-name :n-missing
:municipal-unit 0
:civic-number 4967
:civic-additional 377922
:civic-direction 377760
:civic-street-name 550
:civic-city-name 492

(-> housing-data (haclo/layer-point #:haclo{:x :square-foot-living-area :y :sale-price :xscale {:type “log”}}))

Contains information licensed under the Open Data & Information Government Licence – PVSC & Participating Municipalities

Now that our dataset is assembled and cleaned up, I’ll write it to a new file for easier access in the next steps.

(tc/write-nippy! housing-data "data/end_to_end/housing-data.nippy")
nil
source: src/book/part_0_end_to_end/1_data_import_and_cleanup_example.clj