Data Import Example
TODO: Summary
To build a model that predicts house prices, we need data about recent sales of similar houses. Luckily the organization that tracks all of this data in my province has released it for public use under a very permissive open data license, so we’ll use that data to find what we’re looking for.
First we need the data about actual house sales. That’s available in this Parcel Sales History dataset. Standard tabular file formats like .csv just work in tablecloth, so there’s not much to worry about here with importing the data:
(def sales-history
  (tc/dataset "data/end_to_end/parcel_sales_history.csv"))We can get a quick summary of the dataset with tc/info:
(tc/info sales-history)data/end_to_end/parcel_sales_history.csv: descriptive-stats [14 12]:
| :col-name | :datatype | :n-valid | :n-missing | :min | :mean | :mode | :max | :standard-deviation | :skew | :first | :last | 
|---|---|---|---|---|---|---|---|---|---|---|---|
| Municipal Unit | :string | 553234 | 0 | HALIFAX REGIONAL MUNICIPALITY (HRM) | HALIFAX REGIONAL MUNICIPALITY (HRM) | MUNI. OF THE DISTRICT OF LUNENBURG | |||||
| Assessment Account Number | :int32 | 553234 | 0 | 396.000000 | 5464364.27661351 | 1.10765560E+07 | 3559869.76102067 | 0.18744869 | 10587948 | 3933504 | |
| Civic Number | :int32 | 407909 | 145325 | 0.000000 | 1249.59833198 | 5.19470000E+04 | 2989.77914396 | 6.12767133 | 75 | 918 | |
| Civic Additional | :string | 1365 | 551869 | A | |||||||
| Civic Direction | :string | 2224 | 551010 | ||||||||
| Civic Street Name | :string | 550304 | 2930 | HIGHWAY 1 | CENTRAL | NO 10 | |||||
| Civic Suffix | :string | 497196 | 56038 | RD | AVE | HWY | |||||
| Civic City Name | :string | 552342 | 892 | HALIFAX | HALIFAX | COOKVILLE | |||||
| Sale Price | :int32 | 550742 | 2492 | 0.000000 | 216081.15903999 | 3.70000000E+08 | 2006809.13640543 | 89.31325733 | 0 | 0 | |
| Sale Date | :string | 553234 | 0 | September 28 2012 | January 08 2021 | April 26 2024 | |||||
| Parcels In Sale | :int16 | 545757 | 7477 | 0.000000 | 1.89112554 | 1.30000000E+02 | 3.23739061 | 15.40358232 | 1 | 1 | |
| Y Map Coordinate | :float64 | 526436 | 26798 | 43.398415 | 45.03052545 | 4.70250420E+01 | 0.67659690 | 0.20517924 | 44.65 | 44.41 | |
| X Map Coordinate | :float64 | 526436 | 26798 | -66.392040 | -63.44632344 | -5.96961680E+01 | 1.60085428 | 0.61790586 | -63.64 | -64.55 | |
| Map Coordinates | :string | 526436 | 26798 | (44.651645, -63.639508) | (44.413596, -64.552656) | 
The column names are capitalized strings, which is fine but it’s easier and idiomatic to work with kebab-cased keywords as column-names in Clojure, so we’ll update our import to transform them on load. We can use the to-clean-keyword helper from tcutils to do this. It’s similar to the make_clean_names helper from R’s janitor package, which is commonly used when datasets are loaded to, well, clean up the column names. You can see the full implementation here, but basically it removes any non-ASCII characters in strings, and then kebab-case and keywordizes them.
Tablecloth also gives each dataset a name, which defaults to the file name. We’ll set ours to something more succinct here:
(def sales-history
  (tc/dataset "data/end_to_end/parcel_sales_history.csv"
              {:key-fn tcustr/to-clean-keyword
               :dataset-name "sales-history"}))(tc/info sales-history)sales-history: descriptive-stats [14 12]:
| :col-name | :datatype | :n-valid | :n-missing | :min | :mean | :mode | :max | :standard-deviation | :skew | :first | :last | 
|---|---|---|---|---|---|---|---|---|---|---|---|
| :municipal-unit | :string | 553234 | 0 | HALIFAX REGIONAL MUNICIPALITY (HRM) | HALIFAX REGIONAL MUNICIPALITY (HRM) | MUNI. OF THE DISTRICT OF LUNENBURG | |||||
| :assessment-account-number | :int32 | 553234 | 0 | 396.000000 | 5464364.27661351 | 1.10765560E+07 | 3559869.76102067 | 0.18744869 | 10587948 | 3933504 | |
| :civic-number | :int32 | 407909 | 145325 | 0.000000 | 1249.59833198 | 5.19470000E+04 | 2989.77914396 | 6.12767133 | 75 | 918 | |
| :civic-additional | :string | 1365 | 551869 | A | |||||||
| :civic-direction | :string | 2224 | 551010 | ||||||||
| :civic-street-name | :string | 550304 | 2930 | HIGHWAY 1 | CENTRAL | NO 10 | |||||
| :civic-suffix | :string | 497196 | 56038 | RD | AVE | HWY | |||||
| :civic-city-name | :string | 552342 | 892 | HALIFAX | HALIFAX | COOKVILLE | |||||
| :sale-price | :int32 | 550742 | 2492 | 0.000000 | 216081.15903999 | 3.70000000E+08 | 2006809.13640543 | 89.31325733 | 0 | 0 | |
| :sale-date | :string | 553234 | 0 | September 28 2012 | January 08 2021 | April 26 2024 | |||||
| :parcels-in-sale | :int16 | 545757 | 7477 | 0.000000 | 1.89112554 | 1.30000000E+02 | 3.23739061 | 15.40358232 | 1 | 1 | |
| :y-map-coordinate | :float64 | 526436 | 26798 | 43.398415 | 45.03052545 | 4.70250420E+01 | 0.67659690 | 0.20517924 | 44.65 | 44.41 | |
| :x-map-coordinate | :float64 | 526436 | 26798 | -66.392040 | -63.44632344 | -5.96961680E+01 | 1.60085428 | 0.61790586 | -63.64 | -64.55 | |
| :map-coordinates | :string | 526436 | 26798 | (44.651645, -63.639508) | (44.413596, -64.552656) | 
Here we can see that this dataset includes information about the sale of the house (e.g. address, sale date, sale price, etc), but no features of the house itself (e.g. number of bedrooms, bathrooms, garages, etc.). We could build a model that predicts sale price based solely on a house’s location, but basic common sense tells us there’s more that goes into the value of the home. It’s not obvious exactly what the biggest determinants of price are, but anyone who has shopped for a house before knows that things like the age of the home, its size and features matter. To get this information, we’ll need a second dataset: Residential Dwelling Characteristics.
We can load it the same way:
(def characteristics
  (tc/dataset "data/end_to_end/residential_dwelling_characteristics.csv"
              {:key-fn tcustr/to-clean-keyword
               :dataset-name "characteristics"}))Upon inspection we can see that this dataset includes the information about houses that we’re looking for, and it also as a column for “Assessment Account Number”, which we should be able to use to join it up to our other one.
(tc/info characteristics)characteristics: descriptive-stats [21 12]:
| :col-name | :datatype | :n-valid | :n-missing | :min | :mean | :mode | :max | :standard-deviation | :skew | :first | :last | 
|---|---|---|---|---|---|---|---|---|---|---|---|
| :municipal-unit | :string | 379101 | 0 | HALIFAX REGIONAL MUNICIPALITY (HRM) | MUNICIPALITY OF THE DISTRICT OF CLARE | HALIFAX REGIONAL MUNICIPALITY (HRM) | |||||
| :assessment-account-number | :int32 | 379101 | 0 | 485.000000 | 4599930.15107056 | 1.10765990E+07 | 3141825.67393673 | 0.49704026 | 1286048 | 5728266 | |
| :civic-number | :int32 | 374134 | 4967 | 0.000000 | 1245.77175825 | 9.05984000E+05 | 3355.53596798 | 56.94117266 | 358 | 20 | |
| :civic-additional | :string | 1179 | 377922 | ||||||||
| :civic-direction | :string | 1341 | 377760 | ||||||||
| :civic-street-name | :string | 378551 | 550 | HIGHWAY 1 | LITTLE BROOK | RITCEY | |||||
| :civic-street-suffix | :string | 338432 | 40669 | RD | RD | CRT | |||||
| :civic-city-name | :string | 378609 | 492 | HALIFAX | LITTLE BROOK STATION | PORTERS LAKE | |||||
| :living-units | :int16 | 379101 | 0 | 0.000000 | 1.19882828 | 5.71000000E+02 | 7.17163974 | 66.61425743 | 1 | 1 | |
| :year-built | :int16 | 329438 | 49663 | 200.000000 | 1978.24914551 | 2.04800000E+03 | 29.17932661 | -2.12051956 | 1976 | 1985 | |
| :square-foot-living-area | :int16 | 366348 | 12753 | 0.000000 | 1537.18592977 | 2.39310000E+04 | 750.57258849 | 1.95137428 | 2392 | 2338 | |
| :style | :string | 366445 | 12656 | 1 Storey | 2 Storey | 1.5 Storey | |||||
| :bedrooms | :int16 | 326866 | 52235 | 0.000000 | 3.01704368 | 2.40000000E+01 | 1.02378973 | 1.33747502 | 4 | 4 | |
| :bathrooms | :int16 | 379101 | 0 | 0.000000 | 1.68568007 | 2.50000000E+01 | 0.95784940 | 1.28420156 | 2 | 4 | |
| :under-construction | :string | 379101 | 0 | N | N | N | |||||
| :construction-grade | :string | 366269 | 12832 | Average | Average | Average | |||||
| :finished-basement | :string | 379101 | 0 | N | N | Y | |||||
| :garage | :string | 379101 | 0 | N | N | Y | |||||
| :y-map-coordinate | :float64 | 354986 | 24115 | 43.398415 | 45.04529417 | 4.70260480E+01 | 0.66341981 | 0.29089356 | 44.29 | 44.74 | |
| :x-map-coordinate | :float64 | 354986 | 24115 | -66.387871 | -63.36304830 | -5.97949100E+01 | 1.56388011 | 0.69450136 | -66.10 | -63.30 | |
| :map-coordinates | :string | 354986 | 24115 | (44.293843, -66.10439) | (44.743576, -63.302604) | 
There’s one other piece of information that I want for our model. In Nova Scotia, it’s not uncommon to own relatively large plots of land, and this can make house prices seem surprising. A small or very old house might seem like it’s priced too high, for example, if you don’t notice the size of the lot that it’s on. We can get this information, again indexed by the Assessment Account Number, from the Parcel Land Sizes dataset:
(def land-sizes
  (tc/dataset "data/end_to_end/parcel_land_sizes.csv"
              {:key-fn tcustr/to-clean-keyword
               :dataset-name "land-sizes"}))(tc/info land-sizes)land-sizes: descriptive-stats [13 12]:
| :col-name | :datatype | :n-valid | :n-missing | :min | :mean | :mode | :max | :standard-deviation | :skew | :first | :last | 
|---|---|---|---|---|---|---|---|---|---|---|---|
| :municipal-unit | :string | 647472 | 0 | HALIFAX REGIONAL MUNICIPALITY (HRM) | HALIFAX REGIONAL MUNICIPALITY (HRM) | WEST HANTS REGIONAL MUNICIPALITY | |||||
| :assessment-account-number | :int32 | 647472 | 0 | 35.000000 | 5327202.15206990 | 1.10797510E+07 | 3.46299091E+06 | 0.22692123 | 911593 | 11000738 | |
| :civic-number | :int32 | 436880 | 210592 | 0.000000 | 1299.84449734 | 9.05984000E+05 | 3.38721369E+03 | 48.31175816 | 445 | ||
| :civic-additional | :string | 1462 | 646010 | ||||||||
| :civic-direction | :string | 2603 | 644869 | ||||||||
| :civic-street-name | :string | 638079 | 9393 | WEST PORTERS LAKE | SOUTH RAWDON | ||||||
| :civic-street-suffix | :string | 574720 | 72752 | RD | RD | RD | |||||
| :civic-city-name | :string | 643321 | 4151 | HALIFAX | PORTERS LAKE | HILLSVALE | |||||
| :acreage | :float64 | 644235 | 3237 | -80.010000 | 22.77548066 | 3.23921000E+05 | 8.40222965E+02 | 177.93796144 | 2.470 | 1.850 | |
| :square-feet | :int64 | 606490 | 40982 | -1056000.000000 | 951048.04373197 | 8.15726340E+09 | 3.05072574E+07 | 143.20927713 | 107680 | 80642 | |
| :y-map-coordinate | :float64 | 604633 | 42839 | 43.391144 | 45.07207757 | 4.72261700E+01 | 6.97087997E-01 | 0.11462748 | 44.72 | ||
| :x-map-coordinate | :float64 | 604633 | 42839 | -66.392040 | -63.35237711 | -5.96961680E+01 | 1.66378362E+00 | 0.51926351 | -63.31 | ||
| :map-coordinates | :string | 604633 | 42839 | (44.719344, -63.308412) | 
Since loading the data is so easy here, we’ll take a moment to introduce tablecloth. Tablecloth is the main data manipulation library in the Clojure data science ecosystem. It provides a dataframe-like abstraction for working with data and similar operations to those found in popular data analysis libraries in other languages, like pandas in Python or dplyr in R. It provides a familiar interface for data scientists used to working with tabular data in other ecosystems while leveraging Clojure’s strengths in functional programming and immutability. It’s built on top of tech.ml.dataset, Clojure’s main library for high-performance data processing.
- We’ll use tablecloth throughout the rest of this section to clean up our data.
- 
We want a single dataset that contains only information relevant to building our model. Right now the relevant information is spread across our three datasets. To see what information we care about, first we can inspect the column names from each dataset: 
(tc/column-names sales-history)(:municipal-unit
 :assessment-account-number
 :civic-number
 :civic-additional
 :civic-direction
 :civic-street-name
 :civic-suffix
 :civic-city-name
 :sale-price
 :sale-date
 :parcels-in-sale
 :y-map-coordinate
 :x-map-coordinate
 :map-coordinates)(tc/column-names characteristics)(:municipal-unit
 :assessment-account-number
 :civic-number
 :civic-additional
 :civic-direction
 :civic-street-name
 :civic-street-suffix
 :civic-city-name
 :living-units
 :year-built
 :square-foot-living-area
 :style
 :bedrooms
 :bathrooms
 :under-construction
 :construction-grade
 :finished-basement
 :garage
 :y-map-coordinate
 :x-map-coordinate
 :map-coordinates)(tc/column-names land-sizes)(:municipal-unit
 :assessment-account-number
 :civic-number
 :civic-additional
 :civic-direction
 :civic-street-name
 :civic-street-suffix
 :civic-city-name
 :acreage
 :square-feet
 :y-map-coordinate
 :x-map-coordinate
 :map-coordinates)There are methods we can use to determine which features are most relevant for our target, but for now we can use common sense to omit a few, like the map coordinates. We’ll use tablecloth to join all the datasets together by assessment account number:
(def housing-data
  (-> sales-history
      (tc/select-columns [:assessment-account-number
                          :municipal-unit
                          :civic-number
                          :civic-additional
                          :civic-direction
                          :civic-street-name
                          :civic-suffix
                          :civic-city-name
                          :sale-price
                          :sale-date
                          :parcels-in-sale])
      (tc/inner-join (tc/select-columns characteristics
                                        [:assessment-account-number
                                         :living-units
                                         :year-built
                                         :square-foot-living-area
                                         :style
                                         :bedrooms
                                         :bathrooms
                                         :under-construction
                                         :construction-grade
                                         :finished-basement
                                         :garage])
                     :assessment-account-number)
      (tc/inner-join (tc/select-columns land-sizes
                                        [:assessment-account-number
                                         :square-feet])
                     :assessment-account-number)))Now we have a single, combined dataset to work with. We’ll do some basic data cleaning and integrity checks before saving it to work with in the next steps.
Data cleaning
Correcting datatypes
First, we can check that the datatypes of each column make sense:
(-> housing-data
    tc/info
    (tc/select-columns [:col-name :datatype]))inner-join: descriptive-stats [22 2]:
| :col-name | :datatype | 
|---|---|
| :assessment-account-number | :int32 | 
| :municipal-unit | :string | 
| :civic-number | :int32 | 
| :civic-additional | :string | 
| :civic-direction | :string | 
| :civic-street-name | :string | 
| :civic-suffix | :string | 
| :civic-city-name | :string | 
| :sale-price | :int32 | 
| :sale-date | :string | 
| :parcels-in-sale | :int16 | 
| :living-units | :int16 | 
| :year-built | :int16 | 
| :square-foot-living-area | :int16 | 
| :style | :string | 
| :bedrooms | :int16 | 
| :bathrooms | :int16 | 
| :under-construction | :string | 
| :construction-grade | :string | 
| :finished-basement | :string | 
| :garage | :string | 
| :square-feet | :int64 | 
These mostly look good except for :sale-date, which should probably be either a number (if it’s a year) or some date format. We’ll also double check some of the string columns to make sure their values are coherent.
First, inspecting the :sale-date column:
(:sale-date housing-data)#tech.v3.dataset.column<string>[363075]
:sale-date
[June 28 2018, July 22 2022, January 19 2023, June 28 2022, August 29 2019, October 09 2020, February 10 2023, June 13 2023, September 24 2020, October 08 2020, January 09 2017, May 11 2018, August 31 2022, October 15 2020, November 01 2021, July 16 2013, September 25 2014, September 25 2014, July 02 2019, March 10 2022...]We can see it’s strings that should be full dates. Tablecloth makes it easy to parse the strings into a supported date format. First we’ll define a helper to parse these strings as dates using some of Java’s time utils:
(def date-format (DateTimeFormatter/ofPattern "MMMM dd yyyy"))(defn- to-date [s]
  (LocalDate/parse s date-format))#'book.part-0-end-to-end.1-data-import-and-cleanup-example/to-dateThen use tablecloth’s convert-types to efficiently process the whole column:
(-> housing-data
    (tc/convert-types :sale-date [[:packed-local-date to-date]]))inner-join [363075 22]:
| :assessment-account-number | :municipal-unit | :civic-number | :civic-additional | :civic-direction | :civic-street-name | :civic-suffix | :civic-city-name | :sale-price | :sale-date | :parcels-in-sale | :living-units | :year-built | :square-foot-living-area | :style | :bedrooms | :bathrooms | :under-construction | :construction-grade | :finished-basement | :garage | :square-feet | 
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 911593 | HALIFAX REGIONAL MUNICIPALITY (HRM) | 445 | WEST PORTERS LAKE | RD | PORTERS LAKE | 225000 | 2018-06-28 | 2 | 1 | 1987 | 988 | 1 Storey | 1 | N | Average | N | N | 107680 | |||
| 911593 | HALIFAX REGIONAL MUNICIPALITY (HRM) | 445 | WEST PORTERS LAKE | RD | PORTERS LAKE | 380000 | 2022-07-22 | 2 | 1 | 1987 | 988 | 1 Storey | 1 | N | Average | N | N | 107680 | |||
| 8502838 | MUNICIPALITY OF THE DISTRICT OF ARGYLE | 1125 | DU TITTLE | CH | SURETTES ISLAND | 0 | 2023-01-19 | 0 | 2020 | 640 | 2 Storey | 1 | 0 | N | Low | N | N | ||||
| 3698858 | MUNICIPALITY OF THE COUNTY OF KINGS | 9739 | MAIN | ST | CANNING | 130000 | 2022-06-28 | 1 | 0 | 1632 | 2 Storey | 4 | 1 | Y | Average | N | N | 7562 | |||
| 3698858 | MUNICIPALITY OF THE COUNTY OF KINGS | 9739 | MAIN | ST | CANNING | 85000 | 2019-08-29 | 1 | 0 | 1632 | 2 Storey | 4 | 1 | Y | Average | N | N | 7562 | |||
| 3944026 | MUNI. OF THE DISTRICT OF GUYSBOROUGH | 5231 | HIGHWAY 316 | CHARLOS COVE | 8500 | 2020-10-09 | 2 | 1 | 340 | 1 Storey | 1 | 1 | N | Low | N | N | 5400 | ||||
| 3944026 | MUNI. OF THE DISTRICT OF GUYSBOROUGH | 5231 | HIGHWAY 316 | CHARLOS COVE | 0 | 2023-02-10 | 2 | 1 | 340 | 1 Storey | 1 | 1 | N | Low | N | N | 5400 | ||||
| 3944026 | MUNI. OF THE DISTRICT OF GUYSBOROUGH | 5231 | HIGHWAY 316 | CHARLOS COVE | 0 | 2023-06-13 | 2 | 1 | 340 | 1 Storey | 1 | 1 | N | Low | N | N | 5400 | ||||
| 3885003 | MUNICIPALITY OF THE COUNTY OF INVERNESS | 1394 | MABOU HARBOUR | RD | MABOU HARBOUR | 0 | 2020-09-24 | 1 | 1 | 1910 | 1170 | 1.5 Storey | 4 | 1 | N | Average | N | N | 1320304 | ||
| 3885003 | MUNICIPALITY OF THE COUNTY OF INVERNESS | 1394 | MABOU HARBOUR | RD | MABOU HARBOUR | 0 | 2020-10-08 | 1 | 1 | 1910 | 1170 | 1.5 Storey | 4 | 1 | N | Average | N | N | 1320304 | ||
| … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | 
| 10863791 | MUNI. OF THE DISTRICT OF LUNENBURG | 320 | EVELYN | DR | PARKDALE | 90000 | 2021-02-26 | 1 | 1 | 2022 | 1106 | 1 Storey | 1 | 2 | N | Average | Y | Y | 118483 | ||
| 2745755 | HALIFAX REGIONAL MUNICIPALITY (HRM) | 1178 | KETCH HARBOUR | RD | KETCH HARBOUR | 42000 | 2015-06-30 | 1 | 1 | 2017 | 1170 | 1.5 Storey | 2 | 2 | N | Average | N | Y | 51687 | ||
| 2745755 | HALIFAX REGIONAL MUNICIPALITY (HRM) | 1178 | KETCH HARBOUR | RD | KETCH HARBOUR | 69000 | 2011-06-23 | 1 | 1 | 2017 | 1170 | 1.5 Storey | 2 | 2 | N | Average | N | Y | 51687 | ||
| 2745755 | HALIFAX REGIONAL MUNICIPALITY (HRM) | 1178 | KETCH HARBOUR | RD | KETCH HARBOUR | 591500 | 2021-12-22 | 2 | 1 | 2017 | 1170 | 1.5 Storey | 2 | 2 | N | Average | N | Y | 51687 | ||
| 10751993 | WEST HANTS REGIONAL MUNICIPALITY | 7289 | HIGHWAY 1 | ARDOISE | 44900 | 2022-04-29 | 1 | 1 | 2022 | 1230 | 1 Storey | 3 | 3 | N | Average | N | N | 48352 | |||
| 10751993 | WEST HANTS REGIONAL MUNICIPALITY | 7289 | HIGHWAY 1 | ARDOISE | 0 | 2018-10-25 | 2 | 1 | 2022 | 1230 | 1 Storey | 3 | 3 | N | Average | N | N | 48352 | |||
| 10751993 | WEST HANTS REGIONAL MUNICIPALITY | 7289 | HIGHWAY 1 | ARDOISE | 37000 | 2020-06-04 | 2 | 1 | 2022 | 1230 | 1 Storey | 3 | 3 | N | Average | N | N | 48352 | |||
| 10751993 | WEST HANTS REGIONAL MUNICIPALITY | 7289 | HIGHWAY 1 | ARDOISE | 51000 | 2021-07-19 | 2 | 1 | 2022 | 1230 | 1 Storey | 3 | 3 | N | Average | N | N | 48352 | |||
| 7667817 | HALIFAX REGIONAL MUNICIPALITY (HRM) | 56 | LAKE | RD | PLEASANT VALLEY | 0 | 2019-11-15 | 1 | 1 | 1996 | 1050 | 1.5 Storey | 0 | N | Low | N | N | 332798 | |||
| 10986400 | HALIFAX REGIONAL MUNICIPALITY (HRM) | 538 | MIDNIGHT | RUN | MIDDLE SACKVILLE | 767575 | 2022-05-17 | 9 | 1 | 2023 | 2412 | Split Entry | 4 | 3 | N | Good | Y | Y | 143400 | ||
| 10986400 | HALIFAX REGIONAL MUNICIPALITY (HRM) | 538 | MIDNIGHT | RUN | MIDDLE SACKVILLE | 0 | 2022-01-31 | 9 | 1 | 2023 | 2412 | Split Entry | 4 | 3 | N | Good | Y | Y | 143400 | 
Now the sale date values can be worked with as dates, which is much better than strings. Next we’ll look at the remaining string columns (they’re always suspicious).
(-> housing-data
    (tc/convert-types :sale-date [[:packed-local-date to-date]])
    (tc/select-columns :type/string))inner-join [363075 11]:
| :municipal-unit | :civic-additional | :civic-direction | :civic-street-name | :civic-suffix | :civic-city-name | :style | :under-construction | :construction-grade | :finished-basement | :garage | 
|---|---|---|---|---|---|---|---|---|---|---|
| HALIFAX REGIONAL MUNICIPALITY (HRM) | WEST PORTERS LAKE | RD | PORTERS LAKE | 1 Storey | N | Average | N | N | ||
| HALIFAX REGIONAL MUNICIPALITY (HRM) | WEST PORTERS LAKE | RD | PORTERS LAKE | 1 Storey | N | Average | N | N | ||
| MUNICIPALITY OF THE DISTRICT OF ARGYLE | DU TITTLE | CH | SURETTES ISLAND | 2 Storey | N | Low | N | N | ||
| MUNICIPALITY OF THE COUNTY OF KINGS | MAIN | ST | CANNING | 2 Storey | Y | Average | N | N | ||
| MUNICIPALITY OF THE COUNTY OF KINGS | MAIN | ST | CANNING | 2 Storey | Y | Average | N | N | ||
| MUNI. OF THE DISTRICT OF GUYSBOROUGH | HIGHWAY 316 | CHARLOS COVE | 1 Storey | N | Low | N | N | |||
| MUNI. OF THE DISTRICT OF GUYSBOROUGH | HIGHWAY 316 | CHARLOS COVE | 1 Storey | N | Low | N | N | |||
| MUNI. OF THE DISTRICT OF GUYSBOROUGH | HIGHWAY 316 | CHARLOS COVE | 1 Storey | N | Low | N | N | |||
| MUNICIPALITY OF THE COUNTY OF INVERNESS | MABOU HARBOUR | RD | MABOU HARBOUR | 1.5 Storey | N | Average | N | N | ||
| MUNICIPALITY OF THE COUNTY OF INVERNESS | MABOU HARBOUR | RD | MABOU HARBOUR | 1.5 Storey | N | Average | N | N | ||
| … | … | … | … | … | … | … | … | … | … | … | 
| MUNI. OF THE DISTRICT OF LUNENBURG | EVELYN | DR | PARKDALE | 1 Storey | N | Average | Y | Y | ||
| HALIFAX REGIONAL MUNICIPALITY (HRM) | KETCH HARBOUR | RD | KETCH HARBOUR | 1.5 Storey | N | Average | N | Y | ||
| HALIFAX REGIONAL MUNICIPALITY (HRM) | KETCH HARBOUR | RD | KETCH HARBOUR | 1.5 Storey | N | Average | N | Y | ||
| HALIFAX REGIONAL MUNICIPALITY (HRM) | KETCH HARBOUR | RD | KETCH HARBOUR | 1.5 Storey | N | Average | N | Y | ||
| WEST HANTS REGIONAL MUNICIPALITY | HIGHWAY 1 | ARDOISE | 1 Storey | N | Average | N | N | |||
| WEST HANTS REGIONAL MUNICIPALITY | HIGHWAY 1 | ARDOISE | 1 Storey | N | Average | N | N | |||
| WEST HANTS REGIONAL MUNICIPALITY | HIGHWAY 1 | ARDOISE | 1 Storey | N | Average | N | N | |||
| WEST HANTS REGIONAL MUNICIPALITY | HIGHWAY 1 | ARDOISE | 1 Storey | N | Average | N | N | |||
| HALIFAX REGIONAL MUNICIPALITY (HRM) | LAKE | RD | PLEASANT VALLEY | 1.5 Storey | N | Low | N | N | ||
| HALIFAX REGIONAL MUNICIPALITY (HRM) | MIDNIGHT | RUN | MIDDLE SACKVILLE | Split Entry | N | Good | Y | Y | ||
| HALIFAX REGIONAL MUNICIPALITY (HRM) | MIDNIGHT | RUN | MIDDLE SACKVILLE | Split Entry | N | Good | Y | Y | 
All of the civic address columns are fine; we won’t use those in our model anyway. Some of the other ones – :under-construction, :finished-basement, and :garage – are boolean columns that use “Y”/“N” right now, which is ok. It’s clear and human readable, but we can easily convert these values to actual booleans, which are equally compact and human readable but more machine-friendly. This also has the added benefit of ensuring there are no bogus values, since each value must be only either true or false. Booleans are one of a few datatypes that have a built-in coercer, so we don’t need to write any special helpers to do it ourselves. We can add this to our previous type conversion using the syntax for multiple transformations:
(-> housing-data
    (tc/convert-types {:sale-date [:packed-local-date to-date]
                       :under-construction :boolean
                       :finished-basement :boolean
                       :garage :boolean}))inner-join [363075 22]:
| :assessment-account-number | :municipal-unit | :civic-number | :civic-additional | :civic-direction | :civic-street-name | :civic-suffix | :civic-city-name | :sale-price | :sale-date | :parcels-in-sale | :living-units | :year-built | :square-foot-living-area | :style | :bedrooms | :bathrooms | :under-construction | :construction-grade | :finished-basement | :garage | :square-feet | 
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 911593 | HALIFAX REGIONAL MUNICIPALITY (HRM) | 445 | WEST PORTERS LAKE | RD | PORTERS LAKE | 225000 | 2018-06-28 | 2 | 1 | 1987 | 988 | 1 Storey | 1 | false | Average | false | false | 107680 | |||
| 911593 | HALIFAX REGIONAL MUNICIPALITY (HRM) | 445 | WEST PORTERS LAKE | RD | PORTERS LAKE | 380000 | 2022-07-22 | 2 | 1 | 1987 | 988 | 1 Storey | 1 | false | Average | false | false | 107680 | |||
| 8502838 | MUNICIPALITY OF THE DISTRICT OF ARGYLE | 1125 | DU TITTLE | CH | SURETTES ISLAND | 0 | 2023-01-19 | 0 | 2020 | 640 | 2 Storey | 1 | 0 | false | Low | false | false | ||||
| 3698858 | MUNICIPALITY OF THE COUNTY OF KINGS | 9739 | MAIN | ST | CANNING | 130000 | 2022-06-28 | 1 | 0 | 1632 | 2 Storey | 4 | 1 | true | Average | false | false | 7562 | |||
| 3698858 | MUNICIPALITY OF THE COUNTY OF KINGS | 9739 | MAIN | ST | CANNING | 85000 | 2019-08-29 | 1 | 0 | 1632 | 2 Storey | 4 | 1 | true | Average | false | false | 7562 | |||
| 3944026 | MUNI. OF THE DISTRICT OF GUYSBOROUGH | 5231 | HIGHWAY 316 | CHARLOS COVE | 8500 | 2020-10-09 | 2 | 1 | 340 | 1 Storey | 1 | 1 | false | Low | false | false | 5400 | ||||
| 3944026 | MUNI. OF THE DISTRICT OF GUYSBOROUGH | 5231 | HIGHWAY 316 | CHARLOS COVE | 0 | 2023-02-10 | 2 | 1 | 340 | 1 Storey | 1 | 1 | false | Low | false | false | 5400 | ||||
| 3944026 | MUNI. OF THE DISTRICT OF GUYSBOROUGH | 5231 | HIGHWAY 316 | CHARLOS COVE | 0 | 2023-06-13 | 2 | 1 | 340 | 1 Storey | 1 | 1 | false | Low | false | false | 5400 | ||||
| 3885003 | MUNICIPALITY OF THE COUNTY OF INVERNESS | 1394 | MABOU HARBOUR | RD | MABOU HARBOUR | 0 | 2020-09-24 | 1 | 1 | 1910 | 1170 | 1.5 Storey | 4 | 1 | false | Average | false | false | 1320304 | ||
| 3885003 | MUNICIPALITY OF THE COUNTY OF INVERNESS | 1394 | MABOU HARBOUR | RD | MABOU HARBOUR | 0 | 2020-10-08 | 1 | 1 | 1910 | 1170 | 1.5 Storey | 4 | 1 | false | Average | false | false | 1320304 | ||
| … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | 
| 10863791 | MUNI. OF THE DISTRICT OF LUNENBURG | 320 | EVELYN | DR | PARKDALE | 90000 | 2021-02-26 | 1 | 1 | 2022 | 1106 | 1 Storey | 1 | 2 | false | Average | true | true | 118483 | ||
| 2745755 | HALIFAX REGIONAL MUNICIPALITY (HRM) | 1178 | KETCH HARBOUR | RD | KETCH HARBOUR | 42000 | 2015-06-30 | 1 | 1 | 2017 | 1170 | 1.5 Storey | 2 | 2 | false | Average | false | true | 51687 | ||
| 2745755 | HALIFAX REGIONAL MUNICIPALITY (HRM) | 1178 | KETCH HARBOUR | RD | KETCH HARBOUR | 69000 | 2011-06-23 | 1 | 1 | 2017 | 1170 | 1.5 Storey | 2 | 2 | false | Average | false | true | 51687 | ||
| 2745755 | HALIFAX REGIONAL MUNICIPALITY (HRM) | 1178 | KETCH HARBOUR | RD | KETCH HARBOUR | 591500 | 2021-12-22 | 2 | 1 | 2017 | 1170 | 1.5 Storey | 2 | 2 | false | Average | false | true | 51687 | ||
| 10751993 | WEST HANTS REGIONAL MUNICIPALITY | 7289 | HIGHWAY 1 | ARDOISE | 44900 | 2022-04-29 | 1 | 1 | 2022 | 1230 | 1 Storey | 3 | 3 | false | Average | false | false | 48352 | |||
| 10751993 | WEST HANTS REGIONAL MUNICIPALITY | 7289 | HIGHWAY 1 | ARDOISE | 0 | 2018-10-25 | 2 | 1 | 2022 | 1230 | 1 Storey | 3 | 3 | false | Average | false | false | 48352 | |||
| 10751993 | WEST HANTS REGIONAL MUNICIPALITY | 7289 | HIGHWAY 1 | ARDOISE | 37000 | 2020-06-04 | 2 | 1 | 2022 | 1230 | 1 Storey | 3 | 3 | false | Average | false | false | 48352 | |||
| 10751993 | WEST HANTS REGIONAL MUNICIPALITY | 7289 | HIGHWAY 1 | ARDOISE | 51000 | 2021-07-19 | 2 | 1 | 2022 | 1230 | 1 Storey | 3 | 3 | false | Average | false | false | 48352 | |||
| 7667817 | HALIFAX REGIONAL MUNICIPALITY (HRM) | 56 | LAKE | RD | PLEASANT VALLEY | 0 | 2019-11-15 | 1 | 1 | 1996 | 1050 | 1.5 Storey | 0 | false | Low | false | false | 332798 | |||
| 10986400 | HALIFAX REGIONAL MUNICIPALITY (HRM) | 538 | MIDNIGHT | RUN | MIDDLE SACKVILLE | 767575 | 2022-05-17 | 9 | 1 | 2023 | 2412 | Split Entry | 4 | 3 | false | Good | true | true | 143400 | ||
| 10986400 | HALIFAX REGIONAL MUNICIPALITY (HRM) | 538 | MIDNIGHT | RUN | MIDDLE SACKVILLE | 0 | 2022-01-31 | 9 | 1 | 2023 | 2412 | Split Entry | 4 | 3 | false | Good | true | true | 143400 | 
Great. Now all of the datatypes in our dataset make sense. This was the lowest-hanging fruit in cleaning up this data. Next we’ll move on reshaping the dataset to make it tidy.
Reshaping data
Next, we’ll look at the :style column. This column is not useful in it’s current state. It contains a mix of information about the type of building and the number of stories, written as a string. This will require idiosyncratic and error-prone parsing by anyone who wants to do any analysis that involves this column. These are the distinct values the column contains:
(-> housing-data
    :style
    distinct)("1 Storey"
 "2 Storey"
 "1.5 Storey"
 nil
 "Split Level"
 "2 Storey Townhouse"
 "Manufactured Home"
 "2 Storey Duplex"
 "Split Entry"
 "2 Storey Semi Detached"
 "2.5 Storey"
 "1.5 Storey Semi Detached"
 "1 Storey Duplex"
 "1 Storey Townhouse"
 "1 Storey Semi Detached"
 "3 Storey Triplex"
 "Additon"
 "1.5 Storey Duplex"
 "1.5 Storey Townhouse"
 "3 Storey"
 "2 Storey Triplex"
 "2.5 Storey Semi Detached"
 "1.5 Storey Triplex"
 "3 Storey Townhouse"
 "1 Storey Triplex"
 "1 Storey Quadruplex"
 "1.5 Storey Quadruplex"
 "3 Storey Semi Detached"
 "2 Storey Quadruplex"
 "2.5 Storey Townhouse"
 "3 Storey Duplex"
 "3 Storey Quadruplex"
 "2.5 Storey Duplex"
 "2.5 Storey Triplex"
 "2.5 Storey Quadruplex")The first thing to notice here is that two of these values are not like the others. “Manufactured Home” and “Additon” are ambiguous styles. Since we’re not given any information about this column in the dataset metadata, we have to make some assumptions about what these values indicate. Maybe the dataset creators didn’t have enough information about these homes to describe them accurately, maybe they considered them somehow fundamentally different from the other house types and in a category of their own, or maybe something else. Either way, we’ll take these out-of-place values to indicate that these rows should be treated differently than the rest of the dataset.
If we were interested in carrying out this different treatment, we could move these rows to a separate dataset, but for our purposes here we can just drop the rows, building on the transformations from the previous step:
(-> housing-data
    (tc/convert-types {:sale-date [:packed-local-date to-date]
                       :under-construction :boolean
                       :finished-basement :boolean
                       :garage :boolean})
    (tc/drop-rows #(#{"Manufactured Home" "Additon"} (:style %))))inner-join [351383 22]:
| :assessment-account-number | :municipal-unit | :civic-number | :civic-additional | :civic-direction | :civic-street-name | :civic-suffix | :civic-city-name | :sale-price | :sale-date | :parcels-in-sale | :living-units | :year-built | :square-foot-living-area | :style | :bedrooms | :bathrooms | :under-construction | :construction-grade | :finished-basement | :garage | :square-feet | 
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 911593 | HALIFAX REGIONAL MUNICIPALITY (HRM) | 445 | WEST PORTERS LAKE | RD | PORTERS LAKE | 225000 | 2018-06-28 | 2 | 1 | 1987 | 988 | 1 Storey | 1 | false | Average | false | false | 107680 | |||
| 911593 | HALIFAX REGIONAL MUNICIPALITY (HRM) | 445 | WEST PORTERS LAKE | RD | PORTERS LAKE | 380000 | 2022-07-22 | 2 | 1 | 1987 | 988 | 1 Storey | 1 | false | Average | false | false | 107680 | |||
| 8502838 | MUNICIPALITY OF THE DISTRICT OF ARGYLE | 1125 | DU TITTLE | CH | SURETTES ISLAND | 0 | 2023-01-19 | 0 | 2020 | 640 | 2 Storey | 1 | 0 | false | Low | false | false | ||||
| 3698858 | MUNICIPALITY OF THE COUNTY OF KINGS | 9739 | MAIN | ST | CANNING | 130000 | 2022-06-28 | 1 | 0 | 1632 | 2 Storey | 4 | 1 | true | Average | false | false | 7562 | |||
| 3698858 | MUNICIPALITY OF THE COUNTY OF KINGS | 9739 | MAIN | ST | CANNING | 85000 | 2019-08-29 | 1 | 0 | 1632 | 2 Storey | 4 | 1 | true | Average | false | false | 7562 | |||
| 3944026 | MUNI. OF THE DISTRICT OF GUYSBOROUGH | 5231 | HIGHWAY 316 | CHARLOS COVE | 8500 | 2020-10-09 | 2 | 1 | 340 | 1 Storey | 1 | 1 | false | Low | false | false | 5400 | ||||
| 3944026 | MUNI. OF THE DISTRICT OF GUYSBOROUGH | 5231 | HIGHWAY 316 | CHARLOS COVE | 0 | 2023-02-10 | 2 | 1 | 340 | 1 Storey | 1 | 1 | false | Low | false | false | 5400 | ||||
| 3944026 | MUNI. OF THE DISTRICT OF GUYSBOROUGH | 5231 | HIGHWAY 316 | CHARLOS COVE | 0 | 2023-06-13 | 2 | 1 | 340 | 1 Storey | 1 | 1 | false | Low | false | false | 5400 | ||||
| 3885003 | MUNICIPALITY OF THE COUNTY OF INVERNESS | 1394 | MABOU HARBOUR | RD | MABOU HARBOUR | 0 | 2020-09-24 | 1 | 1 | 1910 | 1170 | 1.5 Storey | 4 | 1 | false | Average | false | false | 1320304 | ||
| 3885003 | MUNICIPALITY OF THE COUNTY OF INVERNESS | 1394 | MABOU HARBOUR | RD | MABOU HARBOUR | 0 | 2020-10-08 | 1 | 1 | 1910 | 1170 | 1.5 Storey | 4 | 1 | false | Average | false | false | 1320304 | ||
| … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | 
| 10863791 | MUNI. OF THE DISTRICT OF LUNENBURG | 320 | EVELYN | DR | PARKDALE | 90000 | 2021-02-26 | 1 | 1 | 2022 | 1106 | 1 Storey | 1 | 2 | false | Average | true | true | 118483 | ||
| 2745755 | HALIFAX REGIONAL MUNICIPALITY (HRM) | 1178 | KETCH HARBOUR | RD | KETCH HARBOUR | 42000 | 2015-06-30 | 1 | 1 | 2017 | 1170 | 1.5 Storey | 2 | 2 | false | Average | false | true | 51687 | ||
| 2745755 | HALIFAX REGIONAL MUNICIPALITY (HRM) | 1178 | KETCH HARBOUR | RD | KETCH HARBOUR | 69000 | 2011-06-23 | 1 | 1 | 2017 | 1170 | 1.5 Storey | 2 | 2 | false | Average | false | true | 51687 | ||
| 2745755 | HALIFAX REGIONAL MUNICIPALITY (HRM) | 1178 | KETCH HARBOUR | RD | KETCH HARBOUR | 591500 | 2021-12-22 | 2 | 1 | 2017 | 1170 | 1.5 Storey | 2 | 2 | false | Average | false | true | 51687 | ||
| 10751993 | WEST HANTS REGIONAL MUNICIPALITY | 7289 | HIGHWAY 1 | ARDOISE | 44900 | 2022-04-29 | 1 | 1 | 2022 | 1230 | 1 Storey | 3 | 3 | false | Average | false | false | 48352 | |||
| 10751993 | WEST HANTS REGIONAL MUNICIPALITY | 7289 | HIGHWAY 1 | ARDOISE | 0 | 2018-10-25 | 2 | 1 | 2022 | 1230 | 1 Storey | 3 | 3 | false | Average | false | false | 48352 | |||
| 10751993 | WEST HANTS REGIONAL MUNICIPALITY | 7289 | HIGHWAY 1 | ARDOISE | 37000 | 2020-06-04 | 2 | 1 | 2022 | 1230 | 1 Storey | 3 | 3 | false | Average | false | false | 48352 | |||
| 10751993 | WEST HANTS REGIONAL MUNICIPALITY | 7289 | HIGHWAY 1 | ARDOISE | 51000 | 2021-07-19 | 2 | 1 | 2022 | 1230 | 1 Storey | 3 | 3 | false | Average | false | false | 48352 | |||
| 7667817 | HALIFAX REGIONAL MUNICIPALITY (HRM) | 56 | LAKE | RD | PLEASANT VALLEY | 0 | 2019-11-15 | 1 | 1 | 1996 | 1050 | 1.5 Storey | 0 | false | Low | false | false | 332798 | |||
| 10986400 | HALIFAX REGIONAL MUNICIPALITY (HRM) | 538 | MIDNIGHT | RUN | MIDDLE SACKVILLE | 767575 | 2022-05-17 | 9 | 1 | 2023 | 2412 | Split Entry | 4 | 3 | false | Good | true | true | 143400 | ||
| 10986400 | HALIFAX REGIONAL MUNICIPALITY (HRM) | 538 | MIDNIGHT | RUN | MIDDLE SACKVILLE | 0 | 2022-01-31 | 9 | 1 | 2023 | 2412 | Split Entry | 4 | 3 | false | Good | true | true | 143400 | 
Removing observations about different types of units from our dataset is one step in the right direction. To move another step closer to a tidy dataset, where each variable is a column and each observation is a row, the next thing we need to do is split up the two pieces of information in this column into two new ones. To do this we can use tablecloth’s reshaping tools, specifically separate-column in this case.
In order to properly handle all the possible values in the column, we’ll write a new helper to parse the number of stories and building type from the list of possible :style strings.
This also introduces a complicated but extremely common problem. We have to start inferring more things about the data. In other words, up to this point we have only been working with the raw data we found from the source. Splitting up this style column into multiple ones means making some assumptions, like what kind of building type do we assign to values that only have a number of stories? Given the context, we can assume it’s a detached home, but at this point we’re encoding some of our own knowledge into the dataset and it’s worth being explicit about it. We’ll also assume that a “Split Level” or “Split Entry” home has 2 stories.
Based on this understanding of the data, we’ll start with a little function to extract the number of stories and building type from any possible value in the :style column:
(defn parse-building-type [val]
  (case val
    nil nil
    "Split Level" [2 "Detached"]
    "Split Entry" [2 "Detached"]
    (let [[_ stories building-type] (re-matches #"(\d+\.?\d?) Storey\s?(.+)?" val)]
      [(parse-double stories) (or building-type "Detached")])))Then we can use this helper with tablecloth’s separate-column to generate our two new columns. We’ll also bind this intermediate dataset to a new symbol to facilitate working with it in future steps:
(def partially-cleaned-housing-data
  (-> housing-data
      (tc/convert-types {:sale-date [:packed-local-date to-date]
                         :under-construction :boolean
                         :finished-basement :boolean
                         :garage :boolean})
      (tc/drop-rows #(#{"Manufactured Home" "Additon"} (:style %)))
      (tc/separate-column :style [:stories :building-type] parse-building-type)))Data Integrity
Now that we’ve fixed up the structural or technical issues with the data, there is a whole class of new problems to deal with that have much less obvious solutions. For these next types of issues, we’ll need to actually look at the data and make some judgment calls about how to handle apparent errors.
Handling missing values
Most
Get columns with missing values
(-> partially-cleaned-housing-data
    tc/info
    (tc/select-rows (comp pos? :n-missing))
    :col-name)#tech.v3.dataset.column<keyword>[15]
:col-name
[:civic-number, :civic-additional, :civic-direction, :civic-street-name, :civic-suffix, :civic-city-name, :sale-price, :parcels-in-sale, :year-built, :square-foot-living-area, :stories, :building-type, :bedrooms, :construction-grade, :square-feet]Drop any numeric column with value of 0:
(def no-missing-values
  (let [cols-missing-values [:sale-price
                             :parcels-in-sale
                             :year-built
                             :square-foot-living-area
                             :stories
                             :building-type
                             :bedrooms
                             :construction-grade
                             :square-feet]
        numeric-cols (-> partially-cleaned-housing-data
                         (tc/select-columns :type/numerical)
                         tc/column-names)]
    (-> partially-cleaned-housing-data
        (tc/drop-rows #(some (fn [col-name]
                               (nil? (col-name %)))
                             cols-missing-values))
        (tc/drop-rows #(some (fn [col-name]
                               (zero? (col-name % 0)))
                             numeric-cols)))))Drop missing sale price, also drop 0 sale price, these don’t provide the information we’re looking for
Check for duplicate values
Removing fully duplicate rows
The first thing we can do is check for entirely duplicate rows. The easiest way to do this is to use scicloj.tcutils.api/duplicate-rows
(tcu/duplicate-rows no-missing-values)inner-join [0 23]:
| :assessment-account-number | :municipal-unit | :civic-number | :civic-additional | :civic-direction | :civic-street-name | :civic-suffix | :civic-city-name | :sale-price | :sale-date | :parcels-in-sale | :living-units | :year-built | :square-foot-living-area | :stories | :building-type | :bedrooms | :bathrooms | :under-construction | :construction-grade | :finished-basement | :garage | :square-feet | 
|---|
In this case it turns out there are none, which is good. But it’s still possible we don’t want to include every row in the final dataset.
Selecting relevant rows
One key feature of tidy datasets, as mentioned above, is that they contain values collected about only a single type of observational unit. We’re interested in building a model to predict the price of single family homes in Nova Scotia, so we’ll want a dataset that contains only data about those types of homes. One clue that a dataset contains values about multiple types of things is values that get repeated in many rows.
We can inspect our dataset to see how many rows appear to be duplicates by grouping by the assessment account number and sale date. There should only be one sale per property per day, so we will want to select any rows that have more than one and investigate these rows:
(def potentially-duplicate-row-counts
  (-> no-missing-values
      (tc/group-by [:assessment-account-number :sale-date])
      (tc/aggregate {:count tc/row-count})
      (tc/select-rows #(< 1 (:count %)))))(-> partially-cleaned-housing-data (tc/map-columns :sale-year :sale-date (memfn getYear)) (ploclo/layer-histogram {:=x :sale-year}))
TODO Explain
(let [potentially-duplicate-row-ids (-> potentially-duplicate-row-counts
                                        (tc/join-columns :id
                                                         [:assessment-account-number :sale-date]
                                                         {:result-type :seq}))]
  (-> no-missing-values
      (tc/map-columns :id [:assessment-account-number :sale-date] vector)
      (tc/inner-join potentially-duplicate-row-ids :id)
      (tc/drop-columns :id)
      (tc/order-by [:assessment-account-number :sale-date])))inner-join [2184 24]:
| :assessment-account-number | :municipal-unit | :civic-number | :civic-additional | :civic-direction | :civic-street-name | :civic-suffix | :civic-city-name | :sale-price | :sale-date | :parcels-in-sale | :living-units | :year-built | :square-foot-living-area | :stories | :building-type | :bedrooms | :bathrooms | :under-construction | :construction-grade | :finished-basement | :garage | :square-feet | :count | 
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 9075 | MUNI. OF THE DISTRICT OF LUNENBURG | 491 | OAKLAND | RD | INDIAN POINT | 155000 | 2014-08-06 | 1 | 2 | 2023 | 668 | 1.0 | Detached | 1 | 1 | false | Average | false | false | 67954 | 2 | ||
| 9075 | MUNI. OF THE DISTRICT OF LUNENBURG | 491 | OAKLAND | RD | INDIAN POINT | 155000 | 2014-08-06 | 1 | 2 | 1977 | 1390 | 1.0 | Detached | 3 | 2 | false | Average | false | true | 67954 | 2 | ||
| 9075 | MUNI. OF THE DISTRICT OF LUNENBURG | 491 | OAKLAND | RD | INDIAN POINT | 170000 | 2018-11-02 | 1 | 2 | 2023 | 668 | 1.0 | Detached | 1 | 1 | false | Average | false | false | 67954 | 2 | ||
| 9075 | MUNI. OF THE DISTRICT OF LUNENBURG | 491 | OAKLAND | RD | INDIAN POINT | 170000 | 2018-11-02 | 1 | 2 | 1977 | 1390 | 1.0 | Detached | 3 | 2 | false | Average | false | true | 67954 | 2 | ||
| 18341 | TOWN OF KENTVILLE | 242 | MAIN | ST | KENTVILLE | 93000 | 2016-08-01 | 1 | 1 | 1888 | 972 | 1.5 | Detached | 3 | 2 | false | Average | false | false | 4413 | 2 | ||
| 18341 | TOWN OF KENTVILLE | 242 | MAIN | ST | KENTVILLE | 1000 | 2016-08-01 | 1 | 1 | 1888 | 972 | 1.5 | Detached | 3 | 2 | false | Average | false | false | 4413 | 2 | ||
| 43397 | MUNI. OF THE DISTRICT OF EAST HANTS | 775 | HIGHWAY 1 | MOUNT UNIACKE | 123000 | 2015-10-29 | 2 | 1 | 2022 | 1008 | 1.0 | Detached | 1 | 2 | true | Average | true | true | 15893 | 2 | |||
| 43397 | MUNI. OF THE DISTRICT OF EAST HANTS | 775 | HIGHWAY 1 | MOUNT UNIACKE | 123000 | 2015-10-29 | 2 | 1 | 1903 | 864 | 1.5 | Detached | 3 | 1 | false | Average | false | true | 15893 | 2 | |||
| 44296 | HALIFAX REGIONAL MUNICIPALITY (HRM) | 6555 | CHESTER | AVE | HALIFAX | 1200 | 2017-01-07 | 1 | 2 | 1943 | 1710 | 2.0 | Detached | 4 | 2 | false | Average | false | false | 4343 | 2 | ||
| 44296 | HALIFAX REGIONAL MUNICIPALITY (HRM) | 6555 | CHESTER | AVE | HALIFAX | 315000 | 2017-01-07 | 1 | 2 | 1943 | 1710 | 2.0 | Detached | 4 | 2 | false | Average | false | false | 4343 | 2 | ||
| … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | 
| 10939399 | TOWN OF BERWICK | 12 | GALA | DR | BERWICK | 70000 | 2022-01-10 | 2 | 1 | 2021 | 1462 | 1.0 | Semi Detached | 3 | 2 | false | Average | false | true | 6738 | 2 | ||
| 10939429 | TOWN OF BERWICK | 13 | GALA | DR | BERWICK | 70000 | 2022-06-30 | 2 | 1 | 2021 | 1462 | 1.0 | Semi Detached | 3 | 2 | false | Average | false | true | 7693 | 2 | ||
| 10939429 | TOWN OF BERWICK | 13 | GALA | DR | BERWICK | 337283 | 2022-06-30 | 1 | 1 | 2021 | 1462 | 1.0 | Semi Detached | 3 | 2 | false | Average | false | true | 7693 | 2 | ||
| 10944831 | MUNI. OF THE DISTRICT OF EAST HANTS | 426 | ESS | RD | UPPER NINE MILE RIVER | 20000 | 2021-12-06 | 1 | 2 | 2023 | 1264 | 1.0 | Detached | 1 | 1 | false | Average | false | true | 437725 | 2 | ||
| 10944831 | MUNI. OF THE DISTRICT OF EAST HANTS | 426 | ESS | RD | UPPER NINE MILE RIVER | 20000 | 2021-12-06 | 1 | 2 | 2022 | 1344 | 1.0 | Detached | 3 | 2 | false | Average | false | true | 437725 | 2 | ||
| 10969220 | WEST HANTS REGIONAL MUNICIPALITY | 36 | ELIZABETH | AVE | GARLANDS CROSSING | 330955 | 2022-02-24 | 1 | 1 | 2021 | 1302 | 1.0 | Semi Detached | 2 | 2 | false | Average | false | true | 5016 | 2 | ||
| 10969220 | WEST HANTS REGIONAL MUNICIPALITY | 36 | ELIZABETH | AVE | GARLANDS CROSSING | 240000 | 2022-02-24 | 4 | 1 | 2021 | 1302 | 1.0 | Semi Detached | 2 | 2 | false | Average | false | true | 5016 | 2 | ||
| 10991625 | MUNICIPALITY OF THE COUNTY OF INVERNESS | 2581 | SHORE | RD | MARGAREE HARBOUR | 600000 | 2022-06-03 | 1 | 2 | 1970 | 2380 | 1.5 | Detached | 4 | 3 | false | Average | true | false | 125986 | 2 | ||
| 10991625 | MUNICIPALITY OF THE COUNTY OF INVERNESS | 2581 | SHORE | RD | MARGAREE HARBOUR | 600000 | 2022-06-03 | 1 | 2 | 1985 | 552 | 1.0 | Detached | 1 | 1 | false | Average | false | false | 125986 | 2 | ||
| 11021141 | MUNI. OF THE DISTRICT OF EAST HANTS | 57 | BRIANNA | DR | LANTZ | 600000 | 2023-04-28 | 4 | 2 | 2023 | 2005 | 2.0 | Detached | 3 | 3 | true | Average | false | true | 11320 | 2 | ||
| 11021141 | MUNI. OF THE DISTRICT OF EAST HANTS | 57 | BRIANNA | DR | LANTZ | 600000 | 2023-04-28 | 4 | 2 | 2023 | 2029 | 2.0 | Detached | 3 | 3 | true | Average | false | true | 11320 | 2 | 
Remove rows with more than one living unit, these aren’t in the category of thing that we want to model:
(let [potentially-duplicate-row-ids (-> potentially-duplicate-row-counts
                                        (tc/join-columns :id
                                                         [:assessment-account-number :sale-date]
                                                         {:result-type :seq}))]
  (-> no-missing-values
      (tc/map-columns :id [:assessment-account-number :sale-date] vector)
      (tc/inner-join potentially-duplicate-row-ids :id)
      (tc/drop-columns :id)
      (tc/order-by [:assessment-account-number :sale-date])
      (tc/drop-rows #(> (:living-units %) 1))))inner-join [686 24]:
| :assessment-account-number | :municipal-unit | :civic-number | :civic-additional | :civic-direction | :civic-street-name | :civic-suffix | :civic-city-name | :sale-price | :sale-date | :parcels-in-sale | :living-units | :year-built | :square-foot-living-area | :stories | :building-type | :bedrooms | :bathrooms | :under-construction | :construction-grade | :finished-basement | :garage | :square-feet | :count | 
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 18341 | TOWN OF KENTVILLE | 242 | MAIN | ST | KENTVILLE | 93000 | 2016-08-01 | 1 | 1 | 1888 | 972 | 1.5 | Detached | 3 | 2 | false | Average | false | false | 4413 | 2 | ||
| 18341 | TOWN OF KENTVILLE | 242 | MAIN | ST | KENTVILLE | 1000 | 2016-08-01 | 1 | 1 | 1888 | 972 | 1.5 | Detached | 3 | 2 | false | Average | false | false | 4413 | 2 | ||
| 43397 | MUNI. OF THE DISTRICT OF EAST HANTS | 775 | HIGHWAY 1 | MOUNT UNIACKE | 123000 | 2015-10-29 | 2 | 1 | 2022 | 1008 | 1.0 | Detached | 1 | 2 | true | Average | true | true | 15893 | 2 | |||
| 43397 | MUNI. OF THE DISTRICT OF EAST HANTS | 775 | HIGHWAY 1 | MOUNT UNIACKE | 123000 | 2015-10-29 | 2 | 1 | 1903 | 864 | 1.5 | Detached | 3 | 1 | false | Average | false | true | 15893 | 2 | |||
| 47244 | MUNI. OF THE DISTRICT OF LUNENBURG | 28 | CONRADIN | DR | CONQUERALL BANK | 459000 | 2014-01-06 | 1 | 1 | 1957 | 792 | 1.0 | Detached | 3 | 3 | false | Low | false | false | 105851 | 2 | ||
| 47244 | MUNI. OF THE DISTRICT OF LUNENBURG | 28 | CONRADIN | DR | CONQUERALL BANK | 459000 | 2014-01-06 | 1 | 1 | 1957 | 2558 | 1.0 | Detached | 3 | 2 | false | Average | false | true | 105851 | 2 | ||
| 47244 | MUNI. OF THE DISTRICT OF LUNENBURG | 28 | CONRADIN | DR | CONQUERALL BANK | 600000 | 2017-04-03 | 1 | 1 | 1957 | 792 | 1.0 | Detached | 3 | 3 | false | Low | false | false | 105851 | 2 | ||
| 47244 | MUNI. OF THE DISTRICT OF LUNENBURG | 28 | CONRADIN | DR | CONQUERALL BANK | 600000 | 2017-04-03 | 1 | 1 | 1957 | 2558 | 1.0 | Detached | 3 | 2 | false | Average | false | true | 105851 | 2 | ||
| 50997 | MUNICIPALITY OF THE COUNTY OF CUMBERLAND | 10271 | DURHAM | ST | PUGWASH | 14000 | 2019-02-01 | 1 | 1 | 2022 | 448 | 1.0 | Detached | 1 | 1 | false | Fair | false | false | 6400 | 2 | ||
| 50997 | MUNICIPALITY OF THE COUNTY OF CUMBERLAND | 10271 | DURHAM | ST | PUGWASH | 14000 | 2019-02-01 | 1 | 1 | 2023 | 704 | 1.0 | Detached | 1 | 1 | false | Fair | false | false | 6400 | 2 | ||
| … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | 
| 10904447 | MUNICIPALITY OF THE COUNTY OF ANTIGONISH | 51 | ANCHOR | LANE | HARBOUR CENTRE | 215000 | 2020-10-20 | 1 | 1 | 2021 | 3088 | 2.0 | Detached | 1 | 1 | false | Good | false | true | 383328 | 2 | ||
| 10905206 | HALIFAX REGIONAL MUNICIPALITY (HRM) | 166 | TALUS | AVE | BEDFORD | 1015000 | 2021-12-21 | 1 | 1 | 2021 | 4528 | 2.0 | Detached | 5 | 5 | false | Good | true | false | 6458 | 2 | ||
| 10905206 | HALIFAX REGIONAL MUNICIPALITY (HRM) | 166 | TALUS | AVE | BEDFORD | 930355 | 2021-12-21 | 1 | 1 | 2021 | 4528 | 2.0 | Detached | 5 | 5 | false | Good | true | false | 6458 | 2 | ||
| 10905311 | HALIFAX REGIONAL MUNICIPALITY (HRM) | 199 | TALUS | AVE | BEDFORD | 699943 | 2021-12-22 | 1 | 1 | 2021 | 3012 | 2.0 | Detached | 4 | 5 | false | Good | true | true | 6364 | 2 | ||
| 10905311 | HALIFAX REGIONAL MUNICIPALITY (HRM) | 199 | TALUS | AVE | BEDFORD | 880000 | 2021-12-22 | 1 | 1 | 2021 | 3012 | 2.0 | Detached | 4 | 5 | false | Good | true | true | 6364 | 2 | ||
| 10939399 | TOWN OF BERWICK | 12 | GALA | DR | BERWICK | 361828 | 2022-01-10 | 1 | 1 | 2021 | 1462 | 1.0 | Semi Detached | 3 | 2 | false | Average | false | true | 6738 | 2 | ||
| 10939399 | TOWN OF BERWICK | 12 | GALA | DR | BERWICK | 70000 | 2022-01-10 | 2 | 1 | 2021 | 1462 | 1.0 | Semi Detached | 3 | 2 | false | Average | false | true | 6738 | 2 | ||
| 10939429 | TOWN OF BERWICK | 13 | GALA | DR | BERWICK | 70000 | 2022-06-30 | 2 | 1 | 2021 | 1462 | 1.0 | Semi Detached | 3 | 2 | false | Average | false | true | 7693 | 2 | ||
| 10939429 | TOWN OF BERWICK | 13 | GALA | DR | BERWICK | 337283 | 2022-06-30 | 1 | 1 | 2021 | 1462 | 1.0 | Semi Detached | 3 | 2 | false | Average | false | true | 7693 | 2 | ||
| 10969220 | WEST HANTS REGIONAL MUNICIPALITY | 36 | ELIZABETH | AVE | GARLANDS CROSSING | 330955 | 2022-02-24 | 1 | 1 | 2021 | 1302 | 1.0 | Semi Detached | 2 | 2 | false | Average | false | true | 5016 | 2 | ||
| 10969220 | WEST HANTS REGIONAL MUNICIPALITY | 36 | ELIZABETH | AVE | GARLANDS CROSSING | 240000 | 2022-02-24 | 4 | 1 | 2021 | 1302 | 1.0 | Semi Detached | 2 | 2 | false | Average | false | true | 5016 | 2 | 
Leaves us with 686 rows to explain
(let [potentially-duplicate-row-ids (-> potentially-duplicate-row-counts
                                        (tc/join-columns :id
                                                         [:assessment-account-number :sale-date]
                                                         {:result-type :seq}))]
  (-> no-missing-values
      (tc/map-columns :id [:assessment-account-number :sale-date] vector)
      (tc/inner-join potentially-duplicate-row-ids :id)
      (tc/drop-columns :id)
      (tc/order-by [:assessment-account-number :sale-date])
      (tc/drop-rows #(> (:living-units %) 1))))inner-join [686 24]:
| :assessment-account-number | :municipal-unit | :civic-number | :civic-additional | :civic-direction | :civic-street-name | :civic-suffix | :civic-city-name | :sale-price | :sale-date | :parcels-in-sale | :living-units | :year-built | :square-foot-living-area | :stories | :building-type | :bedrooms | :bathrooms | :under-construction | :construction-grade | :finished-basement | :garage | :square-feet | :count | 
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 18341 | TOWN OF KENTVILLE | 242 | MAIN | ST | KENTVILLE | 93000 | 2016-08-01 | 1 | 1 | 1888 | 972 | 1.5 | Detached | 3 | 2 | false | Average | false | false | 4413 | 2 | ||
| 18341 | TOWN OF KENTVILLE | 242 | MAIN | ST | KENTVILLE | 1000 | 2016-08-01 | 1 | 1 | 1888 | 972 | 1.5 | Detached | 3 | 2 | false | Average | false | false | 4413 | 2 | ||
| 43397 | MUNI. OF THE DISTRICT OF EAST HANTS | 775 | HIGHWAY 1 | MOUNT UNIACKE | 123000 | 2015-10-29 | 2 | 1 | 2022 | 1008 | 1.0 | Detached | 1 | 2 | true | Average | true | true | 15893 | 2 | |||
| 43397 | MUNI. OF THE DISTRICT OF EAST HANTS | 775 | HIGHWAY 1 | MOUNT UNIACKE | 123000 | 2015-10-29 | 2 | 1 | 1903 | 864 | 1.5 | Detached | 3 | 1 | false | Average | false | true | 15893 | 2 | |||
| 47244 | MUNI. OF THE DISTRICT OF LUNENBURG | 28 | CONRADIN | DR | CONQUERALL BANK | 459000 | 2014-01-06 | 1 | 1 | 1957 | 792 | 1.0 | Detached | 3 | 3 | false | Low | false | false | 105851 | 2 | ||
| 47244 | MUNI. OF THE DISTRICT OF LUNENBURG | 28 | CONRADIN | DR | CONQUERALL BANK | 459000 | 2014-01-06 | 1 | 1 | 1957 | 2558 | 1.0 | Detached | 3 | 2 | false | Average | false | true | 105851 | 2 | ||
| 47244 | MUNI. OF THE DISTRICT OF LUNENBURG | 28 | CONRADIN | DR | CONQUERALL BANK | 600000 | 2017-04-03 | 1 | 1 | 1957 | 792 | 1.0 | Detached | 3 | 3 | false | Low | false | false | 105851 | 2 | ||
| 47244 | MUNI. OF THE DISTRICT OF LUNENBURG | 28 | CONRADIN | DR | CONQUERALL BANK | 600000 | 2017-04-03 | 1 | 1 | 1957 | 2558 | 1.0 | Detached | 3 | 2 | false | Average | false | true | 105851 | 2 | ||
| 50997 | MUNICIPALITY OF THE COUNTY OF CUMBERLAND | 10271 | DURHAM | ST | PUGWASH | 14000 | 2019-02-01 | 1 | 1 | 2022 | 448 | 1.0 | Detached | 1 | 1 | false | Fair | false | false | 6400 | 2 | ||
| 50997 | MUNICIPALITY OF THE COUNTY OF CUMBERLAND | 10271 | DURHAM | ST | PUGWASH | 14000 | 2019-02-01 | 1 | 1 | 2023 | 704 | 1.0 | Detached | 1 | 1 | false | Fair | false | false | 6400 | 2 | ||
| … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | 
| 10904447 | MUNICIPALITY OF THE COUNTY OF ANTIGONISH | 51 | ANCHOR | LANE | HARBOUR CENTRE | 215000 | 2020-10-20 | 1 | 1 | 2021 | 3088 | 2.0 | Detached | 1 | 1 | false | Good | false | true | 383328 | 2 | ||
| 10905206 | HALIFAX REGIONAL MUNICIPALITY (HRM) | 166 | TALUS | AVE | BEDFORD | 1015000 | 2021-12-21 | 1 | 1 | 2021 | 4528 | 2.0 | Detached | 5 | 5 | false | Good | true | false | 6458 | 2 | ||
| 10905206 | HALIFAX REGIONAL MUNICIPALITY (HRM) | 166 | TALUS | AVE | BEDFORD | 930355 | 2021-12-21 | 1 | 1 | 2021 | 4528 | 2.0 | Detached | 5 | 5 | false | Good | true | false | 6458 | 2 | ||
| 10905311 | HALIFAX REGIONAL MUNICIPALITY (HRM) | 199 | TALUS | AVE | BEDFORD | 699943 | 2021-12-22 | 1 | 1 | 2021 | 3012 | 2.0 | Detached | 4 | 5 | false | Good | true | true | 6364 | 2 | ||
| 10905311 | HALIFAX REGIONAL MUNICIPALITY (HRM) | 199 | TALUS | AVE | BEDFORD | 880000 | 2021-12-22 | 1 | 1 | 2021 | 3012 | 2.0 | Detached | 4 | 5 | false | Good | true | true | 6364 | 2 | ||
| 10939399 | TOWN OF BERWICK | 12 | GALA | DR | BERWICK | 361828 | 2022-01-10 | 1 | 1 | 2021 | 1462 | 1.0 | Semi Detached | 3 | 2 | false | Average | false | true | 6738 | 2 | ||
| 10939399 | TOWN OF BERWICK | 12 | GALA | DR | BERWICK | 70000 | 2022-01-10 | 2 | 1 | 2021 | 1462 | 1.0 | Semi Detached | 3 | 2 | false | Average | false | true | 6738 | 2 | ||
| 10939429 | TOWN OF BERWICK | 13 | GALA | DR | BERWICK | 70000 | 2022-06-30 | 2 | 1 | 2021 | 1462 | 1.0 | Semi Detached | 3 | 2 | false | Average | false | true | 7693 | 2 | ||
| 10939429 | TOWN OF BERWICK | 13 | GALA | DR | BERWICK | 337283 | 2022-06-30 | 1 | 1 | 2021 | 1462 | 1.0 | Semi Detached | 3 | 2 | false | Average | false | true | 7693 | 2 | ||
| 10969220 | WEST HANTS REGIONAL MUNICIPALITY | 36 | ELIZABETH | AVE | GARLANDS CROSSING | 330955 | 2022-02-24 | 1 | 1 | 2021 | 1302 | 1.0 | Semi Detached | 2 | 2 | false | Average | false | true | 5016 | 2 | ||
| 10969220 | WEST HANTS REGIONAL MUNICIPALITY | 36 | ELIZABETH | AVE | GARLANDS CROSSING | 240000 | 2022-02-24 | 4 | 1 | 2021 | 1302 | 1.0 | Semi Detached | 2 | 2 | false | Average | false | true | 5016 | 2 | 
Many reasons why these remaining rows are wrong, deal with them case-by-case
Multiple sale prices and multiple number of parcels
TODO: Finish this – revisit sanity of explaining every possible error in the data. How much cleaning is necessary to do a meaningful analysis? TO DROP:
(let [potentially-duplicate-row-ids (-> potentially-duplicate-row-counts
                                        (tc/join-columns :id
                                                         [:assessment-account-number :sale-date]
                                                         {:result-type :seq}))]
  (-> no-missing-values
      (tc/map-columns :id [:assessment-account-number :sale-date] vector)
      (tc/inner-join potentially-duplicate-row-ids :id)
      (tc/drop-columns :id)
      (tc/order-by [:assessment-account-number :sale-date])
      (tc/drop-rows #(> (:living-units %) 1))
      (tc/select-columns [:assessment-account-number :sale-date :sale-price :parcels-in-sale])
      (tc/fold-by [:assessment-account-number :sale-date])
      (tc/select-rows #(and (not (apply = (:sale-price %)))
                            (not (apply = (:parcels-in-sale %)))))
      (tc/unroll [:sale-price :parcels-in-sale])
      :assessment-account-number))#tech.v3.dataset.column<int64>[14]
:assessment-account-number
[931772, 931772, 2216272, 2216272, 4009185, 4009185, 10286433, 10286433, 10939399, 10939399, 10939429, 10939429, 10969220, 10969220](let [potentially-duplicate-row-ids (-> potentially-duplicate-row-counts
                                        (tc/join-columns :id
                                                         [:assessment-account-number :sale-date]
                                                         {:result-type :seq}))]
  (-> no-missing-values
      (tc/map-columns :id [:assessment-account-number :sale-date] vector)
      (tc/inner-join potentially-duplicate-row-ids :id)
      (tc/drop-columns :id)
      (tc/order-by [:assessment-account-number :sale-date])
      (tc/drop-rows #(> (:living-units %) 1))
      ;; (tc/select-columns [:assessment-account-number :sale-date :sale-price])
      (tc/fold-by [:assessment-account-number :sale-date])
      (tc/select-rows #(not (apply = (:sale-price %))))
      (tc/unroll [:sale-price])))_unnamed [301 24]:
| :assessment-account-number | :sale-date | :building-type | :civic-street-name | :square-foot-living-area | :civic-city-name | :civic-additional | :civic-direction | :under-construction | :civic-suffix | :parcels-in-sale | :garage | :stories | :construction-grade | :year-built | :finished-basement | :civic-number | :bedrooms | :count | :square-feet | :bathrooms | :living-units | :municipal-unit | :sale-price | 
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 18341 | 2016-08-01 | [Detached Detached] | [MAIN MAIN] | [972 972] | [KENTVILLE KENTVILLE] | [] | [] | [false false] | [ST ST] | [1 1] | [false false] | [1.5 1.5] | [Average Average] | [1888 1888] | [false false] | [242 242] | [3 3] | [2 2] | [4413 4413] | [2 2] | [1 1] | [TOWN OF KENTVILLE TOWN OF KENTVILLE] | 93000 | 
| 18341 | 2016-08-01 | [Detached Detached] | [MAIN MAIN] | [972 972] | [KENTVILLE KENTVILLE] | [] | [] | [false false] | [ST ST] | [1 1] | [false false] | [1.5 1.5] | [Average Average] | [1888 1888] | [false false] | [242 242] | [3 3] | [2 2] | [4413 4413] | [2 2] | [1 1] | [TOWN OF KENTVILLE TOWN OF KENTVILLE] | 1000 | 
| 53155 | 2017-12-05 | [Detached Detached] | [ALLEN ALLEN] | [600 600] | [PORT GREVILLE PORT GREVILLE] | [] | [] | [false false] | [RD RD] | [1 1] | [false false] | [1.0 1.0] | [Low Low] | [1973 1973] | [false false] | [78 78] | [1 1] | [2 2] | [21780 21780] | [1 1] | [1 1] | [MUNICIPALITY OF THE COUNTY OF CUMBERLAND | 1200 | 
| MUNICIPALITY OF THE COUNTY OF CUMBERLAND] | |||||||||||||||||||||||
| 53155 | 2017-12-05 | [Detached Detached] | [ALLEN ALLEN] | [600 600] | [PORT GREVILLE PORT GREVILLE] | [] | [] | [false false] | [RD RD] | [1 1] | [false false] | [1.0 1.0] | [Low Low] | [1973 1973] | [false false] | [78 78] | [1 1] | [2 2] | [21780 21780] | [1 1] | [1 1] | [MUNICIPALITY OF THE COUNTY OF CUMBERLAND | 41310 | 
| MUNICIPALITY OF THE COUNTY OF CUMBERLAND] | |||||||||||||||||||||||
| 87246 | 2017-06-15 | [Semi Detached Semi Detached] | [WOOD WOOD] | [720 720] | [NEW WATERFORD NEW WATERFORD] | [] | [] | [false false] | [AVE AVE] | [1 1] | [false false] | [1.5 1.5] | [Average Average] | [1910 1910] | [false false] | [3516 3516] | [2 2] | [2 2] | [4932 4932] | [1 1] | [1 1] | [CAPE BRETON REGIONAL MUNICIPALITY (CBRM) | 39000 | 
| CAPE BRETON REGIONAL MUNICIPALITY (CBRM)] | |||||||||||||||||||||||
| 87246 | 2017-06-15 | [Semi Detached Semi Detached] | [WOOD WOOD] | [720 720] | [NEW WATERFORD NEW WATERFORD] | [] | [] | [false false] | [AVE AVE] | [1 1] | [false false] | [1.5 1.5] | [Average Average] | [1910 1910] | [false false] | [3516 3516] | [2 2] | [2 2] | [4932 4932] | [1 1] | [1 1] | [CAPE BRETON REGIONAL MUNICIPALITY (CBRM) | 24000 | 
| CAPE BRETON REGIONAL MUNICIPALITY (CBRM)] | |||||||||||||||||||||||
| 97004 | 2024-04-01 | [Detached Detached] | [MILL VILLAGE EAST MILL VILLAGE EAST] | [680 680] | [CHARLESTON CHARLESTON] | [] | [] | [false false] | [RD RD] | [1 1] | [false false] | [1.0 1.0] | [Average Average] | [1932 1932] | [false false] | [606 606] | [3 3] | [2 2] | [10890 10890] | [1 1] | [1 1] | [QUEENS, REGION OF QUEENS MUNICIPALITY | 138434 | 
| QUEENS, REGION OF QUEENS MUNICIPALITY] | |||||||||||||||||||||||
| 97004 | 2024-04-01 | [Detached Detached] | [MILL VILLAGE EAST MILL VILLAGE EAST] | [680 680] | [CHARLESTON CHARLESTON] | [] | [] | [false false] | [RD RD] | [1 1] | [false false] | [1.0 1.0] | [Average Average] | [1932 1932] | [false false] | [606 606] | [3 3] | [2 2] | [10890 10890] | [1 1] | [1 1] | [QUEENS, REGION OF QUEENS MUNICIPALITY | 34609 | 
| QUEENS, REGION OF QUEENS MUNICIPALITY] | |||||||||||||||||||||||
| 133949 | 2012-01-24 | [Detached Detached] | [HIGHWAY 8 HIGHWAY 8] | [627 627] | [KEMPT KEMPT] | [] | [] | [false false] | [] | [1 1] | [false false] | [1.5 1.5] | [Average Average] | [1943 1943] | [false false] | [12153 12153] | [3 3] | [2 2] | [130680 130680] | [1 1] | [1 1] | [QUEENS, REGION OF QUEENS MUNICIPALITY | 700 | 
| QUEENS, REGION OF QUEENS MUNICIPALITY] | |||||||||||||||||||||||
| 133949 | 2012-01-24 | [Detached Detached] | [HIGHWAY 8 HIGHWAY 8] | [627 627] | [KEMPT KEMPT] | [] | [] | [false false] | [] | [1 1] | [false false] | [1.5 1.5] | [Average Average] | [1943 1943] | [false false] | [12153 12153] | [3 3] | [2 2] | [130680 130680] | [1 1] | [1 1] | [QUEENS, REGION OF QUEENS MUNICIPALITY | 625 | 
| QUEENS, REGION OF QUEENS MUNICIPALITY] | |||||||||||||||||||||||
| … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | 
| 10679354 | 2017-05-03 | [Detached Detached] | [TRANSOM TRANSOM] | [3679 3679] | [HALIFAX HALIFAX] | [] | [] | [false false] | [DR DR] | [1 1] | [false false] | [2.0 2.0] | [Good Good] | [2016 2016] | [true true] | [321 321] | [4 4] | [2 2] | [5379 5379] | [4 4] | [1 1] | [HALIFAX REGIONAL MUNICIPALITY (HRM) | 570000 | 
| HALIFAX REGIONAL MUNICIPALITY (HRM)] | |||||||||||||||||||||||
| 10905206 | 2021-12-21 | [Detached Detached] | [TALUS TALUS] | [4528 4528] | [BEDFORD BEDFORD] | [] | [] | [false false] | [AVE AVE] | [1 1] | [false false] | [2.0 2.0] | [Good Good] | [2021 2021] | [true true] | [166 166] | [5 5] | [2 2] | [6458 6458] | [5 5] | [1 1] | [HALIFAX REGIONAL MUNICIPALITY (HRM) | 1015000 | 
| HALIFAX REGIONAL MUNICIPALITY (HRM)] | |||||||||||||||||||||||
| 10905206 | 2021-12-21 | [Detached Detached] | [TALUS TALUS] | [4528 4528] | [BEDFORD BEDFORD] | [] | [] | [false false] | [AVE AVE] | [1 1] | [false false] | [2.0 2.0] | [Good Good] | [2021 2021] | [true true] | [166 166] | [5 5] | [2 2] | [6458 6458] | [5 5] | [1 1] | [HALIFAX REGIONAL MUNICIPALITY (HRM) | 930355 | 
| HALIFAX REGIONAL MUNICIPALITY (HRM)] | |||||||||||||||||||||||
| 10905311 | 2021-12-22 | [Detached Detached] | [TALUS TALUS] | [3012 3012] | [BEDFORD BEDFORD] | [] | [] | [false false] | [AVE AVE] | [1 1] | [true true] | [2.0 2.0] | [Good Good] | [2021 2021] | [true true] | [199 199] | [4 4] | [2 2] | [6364 6364] | [5 5] | [1 1] | [HALIFAX REGIONAL MUNICIPALITY (HRM) | 699943 | 
| HALIFAX REGIONAL MUNICIPALITY (HRM)] | |||||||||||||||||||||||
| 10905311 | 2021-12-22 | [Detached Detached] | [TALUS TALUS] | [3012 3012] | [BEDFORD BEDFORD] | [] | [] | [false false] | [AVE AVE] | [1 1] | [true true] | [2.0 2.0] | [Good Good] | [2021 2021] | [true true] | [199 199] | [4 4] | [2 2] | [6364 6364] | [5 5] | [1 1] | [HALIFAX REGIONAL MUNICIPALITY (HRM) | 880000 | 
| HALIFAX REGIONAL MUNICIPALITY (HRM)] | |||||||||||||||||||||||
| 10939399 | 2022-01-10 | [Semi Detached Semi Detached] | [GALA GALA] | [1462 1462] | [BERWICK BERWICK] | [] | [] | [false false] | [DR DR] | [1 2] | [true true] | [1.0 1.0] | [Average Average] | [2021 2021] | [false false] | [12 12] | [3 3] | [2 2] | [6738 6738] | [2 2] | [1 1] | [TOWN OF BERWICK TOWN OF BERWICK] | 361828 | 
| 10939399 | 2022-01-10 | [Semi Detached Semi Detached] | [GALA GALA] | [1462 1462] | [BERWICK BERWICK] | [] | [] | [false false] | [DR DR] | [1 2] | [true true] | [1.0 1.0] | [Average Average] | [2021 2021] | [false false] | [12 12] | [3 3] | [2 2] | [6738 6738] | [2 2] | [1 1] | [TOWN OF BERWICK TOWN OF BERWICK] | 70000 | 
| 10939429 | 2022-06-30 | [Semi Detached Semi Detached] | [GALA GALA] | [1462 1462] | [BERWICK BERWICK] | [] | [] | [false false] | [DR DR] | [2 1] | [true true] | [1.0 1.0] | [Average Average] | [2021 2021] | [false false] | [13 13] | [3 3] | [2 2] | [7693 7693] | [2 2] | [1 1] | [TOWN OF BERWICK TOWN OF BERWICK] | 70000 | 
| 10939429 | 2022-06-30 | [Semi Detached Semi Detached] | [GALA GALA] | [1462 1462] | [BERWICK BERWICK] | [] | [] | [false false] | [DR DR] | [2 1] | [true true] | [1.0 1.0] | [Average Average] | [2021 2021] | [false false] | [13 13] | [3 3] | [2 2] | [7693 7693] | [2 2] | [1 1] | [TOWN OF BERWICK TOWN OF BERWICK] | 337283 | 
| 10969220 | 2022-02-24 | [Semi Detached Semi Detached] | [ELIZABETH ELIZABETH] | [1302 1302] | [GARLANDS CROSSING GARLANDS CROSSING] | [] | [] | [false false] | [AVE AVE] | [1 4] | [true true] | [1.0 1.0] | [Average Average] | [2021 2021] | [false false] | [36 36] | [2 2] | [2 2] | [5016 5016] | [2 2] | [1 1] | [WEST HANTS REGIONAL MUNICIPALITY WEST HANTS REGIONAL MUNICIPALITY] | 330955 | 
| 10969220 | 2022-02-24 | [Semi Detached Semi Detached] | [ELIZABETH ELIZABETH] | [1302 1302] | [GARLANDS CROSSING GARLANDS CROSSING] | [] | [] | [false false] | [AVE AVE] | [1 4] | [true true] | [1.0 1.0] | [Average Average] | [2021 2021] | [false false] | [36 36] | [2 2] | [2 2] | [5016 5016] | [2 2] | [1 1] | [WEST HANTS REGIONAL MUNICIPALITY WEST HANTS REGIONAL MUNICIPALITY] | 240000 | 
We want to select the rows that indicate potentially duplicate sales in our original dataset and see what’s going on, so first we’ll create a new ID column that combines the assessment account number and sale date to help us identify potentially duplicate rows, then we’ll select the rows from the original dataset that match these unique ID values and examine these rows to understand what’s going on with the potential duplicates.
To efficiently filter our original dataset by the new ID column we’re adding, we can use an inner-join, to only include rows from the target dataset that have an id in our source dataset. First we’ll create the same composite ID column on our original dataset, then join it with the one the one containing our IDs of interest to efficiently filter by these IDs.
(def duplicate-rows
  (let [potentially-duplicate-row-ids (-> potentially-duplicate-row-counts
                                          (tc/join-columns :id
                                                           [:assessment-account-number :sale-date]
                                                           {:result-type :seq}))]
    (-> partially-cleaned-housing-data
        (tc/map-columns :id [:assessment-account-number :sale-date] vector)
        (tc/inner-join potentially-duplicate-row-ids :id)
        (tc/drop-columns :id)
        (tc/order-by :assessment-account-number))))We can check that this did what we expected by summing up the counts from our potentially duplicate sales dataset. It should equal the number of rows in our filtered original dataset:
(reduce + (:count potentially-duplicate-row-counts))2184(tc/row-count duplicate-rows)2294Now we need to handle these duplicate rows. We can just blindly delete them, but it’s better to try to understand what’s going on, in case there’s some valid reason for them to be duplicated, or something else we’re misunderstanding about our data. Eyeballing the data reveals some interesting things, like that some rows have a living-units value that isn’t 1 and some have multiple different year-built values for the same unit, but some visualisations will help to reveal more about the data.
duplicate-rowsinner-join [2294 24]:
| :assessment-account-number | :municipal-unit | :civic-number | :civic-additional | :civic-direction | :civic-street-name | :civic-suffix | :civic-city-name | :sale-price | :sale-date | :parcels-in-sale | :living-units | :year-built | :square-foot-living-area | :stories | :building-type | :bedrooms | :bathrooms | :under-construction | :construction-grade | :finished-basement | :garage | :square-feet | :count | 
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 9075 | MUNI. OF THE DISTRICT OF LUNENBURG | 491 | OAKLAND | RD | INDIAN POINT | 170000 | 2018-11-02 | 1 | 2 | 1977 | 1390 | 1.0 | Detached | 3 | 2 | false | Average | false | true | 67954 | 2 | ||
| 9075 | MUNI. OF THE DISTRICT OF LUNENBURG | 491 | OAKLAND | RD | INDIAN POINT | 170000 | 2018-11-02 | 1 | 2 | 2023 | 668 | 1.0 | Detached | 1 | 1 | false | Average | false | false | 67954 | 2 | ||
| 9075 | MUNI. OF THE DISTRICT OF LUNENBURG | 491 | OAKLAND | RD | INDIAN POINT | 155000 | 2014-08-06 | 1 | 2 | 1977 | 1390 | 1.0 | Detached | 3 | 2 | false | Average | false | true | 67954 | 2 | ||
| 9075 | MUNI. OF THE DISTRICT OF LUNENBURG | 491 | OAKLAND | RD | INDIAN POINT | 155000 | 2014-08-06 | 1 | 2 | 2023 | 668 | 1.0 | Detached | 1 | 1 | false | Average | false | false | 67954 | 2 | ||
| 18341 | TOWN OF KENTVILLE | 242 | MAIN | ST | KENTVILLE | 1000 | 2016-08-01 | 1 | 1 | 1888 | 972 | 1.5 | Detached | 3 | 2 | false | Average | false | false | 4413 | 2 | ||
| 18341 | TOWN OF KENTVILLE | 242 | MAIN | ST | KENTVILLE | 93000 | 2016-08-01 | 1 | 1 | 1888 | 972 | 1.5 | Detached | 3 | 2 | false | Average | false | false | 4413 | 2 | ||
| 43397 | MUNI. OF THE DISTRICT OF EAST HANTS | 775 | HIGHWAY 1 | MOUNT UNIACKE | 123000 | 2015-10-29 | 2 | 1 | 1903 | 864 | 1.5 | Detached | 3 | 1 | false | Average | false | true | 15893 | 2 | |||
| 43397 | MUNI. OF THE DISTRICT OF EAST HANTS | 775 | HIGHWAY 1 | MOUNT UNIACKE | 123000 | 2015-10-29 | 2 | 1 | 2022 | 1008 | 1.0 | Detached | 1 | 2 | true | Average | true | true | 15893 | 2 | |||
| 44296 | HALIFAX REGIONAL MUNICIPALITY (HRM) | 6555 | CHESTER | AVE | HALIFAX | 315000 | 2017-01-07 | 1 | 2 | 1943 | 1710 | 2.0 | Detached | 4 | 2 | false | Average | false | false | 4343 | 2 | ||
| 44296 | HALIFAX REGIONAL MUNICIPALITY (HRM) | 6555 | CHESTER | AVE | HALIFAX | 1200 | 2017-01-07 | 1 | 2 | 1943 | 1710 | 2.0 | Detached | 4 | 2 | false | Average | false | false | 4343 | 2 | ||
| … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | 
| 10939399 | TOWN OF BERWICK | 12 | GALA | DR | BERWICK | 361828 | 2022-01-10 | 1 | 1 | 2021 | 1462 | 1.0 | Semi Detached | 3 | 2 | false | Average | false | true | 6738 | 2 | ||
| 10939429 | TOWN OF BERWICK | 13 | GALA | DR | BERWICK | 337283 | 2022-06-30 | 1 | 1 | 2021 | 1462 | 1.0 | Semi Detached | 3 | 2 | false | Average | false | true | 7693 | 2 | ||
| 10939429 | TOWN OF BERWICK | 13 | GALA | DR | BERWICK | 70000 | 2022-06-30 | 2 | 1 | 2021 | 1462 | 1.0 | Semi Detached | 3 | 2 | false | Average | false | true | 7693 | 2 | ||
| 10944831 | MUNI. OF THE DISTRICT OF EAST HANTS | 426 | ESS | RD | UPPER NINE MILE RIVER | 20000 | 2021-12-06 | 1 | 2 | 2022 | 1344 | 1.0 | Detached | 3 | 2 | false | Average | false | true | 437725 | 2 | ||
| 10944831 | MUNI. OF THE DISTRICT OF EAST HANTS | 426 | ESS | RD | UPPER NINE MILE RIVER | 20000 | 2021-12-06 | 1 | 2 | 2023 | 1264 | 1.0 | Detached | 1 | 1 | false | Average | false | true | 437725 | 2 | ||
| 10969220 | WEST HANTS REGIONAL MUNICIPALITY | 36 | ELIZABETH | AVE | GARLANDS CROSSING | 240000 | 2022-02-24 | 4 | 1 | 2021 | 1302 | 1.0 | Semi Detached | 2 | 2 | false | Average | false | true | 5016 | 2 | ||
| 10969220 | WEST HANTS REGIONAL MUNICIPALITY | 36 | ELIZABETH | AVE | GARLANDS CROSSING | 330955 | 2022-02-24 | 1 | 1 | 2021 | 1302 | 1.0 | Semi Detached | 2 | 2 | false | Average | false | true | 5016 | 2 | ||
| 10991625 | MUNICIPALITY OF THE COUNTY OF INVERNESS | 2581 | SHORE | RD | MARGAREE HARBOUR | 600000 | 2022-06-03 | 1 | 2 | 1985 | 552 | 1.0 | Detached | 1 | 1 | false | Average | false | false | 125986 | 2 | ||
| 10991625 | MUNICIPALITY OF THE COUNTY OF INVERNESS | 2581 | SHORE | RD | MARGAREE HARBOUR | 600000 | 2022-06-03 | 1 | 2 | 1970 | 2380 | 1.5 | Detached | 4 | 3 | false | Average | true | false | 125986 | 2 | ||
| 11021141 | MUNI. OF THE DISTRICT OF EAST HANTS | 57 | BRIANNA | DR | LANTZ | 600000 | 2023-04-28 | 4 | 2 | 2023 | 2005 | 2.0 | Detached | 3 | 3 | true | Average | false | true | 11320 | 2 | ||
| 11021141 | MUNI. OF THE DISTRICT OF EAST HANTS | 57 | BRIANNA | DR | LANTZ | 600000 | 2023-04-28 | 4 | 2 | 2023 | 2029 | 2.0 | Detached | 3 | 3 | true | Average | false | true | 11320 | 2 | 
We’ll use hanamicloth for our data visualisation. First we can check out the distribution of living unit sizes:
(-> duplicate-rows (ploclo/layer-histogram {:=x :living-units}) )
(-> partially-cleaned-housing-data
    (tc/unique-by [:assessment-account-number :sale-date])
    tc/row-count
 )345716clean up and steps to turn this data into tidy data before saving it to work with in the next steps.
First, we can
If we inspect this new, combined dataset we can see there are quite a few sparse columns:
(-> housing-data
    tc/info
    (tc/select-rows #(< 0 (:n-missing %)))
    (tc/select-columns [:col-name :n-missing]))inner-join: descriptive-stats [14 2]:
| :col-name | :n-missing | 
|---|---|
| :civic-number | 3456 | 
| :civic-additional | 361858 | 
| :civic-direction | 361775 | 
| :civic-street-name | 94 | 
| :civic-suffix | 36727 | 
| :civic-city-name | 98 | 
| :sale-price | 1116 | 
| :parcels-in-sale | 3774 | 
| :year-built | 47337 | 
| :square-foot-living-area | 17512 | 
| :style | 17407 | 
| :bedrooms | 36486 | 
| :construction-grade | 17635 | 
| :square-feet | 13451 | 
Some of these were columns that were in more than one of the original dataset and in combining them, by taking only the value from the first dataset, we may have lost some information. We can tell this by checking the descriptive stats of one of the other datasets, which shows different numbers of missing values:
(-> characteristics
    tc/info
    (tc/select-rows #(< 0 (:n-missing %)))
    (tc/select-columns [:col-name :n-missing]))characteristics: descriptive-stats [14 2]:
| :col-name | :n-missing | 
|---|---|
| :civic-number | 4967 | 
| :civic-additional | 377922 | 
| :civic-direction | 377760 | 
| :civic-street-name | 550 | 
| :civic-street-suffix | 40669 | 
| :civic-city-name | 492 | 
| :year-built | 49663 | 
| :square-foot-living-area | 12753 | 
| :style | 12656 | 
| :bedrooms | 52235 | 
| :construction-grade | 12832 | 
| :y-map-coordinate | 24115 | 
| :x-map-coordinate | 24115 | 
| :map-coordinates | 24115 | 
In this case there are more missing values in the characteristics dataset than in the combined one, so it may be the case that there isn’t actually any extra information in the characteristics dataset, but to be sure we can check. To do that we’ll combine the three datasets, then check the columns we care about for any rows where the value from the first dataset is nil, but either of the other two have a value.
First, to simplify some of this code we’ll define the list of columns the datasets share:
(def common-columns
  [:assessment-account-number
   :municipal-unit
   :civic-number
   :civic-additional
   :civic-direction
   :civic-street-name
   :civic-suffix
   :civic-city-name])Then we can join the datasets together, keeping the duplicated columns this time. Tablecloth will default to disambiguating the duplicated column names by using the dataset names, which works fine in this case:
(def combined-potentially-duplicated-columns
  (-> sales-history
      (tc/select-columns common-columns)
      (tc/inner-join (tc/select-columns characteristics common-columns)
                     :assessment-account-number)
      (tc/inner-join (tc/select-columns land-sizes common-columns)
                     :assessment-account-number)))Now we can check whether any of the other datasets have information for the missing values. For the sake of demonstration we’ll check the :municipal-unit column first:
(-> combined-potentially-duplicated-columns
    (tc/select-rows (fn [row]
                      (let [v1 (:municipal-unit row)
                            v2 (:characteristics.municipal-unit row)
                            v3 (:land-sizes.municipal-unit row)]
                        (and (nil? v1)
                             (not (nil? (or v2 v3))))))))inner-join [0 20]:
| :assessment-account-number | :municipal-unit | :civic-number | :civic-additional | :civic-direction | :civic-street-name | :civic-suffix | :civic-city-name | :characteristics.municipal-unit | :characteristics.civic-number | :characteristics.civic-additional | :characteristics.civic-direction | :characteristics.civic-street-name | :characteristics.civic-city-name | :land-sizes.municipal-unit | :land-sizes.civic-number | :land-sizes.civic-additional | :land-sizes.civic-direction | :land-sizes.civic-street-name | :land-sizes.civic-city-name | 
|---|
It turns out there are actually no rows that have a missing value for :municipal-unit in the sales-history dataset but do have a value in either of the other two datasets. To check the rest of the columns more quickly we can extract our predicate into a helper and have it accept a column name as an argument. This function will take a row and a column name and return true if our primary dataset has no value in the given row and column and either of the secondary datasets contain do have a value:
(defn- non-nil-in-secondary-datasets? [row col-name]
  (let [v1 (col-name row)
        v2 ((->> col-name name (str "characteristics.") keyword) row)
        v3 ((->> col-name name (str "land-sizes.") keyword) row)]
    (and (nil? v1)
         (not (nil? (or v2 v3))))))#'book.part-0-end-to-end.1-data-import-and-cleanup-example/non-nil-in-secondary-datasets?And it order to check all the column names at ones, we can run this check over a given list of column names we want to check:
(defn- check-for-values-in-secondary-datasets [ds column-names] (reduce (fn [ds col-namel] () ds column-names)))
(defn secondary-datsets-have-missing-data [column-names row] (some identity (reduce (fn [row col-name] (non-nil-in-secondary-datasets? row col-name)) row column-names)))
(-> combined-potentially-duplicated-columns (tc/select-rows (partial secondary-datsets-have-missing-data common-columns)))
Now we’ll go through and try to fill in as much information as possible. We can start by filtering out any rows that do not have missing values in our base dataset in the columns we’re trying to fill in. To keep our processing pipeline neat, we’ll write a simple helper that will return true for rows that fit our criteria:
(defn resolve [v1 v2 v3] ;; (some identity args) (or v1 v2 v3) )
(defn consolidate-columns [ds column-names] (reduce (fn [ds col-name] (let [characteristics-col-name (->> col-name name (str “characteristics.”) keyword) land-sizes-col-name (->> col-name name (str “land-sizes.”) keyword)] (-> ds (tc/map-columns col-name [col-name characteristics-col-name land-sizes-col-name] resolve) (tc/drop-columns [characteristics-col-name land-sizes-col-name])))) ds column-names))
(-> combined-potentially-duplicated-columns ;; (tc/head 4) (consolidate-columns common-columns) tc/info (tc/select-rows #(< 0 (:n-missing %))) (tc/select-columns [:col-name :n-missing]))
(defn no-missing-info)
(-> combined-potentially-duplicated-columns (tc/head 5) (tc/drop-rows (fn [row] (every? identity (map (fn [col-name] (not (nil? (col-name row)))) )))) ;; tc/row-count )
that do not have any differences across the three datasets. This is going to be verbose to inline so we’ll write a small helper:
looking for rows where there is a missing value in the main dataset but there is a value for that column in another dataset
filter out rows where all three datasets agree
(defn datasets-concur? [row col]
  (println "*******************")
  (println (col row))
  (println ((->> col name (str "characteristic.") keyword) row))
  (println ((->> col name (str "land-sizes.") keyword) row))
  (println "*******************")
  (= (col row)
     ((->> col name (str "characteristic.") keyword) row)
     ((->> col name (str "land-sizes.") keyword) row)))(defn datasets-same? [row col]
  (let [v1 (col row)
        v2 ((->> col name (str "characteristic.") keyword) row)
        v3 ((->> col name (str "land-sizes.") keyword) row)]
    ;; either all values are the same
    ;; (or
    ;;  (-> [v1 v2 v3] set count (= 1)))
    (->> [v1 v2 v3] set (remove nil?) count (= 1))))(defn no-discrepancies [row]
  (every? identity
          (map (partial datasets-concur? row)
           [:municipal-unit
            :civic-number
            :civic-additional
            :civic-direction
            :civic-street-name
            :civic-suffix
            :civic-city-name])))(defn has-discrepancy? [row col-name]
  ;; returns true if values differ and are not nil, otherwise false
  (let [v1 (col-name row)
        v2 ((->> col-name name (str "characteristic.") keyword) row)
        v3 ((->> col-name name (str "land-sizes.") keyword) row)]
    (->> [v1 v2 v3] set (remove nil?) count (<= 2))))(defn with-discrepancies [row]
  (some identity (map (partial has-discrepancy? row) common-columns))
  )(-> combined-potentially-duplicated-columns
    ;; (tc/head 100)
    ;; (tc/drop-rows no-discrepancies)
    (tc/select-rows with-discrepancies)
    ;; (tc/select-rows #(not= (:municipal-unit %)
    ;;                        ((->> :municipal-unit
    ;;                              name
    ;;                              (str "characteristic.")
    ;;                              keyword) %)
    ;;                        ((->> :municipal-unit
    ;;                              name
    ;;                              (str "land-sizes.")
    ;;                              keyword) %)))
    ;; tc/row-count
    )inner-join [4 20]:
| :assessment-account-number | :municipal-unit | :civic-number | :civic-additional | :civic-direction | :civic-street-name | :civic-suffix | :civic-city-name | :characteristics.municipal-unit | :characteristics.civic-number | :characteristics.civic-additional | :characteristics.civic-direction | :characteristics.civic-street-name | :characteristics.civic-city-name | :land-sizes.municipal-unit | :land-sizes.civic-number | :land-sizes.civic-additional | :land-sizes.civic-direction | :land-sizes.civic-street-name | :land-sizes.civic-city-name | 
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 10823821 | MUNI. OF THE DISTRICT OF EAST HANTS | 257 | MEEK ARM | TRAIL | EAST UNIACKE | MUNI. OF THE DISTRICT OF EAST HANTS | 0 | MEEK ARM | EAST UNIACKE | MUNI. OF THE DISTRICT OF EAST HANTS | 0 | MEEK ARM | EAST UNIACKE | ||||||
| 10823821 | MUNI. OF THE DISTRICT OF EAST HANTS | 257 | MEEK ARM | TRAIL | EAST UNIACKE | MUNI. OF THE DISTRICT OF EAST HANTS | 0 | MEEK ARM | EAST UNIACKE | MUNI. OF THE DISTRICT OF EAST HANTS | 0 | MEEK ARM | EAST UNIACKE | ||||||
| 733814 | MUNICIPALITY OF THE COUNTY OF INVERNESS | 15308 | CABOT | TRAIL | CHETICAMP | MUNICIPALITY OF THE COUNTY OF INVERNESS | 15308 | CABOT | CH�TICAMP | MUNICIPALITY OF THE COUNTY OF INVERNESS | 15308 | CABOT | CH�TICAMP | ||||||
| 10823714 | MUNI. OF THE DISTRICT OF EAST HANTS | 173 | MEEK ARM | TRAIL | EAST UNIACKE | MUNI. OF THE DISTRICT OF EAST HANTS | 0 | MEEK ARM | EAST UNIACKE | MUNI. OF THE DISTRICT OF EAST HANTS | 0 | MEEK ARM | EAST UNIACKE | 
combined-potentially-duplicated-columnsinner-join [363075 20]:
| :assessment-account-number | :municipal-unit | :civic-number | :civic-additional | :civic-direction | :civic-street-name | :civic-suffix | :civic-city-name | :characteristics.municipal-unit | :characteristics.civic-number | :characteristics.civic-additional | :characteristics.civic-direction | :characteristics.civic-street-name | :characteristics.civic-city-name | :land-sizes.municipal-unit | :land-sizes.civic-number | :land-sizes.civic-additional | :land-sizes.civic-direction | :land-sizes.civic-street-name | :land-sizes.civic-city-name | 
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 911593 | HALIFAX REGIONAL MUNICIPALITY (HRM) | 445 | WEST PORTERS LAKE | RD | PORTERS LAKE | HALIFAX REGIONAL MUNICIPALITY (HRM) | 445 | WEST PORTERS LAKE | PORTERS LAKE | HALIFAX REGIONAL MUNICIPALITY (HRM) | 445 | WEST PORTERS LAKE | PORTERS LAKE | ||||||
| 911593 | HALIFAX REGIONAL MUNICIPALITY (HRM) | 445 | WEST PORTERS LAKE | RD | PORTERS LAKE | HALIFAX REGIONAL MUNICIPALITY (HRM) | 445 | WEST PORTERS LAKE | PORTERS LAKE | HALIFAX REGIONAL MUNICIPALITY (HRM) | 445 | WEST PORTERS LAKE | PORTERS LAKE | ||||||
| 8502838 | MUNICIPALITY OF THE DISTRICT OF ARGYLE | 1125 | DU TITTLE | CH | SURETTES ISLAND | MUNICIPALITY OF THE DISTRICT OF ARGYLE | 1125 | DU TITTLE | SURETTES ISLAND | MUNICIPALITY OF THE DISTRICT OF ARGYLE | 1125 | DU TITTLE | SURETTES ISLAND | ||||||
| 3698858 | MUNICIPALITY OF THE COUNTY OF KINGS | 9739 | MAIN | ST | CANNING | MUNICIPALITY OF THE COUNTY OF KINGS | 9739 | MAIN | CANNING | MUNICIPALITY OF THE COUNTY OF KINGS | 9739 | MAIN | CANNING | ||||||
| 3698858 | MUNICIPALITY OF THE COUNTY OF KINGS | 9739 | MAIN | ST | CANNING | MUNICIPALITY OF THE COUNTY OF KINGS | 9739 | MAIN | CANNING | MUNICIPALITY OF THE COUNTY OF KINGS | 9739 | MAIN | CANNING | ||||||
| 3944026 | MUNI. OF THE DISTRICT OF GUYSBOROUGH | 5231 | HIGHWAY 316 | CHARLOS COVE | MUNI. OF THE DISTRICT OF GUYSBOROUGH | 5231 | HIGHWAY 316 | CHARLOS COVE | MUNI. OF THE DISTRICT OF GUYSBOROUGH | 5231 | HIGHWAY 316 | CHARLOS COVE | |||||||
| 3944026 | MUNI. OF THE DISTRICT OF GUYSBOROUGH | 5231 | HIGHWAY 316 | CHARLOS COVE | MUNI. OF THE DISTRICT OF GUYSBOROUGH | 5231 | HIGHWAY 316 | CHARLOS COVE | MUNI. OF THE DISTRICT OF GUYSBOROUGH | 5231 | HIGHWAY 316 | CHARLOS COVE | |||||||
| 3944026 | MUNI. OF THE DISTRICT OF GUYSBOROUGH | 5231 | HIGHWAY 316 | CHARLOS COVE | MUNI. OF THE DISTRICT OF GUYSBOROUGH | 5231 | HIGHWAY 316 | CHARLOS COVE | MUNI. OF THE DISTRICT OF GUYSBOROUGH | 5231 | HIGHWAY 316 | CHARLOS COVE | |||||||
| 3885003 | MUNICIPALITY OF THE COUNTY OF INVERNESS | 1394 | MABOU HARBOUR | RD | MABOU HARBOUR | MUNICIPALITY OF THE COUNTY OF INVERNESS | 1394 | MABOU HARBOUR | MABOU HARBOUR | MUNICIPALITY OF THE COUNTY OF INVERNESS | 1394 | MABOU HARBOUR | MABOU HARBOUR | ||||||
| 3885003 | MUNICIPALITY OF THE COUNTY OF INVERNESS | 1394 | MABOU HARBOUR | RD | MABOU HARBOUR | MUNICIPALITY OF THE COUNTY OF INVERNESS | 1394 | MABOU HARBOUR | MABOU HARBOUR | MUNICIPALITY OF THE COUNTY OF INVERNESS | 1394 | MABOU HARBOUR | MABOU HARBOUR | ||||||
| … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | 
| 10863791 | MUNI. OF THE DISTRICT OF LUNENBURG | 320 | EVELYN | DR | PARKDALE | MUNI. OF THE DISTRICT OF LUNENBURG | 320 | EVELYN | PARKDALE | MUNI. OF THE DISTRICT OF LUNENBURG | 320 | EVELYN | PARKDALE | ||||||
| 2745755 | HALIFAX REGIONAL MUNICIPALITY (HRM) | 1178 | KETCH HARBOUR | RD | KETCH HARBOUR | HALIFAX REGIONAL MUNICIPALITY (HRM) | 1178 | KETCH HARBOUR | KETCH HARBOUR | HALIFAX REGIONAL MUNICIPALITY (HRM) | 1178 | KETCH HARBOUR | KETCH HARBOUR | ||||||
| 2745755 | HALIFAX REGIONAL MUNICIPALITY (HRM) | 1178 | KETCH HARBOUR | RD | KETCH HARBOUR | HALIFAX REGIONAL MUNICIPALITY (HRM) | 1178 | KETCH HARBOUR | KETCH HARBOUR | HALIFAX REGIONAL MUNICIPALITY (HRM) | 1178 | KETCH HARBOUR | KETCH HARBOUR | ||||||
| 2745755 | HALIFAX REGIONAL MUNICIPALITY (HRM) | 1178 | KETCH HARBOUR | RD | KETCH HARBOUR | HALIFAX REGIONAL MUNICIPALITY (HRM) | 1178 | KETCH HARBOUR | KETCH HARBOUR | HALIFAX REGIONAL MUNICIPALITY (HRM) | 1178 | KETCH HARBOUR | KETCH HARBOUR | ||||||
| 10751993 | WEST HANTS REGIONAL MUNICIPALITY | 7289 | HIGHWAY 1 | ARDOISE | WEST HANTS REGIONAL MUNICIPALITY | 7289 | HIGHWAY 1 | ARDOISE | WEST HANTS REGIONAL MUNICIPALITY | 7289 | HIGHWAY 1 | ARDOISE | |||||||
| 10751993 | WEST HANTS REGIONAL MUNICIPALITY | 7289 | HIGHWAY 1 | ARDOISE | WEST HANTS REGIONAL MUNICIPALITY | 7289 | HIGHWAY 1 | ARDOISE | WEST HANTS REGIONAL MUNICIPALITY | 7289 | HIGHWAY 1 | ARDOISE | |||||||
| 10751993 | WEST HANTS REGIONAL MUNICIPALITY | 7289 | HIGHWAY 1 | ARDOISE | WEST HANTS REGIONAL MUNICIPALITY | 7289 | HIGHWAY 1 | ARDOISE | WEST HANTS REGIONAL MUNICIPALITY | 7289 | HIGHWAY 1 | ARDOISE | |||||||
| 10751993 | WEST HANTS REGIONAL MUNICIPALITY | 7289 | HIGHWAY 1 | ARDOISE | WEST HANTS REGIONAL MUNICIPALITY | 7289 | HIGHWAY 1 | ARDOISE | WEST HANTS REGIONAL MUNICIPALITY | 7289 | HIGHWAY 1 | ARDOISE | |||||||
| 7667817 | HALIFAX REGIONAL MUNICIPALITY (HRM) | 56 | LAKE | RD | PLEASANT VALLEY | HALIFAX REGIONAL MUNICIPALITY (HRM) | 56 | LAKE | PLEASANT VALLEY | HALIFAX REGIONAL MUNICIPALITY (HRM) | 56 | LAKE | PLEASANT VALLEY | ||||||
| 10986400 | HALIFAX REGIONAL MUNICIPALITY (HRM) | 538 | MIDNIGHT | RUN | MIDDLE SACKVILLE | HALIFAX REGIONAL MUNICIPALITY (HRM) | 538 | MIDNIGHT | MIDDLE SACKVILLE | HALIFAX REGIONAL MUNICIPALITY (HRM) | 538 | MIDNIGHT | MIDDLE SACKVILLE | ||||||
| 10986400 | HALIFAX REGIONAL MUNICIPALITY (HRM) | 538 | MIDNIGHT | RUN | MIDDLE SACKVILLE | HALIFAX REGIONAL MUNICIPALITY (HRM) | 538 | MIDNIGHT | MIDDLE SACKVILLE | HALIFAX REGIONAL MUNICIPALITY (HRM) | 538 | MIDNIGHT | MIDDLE SACKVILLE | 
(->
  (tc/info characteristics)
 (tc/select-rows #(#{:municipal-unit
                     :civic-number
                     :civic-additional
                     :civic-direction
                     :civic-street-name
                     :civic-suffix
                     :civic-city-name}
                    (:col-name %)))
(tc/select-columns [:col-name :n-missing]))characteristics: descriptive-stats [6 2]:
| :col-name | :n-missing | 
|---|---|
| :municipal-unit | 0 | 
| :civic-number | 4967 | 
| :civic-additional | 377922 | 
| :civic-direction | 377760 | 
| :civic-street-name | 550 | 
| :civic-city-name | 492 | 
(-> housing-data (haclo/layer-point #:haclo{:x :square-foot-living-area :y :sale-price :xscale {:type “log”}}))
Contains information licensed under the Open Data & Information Government Licence – PVSC & Participating Municipalities
Now that our dataset is assembled and cleaned up, I’ll write it to a new file for easier access in the next steps.
(tc/write-nippy! housing-data "data/end_to_end/housing-data.nippy")nil