6 Dealing with messy data
ns chapter-2-input-output.2-2-messy-data
(:nextjournal.clerk/toc true}
{:require [tablecloth.api :as tc]
(:as fun]
[tech.v3.datatype.functional :as kind-clerk])) [scicloj.kind-clerk.api
(kind-clerk/setup!)
:ok
This is a work in progress of the code examples that will make up chapter 2, section 2 of the Clojure data cookbook
How do you know it’s messy? What do I mean by that?
6.1 Multiple types mixed in one column
Tablecloth will handle it just fine, it will just give the column the type :object
def mixed-types
(:A ["string" "more strings" 3]
(tc/dataset {:B [1 2 "whoops"]}))
:columns) (tc/info mixed-types
_unnamed :column info [2 4]:
:categorical? | :name | :datatype | :n-elems |
---|---|---|---|
true | :A | :object | 3 |
true | :B | :object | 3 |
To convert all values in a row to a given type:
:A :string) (tc/convert-types mixed-types
_unnamed [3 2]:
:A | :B |
---|---|
string | 1 |
more strings | 2 |
3 | whoops |
-> mixed-types
(:A :string)
(tc/convert-types :columns)) (tc/info
_unnamed :column info [2 4]:
:categorical? | :name | :datatype | :n-elems |
---|---|---|---|
true | :A | :string | 3 |
true | :B | :object | 3 |
If you try to convert to something that can’t be cast under the hood, it will throw an exception: e.g. (this will make your notebook fail to render) (tc/convert-types mixed-types :B :int)
6.2 Multiple formats for a thing that’s supposed to have one (e.g. phone numbers, postal codes)
You can pass any arbitrary function to update a column
def misformatted
(:phone ["123-456-5654" "(304) 342 1235" "(423)-234-2342" "1234325984" "nope"]
(tc/dataset {:postal-code ["t1n 0k2" "H9Q1L2" "H3H 8V0" "eu5h04" "just wrong"]}))
require '[clojure.string :as str]) (
nil
def phone-regex
(re-pattern
(str
(".*" ; zero or more of any character
"(\\d{3})" ; any 3 numbers
".*" ; zero or more of any character
"(\\d{3})" ; any 3 numbers
".*" ; zero or more of any character
"(\\d{4})" ; any 4 numbers
)))
defn- normalize-phone-numbers [col]
(map (fn [v]
(let [[match a b c] (re-matches phone-regex v)]
(if match
(str "(" a ")-" b "-" c)
("INVALID")))
col))
#'chapter-2-input-output.2-2-messy-data/normalize-phone-numbers
def postal-code-regex
(re-pattern
(str
("([A-Z]{1})" ; any letter
".*" ; zero or more of any character
"(\\d{1})" ; any number
".*"
"([A-Z]{1})"
".*"
"(\\d{1})"
".*"
"([A-Z]{1})"
".*"
"(\\d{1})")))
defn- normalize-postal-codes [col]
(map (fn [v]
(let [[match a b c d e f] (->> v str/upper-case (re-matches postal-code-regex))]
(if match
(str a b c " " d e f)
("INVALID")))
col))
#'chapter-2-input-output.2-2-messy-data/normalize-postal-codes
-> misformatted
(:phone normalize-phone-numbers
(tc/update-columns {:postal-code normalize-postal-codes}))
_unnamed [5 2]:
:phone | :postal-code |
---|---|
(123)-456-5654 | T1N 0K2 |
(304)-342-1235 | H9Q 1L2 |
(423)-234-2342 | H3H 8V0 |
(123)-432-5984 | INVALID |
INVALID | INVALID |
6.3 Missing values
Tablecloth has many built-in helpers for dealing with missing values.
require '[tech.v3.datatype.datetime :as dt]) (
nil
def sparse
(:A [1 2 3 nil nil 6]
(tc/dataset {:B ["test" nil "this" "is" "a" "test"]}))
Drop whole rows with any missing values:
(tc/drop-missing sparse)
_unnamed [3 2]:
:A | :B |
---|---|
1 | test |
3 | this |
6 | test |
Drop whole row with any missing values in a given column:
:A) (tc/drop-missing sparse
_unnamed [4 2]:
:A | :B |
---|---|
1 | test |
2 | |
3 | this |
6 | test |
Replace missing values Table cloth includes many strategies for replacing missing values https://scicloj.github.io/tablecloth/index.html#replace
6.4 Arbitrary values meant to indicate missing (e.g. “NONE”, “N/A”, false, etc.)
It’s not uncommon to see missing values indicated in multiple different ways, sometimes even within the same dataset. E.g. missing cells might be blank entirely, or they might be populated with some arbitrary value meant to indicate “nothing”, like “NONE”, “N/A”, false
, etc.