Columnar and Compressed Data

Warning

This is an excerpt from the work-in-progress Clojure Data Cookbook. It is under active development and subject to change in the coming months.

TODO - write intro

Summary

Nippy

Nippy is “the fastest serialization library for Clojure”. Tablecloth datasets work with nippy out of the box.

Unlike edn Edn doesn’t handle binary data, doesn’t always serialize certain datatypes in a round-trippable way (i.e. reading the file back in )

writing edn is worse than not threadsafe

writing prstr under the hood, and if any other thread tries to write to stdout at the same time, stdout is not threadsafe

https://github.com/jafingerhut/jafingerhut.github.com/blob/master/clojure-info/using-edn-safely.md https://nitor.com/fi/artikkelit/pitfalls-and-bumps-clojures-extensible-data-notation-edn

https://techascent.github.io/tech.ml.dataset/nippy-serialization-rocks.html

Tablecloth can just read nippy datasets directly. No extra dependencies are required.

(tc/dataset "data/tc-nippy.nippy")

data/tc-nippy.nippy [20 10]:

0 7 1 4 6 3 2 9 5 8
cell-0 cell-7 cell-1 cell-4 cell-6 cell-3 cell-2 cell-9 cell-5 cell-8
cell-0 cell-7 cell-1 cell-4 cell-6 cell-3 cell-2 cell-9 cell-5 cell-8
cell-0 cell-7 cell-1 cell-4 cell-6 cell-3 cell-2 cell-9 cell-5 cell-8
cell-0 cell-7 cell-1 cell-4 cell-6 cell-3 cell-2 cell-9 cell-5 cell-8
cell-0 cell-7 cell-1 cell-4 cell-6 cell-3 cell-2 cell-9 cell-5 cell-8
cell-0 cell-7 cell-1 cell-4 cell-6 cell-3 cell-2 cell-9 cell-5 cell-8
cell-0 cell-7 cell-1 cell-4 cell-6 cell-3 cell-2 cell-9 cell-5 cell-8
cell-0 cell-7 cell-1 cell-4 cell-6 cell-3 cell-2 cell-9 cell-5 cell-8
cell-0 cell-7 cell-1 cell-4 cell-6 cell-3 cell-2 cell-9 cell-5 cell-8
cell-0 cell-7 cell-1 cell-4 cell-6 cell-3 cell-2 cell-9 cell-5 cell-8
cell-0 cell-7 cell-1 cell-4 cell-6 cell-3 cell-2 cell-9 cell-5 cell-8
cell-0 cell-7 cell-1 cell-4 cell-6 cell-3 cell-2 cell-9 cell-5 cell-8
cell-0 cell-7 cell-1 cell-4 cell-6 cell-3 cell-2 cell-9 cell-5 cell-8
cell-0 cell-7 cell-1 cell-4 cell-6 cell-3 cell-2 cell-9 cell-5 cell-8
cell-0 cell-7 cell-1 cell-4 cell-6 cell-3 cell-2 cell-9 cell-5 cell-8
cell-0 cell-7 cell-1 cell-4 cell-6 cell-3 cell-2 cell-9 cell-5 cell-8
cell-0 cell-7 cell-1 cell-4 cell-6 cell-3 cell-2 cell-9 cell-5 cell-8
cell-0 cell-7 cell-1 cell-4 cell-6 cell-3 cell-2 cell-9 cell-5 cell-8
cell-0 cell-7 cell-1 cell-4 cell-6 cell-3 cell-2 cell-9 cell-5 cell-8
cell-0 cell-7 cell-1 cell-4 cell-6 cell-3 cell-2 cell-9 cell-5 cell-8

Parquet

Working with parquet files requires a pile of dependencies. There is a very long and carefully chosen list of exclusions curated to avoid a myriad of CVE issues in hadoop that are conveniently packaged up in the tech.parquet package. We’ll add this to our deps.edn:

com.techascent/tmd-parquet {:mvn/version "1.000-beta-39"}

Once this dependency is installed and the relevant namespace is required, tablecloth (thanks to tech.ml.dataset) will handle parquet files like any other file type:

(require '[tech.v3.libs.parquet :as pq])

Parquet support includes the following datatypes: - all numeric types - strings - java.time LocalDate, Instant - UUIDs (read/written as strings, the same as R’s write_parquet function)

Given a path to a single parquet file, we get a single dataset:

(pq/parquet->ds "data/columnar_data/mtcars.parquet")

_unnamed [32 12]:

model mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2

We can pass the key-fn option (like with any other file type) to format column headers on read:

(pq/parquet->ds "data/columnar_data/mtcars.parquet" {:key-fn keyword})

_unnamed [32 12]:

:model :mpg :cyl :disp :hp :drat :wt :qsec :vs :am :gear :carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2

We can also use the :column-allowlist or :column-blocklist options to load only certain columns. This can be useful if the dataset is very large and you know in advance that you only need a subset of the columns. Note that the column header transformations are applied before the column selection, so the column selection should use the transformed column names:

(pq/parquet->ds "data/columnar_data/mtcars.parquet"
                {:key-fn keyword
                 :column-allowlist [:mpg, :cyl]})

_unnamed [32 2]:

:mpg :cyl
21.0 6
21.0 6
22.8 4
21.4 6
18.7 8
18.1 6
14.3 8
24.4 4
22.8 4
19.2 6
15.5 8
15.2 8
13.3 8
19.2 8
27.3 4
26.0 4
30.4 4
15.8 8
19.7 6
15.0 8
21.4 4

As opposed to this, which returns nothing because there are no columns with the string names “mpg” or “cyl”, because of the key-fn option we passed.

(pq/parquet->ds "data/columnar_data/mtcars.parquet"
                {:key-fn keyword
                 :column-allowlist ["mpg", "cyl"]})

_unnamed [0 0]

We can also inspect the metadata of a parquet file. pq/parquet->metadata-seq returns a sequence of maps, each containing the metadata of one row group in the dataset.

(pq/parquet->metadata-seq "data/columnar_data/mtcars.parquet")
[{:path nil,
  :num-rows 32,
  :total-byte-size 167109,
  :columns
  [{:path ["model"],
    :total-size 454,
    :starting-pos 4,
    :original-type :utf8,
    :dictionary-page-offset 0,
    :uncompressed-size 534,
    :encodings #{:plain},
    :statistics
    {:min
     #object[org.apache.parquet.io.api.Binary$ByteArrayBackedBinary 0x72c8a46f "Binary{11 reused bytes, [65, 77, 67, 32, 74, 97, 118, 101, 108, 105, 110]}"],
     :max
     #object[org.apache.parquet.io.api.Binary$ByteArrayBackedBinary 0x1d70ea3f "Binary{10 reused bytes, [86, 111, 108, 118, 111, 32, 49, 52, 50, 69]}"],
     :num-missing 0},
    :first-data-page-offset 4,
    :num-values 32,
    :repetition :optional,
    :decimal-metadata nil,
    :compression-codec :snappy,
    :primitive-type
    #object[org.apache.parquet.schema.PrimitiveType 0x66d5d1d1 "optional binary model (STRING)"]}
   {:path ["mpg"],
    :total-size 180,
    :starting-pos 458,
    :original-type nil,
    :dictionary-page-offset 0,
    :uncompressed-size 281,
    :encodings #{:plain},
    :statistics {:min 10.4, :max 33.9, :num-missing 0},
    :first-data-page-offset 458,
    :num-values 32,
    :repetition :optional,
    :decimal-metadata nil,
    :compression-codec :snappy,
    :primitive-type
    #object[org.apache.parquet.schema.PrimitiveType 0x6c395bfd "optional double mpg"]}
   {:path ["cyl"],
    :total-size 78,
    :starting-pos 638,
    :original-type :int32,
    :dictionary-page-offset 0,
    :uncompressed-size 152,
    :encodings #{:plain},
    :statistics {:min 4, :max 8, :num-missing 0},
    :first-data-page-offset 638,
    :num-values 32,
    :repetition :optional,
    :decimal-metadata nil,
    :compression-codec :snappy,
    :primitive-type
    #object[org.apache.parquet.schema.PrimitiveType 0x533fbf8d "optional int32 cyl (INTEGER(32,true))"]}
   {:path ["disp"],
    :total-size 186,
    :starting-pos 716,
    :original-type nil,
    :dictionary-page-offset 0,
    :uncompressed-size 281,
    :encodings #{:plain},
    :statistics {:min 71.1, :max 472.0, :num-missing 0},
    :first-data-page-offset 716,
    :num-values 32,
    :repetition :optional,
    :decimal-metadata nil,
    :compression-codec :snappy,
    :primitive-type
    #object[org.apache.parquet.schema.PrimitiveType 0x9e7aaab "optional double disp"]}
   {:path ["hp"],
    :total-size 133,
    :starting-pos 902,
    :original-type :int32,
    :dictionary-page-offset 0,
    :uncompressed-size 153,
    :encodings #{:plain},
    :statistics {:min 52, :max 335, :num-missing 0},
    :first-data-page-offset 902,
    :num-values 32,
    :repetition :optional,
    :decimal-metadata nil,
    :compression-codec :snappy,
    :primitive-type
    #object[org.apache.parquet.schema.PrimitiveType 0x2e57538b "optional int32 hp (INTEGER(32,true))"]}
   {:path ["drat"],
    :total-size 192,
    :starting-pos 1035,
    :original-type nil,
    :dictionary-page-offset 0,
    :uncompressed-size 281,
    :encodings #{:plain},
    :statistics {:min 2.76, :max 4.93, :num-missing 0},
    :first-data-page-offset 1035,
    :num-values 32,
    :repetition :optional,
    :decimal-metadata nil,
    :compression-codec :snappy,
    :primitive-type
    #object[org.apache.parquet.schema.PrimitiveType 0x535007df "optional double drat"]}
   {:path ["wt"],
    :total-size 219,
    :starting-pos 1227,
    :original-type nil,
    :dictionary-page-offset 0,
    :uncompressed-size 281,
    :encodings #{:plain},
    :statistics {:min 1.513, :max 5.424, :num-missing 0},
    :first-data-page-offset 1227,
    :num-values 32,
    :repetition :optional,
    :decimal-metadata nil,
    :compression-codec :snappy,
    :primitive-type
    #object[org.apache.parquet.schema.PrimitiveType 0x6eb6b31c "optional double wt"]}
   {:path ["qsec"],
    :total-size 214,
    :starting-pos 1446,
    :original-type nil,
    :dictionary-page-offset 0,
    :uncompressed-size 281,
    :encodings #{:plain},
    :statistics {:min 14.5, :max 22.9, :num-missing 0},
    :first-data-page-offset 1446,
    :num-values 32,
    :repetition :optional,
    :decimal-metadata nil,
    :compression-codec :snappy,
    :primitive-type
    #object[org.apache.parquet.schema.PrimitiveType 0x73b35955 "optional double qsec"]}
   {:path ["vs"],
    :total-size 72,
    :starting-pos 1660,
    :original-type :int32,
    :dictionary-page-offset 0,
    :uncompressed-size 152,
    :encodings #{:plain},
    :statistics {:min 0, :max 1, :num-missing 0},
    :first-data-page-offset 1660,
    :num-values 32,
    :repetition :optional,
    :decimal-metadata nil,
    :compression-codec :snappy,
    :primitive-type
    #object[org.apache.parquet.schema.PrimitiveType 0x17421410 "optional int32 vs (INTEGER(32,true))"]}
   {:path ["am"],
    :total-size 48,
    :starting-pos 1732,
    :original-type :int32,
    :dictionary-page-offset 0,
    :uncompressed-size 152,
    :encodings #{:plain},
    :statistics {:min 0, :max 1, :num-missing 0},
    :first-data-page-offset 1732,
    :num-values 32,
    :repetition :optional,
    :decimal-metadata nil,
    :compression-codec :snappy,
    :primitive-type
    #object[org.apache.parquet.schema.PrimitiveType 0x55a2f549 "optional int32 am (INTEGER(32,true))"]}
   {:path ["gear"],
    :total-size 72,
    :starting-pos 1780,
    :original-type :int32,
    :dictionary-page-offset 0,
    :uncompressed-size 152,
    :encodings #{:plain},
    :statistics {:min 3, :max 5, :num-missing 0},
    :first-data-page-offset 1780,
    :num-values 32,
    :repetition :optional,
    :decimal-metadata nil,
    :compression-codec :snappy,
    :primitive-type
    #object[org.apache.parquet.schema.PrimitiveType 0x621aaba7 "optional int32 gear (INTEGER(32,true))"]}
   {:path ["carb"],
    :total-size 96,
    :starting-pos 1852,
    :original-type :int32,
    :dictionary-page-offset 0,
    :uncompressed-size 153,
    :encodings #{:plain},
    :statistics {:min 1, :max 8, :num-missing 0},
    :first-data-page-offset 1852,
    :num-values 32,
    :repetition :optional,
    :decimal-metadata nil,
    :compression-codec :snappy,
    :primitive-type
    #object[org.apache.parquet.schema.PrimitiveType 0x54f27414 "optional int32 carb (INTEGER(32,true))"]}]}]
source: src/book/part_1_data_import/2_columnar_compressed_data.clj