Columnar and Compressed Data

Warning

This is an excerpt from the work-in-progress Clojure Data Cookbook. It is under active development and subject to change in the coming months.

TODO - write intro

Summary

Nippy

Nippy is “the fastest serialization library for Clojure”. Tablecloth datasets work with nippy out of the box.

Unlike edn Edn doesn’t handle binary data, doesn’t always serialize certain datatypes in a round-trippable way (i.e. reading the file back in )

writing edn is worse than not threadsafe

writing prstr under the hood, and if any other thread tries to write to stdout at the same time, stdout is not threadsafe

https://github.com/jafingerhut/jafingerhut.github.com/blob/master/clojure-info/using-edn-safely.md https://nitor.com/fi/artikkelit/pitfalls-and-bumps-clojures-extensible-data-notation-edn

https://techascent.github.io/tech.ml.dataset/nippy-serialization-rocks.html

Tablecloth can just read nippy datasets directly. No extra dependencies are required.

(tc/dataset "data/tc-nippy.nippy")

data/tc-nippy.nippy [20 10]:

0	7	1	4	6	3	2	9	5	8
cell-0	cell-7	cell-1	cell-4	cell-6	cell-3	cell-2	cell-9	cell-5	cell-8
cell-0	cell-7	cell-1	cell-4	cell-6	cell-3	cell-2	cell-9	cell-5	cell-8
cell-0	cell-7	cell-1	cell-4	cell-6	cell-3	cell-2	cell-9	cell-5	cell-8
cell-0	cell-7	cell-1	cell-4	cell-6	cell-3	cell-2	cell-9	cell-5	cell-8
cell-0	cell-7	cell-1	cell-4	cell-6	cell-3	cell-2	cell-9	cell-5	cell-8
cell-0	cell-7	cell-1	cell-4	cell-6	cell-3	cell-2	cell-9	cell-5	cell-8
cell-0	cell-7	cell-1	cell-4	cell-6	cell-3	cell-2	cell-9	cell-5	cell-8
cell-0	cell-7	cell-1	cell-4	cell-6	cell-3	cell-2	cell-9	cell-5	cell-8
cell-0	cell-7	cell-1	cell-4	cell-6	cell-3	cell-2	cell-9	cell-5	cell-8
cell-0	cell-7	cell-1	cell-4	cell-6	cell-3	cell-2	cell-9	cell-5	cell-8
cell-0	cell-7	cell-1	cell-4	cell-6	cell-3	cell-2	cell-9	cell-5	cell-8
cell-0	cell-7	cell-1	cell-4	cell-6	cell-3	cell-2	cell-9	cell-5	cell-8
cell-0	cell-7	cell-1	cell-4	cell-6	cell-3	cell-2	cell-9	cell-5	cell-8
cell-0	cell-7	cell-1	cell-4	cell-6	cell-3	cell-2	cell-9	cell-5	cell-8
cell-0	cell-7	cell-1	cell-4	cell-6	cell-3	cell-2	cell-9	cell-5	cell-8
cell-0	cell-7	cell-1	cell-4	cell-6	cell-3	cell-2	cell-9	cell-5	cell-8
cell-0	cell-7	cell-1	cell-4	cell-6	cell-3	cell-2	cell-9	cell-5	cell-8
cell-0	cell-7	cell-1	cell-4	cell-6	cell-3	cell-2	cell-9	cell-5	cell-8
cell-0	cell-7	cell-1	cell-4	cell-6	cell-3	cell-2	cell-9	cell-5	cell-8
cell-0	cell-7	cell-1	cell-4	cell-6	cell-3	cell-2	cell-9	cell-5	cell-8

Parquet

Working with parquet files requires a pile of dependencies. There is a very long and carefully chosen list of exclusions curated to avoid a myriad of CVE issues in hadoop that are conveniently packaged up in the tech.parquet package. We’ll add this to our deps.edn:

com.techascent/tmd-parquet {:mvn/version "1.000-beta-39"}

Once this dependency is installed and the relevant namespace is required, tablecloth (thanks to tech.ml.dataset) will handle parquet files like any other file type:

(require '[tech.v3.libs.parquet :as pq])

Parquet support includes the following datatypes: - all numeric types - strings - java.time LocalDate, Instant - UUIDs (read/written as strings, the same as R’s write_parquet function)

Given a path to a single parquet file, we get a single dataset:

(pq/parquet->ds "data/columnar_data/mtcars.parquet")

_unnamed [32 12]:

model	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
Mazda RX4	21.0	6	160.0	110	3.90	2.620	16.46	0	1	4	4
Mazda RX4 Wag	21.0	6	160.0	110	3.90	2.875	17.02	0	1	4	4
Datsun 710	22.8	4	108.0	93	3.85	2.320	18.61	1	1	4	1
Hornet 4 Drive	21.4	6	258.0	110	3.08	3.215	19.44	1	0	3	1
Hornet Sportabout	18.7	8	360.0	175	3.15	3.440	17.02	0	0	3	2
Valiant	18.1	6	225.0	105	2.76	3.460	20.22	1	0	3	1
Duster 360	14.3	8	360.0	245	3.21	3.570	15.84	0	0	3	4
Merc 240D	24.4	4	146.7	62	3.69	3.190	20.00	1	0	4	2
Merc 230	22.8	4	140.8	95	3.92	3.150	22.90	1	0	4	2
Merc 280	19.2	6	167.6	123	3.92	3.440	18.30	1	0	4	4
…	…	…	…	…	…	…	…	…	…	…	…
Dodge Challenger	15.5	8	318.0	150	2.76	3.520	16.87	0	0	3	2
AMC Javelin	15.2	8	304.0	150	3.15	3.435	17.30	0	0	3	2
Camaro Z28	13.3	8	350.0	245	3.73	3.840	15.41	0	0	3	4
Pontiac Firebird	19.2	8	400.0	175	3.08	3.845	17.05	0	0	3	2
Fiat X1-9	27.3	4	79.0	66	4.08	1.935	18.90	1	1	4	1
Porsche 914-2	26.0	4	120.3	91	4.43	2.140	16.70	0	1	5	2
Lotus Europa	30.4	4	95.1	113	3.77	1.513	16.90	1	1	5	2
Ford Pantera L	15.8	8	351.0	264	4.22	3.170	14.50	0	1	5	4
Ferrari Dino	19.7	6	145.0	175	3.62	2.770	15.50	0	1	5	6
Maserati Bora	15.0	8	301.0	335	3.54	3.570	14.60	0	1	5	8
Volvo 142E	21.4	4	121.0	109	4.11	2.780	18.60	1	1	4	2

We can pass the key-fn option (like with any other file type) to format column headers on read:

(pq/parquet->ds "data/columnar_data/mtcars.parquet" {:key-fn keyword})

_unnamed [32 12]:

:model	:mpg	:cyl	:disp	:hp	:drat	:wt	:qsec	:vs	:am	:gear	:carb
Mazda RX4	21.0	6	160.0	110	3.90	2.620	16.46	0	1	4	4
Mazda RX4 Wag	21.0	6	160.0	110	3.90	2.875	17.02	0	1	4	4
Datsun 710	22.8	4	108.0	93	3.85	2.320	18.61	1	1	4	1
Hornet 4 Drive	21.4	6	258.0	110	3.08	3.215	19.44	1	0	3	1
Hornet Sportabout	18.7	8	360.0	175	3.15	3.440	17.02	0	0	3	2
Valiant	18.1	6	225.0	105	2.76	3.460	20.22	1	0	3	1
Duster 360	14.3	8	360.0	245	3.21	3.570	15.84	0	0	3	4
Merc 240D	24.4	4	146.7	62	3.69	3.190	20.00	1	0	4	2
Merc 230	22.8	4	140.8	95	3.92	3.150	22.90	1	0	4	2
Merc 280	19.2	6	167.6	123	3.92	3.440	18.30	1	0	4	4
…	…	…	…	…	…	…	…	…	…	…	…
Dodge Challenger	15.5	8	318.0	150	2.76	3.520	16.87	0	0	3	2
AMC Javelin	15.2	8	304.0	150	3.15	3.435	17.30	0	0	3	2
Camaro Z28	13.3	8	350.0	245	3.73	3.840	15.41	0	0	3	4
Pontiac Firebird	19.2	8	400.0	175	3.08	3.845	17.05	0	0	3	2
Fiat X1-9	27.3	4	79.0	66	4.08	1.935	18.90	1	1	4	1
Porsche 914-2	26.0	4	120.3	91	4.43	2.140	16.70	0	1	5	2
Lotus Europa	30.4	4	95.1	113	3.77	1.513	16.90	1	1	5	2
Ford Pantera L	15.8	8	351.0	264	4.22	3.170	14.50	0	1	5	4
Ferrari Dino	19.7	6	145.0	175	3.62	2.770	15.50	0	1	5	6
Maserati Bora	15.0	8	301.0	335	3.54	3.570	14.60	0	1	5	8
Volvo 142E	21.4	4	121.0	109	4.11	2.780	18.60	1	1	4	2

We can also use the :column-allowlist or :column-blocklist options to load only certain columns. This can be useful if the dataset is very large and you know in advance that you only need a subset of the columns. Note that the column header transformations are applied before the column selection, so the column selection should use the transformed column names:

(pq/parquet->ds "data/columnar_data/mtcars.parquet"
                {:key-fn keyword
                 :column-allowlist [:mpg, :cyl]})

_unnamed [32 2]:

:mpg	:cyl
21.0	6
21.0	6
22.8	4
21.4	6
18.7	8
18.1	6
14.3	8
24.4	4
22.8	4
19.2	6
…	…
15.5	8
15.2	8
13.3	8
19.2	8
27.3	4
26.0	4
30.4	4
15.8	8
19.7	6
15.0	8
21.4	4

As opposed to this, which returns nothing because there are no columns with the string names “mpg” or “cyl”, because of the key-fn option we passed.

(pq/parquet->ds "data/columnar_data/mtcars.parquet"
                {:key-fn keyword
                 :column-allowlist ["mpg", "cyl"]})

_unnamed [0 0]

We can also inspect the metadata of a parquet file. pq/parquet->metadata-seq returns a sequence of maps, each containing the metadata of one row group in the dataset.

(pq/parquet->metadata-seq "data/columnar_data/mtcars.parquet")

[{:path nil,
  :num-rows 32,
  :total-byte-size 167109,
  :columns
  [{:path ["model"],
    :total-size 454,
    :starting-pos 4,
    :original-type :utf8,
    :dictionary-page-offset 0,
    :uncompressed-size 534,
    :encodings #{:plain},
    :statistics
    {:min
     #object[org.apache.parquet.io.api.Binary$ByteArrayBackedBinary 0x72c8a46f "Binary{11 reused bytes, [65, 77, 67, 32, 74, 97, 118, 101, 108, 105, 110]}"],
     :max
     #object[org.apache.parquet.io.api.Binary$ByteArrayBackedBinary 0x1d70ea3f "Binary{10 reused bytes, [86, 111, 108, 118, 111, 32, 49, 52, 50, 69]}"],
     :num-missing 0},
    :first-data-page-offset 4,
    :num-values 32,
    :repetition :optional,
    :decimal-metadata nil,
    :compression-codec :snappy,
    :primitive-type
    #object[org.apache.parquet.schema.PrimitiveType 0x66d5d1d1 "optional binary model (STRING)"]}
   {:path ["mpg"],
    :total-size 180,
    :starting-pos 458,
    :original-type nil,
    :dictionary-page-offset 0,
    :uncompressed-size 281,
    :encodings #{:plain},
    :statistics {:min 10.4, :max 33.9, :num-missing 0},
    :first-data-page-offset 458,
    :num-values 32,
    :repetition :optional,
    :decimal-metadata nil,
    :compression-codec :snappy,
    :primitive-type
    #object[org.apache.parquet.schema.PrimitiveType 0x6c395bfd "optional double mpg"]}
   {:path ["cyl"],
    :total-size 78,
    :starting-pos 638,
    :original-type :int32,
    :dictionary-page-offset 0,
    :uncompressed-size 152,
    :encodings #{:plain},
    :statistics {:min 4, :max 8, :num-missing 0},
    :first-data-page-offset 638,
    :num-values 32,
    :repetition :optional,
    :decimal-metadata nil,
    :compression-codec :snappy,
    :primitive-type
    #object[org.apache.parquet.schema.PrimitiveType 0x533fbf8d "optional int32 cyl (INTEGER(32,true))"]}
   {:path ["disp"],
    :total-size 186,
    :starting-pos 716,
    :original-type nil,
    :dictionary-page-offset 0,
    :uncompressed-size 281,
    :encodings #{:plain},
    :statistics {:min 71.1, :max 472.0, :num-missing 0},
    :first-data-page-offset 716,
    :num-values 32,
    :repetition :optional,
    :decimal-metadata nil,
    :compression-codec :snappy,
    :primitive-type
    #object[org.apache.parquet.schema.PrimitiveType 0x9e7aaab "optional double disp"]}
   {:path ["hp"],
    :total-size 133,
    :starting-pos 902,
    :original-type :int32,
    :dictionary-page-offset 0,
    :uncompressed-size 153,
    :encodings #{:plain},
    :statistics {:min 52, :max 335, :num-missing 0},
    :first-data-page-offset 902,
    :num-values 32,
    :repetition :optional,
    :decimal-metadata nil,
    :compression-codec :snappy,
    :primitive-type
    #object[org.apache.parquet.schema.PrimitiveType 0x2e57538b "optional int32 hp (INTEGER(32,true))"]}
   {:path ["drat"],
    :total-size 192,
    :starting-pos 1035,
    :original-type nil,
    :dictionary-page-offset 0,
    :uncompressed-size 281,
    :encodings #{:plain},
    :statistics {:min 2.76, :max 4.93, :num-missing 0},
    :first-data-page-offset 1035,
    :num-values 32,
    :repetition :optional,
    :decimal-metadata nil,
    :compression-codec :snappy,
    :primitive-type
    #object[org.apache.parquet.schema.PrimitiveType 0x535007df "optional double drat"]}
   {:path ["wt"],
    :total-size 219,
    :starting-pos 1227,
    :original-type nil,
    :dictionary-page-offset 0,
    :uncompressed-size 281,
    :encodings #{:plain},
    :statistics {:min 1.513, :max 5.424, :num-missing 0},
    :first-data-page-offset 1227,
    :num-values 32,
    :repetition :optional,
    :decimal-metadata nil,
    :compression-codec :snappy,
    :primitive-type
    #object[org.apache.parquet.schema.PrimitiveType 0x6eb6b31c "optional double wt"]}
   {:path ["qsec"],
    :total-size 214,
    :starting-pos 1446,
    :original-type nil,
    :dictionary-page-offset 0,
    :uncompressed-size 281,
    :encodings #{:plain},
    :statistics {:min 14.5, :max 22.9, :num-missing 0},
    :first-data-page-offset 1446,
    :num-values 32,
    :repetition :optional,
    :decimal-metadata nil,
    :compression-codec :snappy,
    :primitive-type
    #object[org.apache.parquet.schema.PrimitiveType 0x73b35955 "optional double qsec"]}
   {:path ["vs"],
    :total-size 72,
    :starting-pos 1660,
    :original-type :int32,
    :dictionary-page-offset 0,
    :uncompressed-size 152,
    :encodings #{:plain},
    :statistics {:min 0, :max 1, :num-missing 0},
    :first-data-page-offset 1660,
    :num-values 32,
    :repetition :optional,
    :decimal-metadata nil,
    :compression-codec :snappy,
    :primitive-type
    #object[org.apache.parquet.schema.PrimitiveType 0x17421410 "optional int32 vs (INTEGER(32,true))"]}
   {:path ["am"],
    :total-size 48,
    :starting-pos 1732,
    :original-type :int32,
    :dictionary-page-offset 0,
    :uncompressed-size 152,
    :encodings #{:plain},
    :statistics {:min 0, :max 1, :num-missing 0},
    :first-data-page-offset 1732,
    :num-values 32,
    :repetition :optional,
    :decimal-metadata nil,
    :compression-codec :snappy,
    :primitive-type
    #object[org.apache.parquet.schema.PrimitiveType 0x55a2f549 "optional int32 am (INTEGER(32,true))"]}
   {:path ["gear"],
    :total-size 72,
    :starting-pos 1780,
    :original-type :int32,
    :dictionary-page-offset 0,
    :uncompressed-size 152,
    :encodings #{:plain},
    :statistics {:min 3, :max 5, :num-missing 0},
    :first-data-page-offset 1780,
    :num-values 32,
    :repetition :optional,
    :decimal-metadata nil,
    :compression-codec :snappy,
    :primitive-type
    #object[org.apache.parquet.schema.PrimitiveType 0x621aaba7 "optional int32 gear (INTEGER(32,true))"]}
   {:path ["carb"],
    :total-size 96,
    :starting-pos 1852,
    :original-type :int32,
    :dictionary-page-offset 0,
    :uncompressed-size 153,
    :encodings #{:plain},
    :statistics {:min 1, :max 8, :num-missing 0},
    :first-data-page-offset 1852,
    :num-values 32,
    :repetition :optional,
    :decimal-metadata nil,
    :compression-codec :snappy,
    :primitive-type
    #object[org.apache.parquet.schema.PrimitiveType 0x54f27414 "optional int32 carb (INTEGER(32,true))"]}]}]

source: src/book/part_1_data_import/2_columnar_compressed_data.clj