Columnar and Compressed Data
This is an excerpt from the work-in-progress Clojure Data Cookbook. It is under active development and subject to change in the coming months.
TODO - write intro
Summary
Nippy
Nippy is “the fastest serialization library for Clojure”. Tablecloth datasets work with nippy out of the box.
Unlike edn Edn doesn’t handle binary data, doesn’t always serialize certain datatypes in a round-trippable way (i.e. reading the file back in )
writing edn is worse than not threadsafe
writing prstr under the hood, and if any other thread tries to write to stdout at the same time, stdout is not threadsafe
https://github.com/jafingerhut/jafingerhut.github.com/blob/master/clojure-info/using-edn-safely.md https://nitor.com/fi/artikkelit/pitfalls-and-bumps-clojures-extensible-data-notation-edn
https://techascent.github.io/tech.ml.dataset/nippy-serialization-rocks.html
Tablecloth can just read nippy datasets directly. No extra dependencies are required.
"data/tc-nippy.nippy") (tc/dataset
data/tc-nippy.nippy [20 10]:
0 | 7 | 1 | 4 | 6 | 3 | 2 | 9 | 5 | 8 |
---|---|---|---|---|---|---|---|---|---|
cell-0 | cell-7 | cell-1 | cell-4 | cell-6 | cell-3 | cell-2 | cell-9 | cell-5 | cell-8 |
cell-0 | cell-7 | cell-1 | cell-4 | cell-6 | cell-3 | cell-2 | cell-9 | cell-5 | cell-8 |
cell-0 | cell-7 | cell-1 | cell-4 | cell-6 | cell-3 | cell-2 | cell-9 | cell-5 | cell-8 |
cell-0 | cell-7 | cell-1 | cell-4 | cell-6 | cell-3 | cell-2 | cell-9 | cell-5 | cell-8 |
cell-0 | cell-7 | cell-1 | cell-4 | cell-6 | cell-3 | cell-2 | cell-9 | cell-5 | cell-8 |
cell-0 | cell-7 | cell-1 | cell-4 | cell-6 | cell-3 | cell-2 | cell-9 | cell-5 | cell-8 |
cell-0 | cell-7 | cell-1 | cell-4 | cell-6 | cell-3 | cell-2 | cell-9 | cell-5 | cell-8 |
cell-0 | cell-7 | cell-1 | cell-4 | cell-6 | cell-3 | cell-2 | cell-9 | cell-5 | cell-8 |
cell-0 | cell-7 | cell-1 | cell-4 | cell-6 | cell-3 | cell-2 | cell-9 | cell-5 | cell-8 |
cell-0 | cell-7 | cell-1 | cell-4 | cell-6 | cell-3 | cell-2 | cell-9 | cell-5 | cell-8 |
cell-0 | cell-7 | cell-1 | cell-4 | cell-6 | cell-3 | cell-2 | cell-9 | cell-5 | cell-8 |
cell-0 | cell-7 | cell-1 | cell-4 | cell-6 | cell-3 | cell-2 | cell-9 | cell-5 | cell-8 |
cell-0 | cell-7 | cell-1 | cell-4 | cell-6 | cell-3 | cell-2 | cell-9 | cell-5 | cell-8 |
cell-0 | cell-7 | cell-1 | cell-4 | cell-6 | cell-3 | cell-2 | cell-9 | cell-5 | cell-8 |
cell-0 | cell-7 | cell-1 | cell-4 | cell-6 | cell-3 | cell-2 | cell-9 | cell-5 | cell-8 |
cell-0 | cell-7 | cell-1 | cell-4 | cell-6 | cell-3 | cell-2 | cell-9 | cell-5 | cell-8 |
cell-0 | cell-7 | cell-1 | cell-4 | cell-6 | cell-3 | cell-2 | cell-9 | cell-5 | cell-8 |
cell-0 | cell-7 | cell-1 | cell-4 | cell-6 | cell-3 | cell-2 | cell-9 | cell-5 | cell-8 |
cell-0 | cell-7 | cell-1 | cell-4 | cell-6 | cell-3 | cell-2 | cell-9 | cell-5 | cell-8 |
cell-0 | cell-7 | cell-1 | cell-4 | cell-6 | cell-3 | cell-2 | cell-9 | cell-5 | cell-8 |
Parquet
Working with parquet files requires a pile of dependencies. There is a very long and carefully chosen list of exclusions curated to avoid a myriad of CVE issues in hadoop that are conveniently packaged up in the tech.parquet
package. We’ll add this to our deps.edn
:
:mvn/version "1.000-beta-39"} com.techascent/tmd-parquet {
Once this dependency is installed and the relevant namespace is required, tablecloth (thanks to tech.ml.dataset) will handle parquet files like any other file type:
require '[tech.v3.libs.parquet :as pq]) (
Parquet support includes the following datatypes: - all numeric types - strings - java.time LocalDate, Instant - UUIDs (read/written as strings, the same as R’s write_parquet function)
Given a path to a single parquet file, we get a single dataset:
"data/columnar_data/mtcars.parquet") (pq/parquet->ds
_unnamed [32 12]:
model | mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb |
---|---|---|---|---|---|---|---|---|---|---|---|
Mazda RX4 | 21.0 | 6 | 160.0 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
Mazda RX4 Wag | 21.0 | 6 | 160.0 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
Datsun 710 | 22.8 | 4 | 108.0 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |
Hornet 4 Drive | 21.4 | 6 | 258.0 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 | 1 |
Hornet Sportabout | 18.7 | 8 | 360.0 | 175 | 3.15 | 3.440 | 17.02 | 0 | 0 | 3 | 2 |
Valiant | 18.1 | 6 | 225.0 | 105 | 2.76 | 3.460 | 20.22 | 1 | 0 | 3 | 1 |
Duster 360 | 14.3 | 8 | 360.0 | 245 | 3.21 | 3.570 | 15.84 | 0 | 0 | 3 | 4 |
Merc 240D | 24.4 | 4 | 146.7 | 62 | 3.69 | 3.190 | 20.00 | 1 | 0 | 4 | 2 |
Merc 230 | 22.8 | 4 | 140.8 | 95 | 3.92 | 3.150 | 22.90 | 1 | 0 | 4 | 2 |
Merc 280 | 19.2 | 6 | 167.6 | 123 | 3.92 | 3.440 | 18.30 | 1 | 0 | 4 | 4 |
… | … | … | … | … | … | … | … | … | … | … | … |
Dodge Challenger | 15.5 | 8 | 318.0 | 150 | 2.76 | 3.520 | 16.87 | 0 | 0 | 3 | 2 |
AMC Javelin | 15.2 | 8 | 304.0 | 150 | 3.15 | 3.435 | 17.30 | 0 | 0 | 3 | 2 |
Camaro Z28 | 13.3 | 8 | 350.0 | 245 | 3.73 | 3.840 | 15.41 | 0 | 0 | 3 | 4 |
Pontiac Firebird | 19.2 | 8 | 400.0 | 175 | 3.08 | 3.845 | 17.05 | 0 | 0 | 3 | 2 |
Fiat X1-9 | 27.3 | 4 | 79.0 | 66 | 4.08 | 1.935 | 18.90 | 1 | 1 | 4 | 1 |
Porsche 914-2 | 26.0 | 4 | 120.3 | 91 | 4.43 | 2.140 | 16.70 | 0 | 1 | 5 | 2 |
Lotus Europa | 30.4 | 4 | 95.1 | 113 | 3.77 | 1.513 | 16.90 | 1 | 1 | 5 | 2 |
Ford Pantera L | 15.8 | 8 | 351.0 | 264 | 4.22 | 3.170 | 14.50 | 0 | 1 | 5 | 4 |
Ferrari Dino | 19.7 | 6 | 145.0 | 175 | 3.62 | 2.770 | 15.50 | 0 | 1 | 5 | 6 |
Maserati Bora | 15.0 | 8 | 301.0 | 335 | 3.54 | 3.570 | 14.60 | 0 | 1 | 5 | 8 |
Volvo 142E | 21.4 | 4 | 121.0 | 109 | 4.11 | 2.780 | 18.60 | 1 | 1 | 4 | 2 |
We can pass the key-fn
option (like with any other file type) to format column headers on read:
"data/columnar_data/mtcars.parquet" {:key-fn keyword}) (pq/parquet->ds
_unnamed [32 12]:
:model | :mpg | :cyl | :disp | :hp | :drat | :wt | :qsec | :vs | :am | :gear | :carb |
---|---|---|---|---|---|---|---|---|---|---|---|
Mazda RX4 | 21.0 | 6 | 160.0 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
Mazda RX4 Wag | 21.0 | 6 | 160.0 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
Datsun 710 | 22.8 | 4 | 108.0 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |
Hornet 4 Drive | 21.4 | 6 | 258.0 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 | 1 |
Hornet Sportabout | 18.7 | 8 | 360.0 | 175 | 3.15 | 3.440 | 17.02 | 0 | 0 | 3 | 2 |
Valiant | 18.1 | 6 | 225.0 | 105 | 2.76 | 3.460 | 20.22 | 1 | 0 | 3 | 1 |
Duster 360 | 14.3 | 8 | 360.0 | 245 | 3.21 | 3.570 | 15.84 | 0 | 0 | 3 | 4 |
Merc 240D | 24.4 | 4 | 146.7 | 62 | 3.69 | 3.190 | 20.00 | 1 | 0 | 4 | 2 |
Merc 230 | 22.8 | 4 | 140.8 | 95 | 3.92 | 3.150 | 22.90 | 1 | 0 | 4 | 2 |
Merc 280 | 19.2 | 6 | 167.6 | 123 | 3.92 | 3.440 | 18.30 | 1 | 0 | 4 | 4 |
… | … | … | … | … | … | … | … | … | … | … | … |
Dodge Challenger | 15.5 | 8 | 318.0 | 150 | 2.76 | 3.520 | 16.87 | 0 | 0 | 3 | 2 |
AMC Javelin | 15.2 | 8 | 304.0 | 150 | 3.15 | 3.435 | 17.30 | 0 | 0 | 3 | 2 |
Camaro Z28 | 13.3 | 8 | 350.0 | 245 | 3.73 | 3.840 | 15.41 | 0 | 0 | 3 | 4 |
Pontiac Firebird | 19.2 | 8 | 400.0 | 175 | 3.08 | 3.845 | 17.05 | 0 | 0 | 3 | 2 |
Fiat X1-9 | 27.3 | 4 | 79.0 | 66 | 4.08 | 1.935 | 18.90 | 1 | 1 | 4 | 1 |
Porsche 914-2 | 26.0 | 4 | 120.3 | 91 | 4.43 | 2.140 | 16.70 | 0 | 1 | 5 | 2 |
Lotus Europa | 30.4 | 4 | 95.1 | 113 | 3.77 | 1.513 | 16.90 | 1 | 1 | 5 | 2 |
Ford Pantera L | 15.8 | 8 | 351.0 | 264 | 4.22 | 3.170 | 14.50 | 0 | 1 | 5 | 4 |
Ferrari Dino | 19.7 | 6 | 145.0 | 175 | 3.62 | 2.770 | 15.50 | 0 | 1 | 5 | 6 |
Maserati Bora | 15.0 | 8 | 301.0 | 335 | 3.54 | 3.570 | 14.60 | 0 | 1 | 5 | 8 |
Volvo 142E | 21.4 | 4 | 121.0 | 109 | 4.11 | 2.780 | 18.60 | 1 | 1 | 4 | 2 |
We can also use the :column-allowlist
or :column-blocklist
options to load only certain columns. This can be useful if the dataset is very large and you know in advance that you only need a subset of the columns. Note that the column header transformations are applied before the column selection, so the column selection should use the transformed column names:
"data/columnar_data/mtcars.parquet"
(pq/parquet->ds :key-fn keyword
{:column-allowlist [:mpg, :cyl]})
_unnamed [32 2]:
:mpg | :cyl |
---|---|
21.0 | 6 |
21.0 | 6 |
22.8 | 4 |
21.4 | 6 |
18.7 | 8 |
18.1 | 6 |
14.3 | 8 |
24.4 | 4 |
22.8 | 4 |
19.2 | 6 |
… | … |
15.5 | 8 |
15.2 | 8 |
13.3 | 8 |
19.2 | 8 |
27.3 | 4 |
26.0 | 4 |
30.4 | 4 |
15.8 | 8 |
19.7 | 6 |
15.0 | 8 |
21.4 | 4 |
As opposed to this, which returns nothing because there are no columns with the string names “mpg” or “cyl”, because of the key-fn option we passed.
"data/columnar_data/mtcars.parquet"
(pq/parquet->ds :key-fn keyword
{:column-allowlist ["mpg", "cyl"]})
_unnamed [0 0]
We can also inspect the metadata of a parquet file. pq/parquet->metadata-seq
returns a sequence of maps, each containing the metadata of one row group in the dataset.
"data/columnar_data/mtcars.parquet") (pq/parquet->metadata-seq
:path nil,
[{:num-rows 32,
:total-byte-size 167109,
:columns
:path ["model"],
[{:total-size 454,
:starting-pos 4,
:original-type :utf8,
:dictionary-page-offset 0,
:uncompressed-size 534,
:encodings #{:plain},
:statistics
:min
{0x72c8a46f "Binary{11 reused bytes, [65, 77, 67, 32, 74, 97, 118, 101, 108, 105, 110]}"],
#object[org.apache.parquet.io.api.Binary$ByteArrayBackedBinary :max
0x1d70ea3f "Binary{10 reused bytes, [86, 111, 108, 118, 111, 32, 49, 52, 50, 69]}"],
#object[org.apache.parquet.io.api.Binary$ByteArrayBackedBinary :num-missing 0},
:first-data-page-offset 4,
:num-values 32,
:repetition :optional,
:decimal-metadata nil,
:compression-codec :snappy,
:primitive-type
0x66d5d1d1 "optional binary model (STRING)"]}
#object[org.apache.parquet.schema.PrimitiveType :path ["mpg"],
{:total-size 180,
:starting-pos 458,
:original-type nil,
:dictionary-page-offset 0,
:uncompressed-size 281,
:encodings #{:plain},
:statistics {:min 10.4, :max 33.9, :num-missing 0},
:first-data-page-offset 458,
:num-values 32,
:repetition :optional,
:decimal-metadata nil,
:compression-codec :snappy,
:primitive-type
0x6c395bfd "optional double mpg"]}
#object[org.apache.parquet.schema.PrimitiveType :path ["cyl"],
{:total-size 78,
:starting-pos 638,
:original-type :int32,
:dictionary-page-offset 0,
:uncompressed-size 152,
:encodings #{:plain},
:statistics {:min 4, :max 8, :num-missing 0},
:first-data-page-offset 638,
:num-values 32,
:repetition :optional,
:decimal-metadata nil,
:compression-codec :snappy,
:primitive-type
0x533fbf8d "optional int32 cyl (INTEGER(32,true))"]}
#object[org.apache.parquet.schema.PrimitiveType :path ["disp"],
{:total-size 186,
:starting-pos 716,
:original-type nil,
:dictionary-page-offset 0,
:uncompressed-size 281,
:encodings #{:plain},
:statistics {:min 71.1, :max 472.0, :num-missing 0},
:first-data-page-offset 716,
:num-values 32,
:repetition :optional,
:decimal-metadata nil,
:compression-codec :snappy,
:primitive-type
0x9e7aaab "optional double disp"]}
#object[org.apache.parquet.schema.PrimitiveType :path ["hp"],
{:total-size 133,
:starting-pos 902,
:original-type :int32,
:dictionary-page-offset 0,
:uncompressed-size 153,
:encodings #{:plain},
:statistics {:min 52, :max 335, :num-missing 0},
:first-data-page-offset 902,
:num-values 32,
:repetition :optional,
:decimal-metadata nil,
:compression-codec :snappy,
:primitive-type
0x2e57538b "optional int32 hp (INTEGER(32,true))"]}
#object[org.apache.parquet.schema.PrimitiveType :path ["drat"],
{:total-size 192,
:starting-pos 1035,
:original-type nil,
:dictionary-page-offset 0,
:uncompressed-size 281,
:encodings #{:plain},
:statistics {:min 2.76, :max 4.93, :num-missing 0},
:first-data-page-offset 1035,
:num-values 32,
:repetition :optional,
:decimal-metadata nil,
:compression-codec :snappy,
:primitive-type
0x535007df "optional double drat"]}
#object[org.apache.parquet.schema.PrimitiveType :path ["wt"],
{:total-size 219,
:starting-pos 1227,
:original-type nil,
:dictionary-page-offset 0,
:uncompressed-size 281,
:encodings #{:plain},
:statistics {:min 1.513, :max 5.424, :num-missing 0},
:first-data-page-offset 1227,
:num-values 32,
:repetition :optional,
:decimal-metadata nil,
:compression-codec :snappy,
:primitive-type
0x6eb6b31c "optional double wt"]}
#object[org.apache.parquet.schema.PrimitiveType :path ["qsec"],
{:total-size 214,
:starting-pos 1446,
:original-type nil,
:dictionary-page-offset 0,
:uncompressed-size 281,
:encodings #{:plain},
:statistics {:min 14.5, :max 22.9, :num-missing 0},
:first-data-page-offset 1446,
:num-values 32,
:repetition :optional,
:decimal-metadata nil,
:compression-codec :snappy,
:primitive-type
0x73b35955 "optional double qsec"]}
#object[org.apache.parquet.schema.PrimitiveType :path ["vs"],
{:total-size 72,
:starting-pos 1660,
:original-type :int32,
:dictionary-page-offset 0,
:uncompressed-size 152,
:encodings #{:plain},
:statistics {:min 0, :max 1, :num-missing 0},
:first-data-page-offset 1660,
:num-values 32,
:repetition :optional,
:decimal-metadata nil,
:compression-codec :snappy,
:primitive-type
0x17421410 "optional int32 vs (INTEGER(32,true))"]}
#object[org.apache.parquet.schema.PrimitiveType :path ["am"],
{:total-size 48,
:starting-pos 1732,
:original-type :int32,
:dictionary-page-offset 0,
:uncompressed-size 152,
:encodings #{:plain},
:statistics {:min 0, :max 1, :num-missing 0},
:first-data-page-offset 1732,
:num-values 32,
:repetition :optional,
:decimal-metadata nil,
:compression-codec :snappy,
:primitive-type
0x55a2f549 "optional int32 am (INTEGER(32,true))"]}
#object[org.apache.parquet.schema.PrimitiveType :path ["gear"],
{:total-size 72,
:starting-pos 1780,
:original-type :int32,
:dictionary-page-offset 0,
:uncompressed-size 152,
:encodings #{:plain},
:statistics {:min 3, :max 5, :num-missing 0},
:first-data-page-offset 1780,
:num-values 32,
:repetition :optional,
:decimal-metadata nil,
:compression-codec :snappy,
:primitive-type
0x621aaba7 "optional int32 gear (INTEGER(32,true))"]}
#object[org.apache.parquet.schema.PrimitiveType :path ["carb"],
{:total-size 96,
:starting-pos 1852,
:original-type :int32,
:dictionary-page-offset 0,
:uncompressed-size 153,
:encodings #{:plain},
:statistics {:min 1, :max 8, :num-missing 0},
:first-data-page-offset 1852,
:num-values 32,
:repetition :optional,
:decimal-metadata nil,
:compression-codec :snappy,
:primitive-type
0x54f27414 "optional int32 carb (INTEGER(32,true))"]}]}] #object[org.apache.parquet.schema.PrimitiveType