6  Reading Datasets

author: Daniel Slutsky, Ken Huang

We may use various sources of datasets for our tutorials here.

(ns noj-book.datasets
  (:require [tablecloth.api :as tc]
            [scicloj.metamorph.ml.rdatasets :as rdatasets]
            [scicloj.kindly.v4.kind :as kind]
            ))

6.1 Reading data from files

Noj provides the correct setup, so that tablecloth can be used to read files in these formats:

  • CSV and TSV
  • Nippy
  • Arrow
  • Excel (xlsx)
  • Parquet

Mostly the file type gets detected from the filename suffix (.csv, .tsv., .nippy, .xlsx, .parquet)

6.1.1 CVS and TSV

These support

  • direct read from local disk
  • as gz compressed file
  • over http (as well gz compressed)
  • wrapped in input-stream

See below for examples.

6.1.2 Arrow

For arrow suffix detection is not possible, as there is no standard suffix, so it must be given via option {:file-type arrow}

See further documentation here.

(tc/dataset "data/alldtypes.arrow-feather" {:file-type :arrow})

data/alldtypes.arrow-feather [1000 15]:

uints longs ubytes strings doubles ushorts local_times local_dates ints instants shorts bytes boolean floats text
0 0 0 0 0.0 0 13:39:59.908 2022-02-19 0 1645303199909000000 0 0 true 0.0 0
1 1 1 1 1.0 1 13:39:59.910 2022-02-19 1 1645303199911000000 1 1 false 1.0 1
2 2 2 2 2.0 2 13:39:59.910 2022-02-19 2 1645303199911000000 2 2 true 2.0 2
3 3 3 3 3.0 3 13:39:59.910 2022-02-19 3 1645303199911000000 3 3 true 3.0 3
4 4 4 4 4.0 4 13:39:59.910 2022-02-19 4 1645303199911000000 4 4 false 4.0 4
5 5 5 5 5.0 5 13:39:59.910 2022-02-19 5 1645303199911000000 5 5 false 5.0 5
6 6 6 6 6.0 6 13:39:59.910 2022-02-19 6 1645303199911000000 6 6 true 6.0 6
7 7 7 7 7.0 7 13:39:59.910 2022-02-19 7 1645303199911000000 7 7 false 7.0 7
8 8 8 8 8.0 8 13:39:59.910 2022-02-19 8 1645303199911000000 8 8 false 8.0 8
9 9 9 9 9.0 9 13:39:59.910 2022-02-19 9 1645303199911000000 9 9 true 9.0 9
989 989 221 989 989.0 989 13:39:59.923 2022-02-19 989 1645303199916000000 989 -35 989.0 989
990 990 222 990 990.0 990 13:39:59.923 2022-02-19 990 1645303199916000000 990 -34 990.0 990
991 991 223 991 991.0 991 13:39:59.923 2022-02-19 991 1645303199916000000 991 -33 991.0 991
992 992 224 992 992.0 992 13:39:59.923 2022-02-19 992 1645303199916000000 992 -32 992.0 992
993 993 225 993 993.0 993 13:39:59.923 2022-02-19 993 1645303199916000000 993 -31 993.0 993
994 994 226 994 994.0 994 13:39:59.923 2022-02-19 994 1645303199916000000 994 -30 994.0 994
995 995 227 995 995.0 995 13:39:59.923 2022-02-19 995 1645303199916000000 995 -29 995.0 995
996 996 228 996 996.0 996 13:39:59.923 2022-02-19 996 1645303199916000000 996 -28 996.0 996
997 997 229 997 997.0 997 13:39:59.923 2022-02-19 997 1645303199916000000 997 -27 997.0 997
998 998 230 998 998.0 998 13:39:59.923 2022-02-19 998 1645303199916000000 998 -26 998.0 998
999 999 231 999 999.0 999 13:39:59.923 2022-02-19 999 1645303199916000000 999 -25 999.0 999

6.1.3 Excel

Read single sheet Excel file:

(comment
 (tc/dataset "data/singleSheet.xlsx"))

see documentation here.

An Excel file can have various sheets, which can be accessed as a sequence of datasets:

(comment
 (tech.v3.libs.fastexcel/workbook->datasets "data/twoSheets.xlsx"))

see documentation here.

6.1.4 Parquet

Can be read by:

(comment (tc/dataset "data/userdata1.parquet"))

see documentation here.

6.2 Example datasets from rdatasets

One of the main collections of example dataset is the rdatasets namespace of metamorph.ml, which can fetch datasets from the Rdatasets collection.

(rdatasets/datasets-iris)

https://vincentarelbundock.github.io/Rdatasets/csv/datasets/iris.csv [150 6]:

:rownames :sepal-length :sepal-width :petal-length :petal-width :species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
140 6.9 3.1 5.4 2.1 virginica
141 6.7 3.1 5.6 2.4 virginica
142 6.9 3.1 5.1 2.3 virginica
143 5.8 2.7 5.1 1.9 virginica
144 6.8 3.2 5.9 2.3 virginica
145 6.7 3.3 5.7 2.5 virginica
146 6.7 3.0 5.2 2.3 virginica
147 6.3 2.5 5.0 1.9 virginica
148 6.5 3.0 5.2 2.0 virginica
149 6.2 3.4 5.4 2.3 virginica
150 5.9 3.0 5.1 1.8 virginica
(rdatasets/ggplot2-mpg)

https://vincentarelbundock.github.io/Rdatasets/csv/ggplot2/mpg.csv [234 12]:

:rownames :manufacturer :model :displ :year :cyl :trans :drv :cty :hwy :fl :class
1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compact
2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compact
3 audi a4 2.0 2008 4 manual(m6) f 20 31 p compact
4 audi a4 2.0 2008 4 auto(av) f 21 30 p compact
5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compact
6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compact
7 audi a4 3.1 2008 6 auto(av) f 18 27 p compact
8 audi a4 quattro 1.8 1999 4 manual(m5) 4 18 26 p compact
9 audi a4 quattro 1.8 1999 4 auto(l5) 4 16 25 p compact
10 audi a4 quattro 2.0 2008 4 manual(m6) 4 20 28 p compact
224 volkswagen new beetle 2.0 1999 4 manual(m5) f 21 29 r subcompact
225 volkswagen new beetle 2.0 1999 4 auto(l4) f 19 26 r subcompact
226 volkswagen new beetle 2.5 2008 5 manual(m5) f 20 28 r subcompact
227 volkswagen new beetle 2.5 2008 5 auto(s6) f 20 29 r subcompact
228 volkswagen passat 1.8 1999 4 manual(m5) f 21 29 p midsize
229 volkswagen passat 1.8 1999 4 auto(l5) f 18 29 p midsize
230 volkswagen passat 2.0 2008 4 auto(s6) f 19 28 p midsize
231 volkswagen passat 2.0 2008 4 manual(m6) f 21 29 p midsize
232 volkswagen passat 2.8 1999 6 auto(l5) f 16 26 p midsize
233 volkswagen passat 2.8 1999 6 manual(m5) f 18 26 p midsize
234 volkswagen passat 3.6 2008 6 auto(s6) f 17 26 p midsize
(rdatasets/openintro-simulated_scatter)

https://vincentarelbundock.github.io/Rdatasets/csv/openintro/simulated_scatter.csv [2033 4]:

:rownames :group :x :y
1 1 -15.74380273 35.56615175
2 1 6.34665115 23.52750121
3 1 24.54001114 -1.03170877
4 1 -22.02035224 19.75964793
5 1 22.46083327 -5.85090154
6 1 16.99409017 21.23115046
7 1 18.81878514 -5.28260778
8 1 52.84813702 -20.01482909
9 1 -24.22229471 48.07070892
10 1 57.82316783 -22.72697862
2023 30 352.00000000 29.18804414
2024 30 245.00000000 32.98072746
2025 30 382.00000000 27.89430176
2026 30 240.00000000 33.50059544
2027 30 319.00000000 27.75617312
2028 30 197.00000000 38.44752778
2029 30 316.00000000 29.91404536
2030 30 263.00000000 31.85003163
2031 30 410.00000000 34.96805690
2032 30 252.00000000 33.15689569
2033 30 297.00000000 31.50992258

6.3 Example datasets from Plotly

We can also use datasets from Plotly Sample Datasets

(tc/dataset
 "https://raw.githubusercontent.com/plotly/datasets/refs/heads/master/1962_2006_walmart_store_openings.csv"
 {:key-fn keyword
  :parser-fn {:OPENDATE :string
              :date_super :string}})

https://raw.githubusercontent.com/plotly/datasets/refs/heads/master/1962_2006_walmart_store_openings.csv [2992 16]:

:storenum :OPENDATE :date_super :conversion :st :county :STREETADDR :STRCITY :STRSTATE :ZIPCODE :type_store :LAT :LON :MONTH :DAY :YEAR
1 7/1/62 3/1/97 1 5 7 2110 WEST WALNUT Rogers AR 72756 Supercenter 36.342235 -94.07141 7 1 1962
2 8/1/64 3/1/96 1 5 9 1417 HWY 62/65 N Harrison AR 72601 Supercenter 36.236984 -93.09345 8 1 1964
4 8/1/65 3/1/02 1 5 7 2901 HWY 412 EAST Siloam Springs AR 72761 Supercenter 36.179905 -94.50208 8 1 1965
8 10/1/67 3/1/93 1 5 29 1621 NORTH BUSINESS 9 Morrilton AR 72110 Supercenter 35.156491 -92.75858 10 1 1967
7 10/1/67 5 119 3801 CAMP ROBINSON RD. North Little Rock AR 72118 Wal-Mart 34.813269 -92.30229 10 1 1967
10 7/1/68 3/1/98 1 40 21 2020 SOUTH MUSKOGEE Tahlequah OK 74464 Supercenter 35.923658 -94.97185 7 1 1968
13 11/1/68 3/1/96 1 29 97 2705 GRAND AVE Carthage MO 64836 Supercenter 37.168985 -94.31164 11 1 1968
12 7/1/68 3/1/94 1 40 131 1500 LYNN RIGGS BLVD Claremore OK 74017 Supercenter 36.327143 -95.61192 7 1 1968
11 3/1/68 2/20/02 1 5 5 65 WAL-MART DRIVE Mountain Home AR 72653 Supercenter 36.329026 -92.35781 3 1 1968
9 3/1/68 3/1/00 1 29 143 1303 SOUTH MAIN Sikeston MO 63801 Supercenter 36.891163 -89.58355 3 1 1968
5370 1/31/06 1/31/06 0 8 13 2514 Main St Longmont CO 80504 Supercenter 40.160138 -105.01772 1 31 2006
3608 1/31/06 1/31/06 0 39 85 6067 N Ridge Rd Madison OH 44057 Supercenter 41.800630 -81.06021 1 31 2006
5253 1/31/06 1/31/06 0 51 550 632 Grass Field Pkwy Chesapeake VA 23322 Supercenter 36.687543 -76.22905 1 31 2006
5471 1/31/06 1/31/06 0 39 139 2485 Possum Run Rd Mansfield OH 44903 Supercenter 40.766589 -82.51869 1 31 2006
5346 1/23/06 1/23/06 0 37 1 1318 Mebane Oaks Rd Mebane NC 27302 Supercenter 36.111449 -79.27142 1 23 2006
5313 1/23/06 1/23/06 0 29 183 6100 Ronald Reagan Blvd Lake Saint Louis MO 63367 Supercenter 38.796601 -90.78525 1 23 2006
5403 1/27/06 1/27/06 0 17 19 100 S High Cross Rd Urbana IL 61802 Supercenter 40.121648 -88.17649 1 27 2006
3347 1/23/06 1/23/06 0 12 105 7450 Cypress Gardens Blvd Winter Haven FL 33884 Supercenter 27.997387 -81.68256 1 23 2006
5485 1/27/06 17 31 2500 W 95th St Evergreen Park IL 60805 Wal-Mart 41.719933 -87.70249 1 27 2006
3425 1/27/06 1/27/06 0 48 201 9598 Rowlett Rd Houston TX 77034 Supercenter 29.636430 -95.21789 1 27 2006
5193 1/31/06 6 65 12721 Moreno Beach Dr Moreno Valley CA 92555 Wal-Mart 33.922823 -117.16837 1 31 2006

6.4 Example datasets from from tech.ml.dataset (TMD)

TMD’s repo also has some datasets that we can use:

(tc/dataset
 "https://raw.githubusercontent.com/techascent/tech.ml.dataset/master/test/data/stocks.csv"
 {:key-fn keyword})

https://raw.githubusercontent.com/techascent/tech.ml.dataset/master/test/data/stocks.csv [560 3]:

:symbol :date :price
MSFT 2000-01-01 39.81
MSFT 2000-02-01 36.35
MSFT 2000-03-01 43.22
MSFT 2000-04-01 28.37
MSFT 2000-05-01 25.45
MSFT 2000-06-01 32.54
MSFT 2000-07-01 28.40
MSFT 2000-08-01 28.40
MSFT 2000-09-01 24.53
MSFT 2000-10-01 28.02
AAPL 2009-05-01 135.81
AAPL 2009-06-01 142.43
AAPL 2009-07-01 163.39
AAPL 2009-08-01 168.21
AAPL 2009-09-01 185.35
AAPL 2009-10-01 188.50
AAPL 2009-11-01 199.91
AAPL 2009-12-01 210.73
AAPL 2010-01-01 192.06
AAPL 2010-02-01 204.62
AAPL 2010-03-01 223.02
source: notebooks/noj_book/datasets.clj