6 Reading Datasets
author: Daniel Slutsky, Ken Huang
We may use various sources of datasets for our tutorials here.
(ns noj-book.datasets
(:require [tablecloth.api :as tc]
[scicloj.metamorph.ml.rdatasets :as rdatasets]
[scicloj.kindly.v4.kind :as kind]
))6.1 Reading data from files
Noj provides the correct setup, so that tablecloth can be used to read files in these formats:
- CSV and TSV
- Nippy
- Arrow
- Excel (xlsx)
- Parquet
Mostly the file type gets detected from the filename suffix (.csv, .tsv., .nippy, .xlsx, .parquet)
6.1.1 CVS and TSV
These support
- direct read from local disk
- as gz compressed file
- over http (as well gz compressed)
- wrapped in input-stream
See below for examples.
6.1.2 Arrow
For arrow suffix detection is not possible, as there is no standard suffix, so it must be given via option {:file-type arrow}
See further documentation here.
(tc/dataset "data/alldtypes.arrow-feather" {:file-type :arrow})data/alldtypes.arrow-feather [1000 15]:
| uints | longs | ubytes | strings | doubles | ushorts | local_times | local_dates | ints | instants | shorts | bytes | boolean | floats | text |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0.0 | 0 | 13:39:59.908 | 2022-02-19 | 0 | 1645303199909000000 | 0 | 0 | true | 0.0 | 0 |
| 1 | 1 | 1 | 1 | 1.0 | 1 | 13:39:59.910 | 2022-02-19 | 1 | 1645303199911000000 | 1 | 1 | false | 1.0 | 1 |
| 2 | 2 | 2 | 2 | 2.0 | 2 | 13:39:59.910 | 2022-02-19 | 2 | 1645303199911000000 | 2 | 2 | true | 2.0 | 2 |
| 3 | 3 | 3 | 3 | 3.0 | 3 | 13:39:59.910 | 2022-02-19 | 3 | 1645303199911000000 | 3 | 3 | true | 3.0 | 3 |
| 4 | 4 | 4 | 4 | 4.0 | 4 | 13:39:59.910 | 2022-02-19 | 4 | 1645303199911000000 | 4 | 4 | false | 4.0 | 4 |
| 5 | 5 | 5 | 5 | 5.0 | 5 | 13:39:59.910 | 2022-02-19 | 5 | 1645303199911000000 | 5 | 5 | false | 5.0 | 5 |
| 6 | 6 | 6 | 6 | 6.0 | 6 | 13:39:59.910 | 2022-02-19 | 6 | 1645303199911000000 | 6 | 6 | true | 6.0 | 6 |
| 7 | 7 | 7 | 7 | 7.0 | 7 | 13:39:59.910 | 2022-02-19 | 7 | 1645303199911000000 | 7 | 7 | false | 7.0 | 7 |
| 8 | 8 | 8 | 8 | 8.0 | 8 | 13:39:59.910 | 2022-02-19 | 8 | 1645303199911000000 | 8 | 8 | false | 8.0 | 8 |
| 9 | 9 | 9 | 9 | 9.0 | 9 | 13:39:59.910 | 2022-02-19 | 9 | 1645303199911000000 | 9 | 9 | true | 9.0 | 9 |
| … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
| 989 | 989 | 221 | 989 | 989.0 | 989 | 13:39:59.923 | 2022-02-19 | 989 | 1645303199916000000 | 989 | -35 | 989.0 | 989 | |
| 990 | 990 | 222 | 990 | 990.0 | 990 | 13:39:59.923 | 2022-02-19 | 990 | 1645303199916000000 | 990 | -34 | 990.0 | 990 | |
| 991 | 991 | 223 | 991 | 991.0 | 991 | 13:39:59.923 | 2022-02-19 | 991 | 1645303199916000000 | 991 | -33 | 991.0 | 991 | |
| 992 | 992 | 224 | 992 | 992.0 | 992 | 13:39:59.923 | 2022-02-19 | 992 | 1645303199916000000 | 992 | -32 | 992.0 | 992 | |
| 993 | 993 | 225 | 993 | 993.0 | 993 | 13:39:59.923 | 2022-02-19 | 993 | 1645303199916000000 | 993 | -31 | 993.0 | 993 | |
| 994 | 994 | 226 | 994 | 994.0 | 994 | 13:39:59.923 | 2022-02-19 | 994 | 1645303199916000000 | 994 | -30 | 994.0 | 994 | |
| 995 | 995 | 227 | 995 | 995.0 | 995 | 13:39:59.923 | 2022-02-19 | 995 | 1645303199916000000 | 995 | -29 | 995.0 | 995 | |
| 996 | 996 | 228 | 996 | 996.0 | 996 | 13:39:59.923 | 2022-02-19 | 996 | 1645303199916000000 | 996 | -28 | 996.0 | 996 | |
| 997 | 997 | 229 | 997 | 997.0 | 997 | 13:39:59.923 | 2022-02-19 | 997 | 1645303199916000000 | 997 | -27 | 997.0 | 997 | |
| 998 | 998 | 230 | 998 | 998.0 | 998 | 13:39:59.923 | 2022-02-19 | 998 | 1645303199916000000 | 998 | -26 | 998.0 | 998 | |
| 999 | 999 | 231 | 999 | 999.0 | 999 | 13:39:59.923 | 2022-02-19 | 999 | 1645303199916000000 | 999 | -25 | 999.0 | 999 |
6.1.3 Excel
Read single sheet Excel file:
(comment
(tc/dataset "data/singleSheet.xlsx"))see documentation here.
An Excel file can have various sheets, which can be accessed as a sequence of datasets:
(comment
(tech.v3.libs.fastexcel/workbook->datasets "data/twoSheets.xlsx"))see documentation here.
6.1.4 Parquet
Can be read by:
(comment (tc/dataset "data/userdata1.parquet"))see documentation here.
6.2 Example datasets from rdatasets
One of the main collections of example dataset is the rdatasets namespace of metamorph.ml, which can fetch datasets from the Rdatasets collection.
(rdatasets/datasets-iris)https://vincentarelbundock.github.io/Rdatasets/csv/datasets/iris.csv [150 6]:
| :rownames | :sepal-length | :sepal-width | :petal-length | :petal-width | :species |
|---|---|---|---|---|---|
| 1 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
| 2 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
| 3 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
| 4 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
| 5 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
| 6 | 5.4 | 3.9 | 1.7 | 0.4 | setosa |
| 7 | 4.6 | 3.4 | 1.4 | 0.3 | setosa |
| 8 | 5.0 | 3.4 | 1.5 | 0.2 | setosa |
| 9 | 4.4 | 2.9 | 1.4 | 0.2 | setosa |
| 10 | 4.9 | 3.1 | 1.5 | 0.1 | setosa |
| … | … | … | … | … | … |
| 140 | 6.9 | 3.1 | 5.4 | 2.1 | virginica |
| 141 | 6.7 | 3.1 | 5.6 | 2.4 | virginica |
| 142 | 6.9 | 3.1 | 5.1 | 2.3 | virginica |
| 143 | 5.8 | 2.7 | 5.1 | 1.9 | virginica |
| 144 | 6.8 | 3.2 | 5.9 | 2.3 | virginica |
| 145 | 6.7 | 3.3 | 5.7 | 2.5 | virginica |
| 146 | 6.7 | 3.0 | 5.2 | 2.3 | virginica |
| 147 | 6.3 | 2.5 | 5.0 | 1.9 | virginica |
| 148 | 6.5 | 3.0 | 5.2 | 2.0 | virginica |
| 149 | 6.2 | 3.4 | 5.4 | 2.3 | virginica |
| 150 | 5.9 | 3.0 | 5.1 | 1.8 | virginica |
(rdatasets/ggplot2-mpg)https://vincentarelbundock.github.io/Rdatasets/csv/ggplot2/mpg.csv [234 12]:
| :rownames | :manufacturer | :model | :displ | :year | :cyl | :trans | :drv | :cty | :hwy | :fl | :class |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | audi | a4 | 1.8 | 1999 | 4 | auto(l5) | f | 18 | 29 | p | compact |
| 2 | audi | a4 | 1.8 | 1999 | 4 | manual(m5) | f | 21 | 29 | p | compact |
| 3 | audi | a4 | 2.0 | 2008 | 4 | manual(m6) | f | 20 | 31 | p | compact |
| 4 | audi | a4 | 2.0 | 2008 | 4 | auto(av) | f | 21 | 30 | p | compact |
| 5 | audi | a4 | 2.8 | 1999 | 6 | auto(l5) | f | 16 | 26 | p | compact |
| 6 | audi | a4 | 2.8 | 1999 | 6 | manual(m5) | f | 18 | 26 | p | compact |
| 7 | audi | a4 | 3.1 | 2008 | 6 | auto(av) | f | 18 | 27 | p | compact |
| 8 | audi | a4 quattro | 1.8 | 1999 | 4 | manual(m5) | 4 | 18 | 26 | p | compact |
| 9 | audi | a4 quattro | 1.8 | 1999 | 4 | auto(l5) | 4 | 16 | 25 | p | compact |
| 10 | audi | a4 quattro | 2.0 | 2008 | 4 | manual(m6) | 4 | 20 | 28 | p | compact |
| … | … | … | … | … | … | … | … | … | … | … | … |
| 224 | volkswagen | new beetle | 2.0 | 1999 | 4 | manual(m5) | f | 21 | 29 | r | subcompact |
| 225 | volkswagen | new beetle | 2.0 | 1999 | 4 | auto(l4) | f | 19 | 26 | r | subcompact |
| 226 | volkswagen | new beetle | 2.5 | 2008 | 5 | manual(m5) | f | 20 | 28 | r | subcompact |
| 227 | volkswagen | new beetle | 2.5 | 2008 | 5 | auto(s6) | f | 20 | 29 | r | subcompact |
| 228 | volkswagen | passat | 1.8 | 1999 | 4 | manual(m5) | f | 21 | 29 | p | midsize |
| 229 | volkswagen | passat | 1.8 | 1999 | 4 | auto(l5) | f | 18 | 29 | p | midsize |
| 230 | volkswagen | passat | 2.0 | 2008 | 4 | auto(s6) | f | 19 | 28 | p | midsize |
| 231 | volkswagen | passat | 2.0 | 2008 | 4 | manual(m6) | f | 21 | 29 | p | midsize |
| 232 | volkswagen | passat | 2.8 | 1999 | 6 | auto(l5) | f | 16 | 26 | p | midsize |
| 233 | volkswagen | passat | 2.8 | 1999 | 6 | manual(m5) | f | 18 | 26 | p | midsize |
| 234 | volkswagen | passat | 3.6 | 2008 | 6 | auto(s6) | f | 17 | 26 | p | midsize |
(rdatasets/openintro-simulated_scatter)https://vincentarelbundock.github.io/Rdatasets/csv/openintro/simulated_scatter.csv [2033 4]:
| :rownames | :group | :x | :y |
|---|---|---|---|
| 1 | 1 | -15.74380273 | 35.56615175 |
| 2 | 1 | 6.34665115 | 23.52750121 |
| 3 | 1 | 24.54001114 | -1.03170877 |
| 4 | 1 | -22.02035224 | 19.75964793 |
| 5 | 1 | 22.46083327 | -5.85090154 |
| 6 | 1 | 16.99409017 | 21.23115046 |
| 7 | 1 | 18.81878514 | -5.28260778 |
| 8 | 1 | 52.84813702 | -20.01482909 |
| 9 | 1 | -24.22229471 | 48.07070892 |
| 10 | 1 | 57.82316783 | -22.72697862 |
| … | … | … | … |
| 2023 | 30 | 352.00000000 | 29.18804414 |
| 2024 | 30 | 245.00000000 | 32.98072746 |
| 2025 | 30 | 382.00000000 | 27.89430176 |
| 2026 | 30 | 240.00000000 | 33.50059544 |
| 2027 | 30 | 319.00000000 | 27.75617312 |
| 2028 | 30 | 197.00000000 | 38.44752778 |
| 2029 | 30 | 316.00000000 | 29.91404536 |
| 2030 | 30 | 263.00000000 | 31.85003163 |
| 2031 | 30 | 410.00000000 | 34.96805690 |
| 2032 | 30 | 252.00000000 | 33.15689569 |
| 2033 | 30 | 297.00000000 | 31.50992258 |
6.3 Example datasets from Plotly
We can also use datasets from Plotly Sample Datasets
(tc/dataset
"https://raw.githubusercontent.com/plotly/datasets/refs/heads/master/1962_2006_walmart_store_openings.csv"
{:key-fn keyword
:parser-fn {:OPENDATE :string
:date_super :string}})https://raw.githubusercontent.com/plotly/datasets/refs/heads/master/1962_2006_walmart_store_openings.csv [2992 16]:
| :storenum | :OPENDATE | :date_super | :conversion | :st | :county | :STREETADDR | :STRCITY | :STRSTATE | :ZIPCODE | :type_store | :LAT | :LON | :MONTH | :DAY | :YEAR |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 7/1/62 | 3/1/97 | 1 | 5 | 7 | 2110 WEST WALNUT | Rogers | AR | 72756 | Supercenter | 36.342235 | -94.07141 | 7 | 1 | 1962 |
| 2 | 8/1/64 | 3/1/96 | 1 | 5 | 9 | 1417 HWY 62/65 N | Harrison | AR | 72601 | Supercenter | 36.236984 | -93.09345 | 8 | 1 | 1964 |
| 4 | 8/1/65 | 3/1/02 | 1 | 5 | 7 | 2901 HWY 412 EAST | Siloam Springs | AR | 72761 | Supercenter | 36.179905 | -94.50208 | 8 | 1 | 1965 |
| 8 | 10/1/67 | 3/1/93 | 1 | 5 | 29 | 1621 NORTH BUSINESS 9 | Morrilton | AR | 72110 | Supercenter | 35.156491 | -92.75858 | 10 | 1 | 1967 |
| 7 | 10/1/67 | 5 | 119 | 3801 CAMP ROBINSON RD. | North Little Rock | AR | 72118 | Wal-Mart | 34.813269 | -92.30229 | 10 | 1 | 1967 | ||
| 10 | 7/1/68 | 3/1/98 | 1 | 40 | 21 | 2020 SOUTH MUSKOGEE | Tahlequah | OK | 74464 | Supercenter | 35.923658 | -94.97185 | 7 | 1 | 1968 |
| 13 | 11/1/68 | 3/1/96 | 1 | 29 | 97 | 2705 GRAND AVE | Carthage | MO | 64836 | Supercenter | 37.168985 | -94.31164 | 11 | 1 | 1968 |
| 12 | 7/1/68 | 3/1/94 | 1 | 40 | 131 | 1500 LYNN RIGGS BLVD | Claremore | OK | 74017 | Supercenter | 36.327143 | -95.61192 | 7 | 1 | 1968 |
| 11 | 3/1/68 | 2/20/02 | 1 | 5 | 5 | 65 WAL-MART DRIVE | Mountain Home | AR | 72653 | Supercenter | 36.329026 | -92.35781 | 3 | 1 | 1968 |
| 9 | 3/1/68 | 3/1/00 | 1 | 29 | 143 | 1303 SOUTH MAIN | Sikeston | MO | 63801 | Supercenter | 36.891163 | -89.58355 | 3 | 1 | 1968 |
| … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
| 5370 | 1/31/06 | 1/31/06 | 0 | 8 | 13 | 2514 Main St | Longmont | CO | 80504 | Supercenter | 40.160138 | -105.01772 | 1 | 31 | 2006 |
| 3608 | 1/31/06 | 1/31/06 | 0 | 39 | 85 | 6067 N Ridge Rd | Madison | OH | 44057 | Supercenter | 41.800630 | -81.06021 | 1 | 31 | 2006 |
| 5253 | 1/31/06 | 1/31/06 | 0 | 51 | 550 | 632 Grass Field Pkwy | Chesapeake | VA | 23322 | Supercenter | 36.687543 | -76.22905 | 1 | 31 | 2006 |
| 5471 | 1/31/06 | 1/31/06 | 0 | 39 | 139 | 2485 Possum Run Rd | Mansfield | OH | 44903 | Supercenter | 40.766589 | -82.51869 | 1 | 31 | 2006 |
| 5346 | 1/23/06 | 1/23/06 | 0 | 37 | 1 | 1318 Mebane Oaks Rd | Mebane | NC | 27302 | Supercenter | 36.111449 | -79.27142 | 1 | 23 | 2006 |
| 5313 | 1/23/06 | 1/23/06 | 0 | 29 | 183 | 6100 Ronald Reagan Blvd | Lake Saint Louis | MO | 63367 | Supercenter | 38.796601 | -90.78525 | 1 | 23 | 2006 |
| 5403 | 1/27/06 | 1/27/06 | 0 | 17 | 19 | 100 S High Cross Rd | Urbana | IL | 61802 | Supercenter | 40.121648 | -88.17649 | 1 | 27 | 2006 |
| 3347 | 1/23/06 | 1/23/06 | 0 | 12 | 105 | 7450 Cypress Gardens Blvd | Winter Haven | FL | 33884 | Supercenter | 27.997387 | -81.68256 | 1 | 23 | 2006 |
| 5485 | 1/27/06 | 17 | 31 | 2500 W 95th St | Evergreen Park | IL | 60805 | Wal-Mart | 41.719933 | -87.70249 | 1 | 27 | 2006 | ||
| 3425 | 1/27/06 | 1/27/06 | 0 | 48 | 201 | 9598 Rowlett Rd | Houston | TX | 77034 | Supercenter | 29.636430 | -95.21789 | 1 | 27 | 2006 |
| 5193 | 1/31/06 | 6 | 65 | 12721 Moreno Beach Dr | Moreno Valley | CA | 92555 | Wal-Mart | 33.922823 | -117.16837 | 1 | 31 | 2006 |
6.4 Example datasets from from tech.ml.dataset (TMD)
TMD’s repo also has some datasets that we can use:
(tc/dataset
"https://raw.githubusercontent.com/techascent/tech.ml.dataset/master/test/data/stocks.csv"
{:key-fn keyword})https://raw.githubusercontent.com/techascent/tech.ml.dataset/master/test/data/stocks.csv [560 3]:
| :symbol | :date | :price |
|---|---|---|
| MSFT | 2000-01-01 | 39.81 |
| MSFT | 2000-02-01 | 36.35 |
| MSFT | 2000-03-01 | 43.22 |
| MSFT | 2000-04-01 | 28.37 |
| MSFT | 2000-05-01 | 25.45 |
| MSFT | 2000-06-01 | 32.54 |
| MSFT | 2000-07-01 | 28.40 |
| MSFT | 2000-08-01 | 28.40 |
| MSFT | 2000-09-01 | 24.53 |
| MSFT | 2000-10-01 | 28.02 |
| … | … | … |
| AAPL | 2009-05-01 | 135.81 |
| AAPL | 2009-06-01 | 142.43 |
| AAPL | 2009-07-01 | 163.39 |
| AAPL | 2009-08-01 | 168.21 |
| AAPL | 2009-09-01 | 185.35 |
| AAPL | 2009-10-01 | 188.50 |
| AAPL | 2009-11-01 | 199.91 |
| AAPL | 2009-12-01 | 210.73 |
| AAPL | 2010-01-01 | 192.06 |
| AAPL | 2010-02-01 | 204.62 |
| AAPL | 2010-03-01 | 223.02 |
source: notebooks/noj_book/datasets.clj