6  Example Datasets

author: Daniel Slutsky, Ken Huang

We may use various sources of datasets for our tutorials here.

(ns noj-book.datasets
  (:require [tablecloth.api :as tc]
            [scicloj.metamorph.ml.rdatasets :as rdatasets]))

6.1 rdatasets

One of the main sources is the rdatasets namespace of metamorph.ml, which can fetch datasets from the Rdatasets collection.

(rdatasets/datasets-iris)

https://vincentarelbundock.github.io/Rdatasets/csv/datasets/iris.csv [150 6]:

:rownames :sepal-length :sepal-width :petal-length :petal-width :species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
140 6.9 3.1 5.4 2.1 virginica
141 6.7 3.1 5.6 2.4 virginica
142 6.9 3.1 5.1 2.3 virginica
143 5.8 2.7 5.1 1.9 virginica
144 6.8 3.2 5.9 2.3 virginica
145 6.7 3.3 5.7 2.5 virginica
146 6.7 3.0 5.2 2.3 virginica
147 6.3 2.5 5.0 1.9 virginica
148 6.5 3.0 5.2 2.0 virginica
149 6.2 3.4 5.4 2.3 virginica
150 5.9 3.0 5.1 1.8 virginica
(rdatasets/ggplot2-mpg)

https://vincentarelbundock.github.io/Rdatasets/csv/ggplot2/mpg.csv [234 12]:

:rownames :manufacturer :model :displ :year :cyl :trans :drv :cty :hwy :fl :class
1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compact
2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compact
3 audi a4 2.0 2008 4 manual(m6) f 20 31 p compact
4 audi a4 2.0 2008 4 auto(av) f 21 30 p compact
5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compact
6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compact
7 audi a4 3.1 2008 6 auto(av) f 18 27 p compact
8 audi a4 quattro 1.8 1999 4 manual(m5) 4 18 26 p compact
9 audi a4 quattro 1.8 1999 4 auto(l5) 4 16 25 p compact
10 audi a4 quattro 2.0 2008 4 manual(m6) 4 20 28 p compact
224 volkswagen new beetle 2.0 1999 4 manual(m5) f 21 29 r subcompact
225 volkswagen new beetle 2.0 1999 4 auto(l4) f 19 26 r subcompact
226 volkswagen new beetle 2.5 2008 5 manual(m5) f 20 28 r subcompact
227 volkswagen new beetle 2.5 2008 5 auto(s6) f 20 29 r subcompact
228 volkswagen passat 1.8 1999 4 manual(m5) f 21 29 p midsize
229 volkswagen passat 1.8 1999 4 auto(l5) f 18 29 p midsize
230 volkswagen passat 2.0 2008 4 auto(s6) f 19 28 p midsize
231 volkswagen passat 2.0 2008 4 manual(m6) f 21 29 p midsize
232 volkswagen passat 2.8 1999 6 auto(l5) f 16 26 p midsize
233 volkswagen passat 2.8 1999 6 manual(m5) f 18 26 p midsize
234 volkswagen passat 3.6 2008 6 auto(s6) f 17 26 p midsize
(rdatasets/openintro-simulated_scatter)

https://vincentarelbundock.github.io/Rdatasets/csv/openintro/simulated_scatter.csv [2033 4]:

:rownames :group :x :y
1 1 -15.74380273 35.56615175
2 1 6.34665115 23.52750121
3 1 24.54001114 -1.03170877
4 1 -22.02035224 19.75964793
5 1 22.46083327 -5.85090154
6 1 16.99409017 21.23115046
7 1 18.81878514 -5.28260778
8 1 52.84813702 -20.01482909
9 1 -24.22229471 48.07070892
10 1 57.82316783 -22.72697862
2023 30 352.00000000 29.18804414
2024 30 245.00000000 32.98072746
2025 30 382.00000000 27.89430176
2026 30 240.00000000 33.50059544
2027 30 319.00000000 27.75617312
2028 30 197.00000000 38.44752778
2029 30 316.00000000 29.91404536
2030 30 263.00000000 31.85003163
2031 30 410.00000000 34.96805690
2032 30 252.00000000 33.15689569
2033 30 297.00000000 31.50992258

6.2 Plotly

We can also use datasets from Plotly Sample Datasets

(tc/dataset
 "https://raw.githubusercontent.com/plotly/datasets/refs/heads/master/1962_2006_walmart_store_openings.csv"
 {:key-fn keyword
  :parser-fn {:OPENDATE :string
              :date_super :string}})

https://raw.githubusercontent.com/plotly/datasets/refs/heads/master/1962_2006_walmart_store_openings.csv [2992 16]:

:storenum :OPENDATE :date_super :conversion :st :county :STREETADDR :STRCITY :STRSTATE :ZIPCODE :type_store :LAT :LON :MONTH :DAY :YEAR
1 7/1/62 3/1/97 1 5 7 2110 WEST WALNUT Rogers AR 72756 Supercenter 36.342235 -94.07141 7 1 1962
2 8/1/64 3/1/96 1 5 9 1417 HWY 62/65 N Harrison AR 72601 Supercenter 36.236984 -93.09345 8 1 1964
4 8/1/65 3/1/02 1 5 7 2901 HWY 412 EAST Siloam Springs AR 72761 Supercenter 36.179905 -94.50208 8 1 1965
8 10/1/67 3/1/93 1 5 29 1621 NORTH BUSINESS 9 Morrilton AR 72110 Supercenter 35.156491 -92.75858 10 1 1967
7 10/1/67 5 119 3801 CAMP ROBINSON RD. North Little Rock AR 72118 Wal-Mart 34.813269 -92.30229 10 1 1967
10 7/1/68 3/1/98 1 40 21 2020 SOUTH MUSKOGEE Tahlequah OK 74464 Supercenter 35.923658 -94.97185 7 1 1968
13 11/1/68 3/1/96 1 29 97 2705 GRAND AVE Carthage MO 64836 Supercenter 37.168985 -94.31164 11 1 1968
12 7/1/68 3/1/94 1 40 131 1500 LYNN RIGGS BLVD Claremore OK 74017 Supercenter 36.327143 -95.61192 7 1 1968
11 3/1/68 2/20/02 1 5 5 65 WAL-MART DRIVE Mountain Home AR 72653 Supercenter 36.329026 -92.35781 3 1 1968
9 3/1/68 3/1/00 1 29 143 1303 SOUTH MAIN Sikeston MO 63801 Supercenter 36.891163 -89.58355 3 1 1968
5370 1/31/06 1/31/06 0 8 13 2514 Main St Longmont CO 80504 Supercenter 40.160138 -105.01772 1 31 2006
3608 1/31/06 1/31/06 0 39 85 6067 N Ridge Rd Madison OH 44057 Supercenter 41.800630 -81.06021 1 31 2006
5253 1/31/06 1/31/06 0 51 550 632 Grass Field Pkwy Chesapeake VA 23322 Supercenter 36.687543 -76.22905 1 31 2006
5471 1/31/06 1/31/06 0 39 139 2485 Possum Run Rd Mansfield OH 44903 Supercenter 40.766589 -82.51869 1 31 2006
5346 1/23/06 1/23/06 0 37 1 1318 Mebane Oaks Rd Mebane NC 27302 Supercenter 36.111449 -79.27142 1 23 2006
5313 1/23/06 1/23/06 0 29 183 6100 Ronald Reagan Blvd Lake Saint Louis MO 63367 Supercenter 38.796601 -90.78525 1 23 2006
5403 1/27/06 1/27/06 0 17 19 100 S High Cross Rd Urbana IL 61802 Supercenter 40.121648 -88.17649 1 27 2006
3347 1/23/06 1/23/06 0 12 105 7450 Cypress Gardens Blvd Winter Haven FL 33884 Supercenter 27.997387 -81.68256 1 23 2006
5485 1/27/06 17 31 2500 W 95th St Evergreen Park IL 60805 Wal-Mart 41.719933 -87.70249 1 27 2006
3425 1/27/06 1/27/06 0 48 201 9598 Rowlett Rd Houston TX 77034 Supercenter 29.636430 -95.21789 1 27 2006
5193 1/31/06 6 65 12721 Moreno Beach Dr Moreno Valley CA 92555 Wal-Mart 33.922823 -117.16837 1 31 2006

6.3 tech.ml.dataset (TMD)

TMD’s repo also has some datasets that we can use:

(tc/dataset
 "https://raw.githubusercontent.com/techascent/tech.ml.dataset/master/test/data/stocks.csv"
 {:key-fn keyword})

https://raw.githubusercontent.com/techascent/tech.ml.dataset/master/test/data/stocks.csv [560 3]:

:symbol :date :price
MSFT 2000-01-01 39.81
MSFT 2000-02-01 36.35
MSFT 2000-03-01 43.22
MSFT 2000-04-01 28.37
MSFT 2000-05-01 25.45
MSFT 2000-06-01 32.54
MSFT 2000-07-01 28.40
MSFT 2000-08-01 28.40
MSFT 2000-09-01 24.53
MSFT 2000-10-01 28.02
AAPL 2009-05-01 135.81
AAPL 2009-06-01 142.43
AAPL 2009-07-01 163.39
AAPL 2009-08-01 168.21
AAPL 2009-09-01 185.35
AAPL 2009-10-01 188.50
AAPL 2009-11-01 199.91
AAPL 2009-12-01 210.73
AAPL 2010-01-01 192.06
AAPL 2010-02-01 204.62
AAPL 2010-03-01 223.02
source: notebooks/noj_book/datasets.clj