6 Example Datasets
author: Daniel Slutsky, Ken Huang
We may use various sources of datasets for our tutorials here.
ns noj-book.datasets
(:require [tablecloth.api :as tc]
(:as rdatasets])) [scicloj.metamorph.ml.rdatasets
6.1 rdatasets
One of the main sources is the rdatasets
namespace of metamorph.ml, which can fetch datasets from the Rdatasets collection.
(rdatasets/datasets-iris)
https://vincentarelbundock.github.io/Rdatasets/csv/datasets/iris.csv [150 6]:
:rownames | :sepal-length | :sepal-width | :petal-length | :petal-width | :species |
---|---|---|---|---|---|
1 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
2 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
3 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
4 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
5 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
6 | 5.4 | 3.9 | 1.7 | 0.4 | setosa |
7 | 4.6 | 3.4 | 1.4 | 0.3 | setosa |
8 | 5.0 | 3.4 | 1.5 | 0.2 | setosa |
9 | 4.4 | 2.9 | 1.4 | 0.2 | setosa |
10 | 4.9 | 3.1 | 1.5 | 0.1 | setosa |
… | … | … | … | … | … |
140 | 6.9 | 3.1 | 5.4 | 2.1 | virginica |
141 | 6.7 | 3.1 | 5.6 | 2.4 | virginica |
142 | 6.9 | 3.1 | 5.1 | 2.3 | virginica |
143 | 5.8 | 2.7 | 5.1 | 1.9 | virginica |
144 | 6.8 | 3.2 | 5.9 | 2.3 | virginica |
145 | 6.7 | 3.3 | 5.7 | 2.5 | virginica |
146 | 6.7 | 3.0 | 5.2 | 2.3 | virginica |
147 | 6.3 | 2.5 | 5.0 | 1.9 | virginica |
148 | 6.5 | 3.0 | 5.2 | 2.0 | virginica |
149 | 6.2 | 3.4 | 5.4 | 2.3 | virginica |
150 | 5.9 | 3.0 | 5.1 | 1.8 | virginica |
(rdatasets/ggplot2-mpg)
https://vincentarelbundock.github.io/Rdatasets/csv/ggplot2/mpg.csv [234 12]:
:rownames | :manufacturer | :model | :displ | :year | :cyl | :trans | :drv | :cty | :hwy | :fl | :class |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | audi | a4 | 1.8 | 1999 | 4 | auto(l5) | f | 18 | 29 | p | compact |
2 | audi | a4 | 1.8 | 1999 | 4 | manual(m5) | f | 21 | 29 | p | compact |
3 | audi | a4 | 2.0 | 2008 | 4 | manual(m6) | f | 20 | 31 | p | compact |
4 | audi | a4 | 2.0 | 2008 | 4 | auto(av) | f | 21 | 30 | p | compact |
5 | audi | a4 | 2.8 | 1999 | 6 | auto(l5) | f | 16 | 26 | p | compact |
6 | audi | a4 | 2.8 | 1999 | 6 | manual(m5) | f | 18 | 26 | p | compact |
7 | audi | a4 | 3.1 | 2008 | 6 | auto(av) | f | 18 | 27 | p | compact |
8 | audi | a4 quattro | 1.8 | 1999 | 4 | manual(m5) | 4 | 18 | 26 | p | compact |
9 | audi | a4 quattro | 1.8 | 1999 | 4 | auto(l5) | 4 | 16 | 25 | p | compact |
10 | audi | a4 quattro | 2.0 | 2008 | 4 | manual(m6) | 4 | 20 | 28 | p | compact |
… | … | … | … | … | … | … | … | … | … | … | … |
224 | volkswagen | new beetle | 2.0 | 1999 | 4 | manual(m5) | f | 21 | 29 | r | subcompact |
225 | volkswagen | new beetle | 2.0 | 1999 | 4 | auto(l4) | f | 19 | 26 | r | subcompact |
226 | volkswagen | new beetle | 2.5 | 2008 | 5 | manual(m5) | f | 20 | 28 | r | subcompact |
227 | volkswagen | new beetle | 2.5 | 2008 | 5 | auto(s6) | f | 20 | 29 | r | subcompact |
228 | volkswagen | passat | 1.8 | 1999 | 4 | manual(m5) | f | 21 | 29 | p | midsize |
229 | volkswagen | passat | 1.8 | 1999 | 4 | auto(l5) | f | 18 | 29 | p | midsize |
230 | volkswagen | passat | 2.0 | 2008 | 4 | auto(s6) | f | 19 | 28 | p | midsize |
231 | volkswagen | passat | 2.0 | 2008 | 4 | manual(m6) | f | 21 | 29 | p | midsize |
232 | volkswagen | passat | 2.8 | 1999 | 6 | auto(l5) | f | 16 | 26 | p | midsize |
233 | volkswagen | passat | 2.8 | 1999 | 6 | manual(m5) | f | 18 | 26 | p | midsize |
234 | volkswagen | passat | 3.6 | 2008 | 6 | auto(s6) | f | 17 | 26 | p | midsize |
(rdatasets/openintro-simulated_scatter)
https://vincentarelbundock.github.io/Rdatasets/csv/openintro/simulated_scatter.csv [2033 4]:
:rownames | :group | :x | :y |
---|---|---|---|
1 | 1 | -15.74380273 | 35.56615175 |
2 | 1 | 6.34665115 | 23.52750121 |
3 | 1 | 24.54001114 | -1.03170877 |
4 | 1 | -22.02035224 | 19.75964793 |
5 | 1 | 22.46083327 | -5.85090154 |
6 | 1 | 16.99409017 | 21.23115046 |
7 | 1 | 18.81878514 | -5.28260778 |
8 | 1 | 52.84813702 | -20.01482909 |
9 | 1 | -24.22229471 | 48.07070892 |
10 | 1 | 57.82316783 | -22.72697862 |
… | … | … | … |
2023 | 30 | 352.00000000 | 29.18804414 |
2024 | 30 | 245.00000000 | 32.98072746 |
2025 | 30 | 382.00000000 | 27.89430176 |
2026 | 30 | 240.00000000 | 33.50059544 |
2027 | 30 | 319.00000000 | 27.75617312 |
2028 | 30 | 197.00000000 | 38.44752778 |
2029 | 30 | 316.00000000 | 29.91404536 |
2030 | 30 | 263.00000000 | 31.85003163 |
2031 | 30 | 410.00000000 | 34.96805690 |
2032 | 30 | 252.00000000 | 33.15689569 |
2033 | 30 | 297.00000000 | 31.50992258 |
6.2 Plotly
We can also use datasets from Plotly Sample Datasets
(tc/dataset"https://raw.githubusercontent.com/plotly/datasets/refs/heads/master/1962_2006_walmart_store_openings.csv"
:key-fn keyword
{:parser-fn {:OPENDATE :string
:date_super :string}})
https://raw.githubusercontent.com/plotly/datasets/refs/heads/master/1962_2006_walmart_store_openings.csv [2992 16]:
:storenum | :OPENDATE | :date_super | :conversion | :st | :county | :STREETADDR | :STRCITY | :STRSTATE | :ZIPCODE | :type_store | :LAT | :LON | :MONTH | :DAY | :YEAR |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 7/1/62 | 3/1/97 | 1 | 5 | 7 | 2110 WEST WALNUT | Rogers | AR | 72756 | Supercenter | 36.342235 | -94.07141 | 7 | 1 | 1962 |
2 | 8/1/64 | 3/1/96 | 1 | 5 | 9 | 1417 HWY 62/65 N | Harrison | AR | 72601 | Supercenter | 36.236984 | -93.09345 | 8 | 1 | 1964 |
4 | 8/1/65 | 3/1/02 | 1 | 5 | 7 | 2901 HWY 412 EAST | Siloam Springs | AR | 72761 | Supercenter | 36.179905 | -94.50208 | 8 | 1 | 1965 |
8 | 10/1/67 | 3/1/93 | 1 | 5 | 29 | 1621 NORTH BUSINESS 9 | Morrilton | AR | 72110 | Supercenter | 35.156491 | -92.75858 | 10 | 1 | 1967 |
7 | 10/1/67 | 5 | 119 | 3801 CAMP ROBINSON RD. | North Little Rock | AR | 72118 | Wal-Mart | 34.813269 | -92.30229 | 10 | 1 | 1967 | ||
10 | 7/1/68 | 3/1/98 | 1 | 40 | 21 | 2020 SOUTH MUSKOGEE | Tahlequah | OK | 74464 | Supercenter | 35.923658 | -94.97185 | 7 | 1 | 1968 |
13 | 11/1/68 | 3/1/96 | 1 | 29 | 97 | 2705 GRAND AVE | Carthage | MO | 64836 | Supercenter | 37.168985 | -94.31164 | 11 | 1 | 1968 |
12 | 7/1/68 | 3/1/94 | 1 | 40 | 131 | 1500 LYNN RIGGS BLVD | Claremore | OK | 74017 | Supercenter | 36.327143 | -95.61192 | 7 | 1 | 1968 |
11 | 3/1/68 | 2/20/02 | 1 | 5 | 5 | 65 WAL-MART DRIVE | Mountain Home | AR | 72653 | Supercenter | 36.329026 | -92.35781 | 3 | 1 | 1968 |
9 | 3/1/68 | 3/1/00 | 1 | 29 | 143 | 1303 SOUTH MAIN | Sikeston | MO | 63801 | Supercenter | 36.891163 | -89.58355 | 3 | 1 | 1968 |
… | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
5370 | 1/31/06 | 1/31/06 | 0 | 8 | 13 | 2514 Main St | Longmont | CO | 80504 | Supercenter | 40.160138 | -105.01772 | 1 | 31 | 2006 |
3608 | 1/31/06 | 1/31/06 | 0 | 39 | 85 | 6067 N Ridge Rd | Madison | OH | 44057 | Supercenter | 41.800630 | -81.06021 | 1 | 31 | 2006 |
5253 | 1/31/06 | 1/31/06 | 0 | 51 | 550 | 632 Grass Field Pkwy | Chesapeake | VA | 23322 | Supercenter | 36.687543 | -76.22905 | 1 | 31 | 2006 |
5471 | 1/31/06 | 1/31/06 | 0 | 39 | 139 | 2485 Possum Run Rd | Mansfield | OH | 44903 | Supercenter | 40.766589 | -82.51869 | 1 | 31 | 2006 |
5346 | 1/23/06 | 1/23/06 | 0 | 37 | 1 | 1318 Mebane Oaks Rd | Mebane | NC | 27302 | Supercenter | 36.111449 | -79.27142 | 1 | 23 | 2006 |
5313 | 1/23/06 | 1/23/06 | 0 | 29 | 183 | 6100 Ronald Reagan Blvd | Lake Saint Louis | MO | 63367 | Supercenter | 38.796601 | -90.78525 | 1 | 23 | 2006 |
5403 | 1/27/06 | 1/27/06 | 0 | 17 | 19 | 100 S High Cross Rd | Urbana | IL | 61802 | Supercenter | 40.121648 | -88.17649 | 1 | 27 | 2006 |
3347 | 1/23/06 | 1/23/06 | 0 | 12 | 105 | 7450 Cypress Gardens Blvd | Winter Haven | FL | 33884 | Supercenter | 27.997387 | -81.68256 | 1 | 23 | 2006 |
5485 | 1/27/06 | 17 | 31 | 2500 W 95th St | Evergreen Park | IL | 60805 | Wal-Mart | 41.719933 | -87.70249 | 1 | 27 | 2006 | ||
3425 | 1/27/06 | 1/27/06 | 0 | 48 | 201 | 9598 Rowlett Rd | Houston | TX | 77034 | Supercenter | 29.636430 | -95.21789 | 1 | 27 | 2006 |
5193 | 1/31/06 | 6 | 65 | 12721 Moreno Beach Dr | Moreno Valley | CA | 92555 | Wal-Mart | 33.922823 | -117.16837 | 1 | 31 | 2006 |
6.3 tech.ml.dataset (TMD)
TMD’s repo also has some datasets that we can use:
(tc/dataset"https://raw.githubusercontent.com/techascent/tech.ml.dataset/master/test/data/stocks.csv"
:key-fn keyword}) {
https://raw.githubusercontent.com/techascent/tech.ml.dataset/master/test/data/stocks.csv [560 3]:
:symbol | :date | :price |
---|---|---|
MSFT | 2000-01-01 | 39.81 |
MSFT | 2000-02-01 | 36.35 |
MSFT | 2000-03-01 | 43.22 |
MSFT | 2000-04-01 | 28.37 |
MSFT | 2000-05-01 | 25.45 |
MSFT | 2000-06-01 | 32.54 |
MSFT | 2000-07-01 | 28.40 |
MSFT | 2000-08-01 | 28.40 |
MSFT | 2000-09-01 | 24.53 |
MSFT | 2000-10-01 | 28.02 |
… | … | … |
AAPL | 2009-05-01 | 135.81 |
AAPL | 2009-06-01 | 142.43 |
AAPL | 2009-07-01 | 163.39 |
AAPL | 2009-08-01 | 168.21 |
AAPL | 2009-09-01 | 185.35 |
AAPL | 2009-10-01 | 188.50 |
AAPL | 2009-11-01 | 199.91 |
AAPL | 2009-12-01 | 210.73 |
AAPL | 2010-01-01 | 192.06 |
AAPL | 2010-02-01 | 204.62 |
AAPL | 2010-03-01 | 223.02 |
source: notebooks/noj_book/datasets.clj