6 Example Datasets
author: Daniel Slutsky, Ken Huang
We may use various sources of datasets for our tutorials here.
(ns noj-book.datasets
(:require [tablecloth.api :as tc]
[scicloj.metamorph.ml.rdatasets :as rdatasets]))6.1 rdatasets
One of the main sources is the rdatasets namespace of metamorph.ml, which can fetch datasets from the Rdatasets collection.
(rdatasets/datasets-iris)https://vincentarelbundock.github.io/Rdatasets/csv/datasets/iris.csv [150 6]:
| :rownames | :sepal-length | :sepal-width | :petal-length | :petal-width | :species |
|---|---|---|---|---|---|
| 1 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
| 2 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
| 3 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
| 4 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
| 5 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
| 6 | 5.4 | 3.9 | 1.7 | 0.4 | setosa |
| 7 | 4.6 | 3.4 | 1.4 | 0.3 | setosa |
| 8 | 5.0 | 3.4 | 1.5 | 0.2 | setosa |
| 9 | 4.4 | 2.9 | 1.4 | 0.2 | setosa |
| 10 | 4.9 | 3.1 | 1.5 | 0.1 | setosa |
| … | … | … | … | … | … |
| 140 | 6.9 | 3.1 | 5.4 | 2.1 | virginica |
| 141 | 6.7 | 3.1 | 5.6 | 2.4 | virginica |
| 142 | 6.9 | 3.1 | 5.1 | 2.3 | virginica |
| 143 | 5.8 | 2.7 | 5.1 | 1.9 | virginica |
| 144 | 6.8 | 3.2 | 5.9 | 2.3 | virginica |
| 145 | 6.7 | 3.3 | 5.7 | 2.5 | virginica |
| 146 | 6.7 | 3.0 | 5.2 | 2.3 | virginica |
| 147 | 6.3 | 2.5 | 5.0 | 1.9 | virginica |
| 148 | 6.5 | 3.0 | 5.2 | 2.0 | virginica |
| 149 | 6.2 | 3.4 | 5.4 | 2.3 | virginica |
| 150 | 5.9 | 3.0 | 5.1 | 1.8 | virginica |
(rdatasets/ggplot2-mpg)https://vincentarelbundock.github.io/Rdatasets/csv/ggplot2/mpg.csv [234 12]:
| :rownames | :manufacturer | :model | :displ | :year | :cyl | :trans | :drv | :cty | :hwy | :fl | :class |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | audi | a4 | 1.8 | 1999 | 4 | auto(l5) | f | 18 | 29 | p | compact |
| 2 | audi | a4 | 1.8 | 1999 | 4 | manual(m5) | f | 21 | 29 | p | compact |
| 3 | audi | a4 | 2.0 | 2008 | 4 | manual(m6) | f | 20 | 31 | p | compact |
| 4 | audi | a4 | 2.0 | 2008 | 4 | auto(av) | f | 21 | 30 | p | compact |
| 5 | audi | a4 | 2.8 | 1999 | 6 | auto(l5) | f | 16 | 26 | p | compact |
| 6 | audi | a4 | 2.8 | 1999 | 6 | manual(m5) | f | 18 | 26 | p | compact |
| 7 | audi | a4 | 3.1 | 2008 | 6 | auto(av) | f | 18 | 27 | p | compact |
| 8 | audi | a4 quattro | 1.8 | 1999 | 4 | manual(m5) | 4 | 18 | 26 | p | compact |
| 9 | audi | a4 quattro | 1.8 | 1999 | 4 | auto(l5) | 4 | 16 | 25 | p | compact |
| 10 | audi | a4 quattro | 2.0 | 2008 | 4 | manual(m6) | 4 | 20 | 28 | p | compact |
| … | … | … | … | … | … | … | … | … | … | … | … |
| 224 | volkswagen | new beetle | 2.0 | 1999 | 4 | manual(m5) | f | 21 | 29 | r | subcompact |
| 225 | volkswagen | new beetle | 2.0 | 1999 | 4 | auto(l4) | f | 19 | 26 | r | subcompact |
| 226 | volkswagen | new beetle | 2.5 | 2008 | 5 | manual(m5) | f | 20 | 28 | r | subcompact |
| 227 | volkswagen | new beetle | 2.5 | 2008 | 5 | auto(s6) | f | 20 | 29 | r | subcompact |
| 228 | volkswagen | passat | 1.8 | 1999 | 4 | manual(m5) | f | 21 | 29 | p | midsize |
| 229 | volkswagen | passat | 1.8 | 1999 | 4 | auto(l5) | f | 18 | 29 | p | midsize |
| 230 | volkswagen | passat | 2.0 | 2008 | 4 | auto(s6) | f | 19 | 28 | p | midsize |
| 231 | volkswagen | passat | 2.0 | 2008 | 4 | manual(m6) | f | 21 | 29 | p | midsize |
| 232 | volkswagen | passat | 2.8 | 1999 | 6 | auto(l5) | f | 16 | 26 | p | midsize |
| 233 | volkswagen | passat | 2.8 | 1999 | 6 | manual(m5) | f | 18 | 26 | p | midsize |
| 234 | volkswagen | passat | 3.6 | 2008 | 6 | auto(s6) | f | 17 | 26 | p | midsize |
(rdatasets/openintro-simulated_scatter)https://vincentarelbundock.github.io/Rdatasets/csv/openintro/simulated_scatter.csv [2033 4]:
| :rownames | :group | :x | :y |
|---|---|---|---|
| 1 | 1 | -15.74380273 | 35.56615175 |
| 2 | 1 | 6.34665115 | 23.52750121 |
| 3 | 1 | 24.54001114 | -1.03170877 |
| 4 | 1 | -22.02035224 | 19.75964793 |
| 5 | 1 | 22.46083327 | -5.85090154 |
| 6 | 1 | 16.99409017 | 21.23115046 |
| 7 | 1 | 18.81878514 | -5.28260778 |
| 8 | 1 | 52.84813702 | -20.01482909 |
| 9 | 1 | -24.22229471 | 48.07070892 |
| 10 | 1 | 57.82316783 | -22.72697862 |
| … | … | … | … |
| 2023 | 30 | 352.00000000 | 29.18804414 |
| 2024 | 30 | 245.00000000 | 32.98072746 |
| 2025 | 30 | 382.00000000 | 27.89430176 |
| 2026 | 30 | 240.00000000 | 33.50059544 |
| 2027 | 30 | 319.00000000 | 27.75617312 |
| 2028 | 30 | 197.00000000 | 38.44752778 |
| 2029 | 30 | 316.00000000 | 29.91404536 |
| 2030 | 30 | 263.00000000 | 31.85003163 |
| 2031 | 30 | 410.00000000 | 34.96805690 |
| 2032 | 30 | 252.00000000 | 33.15689569 |
| 2033 | 30 | 297.00000000 | 31.50992258 |
6.2 Plotly
We can also use datasets from Plotly Sample Datasets
(tc/dataset
"https://raw.githubusercontent.com/plotly/datasets/refs/heads/master/1962_2006_walmart_store_openings.csv"
{:key-fn keyword
:parser-fn {:OPENDATE :string
:date_super :string}})https://raw.githubusercontent.com/plotly/datasets/refs/heads/master/1962_2006_walmart_store_openings.csv [2992 16]:
| :storenum | :OPENDATE | :date_super | :conversion | :st | :county | :STREETADDR | :STRCITY | :STRSTATE | :ZIPCODE | :type_store | :LAT | :LON | :MONTH | :DAY | :YEAR |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 7/1/62 | 3/1/97 | 1 | 5 | 7 | 2110 WEST WALNUT | Rogers | AR | 72756 | Supercenter | 36.342235 | -94.07141 | 7 | 1 | 1962 |
| 2 | 8/1/64 | 3/1/96 | 1 | 5 | 9 | 1417 HWY 62/65 N | Harrison | AR | 72601 | Supercenter | 36.236984 | -93.09345 | 8 | 1 | 1964 |
| 4 | 8/1/65 | 3/1/02 | 1 | 5 | 7 | 2901 HWY 412 EAST | Siloam Springs | AR | 72761 | Supercenter | 36.179905 | -94.50208 | 8 | 1 | 1965 |
| 8 | 10/1/67 | 3/1/93 | 1 | 5 | 29 | 1621 NORTH BUSINESS 9 | Morrilton | AR | 72110 | Supercenter | 35.156491 | -92.75858 | 10 | 1 | 1967 |
| 7 | 10/1/67 | 5 | 119 | 3801 CAMP ROBINSON RD. | North Little Rock | AR | 72118 | Wal-Mart | 34.813269 | -92.30229 | 10 | 1 | 1967 | ||
| 10 | 7/1/68 | 3/1/98 | 1 | 40 | 21 | 2020 SOUTH MUSKOGEE | Tahlequah | OK | 74464 | Supercenter | 35.923658 | -94.97185 | 7 | 1 | 1968 |
| 13 | 11/1/68 | 3/1/96 | 1 | 29 | 97 | 2705 GRAND AVE | Carthage | MO | 64836 | Supercenter | 37.168985 | -94.31164 | 11 | 1 | 1968 |
| 12 | 7/1/68 | 3/1/94 | 1 | 40 | 131 | 1500 LYNN RIGGS BLVD | Claremore | OK | 74017 | Supercenter | 36.327143 | -95.61192 | 7 | 1 | 1968 |
| 11 | 3/1/68 | 2/20/02 | 1 | 5 | 5 | 65 WAL-MART DRIVE | Mountain Home | AR | 72653 | Supercenter | 36.329026 | -92.35781 | 3 | 1 | 1968 |
| 9 | 3/1/68 | 3/1/00 | 1 | 29 | 143 | 1303 SOUTH MAIN | Sikeston | MO | 63801 | Supercenter | 36.891163 | -89.58355 | 3 | 1 | 1968 |
| … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
| 5370 | 1/31/06 | 1/31/06 | 0 | 8 | 13 | 2514 Main St | Longmont | CO | 80504 | Supercenter | 40.160138 | -105.01772 | 1 | 31 | 2006 |
| 3608 | 1/31/06 | 1/31/06 | 0 | 39 | 85 | 6067 N Ridge Rd | Madison | OH | 44057 | Supercenter | 41.800630 | -81.06021 | 1 | 31 | 2006 |
| 5253 | 1/31/06 | 1/31/06 | 0 | 51 | 550 | 632 Grass Field Pkwy | Chesapeake | VA | 23322 | Supercenter | 36.687543 | -76.22905 | 1 | 31 | 2006 |
| 5471 | 1/31/06 | 1/31/06 | 0 | 39 | 139 | 2485 Possum Run Rd | Mansfield | OH | 44903 | Supercenter | 40.766589 | -82.51869 | 1 | 31 | 2006 |
| 5346 | 1/23/06 | 1/23/06 | 0 | 37 | 1 | 1318 Mebane Oaks Rd | Mebane | NC | 27302 | Supercenter | 36.111449 | -79.27142 | 1 | 23 | 2006 |
| 5313 | 1/23/06 | 1/23/06 | 0 | 29 | 183 | 6100 Ronald Reagan Blvd | Lake Saint Louis | MO | 63367 | Supercenter | 38.796601 | -90.78525 | 1 | 23 | 2006 |
| 5403 | 1/27/06 | 1/27/06 | 0 | 17 | 19 | 100 S High Cross Rd | Urbana | IL | 61802 | Supercenter | 40.121648 | -88.17649 | 1 | 27 | 2006 |
| 3347 | 1/23/06 | 1/23/06 | 0 | 12 | 105 | 7450 Cypress Gardens Blvd | Winter Haven | FL | 33884 | Supercenter | 27.997387 | -81.68256 | 1 | 23 | 2006 |
| 5485 | 1/27/06 | 17 | 31 | 2500 W 95th St | Evergreen Park | IL | 60805 | Wal-Mart | 41.719933 | -87.70249 | 1 | 27 | 2006 | ||
| 3425 | 1/27/06 | 1/27/06 | 0 | 48 | 201 | 9598 Rowlett Rd | Houston | TX | 77034 | Supercenter | 29.636430 | -95.21789 | 1 | 27 | 2006 |
| 5193 | 1/31/06 | 6 | 65 | 12721 Moreno Beach Dr | Moreno Valley | CA | 92555 | Wal-Mart | 33.922823 | -117.16837 | 1 | 31 | 2006 |
6.3 tech.ml.dataset (TMD)
TMD’s repo also has some datasets that we can use:
(tc/dataset
"https://raw.githubusercontent.com/techascent/tech.ml.dataset/master/test/data/stocks.csv"
{:key-fn keyword})https://raw.githubusercontent.com/techascent/tech.ml.dataset/master/test/data/stocks.csv [560 3]:
| :symbol | :date | :price |
|---|---|---|
| MSFT | 2000-01-01 | 39.81 |
| MSFT | 2000-02-01 | 36.35 |
| MSFT | 2000-03-01 | 43.22 |
| MSFT | 2000-04-01 | 28.37 |
| MSFT | 2000-05-01 | 25.45 |
| MSFT | 2000-06-01 | 32.54 |
| MSFT | 2000-07-01 | 28.40 |
| MSFT | 2000-08-01 | 28.40 |
| MSFT | 2000-09-01 | 24.53 |
| MSFT | 2000-10-01 | 28.02 |
| … | … | … |
| AAPL | 2009-05-01 | 135.81 |
| AAPL | 2009-06-01 | 142.43 |
| AAPL | 2009-07-01 | 163.39 |
| AAPL | 2009-08-01 | 168.21 |
| AAPL | 2009-09-01 | 185.35 |
| AAPL | 2009-10-01 | 188.50 |
| AAPL | 2009-11-01 | 199.91 |
| AAPL | 2009-12-01 | 210.73 |
| AAPL | 2010-01-01 | 192.06 |
| AAPL | 2010-02-01 | 204.62 |
| AAPL | 2010-03-01 | 223.02 |
source: notebooks/noj_book/datasets.clj