6 Example Datasets
author: Daniel Slutsky, Ken Huang
6.1 Rdatasets
For our tutorials here, let us fetch some datasets from Rdatasets:
ns noj-book.datasets
(:require [tablecloth.api :as tc])) (
def iris
(-> "https://vincentarelbundock.github.io/Rdatasets/csv/datasets/iris.csv"
(:key-fn keyword})
(tc/dataset {:Sepal.Length :sepal-length
(tc/rename-columns {:Sepal.Width :sepal-width
:Petal.Length :petal-length
:Petal.Width :petal-width
:Species :species})))
iris
https://vincentarelbundock.github.io/Rdatasets/csv/datasets/iris.csv [150 6]:
:rownames | :sepal-length | :sepal-width | :petal-length | :petal-width | :species |
---|---|---|---|---|---|
1 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
2 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
3 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
4 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
5 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
6 | 5.4 | 3.9 | 1.7 | 0.4 | setosa |
7 | 4.6 | 3.4 | 1.4 | 0.3 | setosa |
8 | 5.0 | 3.4 | 1.5 | 0.2 | setosa |
9 | 4.4 | 2.9 | 1.4 | 0.2 | setosa |
10 | 4.9 | 3.1 | 1.5 | 0.1 | setosa |
… | … | … | … | … | … |
140 | 6.9 | 3.1 | 5.4 | 2.1 | virginica |
141 | 6.7 | 3.1 | 5.6 | 2.4 | virginica |
142 | 6.9 | 3.1 | 5.1 | 2.3 | virginica |
143 | 5.8 | 2.7 | 5.1 | 1.9 | virginica |
144 | 6.8 | 3.2 | 5.9 | 2.3 | virginica |
145 | 6.7 | 3.3 | 5.7 | 2.5 | virginica |
146 | 6.7 | 3.0 | 5.2 | 2.3 | virginica |
147 | 6.3 | 2.5 | 5.0 | 1.9 | virginica |
148 | 6.5 | 3.0 | 5.2 | 2.0 | virginica |
149 | 6.2 | 3.4 | 5.4 | 2.3 | virginica |
150 | 5.9 | 3.0 | 5.1 | 1.8 | virginica |
def mtcars
(-> "https://vincentarelbundock.github.io/Rdatasets/csv/datasets/mtcars.csv"
(:key-fn keyword}))) (tc/dataset {
mtcars
https://vincentarelbundock.github.io/Rdatasets/csv/datasets/mtcars.csv [32 12]:
:rownames | :mpg | :cyl | :disp | :hp | :drat | :wt | :qsec | :vs | :am | :gear | :carb |
---|---|---|---|---|---|---|---|---|---|---|---|
Mazda RX4 | 21.0 | 6 | 160.0 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
Mazda RX4 Wag | 21.0 | 6 | 160.0 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
Datsun 710 | 22.8 | 4 | 108.0 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |
Hornet 4 Drive | 21.4 | 6 | 258.0 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 | 1 |
Hornet Sportabout | 18.7 | 8 | 360.0 | 175 | 3.15 | 3.440 | 17.02 | 0 | 0 | 3 | 2 |
Valiant | 18.1 | 6 | 225.0 | 105 | 2.76 | 3.460 | 20.22 | 1 | 0 | 3 | 1 |
Duster 360 | 14.3 | 8 | 360.0 | 245 | 3.21 | 3.570 | 15.84 | 0 | 0 | 3 | 4 |
Merc 240D | 24.4 | 4 | 146.7 | 62 | 3.69 | 3.190 | 20.00 | 1 | 0 | 4 | 2 |
Merc 230 | 22.8 | 4 | 140.8 | 95 | 3.92 | 3.150 | 22.90 | 1 | 0 | 4 | 2 |
Merc 280 | 19.2 | 6 | 167.6 | 123 | 3.92 | 3.440 | 18.30 | 1 | 0 | 4 | 4 |
… | … | … | … | … | … | … | … | … | … | … | … |
Dodge Challenger | 15.5 | 8 | 318.0 | 150 | 2.76 | 3.520 | 16.87 | 0 | 0 | 3 | 2 |
AMC Javelin | 15.2 | 8 | 304.0 | 150 | 3.15 | 3.435 | 17.30 | 0 | 0 | 3 | 2 |
Camaro Z28 | 13.3 | 8 | 350.0 | 245 | 3.73 | 3.840 | 15.41 | 0 | 0 | 3 | 4 |
Pontiac Firebird | 19.2 | 8 | 400.0 | 175 | 3.08 | 3.845 | 17.05 | 0 | 0 | 3 | 2 |
Fiat X1-9 | 27.3 | 4 | 79.0 | 66 | 4.08 | 1.935 | 18.90 | 1 | 1 | 4 | 1 |
Porsche 914-2 | 26.0 | 4 | 120.3 | 91 | 4.43 | 2.140 | 16.70 | 0 | 1 | 5 | 2 |
Lotus Europa | 30.4 | 4 | 95.1 | 113 | 3.77 | 1.513 | 16.90 | 1 | 1 | 5 | 2 |
Ford Pantera L | 15.8 | 8 | 351.0 | 264 | 4.22 | 3.170 | 14.50 | 0 | 1 | 5 | 4 |
Ferrari Dino | 19.7 | 6 | 145.0 | 175 | 3.62 | 2.770 | 15.50 | 0 | 1 | 5 | 6 |
Maserati Bora | 15.0 | 8 | 301.0 | 335 | 3.54 | 3.570 | 14.60 | 0 | 1 | 5 | 8 |
Volvo 142E | 21.4 | 4 | 121.0 | 109 | 4.11 | 2.780 | 18.60 | 1 | 1 | 4 | 2 |
def scatter
(-> "https://vincentarelbundock.github.io/Rdatasets/csv/openintro/simulated_scatter.csv"
(:key-fn keyword}))) (tc/dataset {
(tc/head scatter)
https://vincentarelbundock.github.io/Rdatasets/csv/openintro/simulated_scatter.csv [5 4]:
:rownames | :group | :x | :y |
---|---|---|---|
1 | 1 | -15.74380273 | 35.56615175 |
2 | 1 | 6.34665115 | 23.52750121 |
3 | 1 | 24.54001114 | -1.03170877 |
4 | 1 | -22.02035224 | 19.75964793 |
5 | 1 | 22.46083327 | -5.85090154 |
6.2 Plotly
We can also use datasets from Plotly Sample Datasets
-> "https://raw.githubusercontent.com/plotly/datasets/refs/heads/master/1962_2006_walmart_store_openings.csv"
(:key-fn keyword
(tc/dataset {:parser-fn {:OPENDATE :string
:date_super :string}})
(tc/head))
https://raw.githubusercontent.com/plotly/datasets/refs/heads/master/1962_2006_walmart_store_openings.csv [5 16]:
:storenum | :OPENDATE | :date_super | :conversion | :st | :county | :STREETADDR | :STRCITY | :STRSTATE | :ZIPCODE | :type_store | :LAT | :LON | :MONTH | :DAY | :YEAR |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 7/1/62 | 3/1/97 | 1 | 5 | 7 | 2110 WEST WALNUT | Rogers | AR | 72756 | Supercenter | 36.342235 | -94.07141 | 7 | 1 | 1962 |
2 | 8/1/64 | 3/1/96 | 1 | 5 | 9 | 1417 HWY 62/65 N | Harrison | AR | 72601 | Supercenter | 36.236984 | -93.09345 | 8 | 1 | 1964 |
4 | 8/1/65 | 3/1/02 | 1 | 5 | 7 | 2901 HWY 412 EAST | Siloam Springs | AR | 72761 | Supercenter | 36.179905 | -94.50208 | 8 | 1 | 1965 |
8 | 10/1/67 | 3/1/93 | 1 | 5 | 29 | 1621 NORTH BUSINESS 9 | Morrilton | AR | 72110 | Supercenter | 35.156491 | -92.75858 | 10 | 1 | 1967 |
7 | 10/1/67 | 5 | 119 | 3801 CAMP ROBINSON RD. | North Little Rock | AR | 72118 | Wal-Mart | 34.813269 | -92.30229 | 10 | 1 | 1967 |
6.3 tech.ml.dataset (TMD)
TMD’s repo also has some datasets that we can use:
def stocks
(
(tc/dataset"https://raw.githubusercontent.com/techascent/tech.ml.dataset/master/test/data/stocks.csv"
:key-fn keyword})) {
stocks
https://raw.githubusercontent.com/techascent/tech.ml.dataset/master/test/data/stocks.csv [560 3]:
:symbol | :date | :price |
---|---|---|
MSFT | 2000-01-01 | 39.81 |
MSFT | 2000-02-01 | 36.35 |
MSFT | 2000-03-01 | 43.22 |
MSFT | 2000-04-01 | 28.37 |
MSFT | 2000-05-01 | 25.45 |
MSFT | 2000-06-01 | 32.54 |
MSFT | 2000-07-01 | 28.40 |
MSFT | 2000-08-01 | 28.40 |
MSFT | 2000-09-01 | 24.53 |
MSFT | 2000-10-01 | 28.02 |
… | … | … |
AAPL | 2009-05-01 | 135.81 |
AAPL | 2009-06-01 | 142.43 |
AAPL | 2009-07-01 | 163.39 |
AAPL | 2009-08-01 | 168.21 |
AAPL | 2009-09-01 | 185.35 |
AAPL | 2009-10-01 | 188.50 |
AAPL | 2009-11-01 | 199.91 |
AAPL | 2009-12-01 | 210.73 |
AAPL | 2010-01-01 | 192.06 |
AAPL | 2010-02-01 | 204.62 |
AAPL | 2010-03-01 | 223.02 |
source: notebooks/noj_book/datasets.clj