5  Datasets

author: Daniel Slutsky, Ken Huang

5.1 Rdatasets

For our tutorials here, let us fetch some datasets from Rdatasets:

(ns noj-book.datasets
  (:require [tablecloth.api :as tc]))
(def iris
  (-> "https://vincentarelbundock.github.io/Rdatasets/csv/datasets/iris.csv"
      (tc/dataset {:key-fn keyword})
      (tc/rename-columns {:Sepal.Length :sepal-length
                          :Sepal.Width :sepal-width
                          :Petal.Length :petal-length
                          :Petal.Width :petal-width
                          :Species :species})))
iris

https://vincentarelbundock.github.io/Rdatasets/csv/datasets/iris.csv [150 6]:

:rownames :sepal-length :sepal-width :petal-length :petal-width :species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
140 6.9 3.1 5.4 2.1 virginica
141 6.7 3.1 5.6 2.4 virginica
142 6.9 3.1 5.1 2.3 virginica
143 5.8 2.7 5.1 1.9 virginica
144 6.8 3.2 5.9 2.3 virginica
145 6.7 3.3 5.7 2.5 virginica
146 6.7 3.0 5.2 2.3 virginica
147 6.3 2.5 5.0 1.9 virginica
148 6.5 3.0 5.2 2.0 virginica
149 6.2 3.4 5.4 2.3 virginica
150 5.9 3.0 5.1 1.8 virginica
(def mtcars
  (-> "https://vincentarelbundock.github.io/Rdatasets/csv/datasets/mtcars.csv"
      (tc/dataset {:key-fn keyword})))
mtcars

https://vincentarelbundock.github.io/Rdatasets/csv/datasets/mtcars.csv [32 12]:

:rownames :mpg :cyl :disp :hp :drat :wt :qsec :vs :am :gear :carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
(def scatter
  (-> "https://vincentarelbundock.github.io/Rdatasets/csv/openintro/simulated_scatter.csv"
      (tc/dataset {:key-fn keyword})))
(tc/head scatter)

https://vincentarelbundock.github.io/Rdatasets/csv/openintro/simulated_scatter.csv [5 4]:

:rownames :group :x :y
1 1 -15.74380273 35.56615175
2 1 6.34665115 23.52750121
3 1 24.54001114 -1.03170877
4 1 -22.02035224 19.75964793
5 1 22.46083327 -5.85090154

5.2 Plotly

We can also use datasets from Plotly Sample Datasets

(-> "https://raw.githubusercontent.com/plotly/datasets/refs/heads/master/1962_2006_walmart_store_openings.csv"
    (tc/dataset {:key-fn keyword})
    (tc/head))

https://raw.githubusercontent.com/plotly/datasets/refs/heads/master/1962_2006_walmart_store_openings.csv [5 16]:

:storenum :OPENDATE :date_super :conversion :st :county :STREETADDR :STRCITY :STRSTATE :ZIPCODE :type_store :LAT :LON :MONTH :DAY :YEAR
1 2062-07-01 2097-03-01 1 5 7 2110 WEST WALNUT Rogers AR 72756 Supercenter 36.342235 -94.07141 7 1 1962
2 2064-08-01 2096-03-01 1 5 9 1417 HWY 62/65 N Harrison AR 72601 Supercenter 36.236984 -93.09345 8 1 1964
4 2065-08-01 2002-03-01 1 5 7 2901 HWY 412 EAST Siloam Springs AR 72761 Supercenter 36.179905 -94.50208 8 1 1965
8 2067-10-01 2093-03-01 1 5 29 1621 NORTH BUSINESS 9 Morrilton AR 72110 Supercenter 35.156491 -92.75858 10 1 1967
7 2067-10-01 5 119 3801 CAMP ROBINSON RD. North Little Rock AR 72118 Wal-Mart 34.813269 -92.30229 10 1 1967

5.3 tech.ml.dataset (TMD)

TMD’s repo also has some datasets that we can use:

(def stocks
  (tc/dataset
   "https://raw.githubusercontent.com/techascent/tech.ml.dataset/master/test/data/stocks.csv"
   {:key-fn keyword}))
source: notebooks/noj_book/datasets.clj