Skip to content
This repository has been archived by the owner on Mar 16, 2022. It is now read-only.
/ uci_datasets Public archive
forked from treforevans/uci_datasets

Regression datasets from the UCI repository with standardized test-train splits.

Notifications You must be signed in to change notification settings

AlCatt91/uci_datasets

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Regression datasets from the UCI machine learning repository prepared for benchmarking studies with test-train splits.

Setup

This repository uses git large file storage so you must first install git LFS otherwise the cloned repo will only contain pointer files rather than the data files. The size of the repository is about 319 Mb.

After installing git LFS you can simply clone and setup the python package as follows:

git clone https://github.com/treforevans/uci_datasets.git
cd uci_datasets
python setup.py develop

Note that you must use develop in the above line, not install.

Usage

The following code gets the first test-train split (i.e., split=0) of the challenger dataset:

from uci_datasets import Dataset
data = Dataset("challenger")
x_train, y_train, x_test, y_test = data.get_split(split=0)

There are 10 test-train splits for each dataset (as in 10-fold cross validation) with 90% of the dataset being training points and 10% being testing points in each split. The split parameter of the Dataset.get_split method accepts integers from 0 to 9 (inclusive).

Datasets

The below table contains the size (number of observations) and the number of input dimensions of each dataset.

Dataset name Number of observations Input dimension
3droad 434874 3
autompg 392 7
bike 17379 17
challenger 23 4
concreteslump 103 7
energy 768 8
forest 517 12
houseelectric 2049280 11
keggdirected 48827 20
kin40k 40000 8
parkinsons 5875 20
pol 15000 26
pumadyn32nm 8192 32
slice 53500 385
solar 1066 10
stock 536 11
yacht 308 6
airfoil 1503 5
autos 159 25
breastcancer 194 33
buzz 583250 77
concrete 1030 8
elevators 16599 18
fertility 100 9
gas 2565 128
housing 506 13
keggundirected 63608 27
machine 209 7
pendulum 630 9
protein 45730 9
servo 167 4
skillcraft 3338 19
sml 4137 26
song 515345 90
tamielectric 45781 3
wine 1599 11

Dataset information can be obtained from the all_datasets dictionary. For example, to obtain a list of all datasets with fewer than 1000 observations, execute the following:

from uci_datasets import all_datasets
[name for name, (n_observations, n_dimensions) in all_datasets.items() if n_observations < 1000]

Papers using these datasets

The following papers use the same datasets and test-train splits present in this repository.

About

Regression datasets from the UCI repository with standardized test-train splits.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%