Setup

Regression datasets from the UCI machine learning repository prepared for benchmarking studies with test-train splits.

Setup

This repository uses git large file storage so you must first install git LFS otherwise the cloned repo will only contain pointer files rather than the data files. The size of the repository is about 319 Mb.

After installing git LFS you can simply clone and setup the python package as follows:

git clone https://github.com/treforevans/uci_datasets.git
cd uci_datasets
python setup.py develop

Note that you must use develop in the above line, not install.

Usage

The following code gets the first test-train split (i.e., split=0) of the challenger dataset:

from uci_datasets import Dataset
data = Dataset("challenger")
x_train, y_train, x_test, y_test = data.get_split(split=0)

There are 10 test-train splits for each dataset (as in 10-fold cross validation) with 90% of the dataset being training points and 10% being testing points in each split. The split parameter of the Dataset.get_split method accepts integers from 0 to 9 (inclusive).

Datasets

The below table contains the size (number of observations) and the number of input dimensions of each dataset.

Dataset name	Number of observations	Input dimension
`3droad`	434874	3
`autompg`	392	7
`bike`	17379	17
`challenger`	23	4
`concreteslump`	103	7
`energy`	768	8
`forest`	517	12
`houseelectric`	2049280	11
`keggdirected`	48827	20
`kin40k`	40000	8
`parkinsons`	5875	20
`pol`	15000	26
`pumadyn32nm`	8192	32
`slice`	53500	385
`solar`	1066	10
`stock`	536	11
`yacht`	308	6
`airfoil`	1503	5
`autos`	159	25
`breastcancer`	194	33
`buzz`	583250	77
`concrete`	1030	8
`elevators`	16599	18
`fertility`	100	9
`gas`	2565	128
`housing`	506	13
`keggundirected`	63608	27
`machine`	209	7
`pendulum`	630	9
`protein`	45730	9
`servo`	167	4
`skillcraft`	3338	19
`sml`	4137	26
`song`	515345	90
`tamielectric`	45781	3
`wine`	1599	11

Dataset information can be obtained from the all_datasets dictionary. For example, to obtain a list of all datasets with fewer than 1000 observations, execute the following:

from uci_datasets import all_datasets
[name for name, (n_observations, n_dimensions) in all_datasets.items() if n_observations < 1000]

Papers using these datasets

The following papers use the same datasets and test-train splits present in this repository.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Setup

Usage

Datasets

Papers using these datasets

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
3droad		3droad
airfoil		airfoil
autompg		autompg
autos		autos
bike		bike
breastcancer		breastcancer
buzz		buzz
challenger		challenger
concrete		concrete
concreteslump		concreteslump
elevators		elevators
energy		energy
fertility		fertility
forest		forest
gas		gas
houseelectric		houseelectric
housing		housing
keggdirected		keggdirected
keggundirected		keggundirected
kin40k		kin40k
machine		machine
parkinsons		parkinsons
pendulum		pendulum
pol		pol
protein		protein
pumadyn32nm		pumadyn32nm
servo		servo
skillcraft		skillcraft
slice		slice
sml		sml
solar		solar
song		song
stock		stock
tamielectric		tamielectric
test		test
uci_datasets		uci_datasets
wine		wine
yacht		yacht
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
setup.py		setup.py

AlCatt91/uci_datasets

Folders and files

Latest commit

History

Repository files navigation

Setup

Usage

Datasets

Papers using these datasets

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages