Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mldata 1.0 #2

Open
wants to merge 65 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
65 commits
Select commit Hold shift + click to select a range
a25762b
Skeleton of the Dataset abstract class.
ASalvail Mar 10, 2014
a6e925c
First draft of the metadata class.
ASalvail Mar 10, 2014
f21cca6
First draft of an in-memory implementation of a dataset.
ASalvail Mar 10, 2014
225ddc7
Added a version number to datasets metadata
ASalvail Mar 11, 2014
5197fad
Added the Dictionary class stub.
ASalvail Mar 11, 2014
27ef780
Updated datasets definition to correctly handle targets datasets.
ASalvail Mar 11, 2014
5ad4ee8
Updated the datasets general methods concerning splits and applying p…
ASalvail Mar 11, 2014
1c605d1
Added a check to make sure splits are defined via a tuple.
ASalvail Mar 12, 2014
2fb5d6b
Deleted InMemoryDataset and transfered methods to Dataset
ASalvail Mar 12, 2014
6d8ee8b
Added definitions of iterator and get item supporting supervised and …
ASalvail Mar 12, 2014
35f112f
utils to save and load the config file which contains the path to the…
ASalvail Mar 18, 2014
aaffb2f
Changed the config file logic to support by dataset path.
ASalvail Mar 26, 2014
7ef811c
Added a check to see if a specific dataset path exists.
ASalvail Mar 26, 2014
625bfea
Added an hash function for easy versioning.
ASalvail Mar 26, 2014
79fc344
New version of dataset_store with versioning and metadata supported.
ASalvail Apr 1, 2014
f11ba43
Added an importer for CSV files based on numpy.loadtxt()
ASalvail Apr 3, 2014
ab262ff
Added comments and new parameters
ASalvail Apr 3, 2014
00730de
Efficient buffered iteration added.
ASalvail Apr 3, 2014
b21420e
Removed uses of utils/utils
ASalvail Apr 3, 2014
fc9e766
Test suite for Datasets
ASalvail Apr 3, 2014
ef46db9
Test suite for Config
ASalvail Apr 8, 2014
c923699
Test suite for Dataset_store
ASalvail Apr 8, 2014
4c8b313
Corrected saving of default config file
ASalvail Apr 9, 2014
d8b4cfa
Corrected loading of config file
ASalvail Apr 9, 2014
f3ea2a6
Corrected joining of parts in CONFIGFILE
ASalvail Apr 9, 2014
0ff610e
Updated to python 3.4
ASalvail Apr 9, 2014
347d24b
Small correction to skip a line
ASalvail Apr 9, 2014
d5adda5
assert syntax correction
ASalvail Apr 9, 2014
2d9b70f
Added __init__ files for tests
ASalvail Apr 9, 2014
52bb43f
Changed the whole logic of __iter__ and __getitem__
ASalvail Apr 10, 2014
ab0b52b
Changed test to remove errors when run.
ASalvail Apr 10, 2014
1b4b158
Create the default dataset directory. Insure the dataset folders woul…
ASalvail Apr 15, 2014
93425f7
Changed preprocess logic to follow python's capacities.
ASalvail Apr 15, 2014
571fbf7
Changed preprocess logic to follow python's capacities.
ASalvail Apr 15, 2014
e652374
Corrected a small mistake in the splits of the test case.
ASalvail Apr 15, 2014
8bde933
Removed a nonsensical test.
ASalvail Apr 15, 2014
ef37e3a
Corrected split argument in a test
ASalvail Apr 15, 2014
62ff2a7
Corrected how h5py is called to store a ndarray.
ASalvail Apr 15, 2014
035ba7d
Small corrections in tests
ASalvail Apr 15, 2014
85bae50
Changed preprocess functions to named function as lambdas can't be pi…
ASalvail Apr 15, 2014
75bc15c
Corrected tests to reflect read-only datasets
ASalvail Apr 15, 2014
b9d5ef4
Corrected errors in dataset loading
ASalvail Apr 15, 2014
8010e60
Added a method to remove a dataset.
ASalvail Apr 15, 2014
594b4f5
Insured datasets were cleaned after tests.
ASalvail Apr 15, 2014
145ecb6
Corrected hashing of function to account for the difference between h…
ASalvail Apr 15, 2014
aa79e28
Module docstring
ASalvail Apr 15, 2014
bb7086b
edited todos comments
ASalvail Apr 15, 2014
a3010fa
Rollback to python 3.3 for CI.
ASalvail Apr 16, 2014
e39be55
Added a check for config file existence
ASalvail Apr 16, 2014
ff6c658
Rename test_Dataset.py to test_dataset.py
ASalvail Apr 16, 2014
2db5970
Corrected typo in config.py
ASalvail Apr 16, 2014
6af517a
Added a LookupError to handle missing datasets.
ASalvail Apr 16, 2014
9e0b80c
Added a LookupError to handle missing datasets_versions.
ASalvail Apr 16, 2014
2411f40
Added test to make sure missing datasets are handled properly.
ASalvail Apr 16, 2014
e623803
Changed ``splits`` logic.
ASalvail Apr 16, 2014
29c723e
Added a split iterator.
ASalvail Apr 16, 2014
503be66
Added a lazy read test.
ASalvail Apr 16, 2014
8e99c21
Corrected an assert statement
ASalvail Apr 25, 2014
a59492d
Close h5py File handle and correct noTarget dset
ASalvail Apr 25, 2014
0f35b10
Added support for minibatches
ASalvail May 13, 2014
5425285
Changed iterators to cycle infinitely and corrected an assert statement.
ASalvail May 31, 2014
6a16d43
Removed infinite cycle in iterator.
ASalvail May 31, 2014
ed98c62
Changed import names to reflect name change. Some other minor correct…
ASalvail May 31, 2014
6ccca01
Added a convenient method the whole dataset in memory.
ASalvail Jul 7, 2014
ada9521
Targets reshaped to fit adequate data structure.
ASalvail Jul 7, 2014
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
270 changes: 260 additions & 10 deletions mldata/dataset.py
Original file line number Diff line number Diff line change
@@ -1,17 +1,267 @@
# -*- coding: utf-8 -*-
"""Datasets store the data used for experiments."""
from itertools import accumulate
import hashlib

import numpy as np

class Dataset(list):
info = {}
BUFFER_SIZE = 1000

def __init__(self, data=[]):
super(Dataset, self).__init__(data)

class Dataset():
"""Interface to interact with physical dataset

class LazyDataset(Dataset):
def __init__(self, lazy_functions):
super(LazyDataset, self).__init__()
self.lazy_functions = lazy_functions
A `Dataset` presents a unified access to data, independent of the
implementation details such as laziness.

Parameters
----------
meta_data : Metadata
data : array_like
target : array_like

Attributes
----------
meta_data : Metadata
Information about the data. See `MetaData` documentation for more info.
data : array_like
The array of data to train on.
target : array_like, optional
The array of target to use for supervised learning. `target` should
be `None` when the dataset doesn't support supervised learning.

"""
def __init__(self, meta_data, data, target=None):
assert len(data) == meta_data.nb_examples,\
"The metadata ``nb_examples`` is inconsistent with the length of "\
"the dataset."
assert len(data) == meta_data.splits[-1] or\
len(data) == sum(meta_data.splits),\
"The metadata ``splits`` is inconsistent with the length of "\
"the dataset."
self.data = data
self.target = target
self.meta_data = meta_data

def __len__(self):
return self.meta_data.nb_examples

def __hash__(self):
""" Hash function used for versioning."""
hasher = hashlib.md5()
for l in self.data:
hasher.update(np.array(l))
if self.target is not None:
for l in self.target:
hasher.update(np.array(l))
return hasher.hexdigest()[:8]

def __iter__(self):
return self.lazy_functions['__iter__']()
"""Provide an iterator handling if the Dataset has a target."""
#todo: retest efficiency of this buffering in python3. With zip being now lazy, it might not be better than the vanilla iter.
buffer = min(BUFFER_SIZE, len(self))

if self.target is not None:
for idx in range(0, len(self.data), buffer):
stop = min(idx + buffer, len(self))
for ex, tg in zip(self.data[idx:stop],
self.target[idx:stop]):
yield (ex, tg)
else:
for idx in range(0, len(self.data), buffer):
stop = min(idx + buffer, len(self))
for ex in self.data[idx:stop]:
yield (ex,)

def __getitem__(self, key):
"""Get the entry specified by the key.

Parameters
----------
key : numpy-like key
The `key` can be a single integer, a slice or a tuple defining
coordinates. Can be treated as a NumPy key.

Returns
-------
(array_like, array_like) or (array_like,)
Return the element specified by the key. It can be an array or
simply a scalar of the type defined by the data [and target
arrays].
The returned values are put in a tuple (data, target) or (data,).

"""
if self.target is not None:
return self.data[key], self.target[key]
else:
return self.data[key],

def _split_iterators(self, start, end, minibatch_size=1):
""" Iterate on a split.

Parameters
----------
start : int
Id of the first element of the split.
end : int
Id of the next element after the last.

"""
buffer = min(BUFFER_SIZE, end - start)

if self.target is not None:
for idx in range(start, end, buffer):
stop = min(idx+buffer, end)
for i in range(idx, stop, minibatch_size):
j = min(stop, i+minibatch_size)
yield (self.data[i:j], self.target[i:j].reshape((1, -1)))
else:
for idx in range(start, end, buffer):
stop = min(idx+buffer, end)
for i in range(idx, stop, minibatch_size):
j = min(stop, i+minibatch_size)
yield (self.data[i:j],)

def get_splits_iterators(self, minibatch_size=1):
""" Creates a tuple of iterator, each iterating on a split.

Each iterators returned is used to iterate over the corresponding
split. For example, if the ``Metadata`` specifies a ``splits`` of
(10, 20, 30), ``get_splits_iterators`` returns a 3-tuple with an
iterator for the ten first examples, another for the ten next and a
third for the ten lasts.

Parameters
----------
minibatch_size : int
The size of minibatches received each iteration.

Returns
-------
tuple of iterable
A tuple of iterator, one for each split.

"""
sp = self._normalize_splits()

itors = [self._split_iterators(start, end, minibatch_size) for
(start, end) in zip([0] + sp, sp)]
return itors

def get_splits(self):
""" Get the datasets arrays.

WARNING : This method will try to load the entire dataset in memory.

Returns
-------
tuple of tuple of array
The data and targets sliced in multiple subarrays.
``((data1, target1), (data2, target2), ...)``

"""
sp = self._normalize_splits()
indices = zip([0]+sp, sp)

if self.target is not None:
return tuple((self.data[slice(*s)], self.target[slice(*s)])
for s in indices)
else:
return tuple((self.data[slice(*s)],) for s in indices)


def apply(self):
"""Apply the preprocess specified in the associated metadata.

This methods simply apply the function given in the metadata (the
identity by default) to the dataset. This function is supposed to do
work on the data and the targets, leaving the rest intact. Still,
as long as the result is still a `Dataset`, `apply` will work.

Returns
-------
Dataset
The preprocessed dataset.

"""
ds = self.meta_data.preprocess(self)
assert isinstance(ds, Dataset)
return ds

def _normalize_splits(self):
sp = list(self.meta_data.splits)

# normalize the splits
if sum(sp) == len(self):
sp = list(accumulate(sp))
assert sp[-1] == len(self), "The splits couldn't be normalized"

return sp


class Metadata():
"""Keep track of information about a dataset.

An instance of this class is required to build a `Dataset`. It gives
information on how the dataset is called, the split, etc.

A single `Dataset` can have multiple metadata files specifying different
split or a special pre-processing that needs to be applied. The
philosophy is to have a single physical copy of the dataset with
different views that can be created on the fly as needed.

Attributes
----------
name : str
The name of the `Dataset`. Default: "Default".
nb_examples : int
The number of example in the dataset (including all splits). Default: 0.
dictionary : Dictionary
_Not yet implemented_
Gives a mapping of words (str) to id (int). Used only when the
dataset has been saved as an array of numbers instead of text.
Default: None
splits : tuple of int
Specifies the split used by this view of the dataset. Default: ().
The numbers can be either the number of the last examples in each
subsets or the number of examples in each categories.
preprocess : function or None
A function that is callable on a `Dataset` to preprocess the data.
The function cannot be a lambda function since those can't be pickled.
Default: identity function.
hash : str
The hash of the linked ``Dataset``. Default: "".

"""
def __init__(self):
self.name = "Default"
self.nb_examples = 0
self.dictionary = None
self.splits = ()
self.preprocess = default_preprocess
self.hash = ""


def default_preprocess(dset):
return dset


class Dictionary:
"""Word / integer association list

This dictionary is used in `Metadata` for NLP problems. This class
ensures O(1) conversion from id to word and O(log n) conversion from word to
id.

Notes
-----
The class is *not yet implemented*.

Plans are for the dictionary to be implemented as a list of words
alphabetically ordered with the index of the word being its id. A method
implements a binary search over the words in order to retrieve its id.

"""

def __init__(self):
raise NotImplementedError("The class Dictionary is not yet "
"implemented.")
Loading