Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mldata 1.0 #2

Open
wants to merge 65 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 47 commits
Commits
Show all changes
65 commits
Select commit Hold shift + click to select a range
a25762b
Skeleton of the Dataset abstract class.
ASalvail Mar 10, 2014
a6e925c
First draft of the metadata class.
ASalvail Mar 10, 2014
f21cca6
First draft of an in-memory implementation of a dataset.
ASalvail Mar 10, 2014
225ddc7
Added a version number to datasets metadata
ASalvail Mar 11, 2014
5197fad
Added the Dictionary class stub.
ASalvail Mar 11, 2014
27ef780
Updated datasets definition to correctly handle targets datasets.
ASalvail Mar 11, 2014
5ad4ee8
Updated the datasets general methods concerning splits and applying p…
ASalvail Mar 11, 2014
1c605d1
Added a check to make sure splits are defined via a tuple.
ASalvail Mar 12, 2014
2fb5d6b
Deleted InMemoryDataset and transfered methods to Dataset
ASalvail Mar 12, 2014
6d8ee8b
Added definitions of iterator and get item supporting supervised and …
ASalvail Mar 12, 2014
35f112f
utils to save and load the config file which contains the path to the…
ASalvail Mar 18, 2014
aaffb2f
Changed the config file logic to support by dataset path.
ASalvail Mar 26, 2014
7ef811c
Added a check to see if a specific dataset path exists.
ASalvail Mar 26, 2014
625bfea
Added an hash function for easy versioning.
ASalvail Mar 26, 2014
79fc344
New version of dataset_store with versioning and metadata supported.
ASalvail Apr 1, 2014
f11ba43
Added an importer for CSV files based on numpy.loadtxt()
ASalvail Apr 3, 2014
ab262ff
Added comments and new parameters
ASalvail Apr 3, 2014
00730de
Efficient buffered iteration added.
ASalvail Apr 3, 2014
b21420e
Removed uses of utils/utils
ASalvail Apr 3, 2014
fc9e766
Test suite for Datasets
ASalvail Apr 3, 2014
ef46db9
Test suite for Config
ASalvail Apr 8, 2014
c923699
Test suite for Dataset_store
ASalvail Apr 8, 2014
4c8b313
Corrected saving of default config file
ASalvail Apr 9, 2014
d8b4cfa
Corrected loading of config file
ASalvail Apr 9, 2014
f3ea2a6
Corrected joining of parts in CONFIGFILE
ASalvail Apr 9, 2014
0ff610e
Updated to python 3.4
ASalvail Apr 9, 2014
347d24b
Small correction to skip a line
ASalvail Apr 9, 2014
d5adda5
assert syntax correction
ASalvail Apr 9, 2014
2d9b70f
Added __init__ files for tests
ASalvail Apr 9, 2014
52bb43f
Changed the whole logic of __iter__ and __getitem__
ASalvail Apr 10, 2014
ab0b52b
Changed test to remove errors when run.
ASalvail Apr 10, 2014
1b4b158
Create the default dataset directory. Insure the dataset folders woul…
ASalvail Apr 15, 2014
93425f7
Changed preprocess logic to follow python's capacities.
ASalvail Apr 15, 2014
571fbf7
Changed preprocess logic to follow python's capacities.
ASalvail Apr 15, 2014
e652374
Corrected a small mistake in the splits of the test case.
ASalvail Apr 15, 2014
8bde933
Removed a nonsensical test.
ASalvail Apr 15, 2014
ef37e3a
Corrected split argument in a test
ASalvail Apr 15, 2014
62ff2a7
Corrected how h5py is called to store a ndarray.
ASalvail Apr 15, 2014
035ba7d
Small corrections in tests
ASalvail Apr 15, 2014
85bae50
Changed preprocess functions to named function as lambdas can't be pi…
ASalvail Apr 15, 2014
75bc15c
Corrected tests to reflect read-only datasets
ASalvail Apr 15, 2014
b9d5ef4
Corrected errors in dataset loading
ASalvail Apr 15, 2014
8010e60
Added a method to remove a dataset.
ASalvail Apr 15, 2014
594b4f5
Insured datasets were cleaned after tests.
ASalvail Apr 15, 2014
145ecb6
Corrected hashing of function to account for the difference between h…
ASalvail Apr 15, 2014
aa79e28
Module docstring
ASalvail Apr 15, 2014
bb7086b
edited todos comments
ASalvail Apr 15, 2014
a3010fa
Rollback to python 3.3 for CI.
ASalvail Apr 16, 2014
e39be55
Added a check for config file existence
ASalvail Apr 16, 2014
ff6c658
Rename test_Dataset.py to test_dataset.py
ASalvail Apr 16, 2014
2db5970
Corrected typo in config.py
ASalvail Apr 16, 2014
6af517a
Added a LookupError to handle missing datasets.
ASalvail Apr 16, 2014
9e0b80c
Added a LookupError to handle missing datasets_versions.
ASalvail Apr 16, 2014
2411f40
Added test to make sure missing datasets are handled properly.
ASalvail Apr 16, 2014
e623803
Changed ``splits`` logic.
ASalvail Apr 16, 2014
29c723e
Added a split iterator.
ASalvail Apr 16, 2014
503be66
Added a lazy read test.
ASalvail Apr 16, 2014
8e99c21
Corrected an assert statement
ASalvail Apr 25, 2014
a59492d
Close h5py File handle and correct noTarget dset
ASalvail Apr 25, 2014
0f35b10
Added support for minibatches
ASalvail May 13, 2014
5425285
Changed iterators to cycle infinitely and corrected an assert statement.
ASalvail May 31, 2014
6a16d43
Removed infinite cycle in iterator.
ASalvail May 31, 2014
ed98c62
Changed import names to reflect name change. Some other minor correct…
ASalvail May 31, 2014
6ccca01
Added a convenient method the whole dataset in memory.
ASalvail Jul 7, 2014
ada9521
Targets reshaped to fit adequate data structure.
ASalvail Jul 7, 2014
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .travis.yml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
language: python
python:
- "3.3"
- "3.4"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Support for Python3.4 is not release yet for Travis. See travis-ci/travis-ci#1989

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the time you'll get around to fix CI, support should be available.

- "2.7"
- "2.6"
# - "pypy"
Expand Down
201 changes: 191 additions & 10 deletions mldata/dataset.py
Original file line number Diff line number Diff line change
@@ -1,17 +1,198 @@
# -*- coding: utf-8 -*-
"""Datasets store the data used for experiments."""
import hashlib

import numpy as np

class Dataset(list):
info = {}
BUFFER_SIZE = 1000

def __init__(self, data=[]):
super(Dataset, self).__init__(data)
class Dataset():
"""Interface to interact with physical dataset

A `Dataset` presents a unified access to data, independent of the
implementation details such as laziness.

class LazyDataset(Dataset):
def __init__(self, lazy_functions):
super(LazyDataset, self).__init__()
self.lazy_functions = lazy_functions
Parameters
----------
meta_data : Metadata
data : array_like
target : array_like

Attributes
----------
meta_data : Metadata
Information about the data. See `MetaData` documentation for more info.
data : array_like
The array of data to train on.
target : array_like, optional
The array of target to use for supervised learning. `target` should
be `None` when the dataset doesn't support supervised learning.

"""
def __init__(self, meta_data, data, target=None):
self.data = data
self.target = target
assert isinstance(meta_data, Metadata)
self.meta_data = meta_data

def __len__(self):
return self.meta_data.nb_examples

def __hash__(self):
""" Hash function used for versioning."""
hasher = hashlib.md5()
for l in self.data:
hasher.update(np.array(l))
if self.target is not None:
for l in self.target:
hasher.update(np.array(l))
return hasher.hexdigest()[:8]

def __iter__(self):
return self.lazy_functions['__iter__']()
"""Provide an iterator when the Dataset has a target."""
#todo: retest efficiency of this buffering in python3. With zip being now lazy, it might not be better than the vanilla iter.
buffer = min(BUFFER_SIZE, len(self.data))
if self.target is not None:
for idx in range(0, len(self.data), buffer):
for ex, tg in zip(self.data[idx:idx+buffer],
self.target[idx:idx+buffer]):
yield (ex,tg)
else:
for idx in range(0, len(self.data), buffer):
for ex in self.data[idx:idx+buffer]:
yield (ex,)

def __getitem__(self, key):
"""Get the entry specified by the key.

Parameters
----------
key : numpy-like key
The `key` can be a single integer, a slice or a tuple defining
coordinates. Can be treated as a NumPy key.

Returns
-------
(array_like, array_like) or (array_like,)
Return the element specified by the key. It can be an array or
simply a scalar of the type defined by the data [and target
arrays].
The returned values are put in a tuple (data, target) or (data,).

"""
if self.target is not None:
return (self.data[key], self.target[key])
else:
return (self.data[key],)

def get_splits(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we get the examples for, let's say, the second split? Do we need to use a preprocessing function to do it?

If I remember correctly, we discussed about creating new Dataset instance to represent each subset.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'll actually be the trainer's job to follow the splits. This way, if you fancy doing cross validation, you're not stuck with multiple Dataset.
Though I agree we should maybe define the iterators directly from Dataset.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The iterators are added.

"""Return the splits defined by the associated metadata.

The split is given via a tuple of integer with each integers
representing the integer after the last id used by this split. For
example::

(5000, 6000, 7000)

would give a test set of all examples from 0 to 4999, a validation
set of examples 5000 to 5999 and a test set of examples 6000 up to
6999. This means that 7000 is also the number of examples in the
dataset.

Returns
-------
tuple of int
Where each integer gives the id of the example coming after the
last one in a split.

Notes
-----
For now, only a tuple is accepted. Eventually, predicates over the
examples id could be supported.

"""
if isinstance(self.meta_data.splits, tuple):
return self.meta_data.splits
else:
raise NotImplementedError("Only splits with tuple are supported.")

def apply(self):
"""Apply the preprocess specified in the associated metadata.

This methods simply apply the function given in the metadata (the
identity by default) to the dataset. This function is supposed to do
work on the data and the targets, leaving the rest intact. Still,
as long as the result is still a `Dataset`, `apply` will work.

Returns
-------
Dataset
The preprocessed dataset.

"""
ds = self.meta_data.preprocess(self)
assert isinstance(ds, Dataset)
return ds


class Metadata():
"""Keep track of information about a dataset.

An instance of this class is required to build a `Dataset`. It gives
information on how the dataset is called, the split, etc.

A single `Dataset` can have multiple metadata files specifying different
split or a special pre-processing that needs to be applied. The
philosophy is to have a single physical copy of the dataset with
different views that can be created on the fly as needed.

Attributes
----------
name : str
The name of the `Dataset`. Default: "Default".
nb_examples : int
The number of example in the dataset (including all splits). Default: 0.
dictionary : Dictionary
Gives a mapping of words (str) to id (int). Used only when the
dataset has been saved as an array of numbers instead of text.
Default: None
splits : tuple of int
Specifies the split used by this view of the dataset. Default: ().
preprocess : function or None
A function that is callable on a `Dataset` to preprocess the data.
The function cannot be a lambda function since those can't be pickled.
Default: identity function.
hash : str
The hash of the linked ``Dataset``. Default: "".

"""
def __init__(self):
self.name = "Default"
self.nb_examples = 0
self.dictionary = None
self.splits = ()
self.preprocess = default_preprocess
self.hash = ""

def default_preprocess(dset):
return dset

class Dictionary:
"""Word / integer association list

This dictionary is used in `Metadata` for NLP problems. This class
ensures O(1) conversion from id to word and O(log n) conversion from word to
id.

Notes
-----
The class is *not yet implemented*.

Plans are for the dictionary to be implemented as a list of words
alphabetically ordered with the index of the word being its id. A method
implements a binary search over the words in order to retrieve its id.

"""

def __init__(self):
raise NotImplementedError("The class Dictionary is not yet "
"implemented.")
Loading