mldata 1.0 #2

ASalvail · 2014-04-15T22:20:57Z

Adds :

Datasets loaded lazily of in memory
Metadata to keep track of datasets
Configuration files to index datasets
Utilities to save/load/remove datasets
A first basic CSV importer to create datasets
Full docstring docs
Test suite

…reprocessing.

The storage requirement will instead be controlled by the driver (hdf5).

…unsupervised learning.

… datasets folders.

__iter__() and __getitem__ are false methods. They actually are class methods, which makes their definition on the fly quite tricky. The old solution would reassing the correct definition in the __init__ method. However, since it is a class method, it also affect other object (that might have a need of the other iterator or getter). Thus, the best way to make it work is the naive way that checks each time is dataset.target is None.

MarcCote · 2014-04-16T00:25:47Z

.travis.yml

@@ -1,6 +1,6 @@
 language: python
 python:
-  - "3.3"
+  - "3.4"


Support for Python3.4 is not release yet for Travis. See travis-ci/travis-ci#1989

By the time you'll get around to fix CI, support should be available.

MarcCote · 2014-04-16T00:28:07Z

Folder tools should be renamed scripts.

I know we discussed it, but are we planning on supporting both Python2.7.+ and Python3.+ ?

MarcCote · 2014-04-16T00:30:53Z

tests/utils/test_config.py

+
+def setup_module():
+    # save current config file
+    os.rename(cfg.CONFIGFILE, cfg.CONFIGFILE  +".bak")


This line failed when $HOME/.mldataConfig does not already exist.

Good catch, will be fixed.

MarcCote · 2014-04-16T00:37:02Z

Why is there a capitalized D in tests/test_Dataset.py ? Is it because of the class with the same name?

ASalvail · 2014-04-16T00:42:29Z

There is no folder tools...
No, we stick to python 3.
And yes for the capital D.

MarcCote · 2014-04-16T00:51:23Z

Oups, you're right about tools, I must have looked at my branch :P .

Regarding the uppercase D, I think it should be put in lowercase because it is refering to the file you are testing: that is dataset.py. Inside the test file, it is fine to use the name of the class.

MarcCote · 2014-04-16T00:59:14Z

mldata/utils/config.py

+def _create_default_config():
+    """ Build and save a default config file for MLData.
+
+    The default config is saved as ``.MLDataConfig`` in the ``$HOME`` folder


Typo: you mean .mldataConfig

MarcCote · 2014-04-16T01:07:06Z

mldata/dataset_store.py

+    path = None
+    if cfg.dataset_exists(dset_name):
+        path = cfg.get_dataset_path(dset_name)
+    return _load_from_file(dset_name + '_' + version_name, path, lazy)


If the dataset is not found in the config file, _load_from_file will fail on the os.join with None. Maybe we should display a better error message.

We can now either gave the ``splits`` be given in the form (nb_train, ..., nb_test) or (nb_train, ..., nb_train + ... + nb_valid)

Creates a tuple, each containing a generator over a part of the dataset, following the given splits.

Promote the use of itertools.cycle(iter) instead.

…ions.

MarcCote · 2014-07-24T17:33:29Z

mldata/dataset_store.py

-        dataset = LazyDataset(lazy_functions)
+        datasetFile = h5py.File(file_to_load, mode='r', driver='core')
+
+    data = datasetFile['/']["data"]


If lazy==False do we want data to be a ndarray? Right now it is a HDF5 dataset but still supports iteration and indexing as numpy.

ASalvail added 30 commits March 10, 2014 15:51

Skeleton of the Dataset abstract class.

a25762b

First draft of the metadata class.

a6e925c

First draft of an in-memory implementation of a dataset.

f21cca6

Added a version number to datasets metadata

225ddc7

Added the Dictionary class stub.

5197fad

Updated datasets definition to correctly handle targets datasets.

27ef780

Updated the datasets general methods concerning splits and applying p…

5ad4ee8

…reprocessing.

Added a check to make sure splits are defined via a tuple.

1c605d1

Deleted InMemoryDataset and transfered methods to Dataset

2fb5d6b

The storage requirement will instead be controlled by the driver (hdf5).

Added definitions of iterator and get item supporting supervised and …

6d8ee8b

…unsupervised learning.

utils to save and load the config file which contains the path to the…

35f112f

… datasets folders.

Changed the config file logic to support by dataset path.

aaffb2f

Added a check to see if a specific dataset path exists.

7ef811c

Added an hash function for easy versioning.

625bfea

New version of dataset_store with versioning and metadata supported.

79fc344

Added an importer for CSV files based on numpy.loadtxt()

f11ba43

Added comments and new parameters

ab262ff

Efficient buffered iteration added.

00730de

Removed uses of utils/utils

b21420e

Test suite for Datasets

fc9e766

Test suite for Config

ef46db9

Test suite for Dataset_store

c923699

Corrected saving of default config file

4c8b313

Corrected loading of config file

d8b4cfa

Corrected joining of parts in CONFIGFILE

f3ea2a6

Updated to python 3.4

0ff610e

Small correction to skip a line

347d24b

assert syntax correction

d5adda5

Added __init__ files for tests

2d9b70f

MarcCote reviewed Apr 16, 2014
View reviewed changes

Rollback to python 3.3 for CI.

a3010fa

MarcCote reviewed Apr 16, 2014
View reviewed changes

ASalvail added 3 commits April 15, 2014 21:00

Added a check for config file existence

e39be55

Rename test_Dataset.py to test_dataset.py

ff6c658

Corrected typo in config.py

2db5970

MarcCote reviewed Apr 16, 2014
View reviewed changes

ASalvail added 14 commits April 16, 2014 11:03

Added a LookupError to handle missing datasets.

6af517a

Added a LookupError to handle missing datasets_versions.

9e0b80c

Added test to make sure missing datasets are handled properly.

2411f40

Changed splits logic.

e623803

We can now either gave the ``splits`` be given in the form (nb_train, ..., nb_test) or (nb_train, ..., nb_train + ... + nb_valid)

Added a split iterator.

29c723e

Creates a tuple, each containing a generator over a part of the dataset, following the given splits.

Added a lazy read test.

503be66

Corrected an assert statement

8e99c21

Close h5py File handle and correct noTarget dset

a59492d

Added support for minibatches

0f35b10

Changed iterators to cycle infinitely and corrected an assert statement.

5425285

Removed infinite cycle in iterator.

6a16d43

Promote the use of itertools.cycle(iter) instead.

Changed import names to reflect name change. Some other minor correct…

ed98c62

…ions.

Added a convenient method the whole dataset in memory.

6ccca01

Targets reshaped to fit adequate data structure.

ada9521

MarcCote reviewed Jul 24, 2014
View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mldata 1.0 #2

mldata 1.0 #2

ASalvail commented Apr 15, 2014

MarcCote Apr 16, 2014

ASalvail Apr 16, 2014

MarcCote commented Apr 16, 2014

MarcCote Apr 16, 2014

ASalvail Apr 16, 2014

MarcCote commented Apr 16, 2014

ASalvail commented Apr 16, 2014

MarcCote commented Apr 16, 2014

MarcCote Apr 16, 2014

MarcCote Apr 16, 2014

MarcCote Jul 24, 2014

mldata 1.0 #2

Are you sure you want to change the base?

mldata 1.0 #2

Conversation

ASalvail commented Apr 15, 2014

MarcCote Apr 16, 2014

Choose a reason for hiding this comment

ASalvail Apr 16, 2014

Choose a reason for hiding this comment

MarcCote commented Apr 16, 2014

MarcCote Apr 16, 2014

Choose a reason for hiding this comment

ASalvail Apr 16, 2014

Choose a reason for hiding this comment

MarcCote commented Apr 16, 2014

ASalvail commented Apr 16, 2014

MarcCote commented Apr 16, 2014

MarcCote Apr 16, 2014

Choose a reason for hiding this comment

MarcCote Apr 16, 2014

Choose a reason for hiding this comment

MarcCote Jul 24, 2014

Choose a reason for hiding this comment