Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mldata 1.0 #2

Open
wants to merge 65 commits into
base: master
Choose a base branch
from
Open

mldata 1.0 #2

wants to merge 65 commits into from

Conversation

ASalvail
Copy link
Member

Adds :

  • Datasets loaded lazily of in memory
  • Metadata to keep track of datasets
  • Configuration files to index datasets
  • Utilities to save/load/remove datasets
  • A first basic CSV importer to create datasets
  • Full docstring docs
  • Test suite

ASalvail added 30 commits March 10, 2014 15:51
The storage requirement will instead be controlled by the driver (hdf5).
__iter__() and __getitem__ are false methods. They actually are class methods, which makes their definition on the fly quite tricky. The old solution would reassing the correct definition in the __init__ method. However, since it is a class method, it also affect other object (that might have a need of the other iterator or getter). Thus, the best way to make it work is the naive way that checks each time is dataset.target is None.
@@ -1,6 +1,6 @@
language: python
python:
- "3.3"
- "3.4"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Support for Python3.4 is not release yet for Travis. See travis-ci/travis-ci#1989

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the time you'll get around to fix CI, support should be available.

@MarcCote
Copy link
Member

Folder tools should be renamed scripts.

I know we discussed it, but are we planning on supporting both Python2.7.+ and Python3.+ ?


def setup_module():
# save current config file
os.rename(cfg.CONFIGFILE, cfg.CONFIGFILE +".bak")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line failed when $HOME/.mldataConfig does not already exist.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, will be fixed.

@MarcCote
Copy link
Member

Why is there a capitalized D in tests/test_Dataset.py ? Is it because of the class with the same name?

@ASalvail
Copy link
Member Author

There is no folder tools...
No, we stick to python 3.
And yes for the capital D.

@MarcCote
Copy link
Member

Oups, you're right about tools, I must have looked at my branch :P .

Regarding the uppercase D, I think it should be put in lowercase because it is refering to the file you are testing: that is dataset.py. Inside the test file, it is fine to use the name of the class.

def _create_default_config():
""" Build and save a default config file for MLData.

The default config is saved as ``.MLDataConfig`` in the ``$HOME`` folder
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: you mean .mldataConfig

path = None
if cfg.dataset_exists(dset_name):
path = cfg.get_dataset_path(dset_name)
return _load_from_file(dset_name + '_' + version_name, path, lazy)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the dataset is not found in the config file, _load_from_file will fail on the os.join with None. Maybe we should display a better error message.

dataset = LazyDataset(lazy_functions)
datasetFile = h5py.File(file_to_load, mode='r', driver='core')

data = datasetFile['/']["data"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If lazy==False do we want data to be a ndarray? Right now it is a HDF5 dataset but still supports iteration and indexing as numpy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants