-
Notifications
You must be signed in to change notification settings - Fork 357
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory leak in ParquetDataset #998
Comments
Why create 1000 Datasets? This will create 1000 Dataset objects in memory. |
oh sorry, I make a misstake for the iterator. when the iterator is 180, only cost 0.5M. but when the iterator increase to 1000, it will cost 3Gb, why and any adivce to release the dataset? in our scenario, data source will store in hadoop directory in each hour or each day, we will train 3 month or online learing every 15 min。 day train, for example:
|
The ParquetDataset supports accepting a list of files. filenames = [file1, file2] # all parquet file for training
dataset = ParquetDataset(filenames, ...) Only create one 'ParquetDataset' for training, and another one 'ParquetDataset' for eval. |
it does create only one 'ParquetDataset' in one day, we will eval_train 90 days dataset, so it will create 180 'ParquetDataset' |
System information
Describe the current behavior
Memory leak in ParquetDataset has occured, after run python code, the memory has increase to 3Gb
Describe the expected behavior
Memory stable in ParquetDataset
Code to reproduce the issue
Provide a reproducible test case that is the bare minimum necessary to generate the problem.
Other info / logs
Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.
The text was updated successfully, but these errors were encountered: