-
Notifications
You must be signed in to change notification settings - Fork 344
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OutOfMemoryError while computing LDA model for large .mallet file #165
Comments
Hi Phillip, I had something similar to this problem, before. I think to get over this issue, can split the original dataset to small text files (1 MB). I hope it will be going well. |
@pstroe yes, this is a problem preventing an otherwise great implementation being useful for practical data sets in the wild. I didn't want to load everthing into memory when creating the .mallet serialized data so I hacked it to iterate: #170. It's not clear to me that multiple files would help, but that creates another problem. A directory of files, expects one instance per file. In my case, that would mean 100s of milllions of files, which isn't practical. |
@JeloH thanks for the suggestions, but as @jfelectron explains, this would not help in our case. also: thanks @jfelectron for your response. i would say our data is very practical, but it comes in new dimensions. our workaround was training the model (which is not a problem, so it can handle that large amount) and output an inferencer. we then did inference on the training data. so instead of writing out the data all at once, wouldn't it be possible to constantly append to the output file? |
hello there,
while training a model for a rather large data set, we get the following error:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at java.util.TreeMap.put(TreeMap.java:577) at java.util.TreeSet.add(TreeSet.java:255) at cc.mallet.topics.ParallelTopicModel.getTopicDocuments(ParallelTopicModel.java:1743) at cc.mallet.topics.ParallelTopicModel.printTopicDocuments(ParallelTopicModel.java:1762)
we created a .mallet file for our data with the
bulk-load
function. in total, we have about 1.3 billion words in more or less 17 million articles. we compute on 59 cores and reserve 180g for mallet. the 1000 iterations to estimate 100 topics run through without any problem, it seems that writing the doctopics file aborts the process. any thoughts why this might be the case? or is there an other issue?looking forward to reading your answer,
phillip
The text was updated successfully, but these errors were encountered: