OutOfMemoryError while computing LDA model for large .mallet file #165

pstroe · 2019-07-03T05:03:27Z

hello there,

while training a model for a rather large data set, we get the following error:

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at java.util.TreeMap.put(TreeMap.java:577) at java.util.TreeSet.add(TreeSet.java:255) at cc.mallet.topics.ParallelTopicModel.getTopicDocuments(ParallelTopicModel.java:1743) at cc.mallet.topics.ParallelTopicModel.printTopicDocuments(ParallelTopicModel.java:1762)

we created a .mallet file for our data with the bulk-load function. in total, we have about 1.3 billion words in more or less 17 million articles. we compute on 59 cores and reserve 180g for mallet. the 1000 iterations to estimate 100 topics run through without any problem, it seems that writing the doctopics file aborts the process. any thoughts why this might be the case? or is there an other issue?

looking forward to reading your answer,

phillip

The text was updated successfully, but these errors were encountered:

JeloH · 2019-07-10T06:53:13Z

hello there,

while training a model for a rather large data set, we get the following error:

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at java.util.TreeMap.put(TreeMap.java:577) at java.util.TreeSet.add(TreeSet.java:255) at cc.mallet.topics.ParallelTopicModel.getTopicDocuments(ParallelTopicModel.java:1743) at cc.mallet.topics.ParallelTopicModel.printTopicDocuments(ParallelTopicModel.java:1762)

we created a .mallet file for our data with the bulk-load function. in total, we have about 1.3 billion words in more or less 17 million articles. we compute on 59 cores and reserve 180g for mallet. the 1000 iterations to estimate 100 topics run through without any problem, it seems that writing the doctopics file aborts the process. any thoughts why this might be the case? or is there an other issue?

looking forward to reading your answer,

phillip

Hi Phillip, I had something similar to this problem, before. I think to get over this issue, can split the original dataset to small text files (1 MB). I hope it will be going well.

jfelectron · 2019-07-26T22:05:25Z

@pstroe yes, this is a problem preventing an otherwise great implementation being useful for practical data sets in the wild. I didn't want to load everthing into memory when creating the .mallet serialized data so I hacked it to iterate: #170.

It's not clear to me that multiple files would help, but that creates another problem. A directory of files, expects one instance per file. In my case, that would mean 100s of milllions of files, which isn't practical.

pstroe · 2019-07-28T15:44:42Z

@JeloH thanks for the suggestions, but as @jfelectron explains, this would not help in our case.

also: thanks @jfelectron for your response. i would say our data is very practical, but it comes in new dimensions. our workaround was training the model (which is not a problem, so it can handle that large amount) and output an inferencer. we then did inference on the training data. so instead of writing out the data all at once, wouldn't it be possible to constantly append to the output file?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OutOfMemoryError while computing LDA model for large .mallet file #165

OutOfMemoryError while computing LDA model for large .mallet file #165

pstroe commented Jul 3, 2019

JeloH commented Jul 10, 2019

jfelectron commented Jul 26, 2019

pstroe commented Jul 28, 2019

OutOfMemoryError while computing LDA model for large .mallet file #165

OutOfMemoryError while computing LDA model for large .mallet file #165

Comments

pstroe commented Jul 3, 2019

JeloH commented Jul 10, 2019

jfelectron commented Jul 26, 2019

pstroe commented Jul 28, 2019