You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The crawler crashes unexpectedly after a while, claiming that resource limits have been reached.
How to reproduce it
If you are describing a bug, please describe here how to reproduce it.
Seed crawler with 10,000 unique URLS, crawl using default fetcher and you will be greeted with following:
2021-04-15 13:45:06 INFO FairFetcher$:71 - Adding doc to SOLR
[15128.721s][warning][os,thread] Failed to start thread - pthread_create failed (EAGAIN) for attributes: stacksize: 1024k, guardsize: 0k, detached.
2021-04-15 13:45:06 WARN BlockManager:69 - Block rdd_25_0 could not be removed as it was not found on disk or in memory
2021-04-15 13:45:06 ERROR Executor:94 - Exception in task 0.0 in stage 15.0 (TID 11)
java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached
at java.lang.Thread.start0(Native Method) ~[?:?]
at java.lang.Thread.start(Thread.java:799) ~[?:?]
at shaded.org.apache.http.impl.client.IdleConnectionEvictor.start(IdleConnectionEvictor.java:96) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?]
at shaded.org.apache.http.impl.client.HttpClientBuilder.build(HttpClientBuilder.java:1227) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?]
at org.apache.solr.client.solrj.impl.HttpClientUtil.createClient(HttpClientUtil.java:319) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?]
at org.apache.solr.client.solrj.impl.HttpClientUtil.createClient(HttpClientUtil.java:330) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?]
at org.apache.solr.client.solrj.impl.HttpClientUtil.createClient(HttpClientUtil.java:268) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?]
at org.apache.solr.client.solrj.impl.HttpClientUtil.createClient(HttpClientUtil.java:255) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?]
at org.apache.solr.client.solrj.impl.HttpSolrClient.(HttpSolrClient.java:204) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?]
at org.apache.solr.client.solrj.impl.HttpSolrClient$Builder.build(HttpSolrClient.java:952) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?]
at edu.usc.irds.sparkler.storage.solr.SolrProxy.newClient(SolrProxy.scala:45) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?]
at edu.usc.irds.sparkler.storage.solr.SolrProxy.(SolrProxy.scala:78) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?]
at edu.usc.irds.sparkler.storage.StorageProxyFactory.getProxy(StorageProxyFactory.scala:33) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?]
at edu.usc.irds.sparkler.model.SparklerJob.newStorageProxy(SparklerJob.scala:54) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?]
at edu.usc.irds.sparkler.pipeline.FairFetcher.next(FairFetcher.scala:72) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?]
at edu.usc.irds.sparkler.pipeline.FairFetcher.next(FairFetcher.scala:29) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?]
at scala.collection.Iterator$$anon$11.next(Iterator.scala:494) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?]
at org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:222) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?]
at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:299) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?]
at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1371) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?]
at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1298) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?]
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1362) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?]
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1186) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?]
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:360) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?]
at org.apache.spark.rdd.RDD.iterator(RDD.scala:311) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?]
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?]
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?]
at org.apache.spark.rdd.RDD.iterator(RDD.scala:313) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?]
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?]
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?]
at org.apache.spark.rdd.RDD.iterator(RDD.scala:313) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?]
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?]
at org.apache.spark.scheduler.Task.run(Task.scala:127) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?]
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?]
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?]
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449) [sparkler-app-0.2.2-SNAPSHOT.jar:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at java.lang.Thread.run(Thread.java:830) [?:?]
Environment and Version Information
Please indicate relevant versions, including, if relevant:
Java Version
1.8
Spark Version
Embedded spark in the JAR
Operating System name and version
Linux crawler 4.19.0-16-cloud-amd64 Add ALV2 license headers on code #1 SMP Debian 4.19.181-1 (2021-03-19) x86_64 GNU/Linux
An external links for reference
If you think any other resources on internet will be helpful to understand and/or resolve this issue, please share them here.
Contributing
If you'd like to help us fix the issue by contributing some code, but would like guidance or help in doing so, please mention it!
I have upped the limits of max number of processes to unlimited and after checking the system while the crawl was in process there were 27302 processes with 26540 of them being sparkler, this looks like there is a leak somewhere.
The text was updated successfully, but these errors were encountered:
Issue Description
Please describe our issue, along with:
The crawler crashes unexpectedly after a while, claiming that resource limits have been reached.
How to reproduce it
If you are describing a bug, please describe here how to reproduce it.
Seed crawler with 10,000 unique URLS, crawl using default fetcher and you will be greeted with following:
2021-04-15 13:45:06 INFO FairFetcher$:71 - Adding doc to SOLR
[15128.721s][warning][os,thread] Failed to start thread - pthread_create failed (EAGAIN) for attributes: stacksize: 1024k, guardsize: 0k, detached.
2021-04-15 13:45:06 WARN BlockManager:69 - Block rdd_25_0 could not be removed as it was not found on disk or in memory
2021-04-15 13:45:06 ERROR Executor:94 - Exception in task 0.0 in stage 15.0 (TID 11)
java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached
at java.lang.Thread.start0(Native Method) ~[?:?]
at java.lang.Thread.start(Thread.java:799) ~[?:?]
at shaded.org.apache.http.impl.client.IdleConnectionEvictor.start(IdleConnectionEvictor.java:96) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?]
at shaded.org.apache.http.impl.client.HttpClientBuilder.build(HttpClientBuilder.java:1227) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?]
at org.apache.solr.client.solrj.impl.HttpClientUtil.createClient(HttpClientUtil.java:319) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?]
at org.apache.solr.client.solrj.impl.HttpClientUtil.createClient(HttpClientUtil.java:330) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?]
at org.apache.solr.client.solrj.impl.HttpClientUtil.createClient(HttpClientUtil.java:268) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?]
at org.apache.solr.client.solrj.impl.HttpClientUtil.createClient(HttpClientUtil.java:255) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?]
at org.apache.solr.client.solrj.impl.HttpSolrClient.(HttpSolrClient.java:204) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?]
at org.apache.solr.client.solrj.impl.HttpSolrClient$Builder.build(HttpSolrClient.java:952) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?]
at edu.usc.irds.sparkler.storage.solr.SolrProxy.newClient(SolrProxy.scala:45) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?]
at edu.usc.irds.sparkler.storage.solr.SolrProxy.(SolrProxy.scala:78) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?]
at edu.usc.irds.sparkler.storage.StorageProxyFactory.getProxy(StorageProxyFactory.scala:33) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?]
at edu.usc.irds.sparkler.model.SparklerJob.newStorageProxy(SparklerJob.scala:54) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?]
at edu.usc.irds.sparkler.pipeline.FairFetcher.next(FairFetcher.scala:72) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?]
at edu.usc.irds.sparkler.pipeline.FairFetcher.next(FairFetcher.scala:29) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?]
at scala.collection.Iterator$$anon$11.next(Iterator.scala:494) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?]
at org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:222) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?]
at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:299) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?]
at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1371) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?]
at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1298) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?]
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1362) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?]
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1186) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?]
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:360) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?]
at org.apache.spark.rdd.RDD.iterator(RDD.scala:311) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?]
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?]
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?]
at org.apache.spark.rdd.RDD.iterator(RDD.scala:313) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?]
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?]
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?]
at org.apache.spark.rdd.RDD.iterator(RDD.scala:313) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?]
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?]
at org.apache.spark.scheduler.Task.run(Task.scala:127) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?]
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?]
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) ~[sparkler-app-0.2.2-SNAPSHOT.jar:?]
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449) [sparkler-app-0.2.2-SNAPSHOT.jar:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at java.lang.Thread.run(Thread.java:830) [?:?]
Environment and Version Information
Please indicate relevant versions, including, if relevant:
1.8
Embedded spark in the JAR
Linux crawler 4.19.0-16-cloud-amd64 Add ALV2 license headers on code #1 SMP Debian 4.19.181-1 (2021-03-19) x86_64 GNU/Linux
An external links for reference
If you think any other resources on internet will be helpful to understand and/or resolve this issue, please share them here.
Contributing
If you'd like to help us fix the issue by contributing some code, but would like guidance or help in doing so, please mention it!
I have upped the limits of max number of processes to unlimited and after checking the system while the crawl was in process there were 27302 processes with 26540 of them being sparkler, this looks like there is a leak somewhere.
The text was updated successfully, but these errors were encountered: