JanusGraph become unresponsive after ~10 days #2462

BrunoBerisso · 2021-02-18T20:24:06Z

BrunoBerisso
Feb 18, 2021

Hi all.

Important: I'm running a modified version of JanusGraph 4.0 that includes the changes described here #2148

I have a three-node JanusGraph cluster backed by another three-node Scylla cluster. We are using ConfiguredGraphFactory to create and drop aprox 10 graphs per day and we keep around 70 graphs with historic data, the graphs are read-only and they have a couple of simple indices. The graphs are small with under 1M vertices or edges.

The problem is that after around two weeks of running smoothly, one of the nodes start timing out 95% of the requests. It shows high memory usage and the logs showed two types of exceptions that I believe are related:

org.janusgraph.core.JanusGraphException: Could not execute operation due to backend exception
java.lang.OutOfMemoryError: unable to create new native thread

I'm not sure how these two exceptions are related but it seems to be some kind of leak during the drop process that's causing the second exception.

Could not execute operation due to backend exception

Sometimes when a graph is dropped I have this exception in the logs in the node that performed the JanusGraphFactory.drop(graph); call. The other nodes don't show any particular activity:

2021-01-28 00:05:45,453 [pool-176-thread-1] INFO  org.janusgraph.core.ConfiguredGraphFactory$DropGraphOnEvictionTrigger  - Graph x1234 has been removed from the graph cache on every JanusGraph node in the cluster.
2021-01-28 00:05:45,453 [pool-176-thread-1] WARN  org.janusgraph.core.ConfiguredGraphFactory$DropGraphOnEvictionTrigger  - Attempting to drop the graph x1234.
2021-01-28 00:05:46,564 [pool-176-thread-1] ERROR org.janusgraph.diskstorage.log.kcvs.KCVSLog  - Reader thread pool for KCVSLog systemlog did not shut down in time - could not clean up or set read markers
2021-01-28 00:05:46,565 [pool-176-thread-1] WARN  org.janusgraph.util.system.IOUtils  - Failed closing org.janusgraph.diskstorage.Backend@3245bba3
org.janusgraph.core.JanusGraphException: Could not execute operation due to backend exception
	at org.janusgraph.diskstorage.util.BackendOperation.execute(BackendOperation.java:56)
	at org.janusgraph.diskstorage.util.BackendOperation.execute(BackendOperation.java:158)
	at org.janusgraph.diskstorage.log.kcvs.KCVSLog.writeSetting(KCVSLog.java:847)
	at org.janusgraph.diskstorage.log.kcvs.KCVSLog.close(KCVSLog.java:307)
	at org.janusgraph.diskstorage.log.kcvs.KCVSLogManager.close(KCVSLogManager.java:239)
	at org.janusgraph.diskstorage.Backend.close(Backend.java:510)
	at org.janusgraph.util.system.IOUtils.closeQuietly(IOUtils.java:63)
	at org.janusgraph.graphdb.database.StandardJanusGraph.closeInternal(StandardJanusGraph.java:242)
	at org.janusgraph.graphdb.database.StandardJanusGraph.close(StandardJanusGraph.java:201)
	at org.janusgraph.core.JanusGraphFactory.drop(JanusGraphFactory.java:209)
	at org.janusgraph.core.ConfiguredGraphFactory$DropGraphOnEvictionTrigger.call(ConfiguredGraphFactory.java:215)
	at org.janusgraph.core.ConfiguredGraphFactory$DropGraphOnEvictionTrigger.call(ConfiguredGraphFactory.java:192)
	at org.janusgraph.graphdb.database.management.ManagementLogger$EvictionTrigger.runTriggers(ManagementLogger.java:188)
	at org.janusgraph.graphdb.database.management.ManagementLogger$EvictionTrigger.receivedAcknowledgement(ManagementLogger.java:180)
	at org.janusgraph.graphdb.database.management.ManagementLogger.read(ManagementLogger.java:111)
	at org.janusgraph.diskstorage.log.util.ProcessMessageJob.run(ProcessMessageJob.java:46)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.janusgraph.diskstorage.PermanentBackendException: Permanent exception while executing backend operation writingLogSetting
	at org.janusgraph.diskstorage.util.BackendOperation.executeDirect(BackendOperation.java:81)
	at org.janusgraph.diskstorage.util.BackendOperation.execute(BackendOperation.java:54)
	... 22 more
Caused by: java.lang.InterruptedException
	at java.lang.Object.wait(Native Method)
	at java.lang.Object.wait(Object.java:502)
	at io.vavr.control.Try.run(Try.java:105)
	at io.vavr.concurrent.FutureImpl.await(FutureImpl.java:114)
	at org.janusgraph.diskstorage.cql.CQLStoreManager.mutateManyUnlogged(CQLStoreManager.java:506)
	at org.janusgraph.diskstorage.cql.CQLStoreManager.mutateMany(CQLStoreManager.java:439)
	at org.janusgraph.diskstorage.cql.CQLKeyColumnValueStore.mutate(CQLKeyColumnValueStore.java:363)
	at org.janusgraph.diskstorage.log.kcvs.KCVSLog$3.call(KCVSLog.java:850)
	at org.janusgraph.diskstorage.log.kcvs.KCVSLog$3.call(KCVSLog.java:847)
	at org.janusgraph.diskstorage.util.BackendOperation.execute(BackendOperation.java:147)
	at org.janusgraph.diskstorage.util.BackendOperation$1.call(BackendOperation.java:161)
	at org.janusgraph.diskstorage.util.BackendOperation.executeDirect(BackendOperation.java:68)
	... 23 more
2021-01-28 00:05:46,566 [pool-176-thread-1] INFO  com.datastax.driver.core.ClockFactory  - Using java.lang.System clock to generate timestamps.
2021-01-28 00:05:46,587 [pool-176-thread-1] WARN  com.datastax.driver.core.Cluster  - You listed scylla-3.my-domain.com]]:9042 in your contact points, but it wasn't found in the control host's system.peers at startup
2021-01-28 00:05:46,587 [pool-176-thread-1] WARN  com.datastax.driver.core.Cluster  - You listed [[scylla-1.my-domain.com:9042 in your contact points, but it wasn't found in the control host's system.peers at startup
2021-01-28 00:05:46,734 [pool-176-thread-1] INFO  com.datastax.driver.core.policies.DCAwareRoundRobinPolicy  - Using data-center name 'us-east-2' for DCAwareRoundRobinPolicy (if this is incorrect, please provide the correct datacenter name with DCAwareRoundRobinPolicy constructor)
2021-01-28 00:05:46,735 [pool-176-thread-1] INFO  com.datastax.driver.core.Cluster  - New Cassandra host scylla-2.my-domain.com/10.20.5.239:9042 added
2021-01-28 00:05:46,735 [pool-176-thread-1] INFO  com.datastax.driver.core.Cluster  - New Cassandra host /10.20.5.199:9042 added
2021-01-28 00:05:46,735 [pool-176-thread-1] INFO  com.datastax.driver.core.Cluster  - New Cassandra host /10.20.5.119:9042 added
2021-01-28 00:05:46,745 [pool-176-thread-1] INFO  org.janusgraph.diskstorage.Backend  - Initiated backend operations thread pool of size 16
2021-01-28 00:05:51,126 [pool-176-thread-1] WARN  org.janusgraph.core.ConfiguredGraphFactory$DropGraphOnEvictionTrigger  - Graph x1234 has been dropped.
2021-01-28 00:05:51,126 [pool-176-thread-1] INFO  org.janusgraph.graphdb.database.management.ManagementLogger  - Received all acknowledgements for eviction [1]

java.lang.OutOfMemoryError

This exception appears in the logs after one of the drop() operation is performed and then continue to appear 261 times in a 3min time window with almost no other message interlieved. This is the log entry:

2021-02-10 00:02:07,962 [JanusGraph Cluster-nio-worker-4] WARN  io.netty.channel.AbstractChannelHandlerContext  - An exception 'java.lang.OutOfMemoryError: unable to create new native thread' [enable DEBUG level for full stacktrace] was thrown by a user handler's exceptionCaught() method while handling the following exception:
java.lang.OutOfMemoryError: unable to create new native thread
	at java.lang.Thread.start0(Native Method)
	at java.lang.Thread.start(Thread.java:717)
	at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
	at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367)
	at com.google.common.util.concurrent.MoreExecutors$ListeningDecorator.execute(MoreExecutors.java:550)
	at java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112)
	at com.google.common.util.concurrent.AbstractListeningExecutorService.submit(AbstractListeningExecutorService.java:50)
	at com.datastax.driver.core.Cluster$Manager.handle(Cluster.java:2634)
	at com.datastax.driver.core.Connection$Dispatcher.channelRead0(Connection.java:1206)
	at com.datastax.driver.core.Connection$Dispatcher.channelRead0(Connection.java:1151)
	at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
	at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
	at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
	at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310)
	at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:297)
	at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:413)
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:265)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
	at io.netty.channel.ChannelInboundHandlerAdapter.channelRead(ChannelInboundHandlerAdapter.java:86)
	at com.datastax.driver.core.InboundTrafficMeter.channelRead(InboundTrafficMeter.java:38)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1434)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:965)
	at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:808)
	at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:408)
	at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:308)
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.lang.Thread.run(Thread.java:748)

li-boxuan · 2021-02-19T02:28:49Z

li-boxuan
Feb 19, 2021
Maintainer

Do you have any thread dump and/or heap dump before/when OOM happens?

2 replies

BrunoBerisso Feb 19, 2021
Author

I was able to reproduce the first exception locally with one JanusGraph node and Cassandra. I took a heap dump before and after performing the drop() call. They are attached here along with the logs.

Archive.zip

li-boxuan Feb 20, 2021
Maintainer

I am not able to reason about either OOM or the first exception from the attachment. :( Probably you could take a heap dump in the production environment every day until OOM happens... and then observe the differences.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JanusGraph become unresponsive after ~10 days #2462

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

JanusGraph become unresponsive after ~10 days #2462

BrunoBerisso Feb 18, 2021

Could not execute operation due to backend exception

java.lang.OutOfMemoryError

Replies: 1 comment · 2 replies

li-boxuan Feb 19, 2021 Maintainer

BrunoBerisso Feb 19, 2021 Author

li-boxuan Feb 20, 2021 Maintainer

BrunoBerisso
Feb 18, 2021

Replies: 1 comment 2 replies

li-boxuan
Feb 19, 2021
Maintainer

BrunoBerisso Feb 19, 2021
Author

li-boxuan Feb 20, 2021
Maintainer