Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connection to ZooKeeper lost on OS X #16

Open
lauritzthamsen opened this issue Jul 4, 2014 · 5 comments
Open

Connection to ZooKeeper lost on OS X #16

lauritzthamsen opened this issue Jul 4, 2014 · 5 comments
Labels

Comments

@lauritzthamsen
Copy link
Member

running the example clients currently fails on OS X with the following output:

2014-07-04 13:47:15,339 |  INFO [main] (LocalClusterSimulator.java:98) - CREATE TMP DIRECTORY: '/var/folders/5g/lk8wz6sd62b63m831_rh3h1w0000gn/T/zookeeper'
2014-07-04 13:47:15,820 |  INFO [nioEventLoopGroup-2-1] (DataReader.java:96) - network server bound to address /141.23.83.200:55283
2014-07-04 13:47:15,824 |  INFO [localEventLoopGroup-4-1] (DataReader.java:96) - network server bound to address local:5d47babf-0251-48ce-8bac-415aa2980314
2014-07-04 13:47:15,827 |  INFO [nioEventLoopGroup-6-1] (IOManager.java:312) - network server bound to address /141.23.83.200:55283
2014-07-04 13:47:21,044 | ERROR [main] (NIOServerCnxnFactory.java:44) - Thread Thread[main,5,main] died
java.lang.IllegalStateException: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /aura
    at de.tuberlin.aura.workloadmanager.InfrastructureManager.<init>(InfrastructureManager.java:116)
    at de.tuberlin.aura.workloadmanager.InfrastructureManager.getInstance(InfrastructureManager.java:133)
    at de.tuberlin.aura.workloadmanager.WorkloadManager.<init>(WorkloadManager.java:77)
    at de.tuberlin.aura.workloadmanager.WorkloadManager.<init>(WorkloadManager.java:57)
    at de.tuberlin.aura.client.executors.LocalClusterSimulator.<init>(LocalClusterSimulator.java:128)
    at de.tuberlin.aura.client.executors.LocalClusterSimulator.<init>(LocalClusterSimulator.java:63)
    at de.tuberlin.aura.demo.examples.IntegrationTests.main(IntegrationTests.java:480)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)

this is the case for both the state on master (e.g. SimpleClient at 90147c3) and develop (e.g. IntegrationTests at 87451d6).

stepping through these clients sometimes leads to successful runs, which might suggest a timing issue and not a general problem with OS X.

@lauritzthamsen
Copy link
Member Author

The problem seems to be that we use the zookeeper-object without making sure that a connection to zookeeper has been established.

logging

LOG.info(String.valueOf(zookeeper.getState()));

before calling

ZookeeperHelper.initDirectories(this.zookeeper);

shows that the zookeeper-object is still in the CONNECTING state just before

java.lang.IllegalStateException: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /aura

@lauritzthamsen
Copy link
Member Author

waiting for state CONNECTED resolves this issue.

i also found Apache Curator. it's a framework built on top of ZooKeeper and provides a higher-level API as well as connection guarantees. i think it might be a good idea for us to use Curator.

@Teots
Copy link

Teots commented Jul 8, 2014

Session establishment is asynchronous. This constructor will initiate connection to the server and return immediately - potentially (usually) before the session is fully established. The watcher argument specifies the watcher that will be notified of any changes in state. This notification can come at any point before or after the constructor call has returned.

Apparently, it can rarely happen that the connection setup last longer than the execution of the constructor. But this can be solved easily by adding a new statement in the switch of the Watcher. It should execute the initDirectories method after receiving the connected state.

@lauritzthamsen
Copy link
Member Author

well, all further interactions with the ZooKeeper files need the connection to be established, not just initDirectories(). all these interactions would have to take place in the Watcher's event callback, but the TaskManager's setupZookeeper() method even returns the zookeeper object for further interactions with the ZooKeeper server... i think it's easiest to explicitly wait for the connection to establish as fix for now.

@lauritzthamsen
Copy link
Member Author

i'll also have a look at Curator in the next days. would just be cool to have it take care of connection establishment and failures for us.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants