-
Notifications
You must be signed in to change notification settings - Fork 533
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds more resiliency to retrying read operations during failures. #685
Conversation
Actually, hold off on merging. This code retries only once. We just tested it by
so it retried once, but then failed. It needs to retry more than once, as an election can take a few seconds. |
We ran a variety of failover tests earlier this week against HEAD. It seems that this driver is very resilient to errors at the mongoS layer, but abysmal at elections and problems in the mongoD layer. Scenarios we tested:
I believe this pull will fix that as well via commit 86053b8. However, this driver still very badly needs additional retry logic in order to handle even a graceful step down. The |
@@ -103,7 +107,9 @@ def ensure_same_process! | |||
end | |||
|
|||
def read | |||
ensure_connected{ |socket| Protocol::Reply.deserialize(socket) } | |||
read_with_retry do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will not work here as the connection does not know anything about the cluster and the retry code calls cluster.scan!
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm. Thoughts on how to handle a retry if the read
that ultimately gets called from here gets a socket error, then?
Stack trace again:
from /var/www/bundle/ruby/2.2.0/bundler/gems/mongo-ruby-driver-a2a7e0f949a3/lib/mongo/socket.rb:123:in `block in read'
from /var/www/bundle/ruby/2.2.0/bundler/gems/mongo-ruby-driver-a2a7e0f949a3/lib/mongo/socket.rb:191:in `handle_errors'
from /var/www/bundle/ruby/2.2.0/bundler/gems/mongo-ruby-driver-a2a7e0f949a3/lib/mongo/socket.rb:122:in `read'
from /var/www/bundle/ruby/2.2.0/bundler/gems/mongo-ruby-driver-a2a7e0f949a3/lib/mongo/protocol/serializers.rb:62:in `deserialize'
from /var/www/bundle/ruby/2.2.0/bundler/gems/mongo-ruby-driver-a2a7e0f949a3/lib/mongo/protocol/message.rb:193:in `deserialize_header'
from /var/www/bundle/ruby/2.2.0/bundler/gems/mongo-ruby-driver-a2a7e0f949a3/lib/mongo/protocol/message.rb:83:in `deserialize'
from /var/www/bundle/ruby/2.2.0/bundler/gems/mongo-ruby-driver-a2a7e0f949a3/lib/mongo/server/connectable.rb:106:in `block in read'
from /var/www/bundle/ruby/2.2.0/bundler/gems/mongo-ruby-driver-a2a7e0f949a3/lib/mongo/server/connectable.rb:91:in `ensure_connected'
from /var/www/bundle/ruby/2.2.0/bundler/gems/mongo-ruby-driver-a2a7e0f949a3/lib/mongo/server/connectable.rb:106:in `read'
from /var/www/bundle/ruby/2.2.0/bundler/gems/mongo-ruby-driver-a2a7e0f949a3/lib/mongo/server/connection.rb:180:in `deliver'
from /var/www/bundle/ruby/2.2.0/bundler/gems/mongo-ruby-driver-a2a7e0f949a3/lib/mongo/server/connection.rb:125:in `block in dispatch'
from /var/www/bundle/ruby/2.2.0/bundler/gems/mongo-ruby-driver-a2a7e0f949a3/lib/mongo/monitoring/publishable.rb:47:in `publish_command'
from /var/www/bundle/ruby/2.2.0/bundler/gems/mongo-ruby-driver-a2a7e0f949a3/lib/mongo/server/connection.rb:124:in `dispatch'
from /var/www/bundle/ruby/2.2.0/bundler/gems/mongo-ruby-driver-a2a7e0f949a3/lib/mongo/operation/executable.rb:36:in `block in execute'
What is your |
We don't alter the |
@durran I saw that you all released 2.1.0 last week. Did you test this and fix the problem? IMO if the official driver can't handle a graceful step-down, I don't think it's ready for prime-time. |
This is not yet resolved... We did tests with planned elections and everything worked as expected for the basic scenarios, which is better than the 1.x series already had... We're still looking at better retry around what you've proposed here, and will probably go into the 2.1.1 release. We had other dependencies that were pressuring the 2.1.0 ga that did not need more robust retry around it... But we're still looking at it. |
Do you have any idea when you are aiming for that release? That's the last thing we are bottlenecked on in order to upgrade to the official driver / Mongoid 5, which then lets us immediately upgrade our staging environment to MongoDB 3.0.x and start testing that. Many features in our upcoming roadmap are dependent on 3.0.x and we've been waiting a very long time to try it out. |
I'm looking into this today and then will put into master for testing... If that looks good then we'll figure out the release of 2.1.1 in our weekly meeting tonight. |
Just to confirm, you're testing on a sharded cluster with multiple mongos that's behind a single load balancer correct? This would be why the general ping strategy is not enough for your case - the ping is going to the mongoses which are always running and ready throughout the reconfiguration. |
Yes, that's correct. We host with ObjectRocket, which uses one connection string to the load balancer. We once tried to move away from this scenario but ran into a lot of issues with connection limits since we're still on Mongo 2.4 and the connection handling isn't nearly as good as it is in >= 2.6. |
@durran if you can get this onto master this week I'll redo the failover testing and report back how it worked. |
@jonhyman Yeah I'll have something in this week. |
@durran friendly reminder. I have scheduled time with our DBAs to do failover testing on Tuesday night, so please try to have something before then. Thanks. |
Sounds great - will have it in before Monday. |
See: #692 You can test against the branch |
Adds some failover support code from Moped which handles reads during failovers better.
We've been doing failover testing with the new driver (add8a77). While doing a
kill -9
of the mongod primary, we saw this error:While doing a stepdown of a primary, we got the same stack trace but the error was
This pull should fix these issues.