Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds more resiliency to retrying read operations during failures. #685

Closed
wants to merge 2 commits into from

Conversation

jonhyman
Copy link
Contributor

@jonhyman jonhyman commented Sep 1, 2015

Adds some failover support code from Moped which handles reads during failovers better.

We've been doing failover testing with the new driver (add8a77). While doing a kill -9 of the mongod primary, we saw this error:

D, [2015-09-01T17:58:43.370391 #2194] DEBUG -- : MONGODB | REDACTED | staging.find | STARTED | {"find"=>"my_collection", "filter"=>{}}
D, [2015-09-01T17:58:43.374908 #2194] DEBUG -- : MONGODB | REDACTED | staging.find | FAILED | can't connect to new replica set master [REDACTED], err: couldn't connect to server REDACTED (REDACTED), connection attempt failed (13639) | 0.004322347000000001s
Mongo::Error::OperationFailure: can't connect to new replica set master [REDACTED], err: couldn't connect to server REDACTED (REDACTED), connection attempt failed (13639)
  from /var/www/platform/shared/dashboard/bundle/ruby/2.2.0/bundler/gems/mongo-ruby-driver-add8a77a48aa/lib/mongo/operation/result.rb:226:in `validate!'
  from /var/www/platform/shared/dashboard/bundle/ruby/2.2.0/bundler/gems/mongo-ruby-driver-add8a77a48aa/lib/mongo/operation/executable.rb:36:in `block in execute'
  from /var/www/platform/shared/dashboard/bundle/ruby/2.2.0/bundler/gems/mongo-ruby-driver-add8a77a48aa/lib/mongo/server/connection_pool.rb:111:in `with_connection'

While doing a stepdown of a primary, we got the same stack trace but the error was

D, [2015-09-01T18:05:43.974336 #2194] DEBUG -- : MONGODB | REDACTED | staging.find | FAILED | error querying server (15988) | 0.0012782479999999998s
Mongo::Error::OperationFailure: error querying server (15988)
from /var/www/platform/shared/dashboard/bundle/ruby/2.2.0/bundler/gems/mongo-ruby-driver-add8a77a48aa/lib/mongo/operation/result.rb:226:in `validate!'

This pull should fix these issues.

@jonhyman
Copy link
Contributor Author

jonhyman commented Sep 1, 2015

Actually, hold off on merging. This code retries only once. We just tested it by kill -9ing a mongod and got this:

D, [2015-09-01T18:40:34.779203 #2090] DEBUG -- : MONGODB | REDACTED | staging.find | FAILED | error querying server (15988) | 0.006565967999999999s
D, [2015-09-01T18:40:34.781297 #2090] DEBUG -- : MONGODB | REDACTED | staging.find | STARTED | {"find"=>"my_collection", "filter"=>{}}
D, [2015-09-01T18:40:34.785768 #2090] DEBUG -- : MONGODB | REDACTED | staging.find | FAILED | ReplicaSetMonitor no master found for set: f35f71b69154572b602a36ded3d5e04b (10009) | 0.004394989s

so it retried once, but then failed. It needs to retry more than once, as an election can take a few seconds.

@jonhyman
Copy link
Contributor Author

jonhyman commented Sep 3, 2015

@durran / @estolfo:

We ran a variety of failover tests earlier this week against HEAD. It seems that this driver is very resilient to errors at the mongoS layer, but abysmal at elections and problems in the mongoD layer. Scenarios we tested:

  • Graceful step down: Fails, the app exceptions out. Stack trace in body of pull.
  • kill -9 a mongod primary for a "dirty election": Fails, the app exceptions out. Stack trace in body of pull.
  • kill -9 a mongos process. Success, reconnects and keeps going
  • Rolling restart of mongos processes. Success, reconnects and keeps going
  • Simulate a host completely hanging. This has happened to us on bad hardware. The way to simulate is to, while performing reads, enable 2 iptables firewall rules that silently drop all incoming/outgoing packets, so 0 bytes will be sent on the wire (the driver will be hung on the blocking socket read, see my comments at Hangs forever in refresh after 5 minutes when 2ary unreachable mongoid/moped#331). This simulates a really bad network/host failure, as heartbeats and connections from the rest of the replset and mongoS will hang until they reach a timeout. This causes the election to take much longer. We have patched the driver in a similar fashion as I describe in that Moped issue to set a long-running driver timeout via read_nonblock and IO.select. In this failure scenario, our patch works great in Moped 2. But in this scenario on this new driver, this is the stack trace we get:
from /var/www/bundle/ruby/2.2.0/bundler/gems/mongo-ruby-driver-a2a7e0f949a3/lib/mongo/socket.rb:123:in `block in read'
  from /var/www/bundle/ruby/2.2.0/bundler/gems/mongo-ruby-driver-a2a7e0f949a3/lib/mongo/socket.rb:191:in `handle_errors'
  from /var/www/bundle/ruby/2.2.0/bundler/gems/mongo-ruby-driver-a2a7e0f949a3/lib/mongo/socket.rb:122:in `read'
  from /var/www/bundle/ruby/2.2.0/bundler/gems/mongo-ruby-driver-a2a7e0f949a3/lib/mongo/protocol/serializers.rb:62:in `deserialize'
  from /var/www/bundle/ruby/2.2.0/bundler/gems/mongo-ruby-driver-a2a7e0f949a3/lib/mongo/protocol/message.rb:193:in `deserialize_header'
  from /var/www/bundle/ruby/2.2.0/bundler/gems/mongo-ruby-driver-a2a7e0f949a3/lib/mongo/protocol/message.rb:83:in `deserialize'
  from /var/www/bundle/ruby/2.2.0/bundler/gems/mongo-ruby-driver-a2a7e0f949a3/lib/mongo/server/connectable.rb:106:in `block in read'
  from /var/www/bundle/ruby/2.2.0/bundler/gems/mongo-ruby-driver-a2a7e0f949a3/lib/mongo/server/connectable.rb:91:in `ensure_connected'
  from /var/www/bundle/ruby/2.2.0/bundler/gems/mongo-ruby-driver-a2a7e0f949a3/lib/mongo/server/connectable.rb:106:in `read'
  from /var/www/bundle/ruby/2.2.0/bundler/gems/mongo-ruby-driver-a2a7e0f949a3/lib/mongo/server/connection.rb:180:in `deliver'
  from /var/www/bundle/ruby/2.2.0/bundler/gems/mongo-ruby-driver-a2a7e0f949a3/lib/mongo/server/connection.rb:125:in `block in dispatch'
  from /var/www/bundle/ruby/2.2.0/bundler/gems/mongo-ruby-driver-a2a7e0f949a3/lib/mongo/monitoring/publishable.rb:47:in `publish_command'
  from /var/www/bundle/ruby/2.2.0/bundler/gems/mongo-ruby-driver-a2a7e0f949a3/lib/mongo/server/connection.rb:124:in `dispatch'
  from /var/www/bundle/ruby/2.2.0/bundler/gems/mongo-ruby-driver-a2a7e0f949a3/lib/mongo/operation/executable.rb:36:in `block in execute'

I believe this pull will fix that as well via commit 86053b8.

However, this driver still very badly needs additional retry logic in order to handle even a graceful step down. The read_with_retry code only retries one time, but it takes a few seconds for the election to happen. I can copy similar code from Moped that has something like cluster.max_retries and cluster.retry_interval and retry more than once. But do you have plans to add this yourself?

@@ -103,7 +107,9 @@ def ensure_same_process!
end

def read
ensure_connected{ |socket| Protocol::Reply.deserialize(socket) }
read_with_retry do
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will not work here as the connection does not know anything about the cluster and the retry code calls cluster.scan!.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. Thoughts on how to handle a retry if the read that ultimately gets called from here gets a socket error, then?

Stack trace again:

from /var/www/bundle/ruby/2.2.0/bundler/gems/mongo-ruby-driver-a2a7e0f949a3/lib/mongo/socket.rb:123:in `block in read'
  from /var/www/bundle/ruby/2.2.0/bundler/gems/mongo-ruby-driver-a2a7e0f949a3/lib/mongo/socket.rb:191:in `handle_errors'
  from /var/www/bundle/ruby/2.2.0/bundler/gems/mongo-ruby-driver-a2a7e0f949a3/lib/mongo/socket.rb:122:in `read'
  from /var/www/bundle/ruby/2.2.0/bundler/gems/mongo-ruby-driver-a2a7e0f949a3/lib/mongo/protocol/serializers.rb:62:in `deserialize'
  from /var/www/bundle/ruby/2.2.0/bundler/gems/mongo-ruby-driver-a2a7e0f949a3/lib/mongo/protocol/message.rb:193:in `deserialize_header'
  from /var/www/bundle/ruby/2.2.0/bundler/gems/mongo-ruby-driver-a2a7e0f949a3/lib/mongo/protocol/message.rb:83:in `deserialize'
  from /var/www/bundle/ruby/2.2.0/bundler/gems/mongo-ruby-driver-a2a7e0f949a3/lib/mongo/server/connectable.rb:106:in `block in read'
  from /var/www/bundle/ruby/2.2.0/bundler/gems/mongo-ruby-driver-a2a7e0f949a3/lib/mongo/server/connectable.rb:91:in `ensure_connected'
  from /var/www/bundle/ruby/2.2.0/bundler/gems/mongo-ruby-driver-a2a7e0f949a3/lib/mongo/server/connectable.rb:106:in `read'
  from /var/www/bundle/ruby/2.2.0/bundler/gems/mongo-ruby-driver-a2a7e0f949a3/lib/mongo/server/connection.rb:180:in `deliver'
  from /var/www/bundle/ruby/2.2.0/bundler/gems/mongo-ruby-driver-a2a7e0f949a3/lib/mongo/server/connection.rb:125:in `block in dispatch'
  from /var/www/bundle/ruby/2.2.0/bundler/gems/mongo-ruby-driver-a2a7e0f949a3/lib/mongo/monitoring/publishable.rb:47:in `publish_command'
  from /var/www/bundle/ruby/2.2.0/bundler/gems/mongo-ruby-driver-a2a7e0f949a3/lib/mongo/server/connection.rb:124:in `dispatch'
  from /var/www/bundle/ruby/2.2.0/bundler/gems/mongo-ruby-driver-a2a7e0f949a3/lib/mongo/operation/executable.rb:36:in `block in execute'

@durran
Copy link
Member

durran commented Sep 4, 2015

What is your server_selection_timeout option set at? This defaults at 30 seconds and should give a 30 second window to retry here. Once the cluster is scanned on the retry and the operation executed again, it goes back through the server selection code which should be waiting up to 30 seconds to issue the next query again. Unless the ping command is actually returning us that it's ready when it's not... I'll double check this as well now.

@jonhyman
Copy link
Contributor Author

jonhyman commented Sep 4, 2015

We don't alter the server_selection_timeout at all. During the stepdown and kill -9 of the mongod process, the code exceptioned out instantly so it is likely the ping command returns ready when it is not.

@jonhyman
Copy link
Contributor Author

@durran I saw that you all released 2.1.0 last week. Did you test this and fix the problem? IMO if the official driver can't handle a graceful step-down, I don't think it's ready for prime-time.

@durran
Copy link
Member

durran commented Sep 14, 2015

This is not yet resolved... We did tests with planned elections and everything worked as expected for the basic scenarios, which is better than the 1.x series already had... We're still looking at better retry around what you've proposed here, and will probably go into the 2.1.1 release. We had other dependencies that were pressuring the 2.1.0 ga that did not need more robust retry around it... But we're still looking at it.

@jonhyman
Copy link
Contributor Author

Do you have any idea when you are aiming for that release? That's the last thing we are bottlenecked on in order to upgrade to the official driver / Mongoid 5, which then lets us immediately upgrade our staging environment to MongoDB 3.0.x and start testing that. Many features in our upcoming roadmap are dependent on 3.0.x and we've been waiting a very long time to try it out.

@durran
Copy link
Member

durran commented Sep 15, 2015

I'm looking into this today and then will put into master for testing... If that looks good then we'll figure out the release of 2.1.1 in our weekly meeting tonight.

@durran
Copy link
Member

durran commented Sep 15, 2015

Just to confirm, you're testing on a sharded cluster with multiple mongos that's behind a single load balancer correct? This would be why the general ping strategy is not enough for your case - the ping is going to the mongoses which are always running and ready throughout the reconfiguration.

@jonhyman
Copy link
Contributor Author

Yes, that's correct. We host with ObjectRocket, which uses one connection string to the load balancer. We once tried to move away from this scenario but ran into a lot of issues with connection limits since we're still on Mongo 2.4 and the connection handling isn't nearly as good as it is in >= 2.6.

@jonhyman
Copy link
Contributor Author

@durran if you can get this onto master this week I'll redo the failover testing and report back how it worked.

@durran
Copy link
Member

durran commented Sep 21, 2015

@jonhyman Yeah I'll have something in this week.

@jonhyman
Copy link
Contributor Author

@durran friendly reminder. I have scheduled time with our DBAs to do failover testing on Tuesday night, so please try to have something before then. Thanks.

@durran
Copy link
Member

durran commented Sep 25, 2015

Sounds great - will have it in before Monday.

@durran
Copy link
Member

durran commented Sep 27, 2015

See: #692

You can test against the branch retry-mongos.

@durran durran closed this Sep 27, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants