-
Notifications
You must be signed in to change notification settings - Fork 151
Hangs forever in refresh after 5 minutes when 2ary unreachable #331
base: master
Are you sure you want to change the base?
Conversation
In MRI 2.1.3 - and probably most other versions - setting the SO_RCVTIMEO option on a socket will cause read(2) to return EWOULDBLOCK as expected, but the interpreter internals will retry the operation instead of passing the failure up to interpreted code. Thus, the timeout has no visible effect. What we should do instead is call Kernel.select with the required timeout to check if there is data available, *then* call read(2). If .select fails we raise a Errors::ConnectionFailure on the assumption that something higher up in the stack will catch it and cope.
We experience this issue as well. Would love to see it merged. |
@telent, have you been running this in production? I tested this out on production and had some really strange errors (moped cursor returning empty documents) that were sporadic and hard to reproduce. I reverted the change for now but will do some more investigation later. |
Though to be clear, I had done it by copying the code into a fork of Moped 1.5. I didn't try with Moped 2 yet. |
I tried testing this again by issuing a scatter-gather query on a massive collection with hundreds of millions of documents. This wreaked havoc on the database, as it appears that the query continually kept getting retried but the cursor wasn't closed. As such, before I CTRL+C'd it, we had 19 identical queries running which caused massive queued reads and blocking. I'm not sure this is safe. Though, I had retrofitted it to 1.5, so it may work in 2.0 but I am bearish on this right now given what happened to us. |
@durran have you thought about this problem at all in the new ruby driver? As @telent points out, |
@jonhyman I have a branch up on the new driver that actually does an |
I may be wrong, but I don't think that that's going to help. Going back to Moped and this issue for a second, the Moped code already has a def alive?
if Kernel::select([ self ], nil, [ self ], 0)
!eof? rescue false
else
true
end
rescue IOError
false
end so if check_for_alive!
handle_socket_errors { super } you don't know if the socket is ready for reading when you go to do a blocking read, and the In your pull, From all my testing locally, I can't figure out how to solve this hanging case in Ruby that does not involve using def read_from_socket(length)
begin
@socket.read_nonblock(length) || String.new
rescue IO::WaitReadable, OpenSSL::SSL::SSLErrorWaitReadable
ready = IO.select([@socket], nil, [@socket], 300)
if ready
retry
else
raise StandardError.new("Maximum query time exceeded, no data from socket returned in 300 seconds.")
end
end
end Note here I'm not raising a |
tl;dr setsockopt(SO_RCVTIMEO) doesn't work for us in Ruby 2.[01] MRI, infinite hangs result
We're having trouble in our Mongo cluster when a secondary becomes unreachable (due to e.g. crashed instance or network partition): the symptom is that after five minutes all attempts to contact the primary seem to hang indefinitely (or until Unicorn kills our worker process). Note that our
mongoid.yml
hasoptions: {read: primary}
Attempting to replicate the problem in a test environment gets us a call stack something like this (some frames omitted for brevity)
and after some digging the problem seems to be that implementing timeouts using socket options doesn't work in Ruby MRI. Looking at it with strace I see a call to
read
that returns after five seconds with EAGAIN followed by a call toppoll
- the EAGAIN doesn't ever get as far as interpreted code.See also http://stackoverflow.com/questions/9853516/set-socket-timeout-in-ruby-via-so-rcvtimeo-socket-option
The attached patch has a spec that demonstrates the problem (try running it under strace) and a fix which implements read timeouts "by hand" using select() instead of relying on socket options. A more full solution would probably address the write() case and be tested on SSL connections as well as plaintext ones, but I thought I'd push this early diagnosis up to you for comment first to see if I'm on the right track
/cc @dawid-sklodowski @cpoo22 just fyi