Hangs forever in refresh after 5 minutes when 2ary unreachable #331

telent · 2014-11-03T17:19:16Z

tl;dr setsockopt(SO_RCVTIMEO) doesn't work for us in Ruby 2.[01] MRI, infinite hangs result

We're having trouble in our Mongo cluster when a secondary becomes unreachable (due to e.g. crashed instance or network partition): the symptom is that after five minutes all attempts to contact the primary seem to hang indefinitely (or until Unicorn kills our worker process). Note that our mongoid.yml has options: {read: primary}

Attempting to replicate the problem in a test environment gets us a call stack something like this (some frames omitted for brevity)

     # /home./.bundler/ruby/2.1.0/moped-4c2a0a29122e/lib/moped/connection/socket/connectable.rb:47:in `read'
     # /home./.bundler/ruby/2.1.0/moped-4c2a0a29122e/lib/moped/connection/socket/connectable.rb:47:in `block in read'
     # /home./.bundler/ruby/2.1.0/moped-4c2a0a29122e/lib/moped/connection/socket/connectable.rb:119:in `handle_socket_errors'
     # /home./.bundler/ruby/2.1.0/moped-4c2a0a29122e/lib/moped/connection/socket/connectable.rb:47:in `read'
     # /home./.bundler/ruby/2.1.0/moped-4c2a0a29122e/lib/moped/connection.rb:221:in `with_connection'
     # /home./.bundler/ruby/2.1.0/moped-4c2a0a29122e/lib/moped/connection.rb:114:in `read'
     # /home./.bundler/ruby/2.1.0/moped-4c2a0a29122e/lib/moped/protocol/query.rb:213:in `receive_replies'
     # /home./.bundler/ruby/2.1.0/moped-4c2a0a29122e/lib/moped/node.rb:180:in `block in ensure_connected'
     # /home./.bundler/ruby/2.1.0/moped-4c2a0a29122e/lib/moped/node.rb:432:in `refresh'
     # /home./.bundler/ruby/2.1.0/moped-4c2a0a29122e/lib/moped/cluster.rb:182:in `block in refresh'
     # /home./.bundler/ruby/2.1.0/moped-4c2a0a29122e/lib/moped/cluster.rb:151:in `nodes'
     # /home./.bundler/ruby/2.1.0/moped-4c2a0a29122e/lib/moped/cluster.rb:240:in `with_primary'
     # /home./.bundler/ruby/2.1.0/moped-4c2a0a29122e/lib/moped/read_preference/primary.rb:55:in `block in with_node'

and after some digging the problem seems to be that implementing timeouts using socket options doesn't work in Ruby MRI. Looking at it with strace I see a call to read that returns after five seconds with EAGAIN followed by a call to ppoll - the EAGAIN doesn't ever get as far as interpreted code.

getsockname(7, {sa_family=AF_INET, sin_port=htons(44829), sin_addr=inet_addr("192.168.195.6")}, [16]) = 0
setsockopt(7, SOL_SOCKET, SO_SNDTIMEO, "\5\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 16) = 0
futex(0x7f24117f6654, FUTEX_WAIT_PRIVATE, 1, NULL) = 0
futex(0x7f24117f66d0, FUTEX_WAKE_PRIVATE, 1) = 0
select(8, [7], NULL, [7], {0, 0})       = 0 (Timeout)
read(7, 0x7f2413130f00, 100)            = -1 EAGAIN (Resource temporarily unavailable)
ppoll([{fd=7, events=POLLIN}], 1, NULL, NULL, 8  <hang indefinitely>

See also http://stackoverflow.com/questions/9853516/set-socket-timeout-in-ruby-via-so-rcvtimeo-socket-option

The attached patch has a spec that demonstrates the problem (try running it under strace) and a fix which implements read timeouts "by hand" using select() instead of relying on socket options. A more full solution would probably address the write() case and be tested on SSL connections as well as plaintext ones, but I thought I'd push this early diagnosis up to you for comment first to see if I'm on the right track

/cc @dawid-sklodowski @cpoo22 just fyi

In MRI 2.1.3 - and probably most other versions - setting the SO_RCVTIMEO option on a socket will cause read(2) to return EWOULDBLOCK as expected, but the interpreter internals will retry the operation instead of passing the failure up to interpreted code. Thus, the timeout has no visible effect. What we should do instead is call Kernel.select with the required timeout to check if there is data available, *then* call read(2). If .select fails we raise a Errors::ConnectionFailure on the assumption that something higher up in the stack will catch it and cope.

coveralls · 2014-11-03T17:28:12Z

Coverage decreased (-0.06%) when pulling c56498f on simplybusiness:timeout_in_read into 7e5ac3e on mongoid:master.

telent · 2014-12-19T15:28:41Z

cc @ngsmrk @dcoxall

jonhyman · 2015-04-08T15:57:34Z

We experience this issue as well. Would love to see it merged.

…#331." This reverts commit fcf449f.

jonhyman · 2015-04-24T04:23:32Z

@telent, have you been running this in production? I tested this out on production and had some really strange errors (moped cursor returning empty documents) that were sporadic and hard to reproduce. I reverted the change for now but will do some more investigation later.

jonhyman · 2015-04-24T04:24:22Z

Though to be clear, I had done it by copying the code into a fork of Moped 1.5. I didn't try with Moped 2 yet.

jonhyman · 2015-04-27T17:58:52Z

I tried testing this again by issuing a scatter-gather query on a massive collection with hundreds of millions of documents. This wreaked havoc on the database, as it appears that the query continually kept getting retried but the cursor wasn't closed. As such, before I CTRL+C'd it, we had 19 identical queries running which caused massive queued reads and blocking. I'm not sure this is safe. Though, I had retrofitted it to 1.5, so it may work in 2.0 but I am bearish on this right now given what happened to us.

jonhyman · 2015-08-23T18:37:07Z

@durran have you thought about this problem at all in the new ruby driver? As @telent points out, SO_RCVTIMEO simply does not work in MRI anymore. We experience this issue when we have a bad networking event or a physical host which has multiple mongoDs freezes or crashes; our processes completely hang and must be kill -9'd. We have an identical stack trace to what @telent describes in the issue. Reviewing the code for the new driver, I think that we'll have the same issue as long as IO.select is not used before the socket read. This pull as it is is problematic for us, my guess is because if a query takes longer than @timeout to return, then the ConnectionFailure causes it to be re-issued (as I mentioned above, we wreaked havoc on a database with this pull and a 10 second timeout), but this is a real operational issue that causes really bad problems for us.

durran · 2015-08-23T20:23:29Z

@jonhyman I have a branch up on the new driver that actually does an IO.select before selecting a server to use in an operation... (https://github.com/mongodb/mongo-ruby-driver/pull/670/files#diff-ea2082a78a5d9b36b80ce79b7705943dR86) The server.connectable? call is eventually calling a select before returning it, and the operation would execute immediately after.

jonhyman · 2015-08-23T20:52:56Z

I may be wrong, but I don't think that that's going to help. Going back to Moped and this issue for a second, the Moped code already has a check_if_alive! immediately before the read, but one issue is that alive? is defined as this (in both Moped and it's roughly identical in the new driver)

def alive?
  if Kernel::select([ self ], nil, [ self ], 0)
    !eof? rescue false
  else
    true
  end
rescue IOError
  false
end

so if select returns nil, alive? still returns true. The alive? check has a 0 second timeout, so all you know is that the socket is not dead, but if it returns nil then it does not have data on it right now and is not ready for reading. Because the current Moped code does this:

check_for_alive!
handle_socket_errors { super }

you don't know if the socket is ready for reading when you go to do a blocking read, and the super call hangs indefinitely as @telent's stack trace shows (and we have that exact stack trace when it happens to us).

In your pull, connectable? simply does an alive? check, which, if select returns nil signifying that the socket is not ready for reading, connectable? will still return true. Then you'll end up calling @socket.read in https://github.com/mongodb/mongo-ruby-driver/blob/master/lib/mongo/socket.rb#L172 and I think this code will be vulnerable to hanging again (I have not yet tested, your pull is the last one I'm waiting on before we perform failover testing scenarios with the new driver 😁).

From all my testing locally, I can't figure out how to solve this hanging case in Ruby that does not involve using select to set a "maximum read time" by making read_from_socket look something like this:

def read_from_socket(length)
  begin
    @socket.read_nonblock(length) || String.new
  rescue IO::WaitReadable, OpenSSL::SSL::SSLErrorWaitReadable
    ready = IO.select([@socket], nil, [@socket], 300)
    if ready
      retry
    else
      raise StandardError.new("Maximum query time exceeded, no data from socket returned in 300 seconds.")
    end
  end
end

Note here I'm not raising a ConnectionFailure to retry, I'm literally failing, because if it was in fact the case that the query should take longer than the max timeout, the ConnectionFailure will cause the query to get re-issued, and you can get in the loop that we got into 4 months ago when I tested this and ran a dozen and a half huge, slow scatter-gather queries on a very large cluster and caused high locking/queued reads.

daniel-barlow added 2 commits November 3, 2014 17:01

test for working timeouts in #read

898edd0

jonhyman added a commit to jonhyman/moped that referenced this pull request Apr 20, 2015

Attempt to fix connection hanging issues. Retrofit of mongoid#331.

fcf449f

jonhyman added a commit to jonhyman/moped that referenced this pull request Apr 24, 2015

Revert "Attempt to fix connection hanging issues. Retrofit of mongoid…

95f17aa

…#331." This reverts commit fcf449f.

jonhyman mentioned this pull request Aug 23, 2015

Stuck on socket read #343

Open

jonhyman mentioned this pull request Sep 3, 2015

Adds more resiliency to retrying read operations during failures. mongodb/mongo-ruby-driver#685

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hangs forever in refresh after 5 minutes when 2ary unreachable #331

Hangs forever in refresh after 5 minutes when 2ary unreachable #331

telent commented Nov 3, 2014

coveralls commented Nov 3, 2014

telent commented Dec 19, 2014

jonhyman commented Apr 8, 2015

jonhyman commented Apr 24, 2015

jonhyman commented Apr 24, 2015

jonhyman commented Apr 27, 2015

jonhyman commented Aug 23, 2015

durran commented Aug 23, 2015

jonhyman commented Aug 23, 2015

Hangs forever in refresh after 5 minutes when 2ary unreachable #331

Are you sure you want to change the base?

Hangs forever in refresh after 5 minutes when 2ary unreachable #331

Conversation

telent commented Nov 3, 2014

coveralls commented Nov 3, 2014

telent commented Dec 19, 2014

jonhyman commented Apr 8, 2015

jonhyman commented Apr 24, 2015

jonhyman commented Apr 24, 2015

jonhyman commented Apr 27, 2015

jonhyman commented Aug 23, 2015

durran commented Aug 23, 2015

jonhyman commented Aug 23, 2015