-
Notifications
You must be signed in to change notification settings - Fork 385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(store): Explicitly time the client out if the connection hung #4039
Conversation
Example for the error message, as rendered to the terminal:
|
I would prefer if we would not show a stacktrace to the user. It is a "planned" exceptional behaviour, the error message is enough. |
@vodorok I'm not sure about that. Without the stack trace, we do not know whether the timeout was triggered when reading from the socket (a.k.a., where we expect this) or during the writing of the socket (where such could, at least theoretically, also happen, but where we do NOT expect the program to hang for so long). The cardinal problem is the exception trigger is an asynchronous operation, so depending on where the normal execution was when the timeout triggered, it could have had killed it anywhere that is within the execution path for The example output message I triggered by putting an explicit and large enough |
I checked the implementation in the meantime, and you only guard the rpc call with the watchdog. What I trying to say, that it will always be a server (or network) related issue, when the timeout triggers, and the user usually cannot do anything about it, apart from restarting the process, and hoping for the best. But I might be missing some context here... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The stack trace is a but ugly IMHO, but I understand why we need it.
LGTM.
The only difference which might be meaningful is whether the send() fails versus the recv() failing. Nevertheless, I will put the stack trace behind the debug verbosity mode and update the patch. |
@whisperity Thanks that is a good compromise. |
This is a partial solution that attempts to gracefully kill an ongoing `CodeChecker store` client if the reply to the `massStoreRun()` API call does not arrive in a timely manner. After reaching the timeout, the client throws an exception onto itself and presents an understandable error message about the situation. This implementation is temporary, thus the timeout is a hard-coded limit. The client still exits with a failure in these cases, because there are no guarantees (and no deterministic ways to check) whether the timeout happened because the TCP connection failed and stuck (which is observed in a lot of cases, unfortunately) or whether the server indeed took more time than allotted to process the stored reports. A successful exit signal would result in CI scripts thinking the store was successfull and then potentially breaking later, unreliably, on the fact that the stored run does not in fact exist, or is in an outdated state, etc.
3e06352
to
15af7d8
Compare
@vodorok @dkrupp Meanwhile I came up with a better idea, where we still print just the two most relevant parts of the stack, but we do that outside of debug mode as well. What do you think?
This ensures if users notify us about the problem we can clearly identify (I'd wager not many run the very verbose debug mode...) that it is indeed what we expected to happen (unfortunately), and give us a way out if we see something like |
I think it is much better like this. It is ok to land. |
@vodorok Could you please APPROVE this so it doesn't show as if you had not reviewed? I think I have good news. After having run an analysis in which the (knowingly bogus!) checker I executed produced 178753 reports, I managed to test this in the real. The store began today 10:58 Zulu, with the client emitting the following before going down to hang:
The client is now hung on Compare a living SSH connection to what is seen on the store-side: (Note the In the server logs, there is no indication that the store of
Nevertheless, and as expected, one hour after the store started executing, the timeout was triggered, explained, and the
Given the fact that this specific CI job is configured to not hard-error on store errors, it seems that we do the best of both worlds: |
This is a partial solution that attempts to gracefully kill an ongoing
CodeChecker store
client if the reply to themassStoreRun()
API call does not arrive in a timely manner. After reaching the timeout, the client throws an exception onto itself and presents an understandable error message about the situation. This implementation is temporary; thus, the timeout is a hard-coded limit.The client still exits with a failure in these cases because there are no guarantees (and no deterministic ways to check) whether the timeout happened because the TCP connection failed and stuck (which is observed in a lot of cases, unfortunately) or whether the server indeed took more time than allotted to process the stored reports. A successful exit signal would result in CI scripts thinking the store was successful and then potentially breaking later, unreliably, on the fact that the stored run does not, in fact, exist or is in an outdated state, etc.