Jobs that timeout will never be able to run again #2

lwc · 2014-08-27T05:23:45Z

When a job overruns it's TTR, beanstalkd will increment the job's timeout stat and put it back on the work queue for another worker to reserve.

In an effort to prevent pathological jobs from dog-piling all available workers, cmdstalk will bury a task it reserves that has timeouts greater than 1. This means that once a task is buried because of a timeout, it will always re-bury instantly each time it is kicked: the job becomes un-runnable.

Using just the buried, kicked and timeout counters, there does not appear to be a way to differentiate between "kicks due buries due to timeouts" in the way that would allow cmdstalk to bury a job the next time it is reserved after a timeout.

The beanstalkd protocol docs make mention of a one second grace period at the end of a reserve time - would it be possible to use this grace period to bury a timed out job in the "same run" as the timeout occurred?

The text was updated successfully, but these errors were encountered:

lwc · 2014-08-27T05:52:32Z

Upon further reading I'm less clear on how DEADLINE_SOON is meant to operate 😕

lox · 2014-09-11T05:13:23Z

DEADLINE_SOON is sent to a client that is in a blocking reserve if there are no other jobs to send it and a job it has is nearing TTR deadline.

The issue with racing beanstalkd to bury a task is that you miss out on the timed-out metadata. It's simply buried, if you beat the server to it.

Perhaps it's just the fact that the job is buried on timeout? What should actually happen to timed-out jobs? If we just kick them at the minute then perhaps we should change the behaviour to release with delay to prevent dog-piling. Perhaps timeouts could result in a more aggressive exponential backoff, or a more premature bury.

Either way, seems like we haven't got it 100% right. Thoughts @pda @rbone?

rbone · 2014-09-11T05:17:23Z

A longer backoff sounds like a reasonable change for the moment. It is tricky however, as some tasks may merit more aggressive burying strategies while others may be safe to retry very frequently. I'd say a longer backoff makes sense as a default, but it might be nice in the future to make this behaviour configurable, possibly even on a per tube basis.

lox · 2014-09-11T05:18:30Z

Should the backoff be proportional to the TTR?

rbone · 2014-09-14T23:56:21Z

Honestly I can't make up my mind on what the default behaviour should be, so it probably doesn't matter too much what way you go. A proportional TTR should be fine. I think having it be configurable per tube will become pretty important however.

pda · 2014-09-15T17:48:24Z

I think a simple function of the try count c should work fine for now.

PR #4 proposes 3 tries with c*c * time.Hour; delays are 0 (first try), 1 hour, 4 hours; total of 5 hours.
4 tries at c*c * time.Hour could also work; that would add an extra retry after an additional 9 hours.

pda mentioned this issue Aug 28, 2014

WIP: ability to kick failed jobs back onto queue. #3

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jobs that timeout will never be able to run again #2

Jobs that timeout will never be able to run again #2

lwc commented Aug 27, 2014

lwc commented Aug 27, 2014

lox commented Sep 11, 2014

rbone commented Sep 11, 2014

lox commented Sep 11, 2014

rbone commented Sep 14, 2014

pda commented Sep 15, 2014

Jobs that timeout will never be able to run again #2

Jobs that timeout will never be able to run again #2

Comments

lwc commented Aug 27, 2014

lwc commented Aug 27, 2014

lox commented Sep 11, 2014

rbone commented Sep 11, 2014

lox commented Sep 11, 2014

rbone commented Sep 14, 2014

pda commented Sep 15, 2014