Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle resources error gracefully. (#155) #167

Open
wants to merge 19 commits into
base: master
Choose a base branch
from

Conversation

Dutil
Copy link
Contributor

@Dutil Dutil commented Sep 13, 2017

Adresses #155 .

@coveralls
Copy link

Coverage Status

Coverage decreased (-1.5%) to 93.076% when pulling e6687c3 on Dutil:iss155 into 9133e15 on SMART-Lab:master.

@coveralls
Copy link

Coverage Status

Coverage decreased (-1.5%) to 93.076% when pulling 1b7ffbb on Dutil:iss155 into 9133e15 on SMART-Lab:master.

@Dutil Dutil changed the title WIP: Issue #155 Handle resources error gracefully. (#155) Sep 13, 2017
with self.assertRaises(SystemExit) as context:
smartdispatch_script.main(argv=argv)

self.assertTrue(context.exception.code, 2)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also want to check that it can pass.

with self.assertRaises(SystemExit) as context:
smartdispatch_script.main(argv=argv)

self.assertTrue(context.exception.code, 2)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idem

with self.assertRaises(SystemExit) as context:
smartdispatch_script.main(argv=argv)

self.assertTrue(context.exception.code, 2)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idem

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm getting confused here. You expect the function to raise SystemExit, but you mocked adding side_effect CalledProcessError with exception code 1. Why would it rather raise SystemExit?

By the way, what's the difference with test_smart_dispatch.TestSmartdispatcher.test_launch_job_check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it raise a CalledProcessError, which is catch here: https://github.com/Dutil/smartdispatch/blob/master/smartdispatch/smartdispatch_script.py#L193
and then quit. I don't reraise the exception in the script, since qsub/msub already print the error message and the stack trace is not really helpful.

And yeah, just realized that I had the same test 2 times, and forgot to remove one. In fact I think I should remove the smartdispatch/tests/test_smartdispatch_script.py entirely, since the tests that I've add in test_smart_dispatch test the same thing, just in different ways.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it raise a CalledProcessError, which is catch here: https://github.com/Dutil/smartdispatch/blob/master/smartdispatch/smartdispatch_script.py#L193
and then quit. I don't reraise the exception in the script, since qsub/msub already print the error message and the stack trace is not really helpful.

Oh right! Thanks for the clarification

And yeah, just realized that I had the same test 2 times, and forgot to remove one. In fact I think I should remove the smartdispatch/tests/test_smartdispatch_script.py entirely, since the tests that I've add in test_smart_dispatch test the same thing, just in different ways.

I agree, those tests are somehow duplicates. Anyhow, I would add tests in test_smart_dispatch which call smartdispatch.main directly. That will make it easier to test the error messages.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, should I just move test_gpu_check and test_cpu_check to test_smart_dispatch.TestSmartdispatcher?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say so, yes.

self.assertTrue(context.exception.code, 2)

except subprocess.CalledProcessError:
self.fail("smartdispatch_script.main() raised CalledProcessError unexpectedly!")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know if self.fail will keep the stack trace from CalledProcessError? I know it is mocked so there won't be interesting information in the message itself, but the stacktrace would point to the unexpected call to check_output.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it doesn't. Gonna add the stack trace to the assert message.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to reraise the error with the message changed than adding the stacktrace as a string to the message (if I understand correctly what you intend).

I would use six.reraise. You can look here for a usage example.

# Actual test
exit_status_100 = call(self.launch_command_with_gpus.format(gpus=100), shell=True)

# Test validation
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not testing 0 gpus too? Because it passes? Then we could assert_equal(test exit_status_0, 0).

with self.assertRaises(SystemExit) as context:
smartdispatch_script.main(argv=argv)

self.assertTrue(context.exception.code, 2)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idem for passing tests

self.assertTrue(context.exception.code, 2)

except subprocess.CalledProcessError:
self.fail("smartdispatch_script.main() raised CalledProcessError unexpectedly!")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idem for stacktrace

def parse_arguments(argv=None):

if argv is None:
argv = sys.argv[1:]
Copy link
Collaborator

@bouthilx bouthilx Sep 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's useless, parser.parse_args(None) will do this internally.

@bouthilx
Copy link
Collaborator

Note for myself, this PR should also fix #96.

@coveralls
Copy link

Coverage Status

Coverage decreased (-1.7%) to 92.797% when pulling fba5bf3 on Dutil:iss155 into 9133e15 on SMART-Lab:master.

@coveralls
Copy link

Coverage Status

Coverage decreased (-2.08%) to 92.45% when pulling 5d99e79 on Dutil:iss155 into 9133e15 on SMART-Lab:master.

@coveralls
Copy link

Coverage Status

Coverage decreased (-2.2%) to 92.293% when pulling 7e357d8 on Dutil:iss155 into 9133e15 on SMART-Lab:master.

@bouthilx
Copy link
Collaborator

bouthilx commented Oct 2, 2017

@Dutil You commit's title says "Removing duplicate test" but I don't see anything removed. Did I miss something?

try:
smartdispatch_script.main(argv=argv)
except SystemExit as e:
self.fail("The command failed the check, but it was supposed to pass.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if you just let it crash on smartdispatch_script.main()? Is it bad because of SystemExit? Do you lose the stacktrace or a valuable error message by doing self.fail?

@coveralls
Copy link

Coverage Status

Coverage decreased (-1.7%) to 92.815% when pulling a83e500 on Dutil:iss155 into 9133e15 on SMART-Lab:master.

@Dutil
Copy link
Contributor Author

Dutil commented Oct 3, 2017

@bouthilx Yeah sorry, I've remove the other file now.

try:
smartdispatch_script.main(argv=argv)
except SystemExit as e:
self.fail("The command failed the check, but it was supposed to pass.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if you just let it crash on smartdispatch_script.main()? Is it bad because of SystemExit? Do you lose the stacktrace or a valuable error message by doing self.fail?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we lose the the information of the stacktrace when using the self.fail. Otherwise the test crash and we get an error (the rest of the tests runs normally).

I used the self.fail instead of letting the script crash, because I felt it was more inline with what the test was supposed to do (i.e. the test failed because a SystemExit was thrown, it's not really an unexpected error). But it's right that we lose information that way. Should I simply let it crash instead?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is important when a unit test fails that the error message is informative enough that we can easily pin-point the problem and fix it. In this case, I believe both SystemExit and self.fail are just as bad, because SystemExit doesn't write down the stack trace nor informative message about the source of the problem, right? If this is true, then I would do like the other tests where the error is reraised as another exception type. In this case, the reraise procedures should be refactored as a single function to avoid code duplication.

@coveralls
Copy link

Coverage Status

Coverage decreased (-1.8%) to 92.774% when pulling 7eb4b84 on Dutil:iss155 into 9133e15 on SMART-Lab:master.


def get_advice(cluster_name):

helios_advice = """On Helios, don't forget that the queue gpu_1, gpu_2, gpu_4 and gpu_8 give access to a specific amount of gpus.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pep8 please 😢 😜

helios_advice = """On Helios, don't forget that the queue gpu_1, gpu_2, gpu_4 and gpu_8 give access to a specific amount of gpus.
For more advices, please refer to the official documentation: 'https://wiki.calculquebec.ca/w/Helios/en'"""
mammouth_advice = "On Mammouth, please refer to the official documentation for more information: 'https://wiki.ccs.usherbrooke.ca/Accueil/en'"
hades_advice = """On Hades, don't forget that the queue name '@hades' needs to be use.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs to be *used


cluster_advice = utils.get_advice(CLUSTER_NAME)

sys.stderr.write("smart-dispatch: error: The launcher wasn't able the launch the job(s) properly. The following error message was returned: \n\n{}\n\nMaybe the pbs file(s) generated were invalid. {}\n\n".format(e.output, cluster_advice))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I won't ask you to pep8ify all code, but please do so for the lines you editied like this one.


# Check that requested gpu number does not exceed node total
if args.gpusPerCommand > queue.nb_gpus_per_node:
sys.stderr.write("smart-dispatch: error: gpusPerCommand exceeds nodes total: asked {req_gpus} gpus, nodes have {node_gpus}\n"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a little advice would be helpful here too. Some users might be confused to get a message saying there is only 1 GPU per node available if they know nodes have more than that. They might not know that specific queues can limit the number of GPUs available (on helios for instance).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So to show the same advice as when qsub/msub fail?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's not so nice. Maybe just had something like: Make sure you specified the correct queue.

@@ -136,3 +136,26 @@ def get_launcher(cluster_name):
return "msub"
else:
return "qsub"

def get_advice(cluster_name):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be preferable to have the advices outside of get_advice, following the constant variable naming convention (HELIOS_ADVICE, MAMMOUTH_ADVICE, etc).

import sys
import traceback

def rethrow_exception(exception, new_message):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about moving this helper function to utils.py? utils/testing.py would be even better actually but I would leave utils refactoring for later.

@@ -23,17 +47,23 @@ def setUp(self):
self.nb_commands = len(self.commands)

scripts_path = abspath(pjoin(os.path.dirname(__file__), os.pardir, "scripts"))
self.smart_dispatch_command = '{} -C 1 -q test -t 5:00 -x'.format(pjoin(scripts_path, 'smart-dispatch'))
self.smart_dispatch_command = '{} -C 1 -G 1 -q test -t 5:00 -x'.format(pjoin(scripts_path, 'smart-dispatch'))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still have tests without GPUs? We need to test both settings; With and without GPUs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I just remembered why I added that. the default value for 'gpusPerCommand' is 1, but for
'gpusPerNode' the default value is 0, when the queue is undefined (like the queue 'test', that is used for the tests). And that triggers the gpu check.

So in order to have those tests pass, we need to either have the -G 1 options, change the check to let it pass if gpusPerNode == 0, or change the gpusPerCommand default to 0.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we should have tests for -g 0 -G 0 and -g 0. Do we?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't the -g 0 -G 0 and -g 0 tests still missing?

…ailable on the queue. pep8fy some long strings.
@coveralls
Copy link

Coverage Status

Coverage decreased (-1.6%) to 92.912% when pulling cec39e6 on Dutil:iss155 into 9133e15 on SMART-Lab:master.

@@ -150,7 +128,22 @@ def test_gpu_check(self):
argv[2] = '1'
smartdispatch_script.main(argv=argv)

@rethrow_exception(SystemExit, "smartdispatch_script.main() raised SystemExit unexpectedly.")
# Test if we don't have gpus. (and spicified in script).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spicified 😝


def func_wraper(func):

def test_func(*args, **kwargs):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't be better to use @functools.wraps(func) here for the stacktrace? I mean

@functools.wraps(func)
def test_func(*args, **kwargs):

I just had a problem with stack trace because of decorators while implementing tests for Slurm clusters. I wonder if the same thing is happening here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just tried, it didn't change the stack trace.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, well you can leave it there anyway.

argv[2] = '1'
smartdispatch_script.main(argv=argv)

# Test if we don't have gpus. (and spicified in script).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you should do grep "spicified" 😝

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my defence, it's the same line/mistake as the other one 😜

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coherence is important indeed

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, it's the same file? hahaha! 😊

argv[4] = '0'
smartdispatch_script.main(argv=argv)

# Don't have gpus, but the user specofy 1 anyway.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A variant! Specofy! 🙃

@coveralls
Copy link

Coverage Status

Coverage decreased (-1.6%) to 92.916% when pulling 3b27597 on Dutil:iss155 into 9133e15 on SMART-Lab:master.

@coveralls
Copy link

Coverage Status

Coverage decreased (-1.6%) to 92.916% when pulling 71af7a3 on Dutil:iss155 into 9133e15 on SMART-Lab:master.

@bouthilx
Copy link
Collaborator

@mgermain Ready for merge. I believe the decrease in coverage is due to the move of scripts/* to smartdispatch/*_script.py. The unit-tests for the proposed changes seems to have a proper coverage in my opinion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants