-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle resources error gracefully. (#155) #167
base: master
Are you sure you want to change the base?
Conversation
…, "script/smart-dispatch" and "script/sd-launch-pbs", have been moved to "smartdispatch/smartdispatch_script.py" and "smartdispatch/sd_launch_pbs_script.py". The last commit of "script/smart-dispatch" was : 072dce6
with self.assertRaises(SystemExit) as context: | ||
smartdispatch_script.main(argv=argv) | ||
|
||
self.assertTrue(context.exception.code, 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also want to check that it can pass.
with self.assertRaises(SystemExit) as context: | ||
smartdispatch_script.main(argv=argv) | ||
|
||
self.assertTrue(context.exception.code, 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
idem
with self.assertRaises(SystemExit) as context: | ||
smartdispatch_script.main(argv=argv) | ||
|
||
self.assertTrue(context.exception.code, 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
idem
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm getting confused here. You expect the function to raise SystemExit, but you mocked adding side_effect CalledProcessError with exception code 1. Why would it rather raise SystemExit?
By the way, what's the difference with test_smart_dispatch.TestSmartdispatcher.test_launch_job_check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it raise a CalledProcessError, which is catch here: https://github.com/Dutil/smartdispatch/blob/master/smartdispatch/smartdispatch_script.py#L193
and then quit. I don't reraise the exception in the script, since qsub/msub already print the error message and the stack trace is not really helpful.
And yeah, just realized that I had the same test 2 times, and forgot to remove one. In fact I think I should remove the smartdispatch/tests/test_smartdispatch_script.py entirely, since the tests that I've add in test_smart_dispatch test the same thing, just in different ways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it raise a CalledProcessError, which is catch here: https://github.com/Dutil/smartdispatch/blob/master/smartdispatch/smartdispatch_script.py#L193
and then quit. I don't reraise the exception in the script, since qsub/msub already print the error message and the stack trace is not really helpful.
Oh right! Thanks for the clarification
And yeah, just realized that I had the same test 2 times, and forgot to remove one. In fact I think I should remove the smartdispatch/tests/test_smartdispatch_script.py entirely, since the tests that I've add in test_smart_dispatch test the same thing, just in different ways.
I agree, those tests are somehow duplicates. Anyhow, I would add tests in test_smart_dispatch which call smartdispatch.main directly. That will make it easier to test the error messages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, should I just move test_gpu_check and test_cpu_check to test_smart_dispatch.TestSmartdispatcher?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say so, yes.
self.assertTrue(context.exception.code, 2) | ||
|
||
except subprocess.CalledProcessError: | ||
self.fail("smartdispatch_script.main() raised CalledProcessError unexpectedly!") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know if self.fail will keep the stack trace from CalledProcessError? I know it is mocked so there won't be interesting information in the message itself, but the stacktrace would point to the unexpected call to check_output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No it doesn't. Gonna add the stack trace to the assert message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better to reraise the error with the message changed than adding the stacktrace as a string to the message (if I understand correctly what you intend).
I would use six.reraise. You can look here for a usage example.
tests/test_smart_dispatch.py
Outdated
# Actual test | ||
exit_status_100 = call(self.launch_command_with_gpus.format(gpus=100), shell=True) | ||
|
||
# Test validation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not testing 0 gpus too? Because it passes? Then we could assert_equal(test exit_status_0, 0).
tests/test_smart_dispatch.py
Outdated
with self.assertRaises(SystemExit) as context: | ||
smartdispatch_script.main(argv=argv) | ||
|
||
self.assertTrue(context.exception.code, 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
idem for passing tests
tests/test_smart_dispatch.py
Outdated
self.assertTrue(context.exception.code, 2) | ||
|
||
except subprocess.CalledProcessError: | ||
self.fail("smartdispatch_script.main() raised CalledProcessError unexpectedly!") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
idem for stacktrace
def parse_arguments(argv=None): | ||
|
||
if argv is None: | ||
argv = sys.argv[1:] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's useless, parser.parse_args(None) will do this internally.
Note for myself, this PR should also fix #96. |
…he script in the same file.
@Dutil You commit's title says "Removing duplicate test" but I don't see anything removed. Did I miss something? |
try: | ||
smartdispatch_script.main(argv=argv) | ||
except SystemExit as e: | ||
self.fail("The command failed the check, but it was supposed to pass.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if you just let it crash on smartdispatch_script.main()? Is it bad because of SystemExit? Do you lose the stacktrace or a valuable error message by doing self.fail?
…he script in the same file.
@bouthilx Yeah sorry, I've remove the other file now. |
tests/test_smart_dispatch.py
Outdated
try: | ||
smartdispatch_script.main(argv=argv) | ||
except SystemExit as e: | ||
self.fail("The command failed the check, but it was supposed to pass.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if you just let it crash on smartdispatch_script.main()? Is it bad because of SystemExit? Do you lose the stacktrace or a valuable error message by doing self.fail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we lose the the information of the stacktrace when using the self.fail. Otherwise the test crash and we get an error (the rest of the tests runs normally).
I used the self.fail instead of letting the script crash, because I felt it was more inline with what the test was supposed to do (i.e. the test failed because a SystemExit was thrown, it's not really an unexpected error). But it's right that we lose information that way. Should I simply let it crash instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is important when a unit test fails that the error message is informative enough that we can easily pin-point the problem and fix it. In this case, I believe both SystemExit and self.fail are just as bad, because SystemExit doesn't write down the stack trace nor informative message about the source of the problem, right? If this is true, then I would do like the other tests where the error is reraised as another exception type. In this case, the reraise procedures should be refactored as a single function to avoid code duplication.
smartdispatch/utils.py
Outdated
|
||
def get_advice(cluster_name): | ||
|
||
helios_advice = """On Helios, don't forget that the queue gpu_1, gpu_2, gpu_4 and gpu_8 give access to a specific amount of gpus. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pep8 please 😢 😜
smartdispatch/utils.py
Outdated
helios_advice = """On Helios, don't forget that the queue gpu_1, gpu_2, gpu_4 and gpu_8 give access to a specific amount of gpus. | ||
For more advices, please refer to the official documentation: 'https://wiki.calculquebec.ca/w/Helios/en'""" | ||
mammouth_advice = "On Mammouth, please refer to the official documentation for more information: 'https://wiki.ccs.usherbrooke.ca/Accueil/en'" | ||
hades_advice = """On Hades, don't forget that the queue name '@hades' needs to be use. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs to be *used
|
||
cluster_advice = utils.get_advice(CLUSTER_NAME) | ||
|
||
sys.stderr.write("smart-dispatch: error: The launcher wasn't able the launch the job(s) properly. The following error message was returned: \n\n{}\n\nMaybe the pbs file(s) generated were invalid. {}\n\n".format(e.output, cluster_advice)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I won't ask you to pep8ify all code, but please do so for the lines you editied like this one.
|
||
# Check that requested gpu number does not exceed node total | ||
if args.gpusPerCommand > queue.nb_gpus_per_node: | ||
sys.stderr.write("smart-dispatch: error: gpusPerCommand exceeds nodes total: asked {req_gpus} gpus, nodes have {node_gpus}\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a little advice would be helpful here too. Some users might be confused to get a message saying there is only 1 GPU per node available if they know nodes have more than that. They might not know that specific queues can limit the number of GPUs available (on helios for instance).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So to show the same advice as when qsub/msub fail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's not so nice. Maybe just had something like: Make sure you specified the correct queue.
smartdispatch/utils.py
Outdated
@@ -136,3 +136,26 @@ def get_launcher(cluster_name): | |||
return "msub" | |||
else: | |||
return "qsub" | |||
|
|||
def get_advice(cluster_name): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be preferable to have the advices outside of get_advice, following the constant variable naming convention (HELIOS_ADVICE, MAMMOUTH_ADVICE, etc).
tests/test_smart_dispatch.py
Outdated
import sys | ||
import traceback | ||
|
||
def rethrow_exception(exception, new_message): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about moving this helper function to utils.py? utils/testing.py would be even better actually but I would leave utils refactoring for later.
@@ -23,17 +47,23 @@ def setUp(self): | |||
self.nb_commands = len(self.commands) | |||
|
|||
scripts_path = abspath(pjoin(os.path.dirname(__file__), os.pardir, "scripts")) | |||
self.smart_dispatch_command = '{} -C 1 -q test -t 5:00 -x'.format(pjoin(scripts_path, 'smart-dispatch')) | |||
self.smart_dispatch_command = '{} -C 1 -G 1 -q test -t 5:00 -x'.format(pjoin(scripts_path, 'smart-dispatch')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still have tests without GPUs? We need to test both settings; With and without GPUs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I just remembered why I added that. the default value for 'gpusPerCommand' is 1, but for
'gpusPerNode' the default value is 0, when the queue is undefined (like the queue 'test', that is used for the tests). And that triggers the gpu check.
So in order to have those tests pass, we need to either have the -G 1 options, change the check to let it pass if gpusPerNode == 0, or change the gpusPerCommand default to 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then we should have tests for -g 0 -G 0 and -g 0. Do we?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aren't the -g 0 -G 0 and -g 0 tests still missing?
…ailable on the queue. pep8fy some long strings.
tests/test_smart_dispatch.py
Outdated
@@ -150,7 +128,22 @@ def test_gpu_check(self): | |||
argv[2] = '1' | |||
smartdispatch_script.main(argv=argv) | |||
|
|||
@rethrow_exception(SystemExit, "smartdispatch_script.main() raised SystemExit unexpectedly.") | |||
# Test if we don't have gpus. (and spicified in script). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spicified 😝
smartdispatch/utils.py
Outdated
|
||
def func_wraper(func): | ||
|
||
def test_func(*args, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't be better to use @functools.wraps(func) here for the stacktrace? I mean
@functools.wraps(func)
def test_func(*args, **kwargs):
I just had a problem with stack trace because of decorators while implementing tests for Slurm clusters. I wonder if the same thing is happening here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just tried, it didn't change the stack trace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, well you can leave it there anyway.
tests/test_smart_dispatch.py
Outdated
argv[2] = '1' | ||
smartdispatch_script.main(argv=argv) | ||
|
||
# Test if we don't have gpus. (and spicified in script). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like you should do grep "spicified"
😝
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my defence, it's the same line/mistake as the other one 😜
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Coherence is important indeed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, it's the same file? hahaha! 😊
tests/test_smart_dispatch.py
Outdated
argv[4] = '0' | ||
smartdispatch_script.main(argv=argv) | ||
|
||
# Don't have gpus, but the user specofy 1 anyway. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A variant! Specofy! 🙃
@mgermain Ready for merge. I believe the decrease in coverage is due to the move of scripts/* to smartdispatch/*_script.py. The unit-tests for the proposed changes seems to have a proper coverage in my opinion. |
Adresses #155 .