Handle resources error gracefully. (#155) #167

Dutil · 2017-09-13T15:12:42Z

Adresses #155 .

…essage

coveralls · 2017-09-13T17:19:38Z

Coverage decreased (-1.5%) to 93.076% when pulling e6687c3 on Dutil:iss155 into 9133e15 on SMART-Lab:master.

…, "script/smart-dispatch" and "script/sd-launch-pbs", have been moved to "smartdispatch/smartdispatch_script.py" and "smartdispatch/sd_launch_pbs_script.py". The last commit of "script/smart-dispatch" was : 072dce6

coveralls · 2017-09-13T18:54:24Z

Coverage decreased (-1.5%) to 93.076% when pulling 1b7ffbb on Dutil:iss155 into 9133e15 on SMART-Lab:master.

bouthilx · 2017-09-18T20:08:48Z

smartdispatch/tests/test_smartdispatch_script.py

+        with self.assertRaises(SystemExit) as context:
+            smartdispatch_script.main(argv=argv)
+
+        self.assertTrue(context.exception.code, 2)


We also want to check that it can pass.

bouthilx · 2017-09-18T20:08:55Z

smartdispatch/tests/test_smartdispatch_script.py

+        with self.assertRaises(SystemExit) as context:
+            smartdispatch_script.main(argv=argv)
+
+        self.assertTrue(context.exception.code, 2)


bouthilx · 2017-09-18T20:09:10Z

smartdispatch/tests/test_smartdispatch_script.py

+            with self.assertRaises(SystemExit) as context:
+                smartdispatch_script.main(argv=argv)
+
+                self.assertTrue(context.exception.code, 2)


I'm getting confused here. You expect the function to raise SystemExit, but you mocked adding side_effect CalledProcessError with exception code 1. Why would it rather raise SystemExit?

By the way, what's the difference with test_smart_dispatch.TestSmartdispatcher.test_launch_job_check?

So it raise a CalledProcessError, which is catch here: https://github.com/Dutil/smartdispatch/blob/master/smartdispatch/smartdispatch_script.py#L193
and then quit. I don't reraise the exception in the script, since qsub/msub already print the error message and the stack trace is not really helpful.

And yeah, just realized that I had the same test 2 times, and forgot to remove one. In fact I think I should remove the smartdispatch/tests/test_smartdispatch_script.py entirely, since the tests that I've add in test_smart_dispatch test the same thing, just in different ways.

So it raise a CalledProcessError, which is catch here: https://github.com/Dutil/smartdispatch/blob/master/smartdispatch/smartdispatch_script.py#L193
and then quit. I don't reraise the exception in the script, since qsub/msub already print the error message and the stack trace is not really helpful.

Oh right! Thanks for the clarification

And yeah, just realized that I had the same test 2 times, and forgot to remove one. In fact I think I should remove the smartdispatch/tests/test_smartdispatch_script.py entirely, since the tests that I've add in test_smart_dispatch test the same thing, just in different ways.

I agree, those tests are somehow duplicates. Anyhow, I would add tests in test_smart_dispatch which call smartdispatch.main directly. That will make it easier to test the error messages.

Ok, should I just move test_gpu_check and test_cpu_check to test_smart_dispatch.TestSmartdispatcher?

I would say so, yes.

bouthilx · 2017-09-18T20:12:11Z

smartdispatch/tests/test_smartdispatch_script.py

+                self.assertTrue(context.exception.code, 2)
+
+        except subprocess.CalledProcessError:
+            self.fail("smartdispatch_script.main() raised CalledProcessError unexpectedly!")


Do you know if self.fail will keep the stack trace from CalledProcessError? I know it is mocked so there won't be interesting information in the message itself, but the stacktrace would point to the unexpected call to check_output.

No it doesn't. Gonna add the stack trace to the assert message.

I think it would be better to reraise the error with the message changed than adding the stacktrace as a string to the message (if I understand correctly what you intend).

I would use six.reraise. You can look here for a usage example.

bouthilx · 2017-09-18T20:15:00Z

tests/test_smart_dispatch.py

+        # Actual test
+        exit_status_100 = call(self.launch_command_with_gpus.format(gpus=100), shell=True)
+
+        # Test validation


Why not testing 0 gpus too? Because it passes? Then we could assert_equal(test exit_status_0, 0).

bouthilx · 2017-09-18T20:16:26Z

tests/test_smart_dispatch.py

+            with self.assertRaises(SystemExit) as context:
+                smartdispatch_script.main(argv=argv)
+
+                self.assertTrue(context.exception.code, 2)


idem for passing tests

bouthilx · 2017-09-18T20:16:34Z

tests/test_smart_dispatch.py

+                self.assertTrue(context.exception.code, 2)
+
+        except subprocess.CalledProcessError:
+            self.fail("smartdispatch_script.main() raised CalledProcessError unexpectedly!")


idem for stacktrace

bouthilx · 2017-09-18T20:19:41Z

smartdispatch/smartdispatch_script.py

+def parse_arguments(argv=None):
+
+    if argv is None:
+        argv = sys.argv[1:]


That's useless, parser.parse_args(None) will do this internally.

bouthilx · 2017-09-19T17:24:37Z

Note for myself, this PR should also fix #96.

coveralls · 2017-09-21T16:40:36Z

Coverage decreased (-1.7%) to 92.797% when pulling fba5bf3 on Dutil:iss155 into 9133e15 on SMART-Lab:master.

… test.

coveralls · 2017-09-21T20:33:11Z

Coverage decreased (-2.08%) to 92.45% when pulling 5d99e79 on Dutil:iss155 into 9133e15 on SMART-Lab:master.

…he script in the same file.

coveralls · 2017-09-26T15:08:05Z

Coverage decreased (-2.2%) to 92.293% when pulling 7e357d8 on Dutil:iss155 into 9133e15 on SMART-Lab:master.

bouthilx · 2017-10-02T18:01:45Z

@Dutil You commit's title says "Removing duplicate test" but I don't see anything removed. Did I miss something?

bouthilx · 2017-10-02T18:12:29Z

smartdispatch/tests/test_smartdispatch_script.py

+        try:
+            smartdispatch_script.main(argv=argv)
+        except SystemExit as e:
+            self.fail("The command failed the check, but it was supposed to pass.")


What happens if you just let it crash on smartdispatch_script.main()? Is it bad because of SystemExit? Do you lose the stacktrace or a valuable error message by doing self.fail?

…he script in the same file.

…iss155

coveralls · 2017-10-03T04:09:01Z

Coverage decreased (-1.7%) to 92.815% when pulling a83e500 on Dutil:iss155 into 9133e15 on SMART-Lab:master.

Dutil · 2017-10-03T04:14:20Z

@bouthilx Yeah sorry, I've remove the other file now.

bouthilx · 2017-10-03T12:32:04Z

tests/test_smart_dispatch.py

+        try:
+            smartdispatch_script.main(argv=argv)
+        except SystemExit as e:
+            self.fail("The command failed the check, but it was supposed to pass.")


What happens if you just let it crash on smartdispatch_script.main()? Is it bad because of SystemExit? Do you lose the stacktrace or a valuable error message by doing self.fail?

Yes, we lose the the information of the stacktrace when using the self.fail. Otherwise the test crash and we get an error (the rest of the tests runs normally).

I used the self.fail instead of letting the script crash, because I felt it was more inline with what the test was supposed to do (i.e. the test failed because a SystemExit was thrown, it's not really an unexpected error). But it's right that we lose information that way. Should I simply let it crash instead?

I think it is important when a unit test fails that the error message is informative enough that we can easily pin-point the problem and fix it. In this case, I believe both SystemExit and self.fail are just as bad, because SystemExit doesn't write down the stack trace nor informative message about the source of the problem, right? If this is true, then I would do like the other tests where the error is reraised as another exception type. In this case, the reraise procedures should be refactored as a single function to avoid code duplication.

coveralls · 2017-10-09T18:07:39Z

Coverage decreased (-1.8%) to 92.774% when pulling 7eb4b84 on Dutil:iss155 into 9133e15 on SMART-Lab:master.

bouthilx · 2017-10-09T20:11:32Z

smartdispatch/utils.py

+
+def get_advice(cluster_name):
+
+    helios_advice = """On Helios, don't forget that the queue gpu_1, gpu_2, gpu_4 and gpu_8 give access to a specific amount of gpus.


pep8 please 😢 😜

bouthilx · 2017-10-09T20:11:55Z

smartdispatch/utils.py

+    helios_advice = """On Helios, don't forget that the queue gpu_1, gpu_2, gpu_4 and gpu_8 give access to a specific amount of gpus.
+For more advices, please refer to the official documentation: 'https://wiki.calculquebec.ca/w/Helios/en'"""    
+    mammouth_advice = "On Mammouth, please refer to the official documentation for more information: 'https://wiki.ccs.usherbrooke.ca/Accueil/en'"
+    hades_advice = """On Hades, don't forget that the queue name '@hades' needs to be use.


needs to be *used

bouthilx · 2017-10-09T20:14:03Z

smartdispatch/smartdispatch_script.py

+
+            cluster_advice = utils.get_advice(CLUSTER_NAME)
+
+            sys.stderr.write("smart-dispatch: error: The launcher wasn't able the launch the job(s) properly. The following error message was returned: \n\n{}\n\nMaybe the pbs file(s) generated were invalid. {}\n\n".format(e.output, cluster_advice))


I won't ask you to pep8ify all code, but please do so for the lines you editied like this one.

bouthilx · 2017-10-10T13:43:58Z

smartdispatch/smartdispatch_script.py

+
+    # Check that requested gpu number does not exceed node total
+    if args.gpusPerCommand > queue.nb_gpus_per_node:
+        sys.stderr.write("smart-dispatch: error: gpusPerCommand exceeds nodes total: asked {req_gpus} gpus, nodes have {node_gpus}\n"


Maybe a little advice would be helpful here too. Some users might be confused to get a message saying there is only 1 GPU per node available if they know nodes have more than that. They might not know that specific queues can limit the number of GPUs available (on helios for instance).

So to show the same advice as when qsub/msub fail?

Yeah that's not so nice. Maybe just had something like: Make sure you specified the correct queue.

bouthilx · 2017-10-10T13:45:31Z

smartdispatch/utils.py

@@ -136,3 +136,26 @@ def get_launcher(cluster_name):
        return "msub"
    else:
        return "qsub"
+
+def get_advice(cluster_name):


I think it would be preferable to have the advices outside of get_advice, following the constant variable naming convention (HELIOS_ADVICE, MAMMOUTH_ADVICE, etc).

bouthilx · 2017-10-10T13:47:43Z

tests/test_smart_dispatch.py

+import sys
+import traceback
+
+def rethrow_exception(exception, new_message):


What about moving this helper function to utils.py? utils/testing.py would be even better actually but I would leave utils refactoring for later.

bouthilx · 2017-10-10T13:50:23Z

tests/test_smart_dispatch.py

@@ -23,17 +47,23 @@ def setUp(self):
        self.nb_commands = len(self.commands)

        scripts_path = abspath(pjoin(os.path.dirname(__file__), os.pardir, "scripts"))
-        self.smart_dispatch_command = '{} -C 1 -q test -t 5:00 -x'.format(pjoin(scripts_path, 'smart-dispatch'))
+        self.smart_dispatch_command = '{} -C 1 -G 1 -q test -t 5:00 -x'.format(pjoin(scripts_path, 'smart-dispatch'))


Do we still have tests without GPUs? We need to test both settings; With and without GPUs.

So I just remembered why I added that. the default value for 'gpusPerCommand' is 1, but for
'gpusPerNode' the default value is 0, when the queue is undefined (like the queue 'test', that is used for the tests). And that triggers the gpu check.

So in order to have those tests pass, we need to either have the -G 1 options, change the check to let it pass if gpusPerNode == 0, or change the gpusPerCommand default to 0.

Then we should have tests for -g 0 -G 0 and -g 0. Do we?

Aren't the -g 0 -G 0 and -g 0 tests still missing?

…ailable on the queue. pep8fy some long strings.

coveralls · 2017-10-10T21:11:07Z

Coverage decreased (-1.6%) to 92.912% when pulling cec39e6 on Dutil:iss155 into 9133e15 on SMART-Lab:master.

bouthilx · 2017-10-16T19:38:34Z

tests/test_smart_dispatch.py

@@ -150,7 +128,22 @@ def test_gpu_check(self):
        argv[2] = '1'
        smartdispatch_script.main(argv=argv)

-    @rethrow_exception(SystemExit, "smartdispatch_script.main() raised SystemExit unexpectedly.")
+        # Test if we don't have gpus. (and spicified in script).


spicified 😝

bouthilx · 2017-10-16T19:43:22Z

smartdispatch/utils.py

+
+    def func_wraper(func):
+
+        def test_func(*args, **kwargs):


Wouldn't be better to use @functools.wraps(func) here for the stacktrace? I mean

@functools.wraps(func) def test_func(*args, **kwargs):

I just had a problem with stack trace because of decorators while implementing tests for Slurm clusters. I wonder if the same thing is happening here.

Just tried, it didn't change the stack trace.

OK, well you can leave it there anyway.

bouthilx · 2017-10-16T19:45:43Z

tests/test_smart_dispatch.py

+        argv[2] = '1'
+        smartdispatch_script.main(argv=argv)
+
+        # Test if we don't have gpus. (and spicified in script).


Looks like you should do grep "spicified" 😝

In my defence, it's the same line/mistake as the other one 😜

Coherence is important indeed

Wait, it's the same file? hahaha! 😊

bouthilx · 2017-10-16T19:46:19Z

tests/test_smart_dispatch.py

+        argv[4] = '0'
+        smartdispatch_script.main(argv=argv)
+
+        # Don't have gpus, but the user specofy 1 anyway.


A variant! Specofy! 🙃

coveralls · 2017-10-17T16:22:29Z

Coverage decreased (-1.6%) to 92.916% when pulling 3b27597 on Dutil:iss155 into 9133e15 on SMART-Lab:master.

coveralls · 2017-10-17T18:01:33Z

Coverage decreased (-1.6%) to 92.916% when pulling 71af7a3 on Dutil:iss155 into 9133e15 on SMART-Lab:master.

bouthilx · 2017-10-17T18:18:20Z

@mgermain Ready for merge. I believe the decrease in coverage is due to the move of scripts/* to smartdispatch/*_script.py. The unit-tests for the proposed changes seems to have a proper coverage in my opinion.

Adding a check to the number of requested gpus to give a meaningful m…

072dce6

…essage

Dutil added 7 commits September 13, 2017 14:40

changed the call to check_output to be able to mock it.

d9c4e9c

Adding a check for the launcher error.

4a0d63d

Adding test to test the well behaviour of the script

9f63771

Using an unexisting queue.

cc07bc6

more precise error message.

4d76f5a

Puttig all the tests on the script in the same file.

1b7ffbb

Dutil force-pushed the iss155 branch from e6687c3 to 1b7ffbb Compare September 13, 2017 18:52

Dutil changed the title ~~WIP: Issue #155~~ Handle resources error gracefully. (#155) Sep 13, 2017

bouthilx reviewed Sep 18, 2017

View reviewed changes

Dutil added 2 commits September 21, 2017 12:37

The tests cover more cases and are more explicit.

90c4f4d

removing useless lines.

fba5bf3

Raising a new Exception instead with the stack instead of falling the…

5d99e79

… test.

Removing duplicate test. Putting all the tests that have to do with t…

7e357d8

…he script in the same file.

bouthilx reviewed Oct 2, 2017

View reviewed changes

Dutil added 2 commits October 3, 2017 00:04

Removing duplicate test. Putting all the tests that have to do with t…

c33292b

…he script in the same file.

Merge branch 'iss155' of https://github.com/Dutil/smartdispatch into …

a83e500

…iss155

bouthilx reviewed Oct 3, 2017

View reviewed changes

refactoring the test to catch and reraise the exceptions.

94f58d0

Adding specific advices depending on the cluster we are currently on.

7eb4b84

bouthilx reviewed Oct 9, 2017

View reviewed changes

bouthilx reviewed Oct 10, 2017

View reviewed changes

refactoring the test utils, and testing le script when no gpus are av…

cec39e6

…ailable on the queue. pep8fy some long strings.

bouthilx reviewed Oct 16, 2017

View reviewed changes

Correcting some typo and use better decorator helper functions.

3b27597

Adding some comments to make the tests easier to understand.

71af7a3


		def get_advice(cluster_name):

		helios_advice = """On Helios, don't forget that the queue gpu_1, gpu_2, gpu_4 and gpu_8 give access to a specific amount of gpus.


		cluster_advice = utils.get_advice(CLUSTER_NAME)

		sys.stderr.write("smart-dispatch: error: The launcher wasn't able the launch the job(s) properly. The following error message was returned: \n\n{}\n\nMaybe the pbs file(s) generated were invalid. {}\n\n".format(e.output, cluster_advice))

Handle resources error gracefully. (#155) #167

Are you sure you want to change the base?

Handle resources error gracefully. (#155) #167

Conversation

Dutil commented Sep 13, 2017

coveralls commented Sep 13, 2017

coveralls commented Sep 13, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bouthilx Sep 18, 2017 • edited Loading

Choose a reason for hiding this comment

bouthilx commented Sep 19, 2017

coveralls commented Sep 21, 2017

coveralls commented Sep 21, 2017

coveralls commented Sep 26, 2017

bouthilx commented Oct 2, 2017

Choose a reason for hiding this comment

coveralls commented Oct 3, 2017

Dutil commented Oct 3, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coveralls commented Oct 9, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coveralls commented Oct 10, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coveralls commented Oct 17, 2017

coveralls commented Oct 17, 2017

bouthilx commented Oct 17, 2017

bouthilx Sep 18, 2017 •

edited

Loading