-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use an Asynchronous Event Loop in SbyJob (rebased) #108
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: William D. Jones <[email protected]>
Signed-off-by: William D. Jones <[email protected]>
Signed-off-by: William D. Jones <[email protected]>
Signed-off-by: William D. Jones <[email protected]>
Signed-off-by: William D. Jones <[email protected]>
Signed-off-by: William D. Jones <[email protected]>
Signed-off-by: William D. Jones <[email protected]>
Signed-off-by: William D. Jones <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know there are a lot of considerations that went into SbyJob/SbyTask that I am not fully aware of and that you'll need to wait for review from @clairexen on, but just testing it out I did notice that the error handling is now significantly less graceful, e.g. if a solver is not installed:
on master:
make test_demo2
cd sbysrc && python3 sby.py -f demo2.sby
SBY 12:57:53 [demo2] Removing directory 'demo2'.
SBY 12:57:53 [demo2] Writing 'demo2/src/demo.v'.
SBY 12:57:53 [demo2] engine_0: aiger suprove
SBY 12:57:53 [demo2] engine_1: aiger avy
SBY 12:57:53 [demo2] nomem: starting process "cd demo2/src; yosys -ql ../model/design_nomem.log ../model/design_nomem.ys"
SBY 12:57:53 [demo2] nomem: finished (returncode=0)
SBY 12:57:53 [demo2] aig: starting process "cd demo2/model; yosys -ql design_aiger.log design_aiger.ys"
SBY 12:57:53 [demo2] aig: finished (returncode=0)
SBY 12:57:53 [demo2] engine_0: starting process "cd demo2; suprove model/design_aiger.aig"
SBY 12:57:53 [demo2] engine_1: starting process "cd demo2; avy --cex - model/design_aiger.aig"
SBY 12:57:53 [demo2] engine_0: finished (returncode=127)
SBY 12:57:53 [demo2] engine_0: COMMAND NOT FOUND. ERROR.
SBY 12:57:53 [demo2] engine_1: finished (returncode=0)
SBY 12:57:53 [demo2] engine_1: Status returned by engine: PASS
SBY 12:57:53 [demo2] summary: Elapsed clock time [H:MM:SS (secs)]: 0:00:00 (0)
SBY 12:57:53 [demo2] summary: Elapsed process time [H:MM:SS (secs)]: 0:00:00 (0)
SBY 12:57:53 [demo2] summary: engine_1 (aiger avy) returned PASS
SBY 12:57:53 [demo2] DONE (ERROR, rc=16)
make: *** [test_demo2] Error 16
on async (after merging master to get the new tests):
make test_demo2
cd sbysrc && python3 sby.py -f demo2.sby
SBY 12:59:13 [demo2] Removing directory 'demo2'.
SBY 12:59:13 [demo2] Writing 'demo2/src/demo.v'.
SBY 12:59:13 [demo2] engine_0: aiger suprove
SBY 12:59:13 [demo2] engine_1: aiger avy
SBY 12:59:13 [demo2] nomem: starting process "cd demo2/src; yosys -ql ../model/design_nomem.log ../model/design_nomem.ys"
SBY 12:59:13 [demo2] nomem: finished (returncode=0)
SBY 12:59:13 [demo2] aig: starting process "cd demo2/model; yosys -ql design_aiger.log design_aiger.ys"
SBY 12:59:13 [demo2] aig: finished (returncode=0)
SBY 12:59:13 [demo2] engine_0: starting process "cd demo2; suprove model/design_aiger.aig"
SBY 12:59:13 [demo2] engine_1: starting process "cd demo2; avy --cex - model/design_aiger.aig"
SBY 12:59:13 [demo2] engine_0: finished (returncode=127)
Traceback (most recent call last):
File "sby.py", line 430, in <module>
retcode |= run_job(t)
File "sby.py", line 388, in run_job
job.run(setupmode)
File "/Users/nak/Source/SymbiYosys/sbysrc/sby_core.py", line 691, in run
self.taskloop()
File "/Users/nak/Source/SymbiYosys/sbysrc/sby_core.py", line 267, in taskloop
loop.run_until_complete(poll_fut)
File "/Users/nak/.pyenv/versions/3.7.4/lib/python3.7/asyncio/base_events.py", line 579, in run_until_complete
return future.result()
File "/Users/nak/Source/SymbiYosys/sbysrc/sby_core.py", line 302, in task_poller
await task.shutdown_and_notify()
File "/Users/nak/Source/SymbiYosys/sbysrc/sby_core.py", line 186, in shutdown_and_notify
self.handle_exit(self.p.returncode)
File "/Users/nak/Source/SymbiYosys/sbysrc/sby_core.py", line 109, in handle_exit
self.exit_callback(retcode)
File "/Users/nak/Source/SymbiYosys/sbysrc/sby_engine_aiger.py", line 81, in exit_callback
assert retcode == 0
AssertionError
make: *** [test_demo2] Error 1
From the latter it really isn't clear what the problem is at all.
I also remember that we ran into https://bugs.python.org/issue23548 a few times in CI when processes launched in asyncio errored, and subcommands failing is a very frequent and expected thing to happen in sby. Is this accounted for here?
It might be a couple days before I can dig into this again, but the first issue with missing tools definitely looks like a regression introduced by me. The handling of exit code 127 was added since the original PR was authored and although I tried to incorporate it, it looks like it will need to be adapted. I haven't run into the second issue but it looks like the I am assuming the CI infrastructure is internal to Symbiotic EDA? I wouldn't blame you for wanting some unit tests for this but wouldn't want to reinvent the wheel. Also some of these errors are probably race conditions and hard to reproduce... FWIW I was having issues with Yices crashing on OS X with python 3.8 (which uses the same event loop as linux) and here are two example logs:
|
Yes, looking at it more closely it's just a question of moving the check for We have added some tests to this repo the other day, which we also use in CI. If you merge recent master you can use I'll be honest though, it's going to be an uphill battle getting this PR merged, as there's a lot of platforms to consider (every flavor of linux from ancient CentOS distributions to cutting-edge Arch, MacOS, several ways of launching things on Windows...) and we don't have the structure in place to test a major rewrite like this on all of them. The current implementation is at a point where we're mostly sure that it works most of the time on most of these platforms only because we got enough bug reports from users that told us when it didn't work, and it's an unappealing proposition to make our customers go through that again. Do I understand the description of the original PR correctly that the problem this addresses is just the performance impact of not running multiple solvers in parallel on windows? |
I can't remember if multiple solvers in parallel worked in Windows before this PR. When I said "parallel" in the original PR, that was in reference to the intended The performance impact comes from From what I remember, sby's main loop is basically: If an event arrives, handle event. If timeout is reached, check to see if we should bail, and if not, wait for another event. IIRC, the async rewrite removes the "check to see if we should bail part", and thus the need for any looping logic, by making the Python event loop decide when to bail. |
Ah, I see. So it's still working correctly before this change though, just suboptimally because it wastes one core busy-looping, right? |
@nakengelhardt I apologize, it looks like I misremembered a bit. The busy-waiting is a problem, but by "parallel" it looks like I was referring to this comment in a previous PR:
Unfortunately, I don't remember the difference between At the time, what I meant by this is "if you're doing I didn't write a good bug report- if I did, I would described this behavior in more detail other than "the culprit is |
This is mostly just a rebased version of #39. On @whitequark's request I have also replaced the use of
taskkill
on Windows with a console process group that is terminated using a ctrl+break signal. I verified using SysInternals ProcessMonitor that this causes the child processes to exit with code 0xC000013A which means they received the signal.cc: @cr1901 as the original author