-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PM-1932] Failure to cleanup /tmp in nonsharedfs mode #2045
Comments
Author: @rynge Is it the pegasuslite directory you are seeing, or something else? |
Author: @vahi i know kickstart does stuff in /tmp and also our integrity stuff |
Author: @rynge Some slurm clusters are set up that way as well. However, we should consider the base case here - we should do as good job as we can to clean up. |
Author: @papajim From Slurm's doc. https://slurm.schedmd.com/scancel.html So we might be able to capture the SIGTERM signal if we don't do already. |
Author: @papajim I took a look into the pegasus-lite code and we trap SIGTERM, SIGINT but I don't think we handle them properly. In pegasus-lite when we execute commands like pegasus-transfer or the main task (e.g., a Singularity based task), we don't invoke them as child processes and then waiting for the child pids to exit. |
Author: @rynge Good point! If we are going to run these as child processes, there are a few things to consider:
|
Setting: Running Pegasus workflows on Eclair
Scheduler: Slurm
How to replicate: Run pegasus-remove command as a job is being executed
When we issue the pegasus-remove command while jobs are running on Eclair (and probably any other Slrum based cluster), Pegasus fails to cleanup data from /tmp on the worker host.
I'm not sure if this can be solved by capturing a signal from Slurm and gracefully exiting the job.
Reporter: @papajim
Watchers:
@papajim
@rynge
@vahi
Attachments:
The text was updated successfully, but these errors were encountered: