Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PM-1932] Failure to cleanup /tmp in nonsharedfs mode #2045

Open
mayani opened this issue Dec 14, 2024 · 7 comments
Open

[PM-1932] Failure to cleanup /tmp in nonsharedfs mode #2045

mayani opened this issue Dec 14, 2024 · 7 comments
Assignees
Labels
fix-5.1.0 minor Minor loss of function, or other problem where easy workaround is present. sync-from-jira Synced from Jira

Comments

@mayani
Copy link
Member

mayani commented Dec 14, 2024

Setting: Running Pegasus workflows on Eclair
Scheduler: Slurm
How to replicate: Run pegasus-remove command as a job is being executed

When we issue the pegasus-remove command while jobs are running on Eclair (and probably any other Slrum based cluster), Pegasus fails to cleanup data from /tmp on the worker host.

I'm not sure if this can be solved by capturing a signal from Slurm and gracefully exiting the job.

Reporter: @papajim
Watchers:
@papajim
@rynge
@vahi
Attachments:
Image

@mayani
Copy link
Member Author

mayani commented Dec 14, 2024

Author: @rynge

Is it the pegasuslite directory you are seeing, or something else?

@mayani
Copy link
Member Author

mayani commented Dec 14, 2024

Author: @papajim

@rynge Check the attached image for an example.
Relevant dirs are the:

  • ks.*
  • pegasus.*
  • rootfs-*

@mayani
Copy link
Member Author

mayani commented Dec 14, 2024

Author: @vahi

i know kickstart does stuff in /tmp and also our integrity stuff
@ryngeif we were to use condor scratch directories for this stuff, then this issue could go away. condor will clean up the directory for us

@mayani
Copy link
Member Author

mayani commented Dec 14, 2024

Author: @rynge

Some slurm clusters are set up that way as well. However, we should consider the base case here - we should do as good job as we can to clean up.

@mayani
Copy link
Member Author

mayani commented Dec 14, 2024

Author: @papajim

From Slurm's doc. https://slurm.schedmd.com/scancel.html
"To cancel a job, invoke scancel without --signal option. This will send first a SIGCONT to all steps to eventually wake them up followed by a SIGTERM, then wait the KillWait duration defined in the slurm.conf file and finally if they have not terminated send a SIGKILL. This gives time for the running job/step(s) to clean up."

So we might be able to capture the SIGTERM signal if we don't do already.
If we currently do this, the KillWait duration on Eclair might be set to too low.

@mayani
Copy link
Member Author

mayani commented Dec 14, 2024

Author: @papajim

I took a look into the pegasus-lite code and we trap SIGTERM, SIGINT but I don't think we handle them properly.
According to section 12.2.2 of this https://tldp.org/LDP/Bash-Beginners-Guide/html/sect_12_02.html : "When Bash receives a signal for which a trap has been set while waiting for a command to complete, the trap will not be executed until the command completes. When Bash is waiting for an asynchronous command via the wait built-in, the reception of a signal for which a trap has been set will cause the wait built-in to return immediately with an exit status greater than 128, immediately after which the trap is executed."

In pegasus-lite when we execute commands like pegasus-transfer or the main task (e.g., a Singularity based task), we don't invoke them as child processes and then waiting for the child pids to exit.
As a result the trap code is executed after the command finishes. In the cases where the command takes longer than Slrum's KillWait timeout the trap code never gets executed.

@mayani
Copy link
Member Author

mayani commented Dec 14, 2024

Author: @rynge

Good point! If we are going to run these as child processes, there are a few things to consider:

  • if we receive signals, we need to send appropriate signals to the child process
  • probably need a timeout for that signal to be handled downstream, but what do we do at timeout?
  • consider a job which ignores signals
  • consider a job which takes too long to respond (cleaning up/writing checkpoints/...)'
  • once a child process finishes, handle exit codes
  • do we need to take process groups/sessions into account?
  • we need some nice wrapper code for the above

@mayani mayani added sync-from-jira Synced from Jira fix-5.1.0 minor Minor loss of function, or other problem where easy workaround is present. labels Dec 14, 2024
@mayani mayani changed the title PM-1932 [PM-1932] Failure to cleanup /tmp in nonsharedfs mode Dec 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fix-5.1.0 minor Minor loss of function, or other problem where easy workaround is present. sync-from-jira Synced from Jira
Projects
None yet
Development

No branches or pull requests

2 participants