[PM-1932] Failure to cleanup /tmp in nonsharedfs mode #2045

mayani · 2024-12-14T08:28:54Z

Setting: Running Pegasus workflows on Eclair
Scheduler: Slurm
How to replicate: Run pegasus-remove command as a job is being executed

When we issue the pegasus-remove command while jobs are running on Eclair (and probably any other Slrum based cluster), Pegasus fails to cleanup data from /tmp on the worker host.

I'm not sure if this can be solved by capturing a signal from Slurm and gracefully exiting the job.

Reporter: @papajim
Watchers:
@papajim
@rynge
@vahi
Attachments:

mayani · 2024-12-14T20:00:42Z

Author: @rynge

Is it the pegasuslite directory you are seeing, or something else?

mayani · 2024-12-14T20:00:43Z

Author: @papajim

@rynge Check the attached image for an example.
Relevant dirs are the:

ks.*
pegasus.*
rootfs-*

mayani · 2024-12-14T20:00:44Z

Author: @vahi

i know kickstart does stuff in /tmp and also our integrity stuff
@ryngeif we were to use condor scratch directories for this stuff, then this issue could go away. condor will clean up the directory for us

mayani · 2024-12-14T20:00:45Z

Author: @rynge

Some slurm clusters are set up that way as well. However, we should consider the base case here - we should do as good job as we can to clean up.

mayani · 2024-12-14T20:00:46Z

Author: @papajim

From Slurm's doc. https://slurm.schedmd.com/scancel.html
"To cancel a job, invoke scancel without --signal option. This will send first a SIGCONT to all steps to eventually wake them up followed by a SIGTERM, then wait the KillWait duration defined in the slurm.conf file and finally if they have not terminated send a SIGKILL. This gives time for the running job/step(s) to clean up."

So we might be able to capture the SIGTERM signal if we don't do already.
If we currently do this, the KillWait duration on Eclair might be set to too low.

mayani · 2024-12-14T20:00:47Z

Author: @papajim

I took a look into the pegasus-lite code and we trap SIGTERM, SIGINT but I don't think we handle them properly.
According to section 12.2.2 of this https://tldp.org/LDP/Bash-Beginners-Guide/html/sect_12_02.html : "When Bash receives a signal for which a trap has been set while waiting for a command to complete, the trap will not be executed until the command completes. When Bash is waiting for an asynchronous command via the wait built-in, the reception of a signal for which a trap has been set will cause the wait built-in to return immediately with an exit status greater than 128, immediately after which the trap is executed."

In pegasus-lite when we execute commands like pegasus-transfer or the main task (e.g., a Singularity based task), we don't invoke them as child processes and then waiting for the child pids to exit.
As a result the trap code is executed after the command finishes. In the cases where the command takes longer than Slrum's KillWait timeout the trap code never gets executed.

mayani · 2024-12-14T20:00:48Z

Author: @rynge

Good point! If we are going to run these as child processes, there are a few things to consider:

if we receive signals, we need to send appropriate signals to the child process
probably need a timeout for that signal to be handled downstream, but what do we do at timeout?
consider a job which ignores signals
consider a job which takes too long to respond (cleaning up/writing checkpoints/...)'
once a child process finishes, handle exit codes
do we need to take process groups/sessions into account?
we need some nice wrapper code for the above

mayani added sync-from-jira Synced from Jira fix-5.1.0 minor Minor loss of function, or other problem where easy workaround is present. labels Dec 14, 2024

mayani changed the title ~~PM-1932~~ [PM-1932] Failure to cleanup /tmp in nonsharedfs mode Dec 14, 2024

mayani assigned rynge Dec 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PM-1932] Failure to cleanup /tmp in nonsharedfs mode #2045

[PM-1932] Failure to cleanup /tmp in nonsharedfs mode #2045

mayani commented Dec 14, 2024 •

edited

Loading

mayani commented Dec 14, 2024

mayani commented Dec 14, 2024

mayani commented Dec 14, 2024

mayani commented Dec 14, 2024

mayani commented Dec 14, 2024

mayani commented Dec 14, 2024

mayani commented Dec 14, 2024

[PM-1932] Failure to cleanup /tmp in nonsharedfs mode #2045

[PM-1932] Failure to cleanup /tmp in nonsharedfs mode #2045

Comments

mayani commented Dec 14, 2024 • edited Loading

mayani commented Dec 14, 2024

mayani commented Dec 14, 2024

mayani commented Dec 14, 2024

mayani commented Dec 14, 2024

mayani commented Dec 14, 2024

mayani commented Dec 14, 2024

mayani commented Dec 14, 2024

mayani commented Dec 14, 2024 •

edited

Loading