Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flux machine issues #106

Open
jwhite242 opened this issue Feb 27, 2023 · 2 comments
Open

Flux machine issues #106

jwhite242 opened this issue Feb 27, 2023 · 2 comments
Labels
question Further information is requested

Comments

@jwhite242
Copy link
Collaborator

I am currently seeing some issues with the flux scheduler that don't crop up in the others. The machine appears to struggle with lots of fast running jobs, possibly losing track of resource availability, leading to throughput grinding to a halt at some point -> looks to be related to there being a mix of using ats and the flux handler for resource tracking?

Additionally, the time argument to flux mini run is causing some issues, requiring large over estimates of job allocation times other wise there gets to be a race condition where ats thinks jobs are still remaining, but flux won't schedule any of them due to the time requested exceeding the remaining allocation time.

This is tested with a project specific ats wrapper, and on a flux scheduled cluster (not bootstrapped within slurm/etc).

Jeremy

@dawson6
Copy link
Member

dawson6 commented Mar 20, 2023

​[4:12 PM] Dawson, Shawn A.
This one needs to be better defined so we can find the real issue that flux is having or our use of flux is having. But the simple solution of not passing the '-t' option to flux will cause other types of errors. So that is up for discussion still.
​[4:12 PM] Dawson, Shawn A.
Is there a way we can create a simple reproducer for this? I have several small test codes, if it is just matter of ensuring I submit 100's of them of a certain size or what not, we can make that happen,
<https://teams.microsoft.com/l/message/19:[email protected]/1679353822518?tenantId=a722dec9-ae4e-4ae3-9d75-fd66e2680a63&amp;groupId=07b41973-fc19-4954-88da-63aae39c8ca1&amp;parentMessageId=1678980904524&amp;teamName=ATS Testing System&channelName=General&createdTime=1679353822518&allowXTenantAccess=false>

@dawson6
Copy link
Member

dawson6 commented Aug 16, 2023

@jwhite242

The 7.0.114 release has significant changes with how flux is used. Can you test this out on rzvernal and see if there are isssues still?

@dawson6 dawson6 added the question Further information is requested label Aug 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants