-
Notifications
You must be signed in to change notification settings - Fork 312
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
3.11.1 slurmctld core dumps with error message: double free or corruption (!prev) #6529
Comments
After the last restart I have not seen this issue again (four days ago). There was something wrong with my reverse DNS lookup that I did fix before the restart. I don't want to believe that is it, but isn't it always DNS? Nonetheless, I am increasing my testing, but I've also modified the systemctl service file for slurmctl to include the ability to self-restart.
Four new lines added: and in the [Service] section: Maybe something you might want to consider adding to standard distribution? |
Hi @gwolski From the previous error logs it does look like a Networking issue. Are you still facing an issue with changes you made to DNS? Thank you for the suggestion on self-restarting slurmctld, will track this in our backlog. |
My DNS server did not have a reverse lookup zone so slurmctld could not find my submission hosts. I don't know why this would be a requirement but once I added my submission hosts to the reverse lookup zone, things quieted down. So I am suspicious this was the cause. I am now building a new 3.11.1 pcluster image with my latest custom AMI changes and will deploy in a new 3.11.1 cluster. Will run and then update this ticket. I will add a feature request for the self-starting slurmctld so it is outside of this ticket. |
Hi @gwolski, Thank you for reporting the extra details and keeping us posted about the progress. In order for us to better investigate what has caused Slurm to crash, would you mind sharing the following if they are still available:
Thank you, Francesco |
Hi @demartinofra, I'm starting srun jobs directly from the head_node and I'm encountering random errors on one type of simulation job. I have some tcl code that tries to write to stdout, and I get;
It's "random" as to why one job out of hundreds gets this error. (I'm running same regression over and over to stress the system). I can't imagine why I have a stale file handle. I'm using nfsv3 with AWS FSX Ontap storage and I never see this error elsewhere (I have a 3.9.1 cluster running this same code and it has had 10s of thousands of jobs go through). I'm running a custom Rocky 8.10 ami (with latest updates) with parallelcluster 3.11.1 overlaid. I'll update once I figure out this problem and enable submission hosts to the cluster and have success (or failure). |
Not something I've seen already. How are you mounting the AWS FSX Ontap storage on the compute nodes? |
I'm mounting the filesystems with autofs. I will be enabling my remote submission hosts tomorrow. |
Thanks for the update. Let us know if you are able to reproduce the coredump issue again and in the case please share the content of the coredump file with us so that we can look into it. |
@demartinofra I had a crash during the night. Reviewing the slurmctld.log file, again I'm seeing slurmctld unable to resolve compute node names. And now something clicked in that I have my /etc/resolv.conf locked down so it does not have an entry for my cluster_name.pcluster domain on the. search line. I have now updated my /etc/resolv.conf search line to include "tsi4.pcluster" so compute names will resolve (though I have not modified it on my compute nodes). However, I've been running the same in PC 3.9.1 for months with no crashes, however I do see the same occasional error message in those logs - the last one from Nov 6. So how on earth is slurm resolving node names if I don't have my /etc/resolv.conf correct for the many days things "are just working"? Maybe this is the cause of the crash? Right before the first coredump, my /var/log/messages has: Nov 12 03:00:41 ip-10-6-2-5 slurmctld[9322]: slurmctld: error: xgetaddrinfo: getaddrinfo(sp-c7a-4xl-dy-sp-32-gb-16-cores-1:6818) failed: Name or service not known And here's the start of the core dump info: Nov 12 03:00:41 ip-10-6-2-5 systemd-coredump[3641170]: Process 9322 (slurmctld) of user 401 dumped core.#12#012Stack trace of thread 3641156:#12#0 0x0000149e3c4e952f __GI_raise (libc.so.6)#12#1 0x0000149e3c4bce65 __GI_abort (libc.so.6)#12#2 0x0000149e3c52a727 __libc_message (libc.so.6)#12#3 0x0000149e3c531a2c malloc_printerr (libc.so.6)#12#4 0x0000149e3c533abc _int_free (libc.so.6)#12#5 0x0000149e3d595823 slurm_xfree (libslurmfull.so)#12#6 0x0000149e3d527b1b slurm_free_node_alias_addrs_members (libslurmfull.so)#12#7 0x0000149e3d4d56b4 destroy_forward (libslurmfull.so)#12#8 0x0000000000426258 _thread_per_group_rpc (slurmctld)#12#9 0x I have a bunch of core dumps in: $ ls -ltr /var/lib/systemd/coredump/ You asked me to run a gdb command above, but I don't have a /var/spool/slurm.state/ directory. Maybe you want me to run it on one of these core dumps (note they are .lz4 files - I am guessing I have to uncompress first?). Do you want one of the coredump files? It probably has confidential info - any way I can get it to you via another way rather than an attachment? Attached are all the files I believe you have asked for in these two tar files: Please let me know how to provide the core files and if I can provide anything else. |
The team has root caused the "transcript error" message above and why it is happening with 3.11.1. It is due to our EDA tool having a five minute time out waiting for machines to come alive. If they do not come alive, it resubmits the job. So we were having two of the same jobs submitted, so they were stomping on each other. The five minute time out worked just fine in 3.9.3 as the machines mostly came up in under five minutes. In 3.11.1 I am seeing start up times of machines of over seven minutes. I just built a cluster with your generic rhel8 AMI today and the first machine I created I saw a boot time of of 7:08 minutes today, i.e. from CF to RUNNING. It was a spot r7a.medium. See #6479. I have been running 3.9.3 and I just saw this happen once just last week. As compared to every 100 or so simulations in 3.11.1. We run 100s of sims per day through PC right now. |
We configure Slurm so that IP addresses are propagated by the controller. In particular we explicitly set the ip address (NodeAddr) of each Slurm node as part of our instance provisioning process so that Slurm does not need to perform any DNS resolution. The fact that Slurmctld is crashing when dns resolution is misconfigured sounds like an issue in how Slurm is handling it. Once we have have the coredumps available we can work with SchedMD to nail down the issue.
This is unexpected. The To process systemd coredumps (in case you don't have the coredumps in slurm.state directory) you can use
You are right. if you have the possibility to open an AWS case you could do that and upload the file in the case. Otherwise I can reach out over email and provide you with a different way to safely upload the coredump.
I will ask the team to look into this. |
I mistyped - I meant to say I have no core.NNN files in /var/spool/slurm.state directory. Here is what I have in /var/spool/slurm.state directory:
I'm running on Rocky8 and coredumpctl does not support --debugger-arguments. I can run
I have attached that output. I have never used coredumpctl and the last time I seriously used gdb was back in the 90s, so I'm really rusty there... Would need your help if you want me to do anything specific. I do not have AWS technical support right now. You can get my email from Pedro Gil or Allan Carter @cartalla in your company and send me a private note. |
No worries at all, let me try to help :) If the coredumpctl version does not support the --debugger-arguments option I guess we can first dump the coredump and then process them separately with gdp. Can you try something like this:
if that works can you send both coredump and slurmctld_coredump_analysis.txt to us? I reached out to you over e-mail so that you don't need to share them here. |
hmm. trying the coredumpctl command today, and not having success:
Yet this worked a few days ago. journalctl --verify shows things are "PASS", i.e. all ok. I tried googling for ideas as to why no corefiles get listed, but can't find anything. I can keep looking, but hoping you have an idea... |
It seems the journal files are being rotated so coredumpctl can't find them anymore, but I do have the coredump files. I will package up all the coredump files for you to review. I was able to run your gdb command on each individual coredump file after uncompressing. Details sent in private email on how you can retrieve the information. |
Attempting to move to parallelcluster 3.11.1. I have been using 3.9.1 (with its known bug since May) w/o this issue.
3.9.1 is custom Rocky 8.9 image with parallelcluster 3.9.1 overlay.
New set up is a custom Rocky 8.10 AMI upon which I have overlayed parallelcuster 3.11.1 with pcluster build.
Deployed just fine. Ran for a week with small number of jobs submitted.
Started banging on it a bit more with hundreds of jobs/day. About every 24 hours, slurmctld dumps core with ultimate core dump error message in /var/log/messages:
Oct 30 11:04:30 ip-10-6-11-248 slurmctld[1374]: double free or corruption (!prev)
Oct 30 11:04:30 ip-10-6-11-248 systemd[1]: Created slice system-systemd\x2dcoredump.slice.
Oct 30 11:04:30 ip-10-6-11-248 systemd[1]: Started Process Core Dump (PID 2977/UID 0).
Oct 30 11:04:30 ip-10-6-11-248 systemd-coredump[2978]: Process 1374 (slurmctld) of user 401 dumped core.#12#012Stack trace of thread 2531:#12#0
Reboot the machine and 24 hours later:
Oct 31 10:48:04 ip-10-6-11-248 slurmctld[1383]: corrupted double-linked list
Oct 31 10:48:04 ip-10-6-11-248 systemd[1]: Created slice system-systemd\x2dcoredump.slice.
Oct 31 10:48:04 ip-10-6-11-248 systemd[1]: Started Process Core Dump (PID 728511/UID 0).
Oct 31 10:48:04 ip-10-6-11-248 systemd-coredump[728512]: Process 1383 (slurmctld) of user 401 dumped core.#12#012Stack trace of thread 728510:#12#0
Prior error messages right before this are of the form:
Oct 30 11:04:21 ip-10-6-11-248 slurmctld[1374]: slurmctld: error: xgetaddrinfo: getaddrinfo(sp-r7a-m-dy-sp-8-gb-1-cores-21:6818) failed: Name or service not known
Oct 30 11:04:21 ip-10-6-11-248 slurmctld[1374]: slurmctld: error: slurm_set_addr: Unable to resolve "sp-r7a-m-dy-sp-8-gb-1-cores-21"
Oct 30 11:04:21 ip-10-6-11-248 slurmctld[1374]: slurmctld: error: unable to split forward hostlist
Oct 30 11:04:21 ip-10-6-11-248 slurmctld[1374]: slurmctld: error: _thread_per_group_rpc: no ret_list given
Oct 30 11:04:21 ip-10-6-11-248 slurmctld[1374]: slurmctld: error: xgetaddrinfo: getaddrinfo(sp-r7a-m-dy-sp-8-gb-1-cores-25:6818) failed: Name or service not known
Oct 30 11:04:21 ip-10-6-11-248 slurmctld[1374]: slurmctld: error: slurm_set_addr: Unable to resolve "sp-r7a-m-dy-sp-8-gb-1-cores-25"
Oct 30 11:04:21 ip-10-6-11-248 slurmctld[1374]: slurmctld: error: unable to split forward hostlist
Oct 30 11:04:21 ip-10-6-11-248 slurmctld[1374]: slurmctld: error: _thread_per_group_rpc: no ret_list given
[duplicate messages not included, just different node names]
Oct 30 11:04:30 ip-10-6-11-248 slurmctld[1374]: slurmctld: agent/is_node_resp: node:sp-m7a-l-dy-sp-8-gb-2-cores-4 RPC:REQUEST_NODE_REGISTRATION_STATUS : Communication conne
[duplicate messages not included, just different node names]
Error messages in slurmctld.log from same time as the second crash message above:
[2024-10-31T10:47:04.008] error: unable to split forward hostlist
[2024-10-31T10:47:04.008] error: _thread_per_group_rpc: no ret_list given
[2024-10-31T10:47:05.134] error: slurm_receive_msg [10.6.2.57:50156]: Zero Bytes were transmitted or received
[2024-10-31T10:47:18.718] error: slurm_receive_msg [10.6.9.248:41134]: Zero Bytes were transmitted or received
[2024-10-31T10:47:20.862] error: slurm_receive_msg [10.6.14.229:53816]: Zero Bytes were transmitted or received
[2024-10-31T10:47:22.137] error: slurm_receive_msg [10.6.2.57:50996]: Zero Bytes were transmitted or received
[2024-10-31T10:48:04.000] cleanup_completing: JobId=4094 completion process took 134 seconds
[2024-10-31T10:48:04.000] error: Nodes sp-r7a-m-dy-sp-8-gb-1-cores-37 not responding, setting DOWN
[2024-10-31T10:48:04.003] error: xgetaddrinfo: getaddrinfo(sp-r7a-m-dy-sp-8-gb-1-cores-10:6818) failed: Name or service not known
[2024-10-31T10:48:04.003] error: slurm_set_addr: Unable to resolve "sp-r7a-m-dy-sp-8-gb-1-cores-10"
[2024-10-31T10:48:04.005] error: xgetaddrinfo: getaddrinfo(sp-r7a-m-dy-sp-8-gb-1-cores-11:6818) failed: Name or service not known
[2024-10-31T10:48:04.005] error: slurm_set_addr: Unable to resolve "sp-r7a-m-dy-sp-8-gb-1-cores-11"
[2024-10-31T10:48:04.007] error: xgetaddrinfo: getaddrinfo(sp-r7a-m-dy-sp-8-gb-1-cores-33:6818) failed: Name or service not known
[2024-10-31T10:48:04.007] error: slurm_set_addr: Unable to resolve "sp-r7a-m-dy-sp-8-gb-1-cores-33"
[2024-10-31T10:48:04.009] error: xgetaddrinfo: getaddrinfo(sp-r7a-m-dy-sp-8-gb-1-cores-36:6818) failed: Name or service not known
[2024-10-31T10:48:04.009] error: slurm_set_addr: Unable to resolve "sp-r7a-m-dy-sp-8-gb-1-cores-36"
[2024-10-31T10:48:04.010] error: xgetaddrinfo: getaddrinfo(sp-r7a-m-dy-sp-8-gb-1-cores
Has anyone seen this? I'm going back to 3.10.1 and will attempt to deploy that version.
The text was updated successfully, but these errors were encountered: