You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A common tuning for mariadb running under systemd on Linux systems is to set LimitNOFILE to something larger than the default. In our case, we set it to infinity, which has a different meaning depending on the version of systemd;
64k prior to systemd 234
the value of fs.nr_open from 234, which can be extraordinarily large - 1073741816 on rhel9
In the check_port function, we have the following;
The problem is that lsof closes all file handles except stdin, stdout & stderr. When the nofile limit is high, this can take longer than some hard-coded timeouts. ie, in the wsrep_sst_mariabackup script, we have this in recv_joiner;
So the check_port call needs to complete before the timeout configured in recv_joiner in order to signal to the donor that we're ready to receive the backup. This never occurs, because lsof is still busy closing file handles when the timeout expires. On rhel8 with LimitNOFILE=infiinity set in the systemd unit file for mariadb, everything is peachy as it's really 64k. But the same config migrated to rhel9 will result in being unable to bootstrap a cluster & there's very little in the way of logging to indicate why.
Would it be reasonable to set some sane limits within the code that calls the scripts associated with wsrep_sst_method, or perhaps to call ulimit -n 4096 or similar within the wsrep_sst_* scripts? It really is a nasty gotcha.
The text was updated successfully, but these errors were encountered:
A common tuning for mariadb running under systemd on Linux systems is to set LimitNOFILE to something larger than the default. In our case, we set it to
infinity
, which has a different meaning depending on the version of systemd;fs.nr_open
from 234, which can be extraordinarily large - 1073741816 on rhel9In the check_port function, we have the following;
The problem is that
lsof
closes all file handles except stdin, stdout & stderr. When the nofile limit is high, this can take longer than some hard-coded timeouts. ie, in thewsrep_sst_mariabackup
script, we have this inrecv_joiner
;And in
wait_for_listen
;So the
check_port
call needs to complete before the timeout configured inrecv_joiner
in order to signal to the donor that we're ready to receive the backup. This never occurs, becauselsof
is still busy closing file handles when the timeout expires. On rhel8 withLimitNOFILE=infiinity
set in the systemd unit file for mariadb, everything is peachy as it's really 64k. But the same config migrated to rhel9 will result in being unable to bootstrap a cluster & there's very little in the way of logging to indicate why.Would it be reasonable to set some sane limits within the code that calls the scripts associated with
wsrep_sst_method
, or perhaps to callulimit -n 4096
or similar within the wsrep_sst_* scripts? It really is a nasty gotcha.The text was updated successfully, but these errors were encountered: