-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Etcd cluster keeps getting corrupted on Talos since version 1.8.x upgrade ( etcd 3.5.16) #18933
Comments
Logs you provided are just from request failures due to past corruption alarm. Can you provide the logs from time the corruption happen? |
This is what the log bundle had from the time I generated it. Not sure what happened then or why it starts from that time. Maybe it got overwritten since its constantly saying cluster corrupted. do you think any other log file from the bundle might help? or me uploading the whole bundle? |
Could you provide the complete log of all etcd instances?
Yes, please. Is it possible to upload all the db files (under the ${data_dir}/member/snap/db) If it doesn't have any sensitive data? |
Sorry no I only have log bundles from the nodes (nodes themselves have been reset). I'll see if there's any sensitive data in the log bundles and take that out and then provide those. |
here are the support log bundles i got from nodes running - |
Unfortunately, the log in this bundle isn't complete either. The first log entry is the same ( I also do not see the data files as mentioned in #18933 (comment). If you can provide that files, I can analyze the db file directly. |
I'm not sure how I could get the etcd db of talos nodes as there is no direct shell access. just commands via talosctl |
You might want to raise the question in talos community? |
I have the same problem, in addition, I have it happening on the VMware software platform: cluster - 3 control plane, 3 working nodes, everything works fine after a clean installation for about 4-6 hours etcd then goes into the error "etcdserver: no leader", this is the last thing I could diagnose, build an image of Talos in Factory-installed with extensions : customization: Interestingly, if you completely turn off the virtual machines with Talos, then wait a few seconds and start them back one by one starting from the first control plane, everything starts working again, then after a few hours everything repeats again. |
Bug report criteria
What happened?
Hi folks, I have a really odd issue that I'm troubleshooting. I have a 3 node Talos (1.8.3) cluster at home where etcd (3.5.16) keeps getting corrupted after a while. Initially I thought it could be a disk related issue. So I bought brand new disks and swapped them around. I installed a new cluster last night (around 8pm) and when I woke up this morning (8am) cluster was not working and etcd was reporting cluster corrupted.
Looking at the logs, it seems something happened around 6am, but I'm unable to work out what the cause is.
So far I have redeployed the cluster in the past week 4 times and every time etcd has ended up corrupted.
Any help/guidance to troubleshoot this would be much appreciated.
What did you expect to happen?
Cluster not the get corrupted
How can we reproduce it (as minimally and precisely as possible)?
I'm not 100% sure how this can be reproduced in your env as I don't fully understand why this happens
Anything else we need to know?
I have actually saved a log bundles from all 3 cluster nodes using
talosctl -n node_ip support
I'm just not sure which log files would be helpful. If you could advise which logs are needed I can provide them:
the log bundle has folders:
kubernetes-logs
service-logs (etcd.log file here, I pasted it in the relevant log section)
and separately log files:
controller-runtime.log
dmesg.log
Etcd version (please run commands below)
here is the output of EtcdConfigs.etcd.talos.dev file from node1
Etcd configuration (command line flags or environment variables)
paste your configuration here
metadata:
namespace: etcd
type: EtcdConfigs.etcd.talos.dev
id: etcd
version: 1
owner: etcd.ConfigController
phase: running
created: 2024-11-20T20:09:34Z
updated: 2024-11-20T20:09:34Z
spec:
advertiseValidSubnets:
- 10.1.1.0/24
advertiseExcludeSubnets:
- 10.1.1.30
listenValidSubnets:
- 10.1.1.0/24
listenExcludeSubnets: []
image: gcr.io/etcd-development/etcd:v3.5.16
extraArgs:
listen-metrics-urls: http://0.0.0.0:2381
Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)
I'm not 100% sure how I can run the below commands on talos
Relevant log output
The text was updated successfully, but these errors were encountered: