-
Hello, I’ve been working with Talos on EC2 instances (3 instances, one in each of 3 AZs). While deploying some applications using helm, all the etcd nodes became unhealthy at some point. t health
discovered nodes: ["10.0.122.99" "10.0.110.191" "10.0.94.120"]
waiting for etcd to be healthy: ...
waiting for etcd to be healthy: 3 errors occurred:
* 10.0.110.191: service is not healthy: etcd
* 10.0.122.99: service is not healthy: etcd
* 10.0.94.120: service is not healthy: etcd I attempted to reboot one of the nodes, but the main node went down and never came back up. t reboot
◱ watching nodes: [{{ public ip }}]
* {{ public ip }}: unavailable, retrying...
t health
healthcheck error: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp {{ public ip }}:50000: connect: connection refused" Ping also doesn’t work. I tried connecting through the EC2 console, but that didn’t work either. Even after stopping and restarting the EC2 instance, there was no improvement. I understand that the etcd quorum might be an issue, but before addressing that, I’d like to know how to restart the Talos cluster installed on EC2 instances. My main question is: How can I reboot or recover a node running on an EC2 instance in this situation? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 7 replies
-
Your issue is something else, not related directly to EC2. On AWS, you can pull the logs from the serial console to understand what is wrong. If etcd is unhealthy, something is really bad, like broken connectivity between machines, out of resources (not enough RAM/CPU), etc. |
Beta Was this translation helpful? Give feedback.
talos seems to boot fine as per this https://gist.github.com/vitamaxDH/fc466b32dae7ceca3c47aefbfc68e7b6#file-gistfile1-txt-L646 and does reboot later on when requested, it could be some firewall rule or local routing issue that prevents access