How to reboot Talos cluster on AWS EC2 instance? #10058

vitamaxDH · 2024-12-27T05:14:56Z

vitamaxDH
Dec 27, 2024

Hello, I’ve been working with Talos on EC2 instances (3 instances, one in each of 3 AZs).

While deploying some applications using helm, all the etcd nodes became unhealthy at some point.

t health
discovered nodes: ["10.0.122.99" "10.0.110.191" "10.0.94.120"]
waiting for etcd to be healthy: ...
waiting for etcd to be healthy: 3 errors occurred:
    * 10.0.110.191: service is not healthy: etcd
    * 10.0.122.99: service is not healthy: etcd
    * 10.0.94.120: service is not healthy: etcd

I attempted to reboot one of the nodes, but the main node went down and never came back up.

t reboot
◱ watching nodes: [{{ public ip }}]
    * {{ public ip }}: unavailable, retrying...

t health
healthcheck error: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp {{ public ip }}:50000: connect: connection refused"

Ping also doesn’t work. I tried connecting through the EC2 console, but that didn’t work either. Even after stopping and restarting the EC2 instance, there was no improvement.

I understand that the etcd quorum might be an issue, but before addressing that, I’d like to know how to restart the Talos cluster installed on EC2 instances.

My main question is: How can I reboot or recover a node running on an EC2 instance in this situation?

Answered by frezbo

Dec 31, 2024

talos seems to boot fine as per this https://gist.github.com/vitamaxDH/fc466b32dae7ceca3c47aefbfc68e7b6#file-gistfile1-txt-L646 and does reboot later on when requested, it could be some firewall rule or local routing issue that prevents access

View full answer

smira · 2024-12-27T09:23:40Z

smira
Dec 27, 2024
Maintainer

Your issue is something else, not related directly to EC2.

On AWS, you can pull the logs from the serial console to understand what is wrong.

If etcd is unhealthy, something is really bad, like broken connectivity between machines, out of resources (not enough RAM/CPU), etc.

7 replies

vitamaxDH Dec 31, 2024
Author

Thanks, but rebooting / stopping & starting the instance still did not work. It still returns the same error.

frezbo Dec 31, 2024
Maintainer

could you post the logs from the ec2 serial console, or talosctl support if some connectivity works

vitamaxDH Dec 31, 2024
Author

@frezbo I have these system logs instead (gists). Would they also be helpful?

frezbo Dec 31, 2024
Maintainer

talos seems to boot fine as per this https://gist.github.com/vitamaxDH/fc466b32dae7ceca3c47aefbfc68e7b6#file-gistfile1-txt-L646 and does reboot later on when requested, it could be some firewall rule or local routing issue that prevents access

Answer selected by vitamaxDH

vitamaxDH Dec 31, 2024
Author

All the security groups worked through the load balancer using talosctl, but they stopped after the CPU spike and reboot failure. I'll look into this more. I decided to switch to a different instance family, like M6, since T4g.small is too weak for production traffic. Thanks for your input

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to reboot Talos cluster on AWS EC2 instance? #10058

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

How to reboot Talos cluster on AWS EC2 instance? #10058

vitamaxDH Dec 27, 2024

Replies: 1 comment · 7 replies

smira Dec 27, 2024 Maintainer

vitamaxDH Dec 31, 2024 Author

frezbo Dec 31, 2024 Maintainer

vitamaxDH Dec 31, 2024 Author

frezbo Dec 31, 2024 Maintainer

vitamaxDH Dec 31, 2024 Author

vitamaxDH
Dec 27, 2024

Replies: 1 comment 7 replies

smira
Dec 27, 2024
Maintainer

vitamaxDH Dec 31, 2024
Author

frezbo Dec 31, 2024
Maintainer

vitamaxDH Dec 31, 2024
Author

frezbo Dec 31, 2024
Maintainer

vitamaxDH Dec 31, 2024
Author