Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Etcd Pods Crashloop in GreptimeDB Deployment leading to data loss #5218

Open
cbisht31 opened this issue Dec 23, 2024 · 3 comments
Open

Etcd Pods Crashloop in GreptimeDB Deployment leading to data loss #5218

cbisht31 opened this issue Dec 23, 2024 · 3 comments
Labels
C-bug Category Bugs

Comments

@cbisht31
Copy link

What type of bug is this?

Crash

What subsystems are affected?

Distributed Cluster, Frontend, Datanode

Minimal reproduce step

  • Deploy GreptimeDB using the provided Helm chart in a Kubernetes cluster.
  • Ensure the etcd pods are using persistent volume claims for data storage.
  • Simulate a high-write workload or introduce a network partition between etcd peers.
  • Monitor the etcd pods, the affected pod (etcd-2 in this case) enters a crashloop state.
  • We observe that the cluster requires complete reinstallation, and data is lost if backups are not in place.

What did you expect to see?

  • Stable etcd pods with proper cluster health.
  • Automatic recovery of data or self-healing in case of minor disruptions
  • No data loss during pod failures.

What did you see instead?

  • Etcd pods repeatedly crash and fail to restart.
  • Warning in Kubernetes: Back-off restarting failed container etcd
  • Cluster requires complete reinstallation, leading to data loss.

What operating system did you use?

Windows 10 x64

What version of GreptimeDB did you use?

0.9.5

Relevant log output and stack trace

�[38;5;6metcd �[38;5;5m10:12:03.99 �[0m�[38;5;2mINFO �[0m ==> 
�[38;5;6metcd �[38;5;5m10:12:03.99 �[0m�[38;5;2mINFO �[0m ==> �[1mWelcome to the Bitnami etcd container�[0m
�[38;5;6metcd �[38;5;5m10:12:03.99 �[0m�[38;5;2mINFO �[0m ==> Subscribe to project updates by watching �[1mhttps://github.com/bitnami/containers�[0m
�[38;5;6metcd �[38;5;5m10:12:03.99 �[0m�[38;5;2mINFO �[0m ==> Submit issues and feature requests at �[1mhttps://github.com/bitnami/containers/issues�[0m
�[38;5;6metcd �[38;5;5m10:12:03.99 �[0m�[38;5;2mINFO �[0m ==> Upgrade to Tanzu Application Catalog for production environments to access custom-configured and pre-packaged software components. Gain enhanced features, including Software Bill of Materials (SBOM), CVE scan result reports, and VEX documents. To learn more, visit �[1mhttps://bitnami.com/enterprise�[0m
�[38;5;6metcd �[38;5;5m10:12:03.99 �[0m�[38;5;2mINFO �[0m ==> 
�[38;5;6metcd �[38;5;5m10:12:04.00 �[0m�[38;5;2mINFO �[0m ==> ** Starting etcd setup **
�[38;5;6metcd �[38;5;5m10:12:04.04 �[0m�[38;5;2mINFO �[0m ==> Validating settings in ETCD_* env vars..
�[38;5;6metcd �[38;5;5m10:12:04.04 �[0m�[38;5;3mWARN �[0m ==> You set the environment variable ALLOW_NONE_AUTHENTICATION=yes. For safety reasons, do not use this flag in a production environment.
�[38;5;6metcd �[38;5;5m10:12:04.05 �[0m�[38;5;2mINFO �[0m ==> Initializing etcd
�[38;5;6metcd �[38;5;5m10:12:04.05 �[0m�[38;5;2mINFO �[0m ==> Generating etcd config file using env variables
�[38;5;6metcd �[38;5;5m10:12:04.06 �[0m�[38;5;2mINFO �[0m ==> Detected data from previous deployments
�[38;5;6metcd �[38;5;5m10:12:09.25 �[0m�[38;5;3mWARN �[0m ==> Cluster not responding!
�[38;5;6metcd �[38;5;5m10:12:09.25 �[0m�[38;5;3mWARN �[0m ==> Disaster recovery is disabled, the cluster will try to recover on it's own
�[38;5;6metcd �[38;5;5m10:12:09.25 �[0m�[38;5;2mINFO �[0m ==> ** etcd setup finished! **

�[38;5;6metcd �[38;5;5m10:12:09.26 �[0m�[38;5;2mINFO �[0m ==> ** Starting etcd **
{"level":"info","ts":"2024-12-20T10:12:09.278820Z","caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_ADVERTISE_CLIENT_URLS","variable-value":"http://etcd-2.etcd-headless.greptimedemo.svc.cluster.local:2379,http://etcd.greptimedemo.svc.cluster.local:2379"}
{"level":"info","ts":"2024-12-20T10:12:09.278886Z","caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_AUTH_TOKEN","variable-value":"jwt,priv-key=/opt/bitnami/etcd/certs/token/jwt-token.pem,sign-method=RS256,ttl=10m"}
{"level":"info","ts":"2024-12-20T10:12:09.278898Z","caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_AUTO_TLS","variable-value":"false"}
{"level":"info","ts":"2024-12-20T10:12:09.278910Z","caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_CLIENT_CERT_AUTH","variable-value":"false"}
{"level":"info","ts":"2024-12-20T10:12:09.278921Z","caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_DATA_DIR","variable-value":"/bitnami/etcd/data"}
{"level":"info","ts":"2024-12-20T10:12:09.278959Z","caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_INITIAL_ADVERTISE_PEER_URLS","variable-value":"http://etcd-2.etcd-headless.greptimedemo.svc.cluster.local:2380"}
{"level":"info","ts":"2024-12-20T10:12:09.278970Z","caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_INITIAL_CLUSTER","variable-value":"etcd-0=http://etcd-0.etcd-headless.greptimedemo.svc.cluster.local:2380,etcd-1=http://etcd-1.etcd-headless.greptimedemo.svc.cluster.local:2380,etcd-2=http://etcd-2.etcd-headless.greptimedemo.svc.cluster.local:2380"}
{"level":"info","ts":"2024-12-20T10:12:09.278978Z","caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_INITIAL_CLUSTER_STATE","variable-value":"new"}
{"level":"info","ts":"2024-12-20T10:12:09.278986Z","caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_INITIAL_CLUSTER_TOKEN","variable-value":"etcd-cluster-k8s"}
{"level":"info","ts":"2024-12-20T10:12:09.279008Z","caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_LISTEN_CLIENT_URLS","variable-value":"http://0.0.0.0:2379"}
{"level":"info","ts":"2024-12-20T10:12:09.279021Z","caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_LISTEN_PEER_URLS","variable-value":"http://0.0.0.0:2380"}
{"level":"info","ts":"2024-12-20T10:12:09.279028Z","caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_LOG_LEVEL","variable-value":"info"}
{"level":"info","ts":"2024-12-20T10:12:09.279042Z","caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_NAME","variable-value":"etcd-2"}
{"level":"info","ts":"2024-12-20T10:12:09.279052Z","caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_PEER_AUTO_TLS","variable-value":"false"}
{"level":"warn","ts":"2024-12-20T10:12:09.279079Z","caller":"flags/flag.go:93","msg":"unrecognized environment variable","environment-variable":"ETCD_TRUSTED_CA_FILE="}
{"level":"warn","ts":"2024-12-20T10:12:09.279097Z","caller":"flags/flag.go:93","msg":"unrecognized environment variable","environment-variable":"ETCD_SERVICE_HOST=10.0.118.114"}
{"level":"warn","ts":"2024-12-20T10:12:09.279104Z","caller":"flags/flag.go:93","msg":"unrecognized environment variable","environment-variable":"ETCD_PORT_2380_TCP_ADDR=10.0.118.114"}
{"level":"warn","ts":"2024-12-20T10:12:09.279111Z","caller":"flags/flag.go:93","msg":"unrecognized environment variable","environment-variable":"ETCD_PORT_2379_TCP=tcp://10.0.118.114:2379"}
{"level":"warn","ts":"2024-12-20T10:12:09.279123Z","caller":"flags/flag.go:93","msg":"unrecognized environment variable","environment-variable":"ETCD_DISABLE_STORE_MEMBER_ID=no"}
{"level":"warn","ts":"2024-12-20T10:12:09.279129Z","caller":"flags/flag.go:93","msg":"unrecognized environment variable","environment-variable":"ETCD_CONF_FILE=/opt/bitnami/etcd/conf/etcd.yaml"}
{"level":"warn","ts":"2024-12-20T10:12:09.279135Z","caller":"flags/flag.go:93","msg":"unrecognized environment variable","environment-variable":"ETCD_SNAPSHOT_HISTORY_LIMIT=1"}
{"level":"warn","ts":"2024-12-20T10:12:09.279143Z","caller":"flags/flag.go:93","msg":"unrecognized environment variable","environment-variable":"ETCD_ON_K8S=yes"}
{"level":"warn","ts":"2024-12-20T10:12:09.279153Z","caller":"flags/flag.go:93","msg":"unrecognized environment variable","environment-variable":"ETCD_PORT_2380_TCP_PROTO=tcp"}
{"level":"warn","ts":"2024-12-20T10:12:09.279159Z","caller":"flags/flag.go:93","msg":"unrecognized environment variable","environment-variable":"ETCD_SNAPSHOTS_DIR=/snapshots"}
{"level":"warn","ts":"2024-12-20T10:12:09.279166Z","caller":"flags/flag.go:93","msg":"unrecognized environment variable","environment-variable":"ETCD_BIN_DIR=/opt/bitnami/etcd/bin"}
{"level":"warn","ts":"2024-12-20T10:12:09.279174Z","caller":"flags/flag.go:93","msg":"unrecognized environment variable","environment-variable":"ETCD_VOLUME_DIR=/bitnami/etcd"}
{"level":"warn","ts":"2024-12-20T10:12:09.279181Z","caller":"flags/flag.go:93","msg":"unrecognized environment variable","environment-variable":"ETCD_PORT_2379_TCP_PORT=2379"}
{"level":"warn","ts":"2024-12-20T10:12:09.279186Z","caller":"flags/flag.go:93","msg":"unrecognized environment variable","environment-variable":"ETCD_DEFAULT_CONF_DIR=/opt/bitnami/etcd/conf.default"}
{"level":"warn","ts":"2024-12-20T10:12:09.279196Z","caller":"flags/flag.go:93","msg":"unrecognized environment variable","environment-variable":"ETCD_PORT_2379_TCP_ADDR=10.0.118.114"}
{"level":"warn","ts":"2024-12-20T10:12:09.279203Z","caller":"flags/flag.go:93","msg":"unrecognized environment variable","environment-variable":"ETCD_PORT_2380_TCP_PORT=2380"}
{"level":"warn","ts":"2024-12-20T10:12:09.279210Z","caller":"flags/flag.go:93","msg":"unrecognized environment variable","environment-variable":"ETCD_PORT_2380_TCP=tcp://10.0.118.114:2380"}
{"level":"warn","ts":"2024-12-20T10:12:09.279216Z","caller":"flags/flag.go:93","msg":"unrecognized environment variable","environment-variable":"ETCD_CLUSTER_DOMAIN=etcd-headless.greptimedemo.svc.cluster.local"}
{"level":"warn","ts":"2024-12-20T10:12:09.279222Z","caller":"flags/flag.go:93","msg":"unrecognized environment variable","environment-variable":"ETCD_DISASTER_RECOVERY=no"}
{"level":"warn","ts":"2024-12-20T10:12:09.279229Z","caller":"flags/flag.go:93","msg":"unrecognized environment variable","environment-variable":"ETCD_KEY_FILE="}
{"level":"warn","ts":"2024-12-20T10:12:09.279235Z","caller":"flags/flag.go:93","msg":"unrecognized environment variable","environment-variable":"ETCD_CONF_DIR=/opt/bitnami/etcd/conf"}
{"level":"warn","ts":"2024-12-20T10:12:09.279241Z","caller":"flags/flag.go:93","msg":"unrecognized environment variable","environment-variable":"ETCD_DAEMON_GROUP=etcd"}
{"level":"warn","ts":"2024-12-20T10:12:09.279246Z","caller":"flags/flag.go:93","msg":"unrecognized environment variable","environment-variable":"ETCD_START_FROM_SNAPSHOT=no"}
{"level":"warn","ts":"2024-12-20T10:12:09.279253Z","caller":"flags/flag.go:93","msg":"unrecognized environment variable","environment-variable":"ETCD_SERVICE_PORT_CLIENT=2379"}
{"level":"warn","ts":"2024-12-20T10:12:09.279260Z","caller":"flags/flag.go:93","msg":"unrecognized environment variable","environment-variable":"ETCD_INIT_SNAPSHOT_FILENAME="}
{"level":"warn","ts":"2024-12-20T10:12:09.279267Z","caller":"flags/flag.go:93","msg":"unrecognized environment variable","environment-variable":"ETCD_INIT_SNAPSHOTS_DIR=/init-snapshot"}
{"level":"warn","ts":"2024-12-20T10:12:09.279274Z","caller":"flags/flag.go:93","msg":"unrecognized environment variable","environment-variable":"ETCD_DISABLE_PRESTOP=no"}
{"level":"warn","ts":"2024-12-20T10:12:09.279282Z","caller":"flags/flag.go:93","msg":"unrecognized environment variable","environment-variable":"ETCD_TMP_DIR=/opt/bitnami/etcd/tmp"}
{"level":"warn","ts":"2024-12-20T10:12:09.279289Z","caller":"flags/flag.go:93","msg":"unrecognized environment variable","environment-variable":"ETCD_SERVICE_PORT_PEER=2380"}
{"level":"warn","ts":"2024-12-20T10:12:09.279297Z","caller":"flags/flag.go:93","msg":"unrecognized environment variable","environment-variable":"ETCD_BASE_DIR=/opt/bitnami/etcd"}
{"level":"warn","ts":"2024-12-20T10:12:09.279305Z","caller":"flags/flag.go:93","msg":"unrecognized environment variable","environment-variable":"ETCD_PORT_2379_TCP_PROTO=tcp"}
{"level":"warn","ts":"2024-12-20T10:12:09.279315Z","caller":"flags/flag.go:93","msg":"unrecognized environment variable","environment-variable":"ETCD_CERT_FILE="}
{"level":"warn","ts":"2024-12-20T10:12:09.279321Z","caller":"flags/flag.go:93","msg":"unrecognized environment variable","environment-variable":"ETCD_SERVICE_PORT=2379"}
{"level":"warn","ts":"2024-12-20T10:12:09.279327Z","caller":"flags/flag.go:93","msg":"unrecognized environment variable","environment-variable":"ETCD_PORT=tcp://10.0.118.114:2379"}
{"level":"warn","ts":"2024-12-20T10:12:09.279333Z","caller":"flags/flag.go:93","msg":"unrecognized environment variable","environment-variable":"ETCD_NEW_MEMBERS_ENV_FILE=/bitnami/etcd/data/new_member_envs"}
{"level":"warn","ts":"2024-12-20T10:12:09.279339Z","caller":"flags/flag.go:93","msg":"unrecognized environment variable","environment-variable":"ETCD_DAEMON_USER=etcd"}
{"level":"warn","ts":"2024-12-20T10:12:09.279364Z","caller":"embed/config.go:689","msg":"Running http and grpc server on single port. This is not recommended for production."}
{"level":"info","ts":"2024-12-20T10:12:09.279398Z","caller":"etcdmain/etcd.go:73","msg":"Running: ","args":["etcd"]}
{"level":"warn","ts":"2024-12-20T10:12:09.279451Z","caller":"etcdmain/etcd.go:446","msg":"found invalid file under data directory","filename":"member_id","data-dir":"/bitnami/etcd/data"}
{"level":"info","ts":"2024-12-20T10:12:09.279470Z","caller":"etcdmain/etcd.go:116","msg":"server has been already initialized","data-dir":"/bitnami/etcd/data","dir-type":"member"}
{"level":"warn","ts":"2024-12-20T10:12:09.279493Z","caller":"embed/config.go:689","msg":"Running http and grpc server on single port. This is not recommended for production."}
{"level":"info","ts":"2024-12-20T10:12:09.279506Z","caller":"embed/etcd.go:128","msg":"configuring peer listeners","listen-peer-urls":["http://0.0.0.0:2380"]}
{"level":"info","ts":"2024-12-20T10:12:09.279651Z","caller":"embed/etcd.go:136","msg":"configuring client listeners","listen-client-urls":["http://0.0.0.0:2379"]}
{"level":"info","ts":"2024-12-20T10:12:09.279770Z","caller":"embed/etcd.go:311","msg":"starting an etcd server","etcd-version":"3.5.17","git-sha":"507c0de","go-version":"go1.22.9","go-os":"linux","go-arch":"amd64","max-cpu-set":8,"max-cpu-available":8,"member-initialized":true,"name":"etcd-2","data-dir":"/bitnami/etcd/data","wal-dir":"","wal-dir-dedicated":"","member-dir":"/bitnami/etcd/data/member","force-new-cluster":false,"heartbeat-interval":"100ms","election-timeout":"1s","initial-election-tick-advance":true,"snapshot-count":100000,"max-wals":5,"max-snapshots":5,"snapshot-catchup-entries":5000,"initial-advertise-peer-urls":["http://etcd-2.etcd-headless.greptimedemo.svc.cluster.local:2380"],"listen-peer-urls":["http://0.0.0.0:2380"],"advertise-client-urls":["http://etcd-2.etcd-headless.greptimedemo.svc.cluster.local:2379","http://etcd.greptimedemo.svc.cluster.local:2379"],"listen-client-urls":["http://0.0.0.0:2379"],"listen-metrics-urls":[],"cors":["*"],"host-whitelist":["*"],"initial-cluster":"","initial-cluster-state":"new","initial-cluster-token":"","quota-backend-bytes":2147483648,"max-request-bytes":1572864,"max-concurrent-streams":4294967295,"pre-vote":true,"initial-corrupt-check":false,"corrupt-check-time-interval":"0s","compact-check-time-enabled":false,"compact-check-time-interval":"1m0s","auto-compaction-mode":"periodic","auto-compaction-retention":"0s","auto-compaction-interval":"0s","discovery-url":"","discovery-proxy":"","downgrade-check-interval":"5s"}
{"level":"info","ts":"2024-12-20T10:12:09.355979Z","caller":"etcdserver/backend.go:81","msg":"opened backend db","path":"/bitnami/etcd/data/member/snap/db","took":"14.307112ms"}
{"level":"info","ts":"2024-12-20T10:12:10.159838Z","caller":"etcdserver/server.go:511","msg":"recovered v2 store from snapshot","snapshot-index":200002,"snapshot-size":"12 kB"}
{"level":"info","ts":"2024-12-20T10:12:10.159888Z","caller":"etcdserver/server.go:524","msg":"recovered v3 backend from snapshot","backend-size-bytes":73584640,"backend-size":"74 MB","backend-size-in-use-bytes":73543680,"backend-size-in-use":"74 MB"}
{"level":"info","ts":"2024-12-20T10:12:10.651513Z","caller":"etcdserver/raft.go:540","msg":"restarting local member","cluster-id":"46376e04dfc54987","local-member-id":"d3b6dfbcd90d4670","commit-index":277384}
{"level":"info","ts":"2024-12-20T10:12:10.652930Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"d3b6dfbcd90d4670 switched to configuration voters=(4372864707720991929 14548331106888106469 15255626789952505456)"}
{"level":"info","ts":"2024-12-20T10:12:10.652971Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"d3b6dfbcd90d4670 became follower at term 7"}
{"level":"info","ts":"2024-12-20T10:12:10.652983Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"newRaft d3b6dfbcd90d4670 [peers: [3caf8948828d54b9,c9e60e0fb4d4c1e5,d3b6dfbcd90d4670], term: 7, commit: 277384, applied: 200002, lastindex: 277384, lastterm: 7]"}
{"level":"info","ts":"2024-12-20T10:12:10.653098Z","caller":"api/capability.go:75","msg":"enabled capabilities for version","cluster-version":"3.5"}
{"level":"info","ts":"2024-12-20T10:12:10.653114Z","caller":"membership/cluster.go:278","msg":"recovered/added member from store","cluster-id":"46376e04dfc54987","local-member-id":"d3b6dfbcd90d4670","recovered-remote-peer-id":"3caf8948828d54b9","recovered-remote-peer-urls":["http://etcd-1.etcd-headless.greptimedemo.svc.cluster.local:2380"]}
{"level":"info","ts":"2024-12-20T10:12:10.653122Z","caller":"membership/cluster.go:278","msg":"recovered/added member from store","cluster-id":"46376e04dfc54987","local-member-id":"d3b6dfbcd90d4670","recovered-remote-peer-id":"c9e60e0fb4d4c1e5","recovered-remote-peer-urls":["http://etcd-0.etcd-headless.greptimedemo.svc.cluster.local:2380"]}
{"level":"info","ts":"2024-12-20T10:12:10.653127Z","caller":"membership/cluster.go:278","msg":"recovered/added member from store","cluster-id":"46376e04dfc54987","local-member-id":"d3b6dfbcd90d4670","recovered-remote-peer-id":"d3b6dfbcd90d4670","recovered-remote-peer-urls":["http://etcd-2.etcd-headless.greptimedemo.svc.cluster.local:2380"]}
{"level":"info","ts":"2024-12-20T10:12:10.653132Z","caller":"membership/cluster.go:287","msg":"set cluster version from store","cluster-version":"3.5"}
{"level":"info","ts":"2024-12-20T10:12:10.841836Z","caller":"mvcc/kvstore.go:423","msg":"kvstore restored","current-rev":12284}
{"level":"info","ts":"2024-12-20T10:12:10.844600Z","caller":"etcdserver/quota.go:94","msg":"enabled backend quota with default value","quota-name":"v3-applier","quota-size-bytes":2147483648,"quota-size":"2.1 GB"}
{"level":"info","ts":"2024-12-20T10:12:10.847289Z","caller":"rafthttp/peer.go:133","msg":"starting remote peer","remote-peer-id":"3caf8948828d54b9"}
{"level":"info","ts":"2024-12-20T10:12:10.847313Z","caller":"rafthttp/pipeline.go:72","msg":"started HTTP pipelining with remote peer","local-member-id":"d3b6dfbcd90d4670","remote-peer-id":"3caf8948828d54b9"}
{"level":"info","ts":"2024-12-20T10:12:10.847429Z","caller":"rafthttp/peer.go:137","msg":"started remote peer","remote-peer-id":"3caf8948828d54b9"}
{"level":"info","ts":"2024-12-20T10:12:10.847431Z","caller":"rafthttp/stream.go:169","msg":"started stream writer with remote peer","local-member-id":"d3b6dfbcd90d4670","remote-peer-id":"3caf8948828d54b9"}
{"level":"info","ts":"2024-12-20T10:12:10.847449Z","caller":"rafthttp/stream.go:169","msg":"started stream writer with remote peer","local-member-id":"d3b6dfbcd90d4670","remote-peer-id":"3caf8948828d54b9"}
{"level":"info","ts":"2024-12-20T10:12:10.847461Z","caller":"rafthttp/stream.go:395","msg":"started stream reader with remote peer","stream-reader-type":"stream MsgApp v2","local-member-id":"d3b6dfbcd90d4670","remote-peer-id":"3caf8948828d54b9"}
{"level":"info","ts":"2024-12-20T10:12:10.847467Z","caller":"rafthttp/transport.go:317","msg":"added remote peer","local-member-id":"d3b6dfbcd90d4670","remote-peer-id":"3caf8948828d54b9","remote-peer-urls":["http://etcd-1.etcd-headless.greptimedemo.svc.cluster.local:2380"]}
{"level":"info","ts":"2024-12-20T10:12:10.847498Z","caller":"rafthttp/peer.go:133","msg":"starting remote peer","remote-peer-id":"c9e60e0fb4d4c1e5"}
{"level":"info","ts":"2024-12-20T10:12:10.847479Z","caller":"rafthttp/stream.go:395","msg":"started stream reader with remote peer","stream-reader-type":"stream Message","local-member-id":"d3b6dfbcd90d4670","remote-peer-id":"3caf8948828d54b9"}
{"level":"info","ts":"2024-12-20T10:12:10.847538Z","caller":"rafthttp/pipeline.go:72","msg":"started HTTP pipelining with remote peer","local-member-id":"d3b6dfbcd90d4670","remote-peer-id":"c9e60e0fb4d4c1e5"}
{"level":"info","ts":"2024-12-20T10:12:10.847614Z","caller":"rafthttp/stream.go:169","msg":"started stream writer with remote peer","local-member-id":"d3b6dfbcd90d4670","remote-peer-id":"c9e60e0fb4d4c1e5"}
{"level":"info","ts":"2024-12-20T10:12:10.847630Z","caller":"rafthttp/peer.go:137","msg":"started remote peer","remote-peer-id":"c9e60e0fb4d4c1e5"}
{"level":"info","ts":"2024-12-20T10:12:10.847640Z","caller":"rafthttp/stream.go:169","msg":"started stream writer with remote peer","local-member-id":"d3b6dfbcd90d4670","remote-peer-id":"c9e60e0fb4d4c1e5"}
{"level":"info","ts":"2024-12-20T10:12:10.847649Z","caller":"rafthttp/transport.go:317","msg":"added remote peer","local-member-id":"d3b6dfbcd90d4670","remote-peer-id":"c9e60e0fb4d4c1e5","remote-peer-urls":["http://etcd-0.etcd-headless.greptimedemo.svc.cluster.local:2380"]}
{"level":"info","ts":"2024-12-20T10:12:10.847656Z","caller":"rafthttp/stream.go:395","msg":"started stream reader with remote peer","stream-reader-type":"stream MsgApp v2","local-member-id":"d3b6dfbcd90d4670","remote-peer-id":"c9e60e0fb4d4c1e5"}
{"level":"info","ts":"2024-12-20T10:12:10.847675Z","caller":"etcdserver/server.go:864","msg":"starting etcd server","local-member-id":"d3b6dfbcd90d4670","local-server-version":"3.5.17","cluster-id":"46376e04dfc54987","cluster-version":"3.5"}
{"level":"info","ts":"2024-12-20T10:12:10.847663Z","caller":"rafthttp/stream.go:395","msg":"started stream reader with remote peer","stream-reader-type":"stream Message","local-member-id":"d3b6dfbcd90d4670","remote-peer-id":"c9e60e0fb4d4c1e5"}
{"level":"info","ts":"2024-12-20T10:12:10.847751Z","caller":"etcdserver/server.go:773","msg":"starting initial election tick advance","election-ticks":10}
{"level":"info","ts":"2024-12-20T10:12:10.847786Z","caller":"fileutil/purge.go:50","msg":"started to purge file","dir":"/bitnami/etcd/data/member/snap","suffix":"snap.db","max":5,"interval":"30s"}
{"level":"info","ts":"2024-12-20T10:12:10.847810Z","caller":"fileutil/purge.go:50","msg":"started to purge file","dir":"/bitnami/etcd/data/member/snap","suffix":"snap","max":5,"interval":"30s"}
{"level":"info","ts":"2024-12-20T10:12:10.847822Z","caller":"fileutil/purge.go:50","msg":"started to purge file","dir":"/bitnami/etcd/data/member/wal","suffix":"wal","max":5,"interval":"30s"}
{"level":"info","ts":"2024-12-20T10:12:10.847868Z","caller":"v3rpc/health.go:61","msg":"grpc service status changed","service":"","status":"SERVING"}
{"level":"info","ts":"2024-12-20T10:12:10.848883Z","caller":"embed/etcd.go:600","msg":"serving peer traffic","address":"[::]:2380"}
{"level":"info","ts":"2024-12-20T10:12:10.848901Z","caller":"embed/etcd.go:572","msg":"cmux::serve","address":"[::]:2380"}
{"level":"info","ts":"2024-12-20T10:12:10.848925Z","caller":"embed/etcd.go:280","msg":"now serving peer/client/metrics","local-member-id":"d3b6dfbcd90d4670","initial-advertise-peer-urls":["http://etcd-2.etcd-headless.greptimedemo.svc.cluster.local:2380"],"listen-peer-urls":["http://0.0.0.0:2380"],"advertise-client-urls":["http://etcd-2.etcd-headless.greptimedemo.svc.cluster.local:2379","http://etcd.greptimedemo.svc.cluster.local:2379"],"listen-client-urls":["http://0.0.0.0:2379"],"listen-metrics-urls":[]}
{"level":"warn","ts":"2024-12-20T10:12:10.852946Z","caller":"etcdserver/server.go:1154","msg":"server error","error":"the member has been permanently removed from the cluster"}
{"level":"warn","ts":"2024-12-20T10:12:10.852979Z","caller":"etcdserver/server.go:1155","msg":"data-dir used by this member must be removed"}
{"level":"warn","ts":"2024-12-20T10:12:10.853019Z","caller":"etcdserver/server.go:2161","msg":"failed to publish local member to cluster through raft","local-member-id":"d3b6dfbcd90d4670","local-member-attributes":"{Name:etcd-2 ClientURLs:[http://etcd-2.etcd-headless.greptimedemo.svc.cluster.local:2379 http://etcd.greptimedemo.svc.cluster.local:2379]}","request-path":"/0/members/d3b6dfbcd90d4670/attributes","publish-timeout":"7s","error":"etcdserver: request cancelled"}
{"level":"warn","ts":"2024-12-20T10:12:10.853065Z","caller":"etcdserver/server.go:2161","msg":"failed to publish local member to cluster through raft","local-member-id":"d3b6dfbcd90d4670","local-member-attributes":"{Name:etcd-2 ClientURLs:[http://etcd-2.etcd-headless.greptimedemo.svc.cluster.local:2379 http://etcd.greptimedemo.svc.cluster.local:2379]}","request-path":"/0/members/d3b6dfbcd90d4670/attributes","publish-timeout":"7s","error":"etcdserver: request cancelled"}
{"level":"warn","ts":"2024-12-20T10:12:10.853082Z","caller":"etcdserver/server.go:2151","msg":"stopped publish because server is stopped","local-member-id":"d3b6dfbcd90d4670","local-member-attributes":"{Name:etcd-2 ClientURLs:[http://etcd-2.etcd-headless.greptimedemo.svc.cluster.local:2379 http://etcd.greptimedemo.svc.cluster.local:2379]}","publish-timeout":"7s","error":"etcdserver: server stopped"}
{"level":"info","ts":"2024-12-20T10:12:11.149960Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"d3b6dfbcd90d4670 switched to configuration voters=(4372864707720991929 14548331106888106469)"}
{"level":"info","ts":"2024-12-20T10:12:11.150020Z","caller":"membership/cluster.go:472","msg":"removed member","cluster-id":"46376e04dfc54987","local-member-id":"d3b6dfbcd90d4670","removed-remote-peer-id":"d3b6dfbcd90d4670","removed-remote-peer-urls":["http://etcd-2.etcd-headless.greptimedemo.svc.cluster.local:2380"]}
{"level":"info","ts":"2024-12-20T10:12:11.150056Z","caller":"rafthttp/peer.go:330","msg":"stopping remote peer","remote-peer-id":"c9e60e0fb4d4c1e5"}
{"level":"info","ts":"2024-12-20T10:12:11.150076Z","caller":"rafthttp/stream.go:294","msg":"stopped TCP streaming connection with remote peer","stream-writer-type":"unknown stream","remote-peer-id":"c9e60e0fb4d4c1e5"}
{"level":"info","ts":"2024-12-20T10:12:11.150097Z","caller":"rafthttp/stream.go:294","msg":"stopped TCP streaming connection with remote peer","stream-writer-type":"unknown stream","remote-peer-id":"c9e60e0fb4d4c1e5"}
{"level":"info","ts":"2024-12-20T10:12:11.150108Z","caller":"rafthttp/pipeline.go:85","msg":"stopped HTTP pipelining with remote peer","local-member-id":"d3b6dfbcd90d4670","remote-peer-id":"c9e60e0fb4d4c1e5"}
{"level":"info","ts":"2024-12-20T10:12:11.150121Z","caller":"rafthttp/stream.go:442","msg":"stopped stream reader with remote peer","stream-reader-type":"stream MsgApp v2","local-member-id":"d3b6dfbcd90d4670","remote-peer-id":"c9e60e0fb4d4c1e5"}
{"level":"info","ts":"2024-12-20T10:12:11.150130Z","caller":"rafthttp/stream.go:442","msg":"stopped stream reader with remote peer","stream-reader-type":"stream Message","local-member-id":"d3b6dfbcd90d4670","remote-peer-id":"c9e60e0fb4d4c1e5"}
{"level":"info","ts":"2024-12-20T10:12:11.150137Z","caller":"rafthttp/peer.go:335","msg":"stopped remote peer","remote-peer-id":"c9e60e0fb4d4c1e5"}
{"level":"info","ts":"2024-12-20T10:12:11.150141Z","caller":"rafthttp/peer.go:330","msg":"stopping remote peer","remote-peer-id":"3caf8948828d54b9"}
{"level":"info","ts":"2024-12-20T10:12:11.150150Z","caller":"rafthttp/stream.go:294","msg":"stopped TCP streaming connection with remote peer","stream-writer-type":"unknown stream","remote-peer-id":"3caf8948828d54b9"}
{"level":"info","ts":"2024-12-20T10:12:11.150164Z","caller":"rafthttp/stream.go:294","msg":"stopped TCP streaming connection with remote peer","stream-writer-type":"unknown stream","remote-peer-id":"3caf8948828d54b9"}
{"level":"info","ts":"2024-12-20T10:12:11.150175Z","caller":"rafthttp/pipeline.go:85","msg":"stopped HTTP pipelining with remote peer","local-member-id":"d3b6dfbcd90d4670","remote-peer-id":"3caf8948828d54b9"}
{"level":"info","ts":"2024-12-20T10:12:11.150181Z","caller":"rafthttp/stream.go:442","msg":"stopped stream reader with remote peer","stream-reader-type":"stream MsgApp v2","local-member-id":"d3b6dfbcd90d4670","remote-peer-id":"3caf8948828d54b9"}
{"level":"info","ts":"2024-12-20T10:12:11.150195Z","caller":"rafthttp/stream.go:442","msg":"stopped stream reader with remote peer","stream-reader-type":"stream Message","local-member-id":"d3b6dfbcd90d4670","remote-peer-id":"3caf8948828d54b9"}
{"level":"info","ts":"2024-12-20T10:12:11.150200Z","caller":"rafthttp/peer.go:335","msg":"stopped remote peer","remote-peer-id":"3caf8948828d54b9"}
{"level":"info","ts":"2024-12-20T10:12:11.160126Z","caller":"etcdmain/main.go:44","msg":"notifying init daemon"}
{"level":"info","ts":"2024-12-20T10:12:11.160145Z","caller":"etcdmain/main.go:50","msg":"successfully notified init daemon"}
@cbisht31 cbisht31 added the C-bug Category Bugs label Dec 23, 2024
@killme2008
Copy link
Contributor

It's unexpected.

The etcd is designed to withstand machine failures. An etcd cluster automatically recovers from temporary failures (e.g., machine reboots) and tolerates up to (N-1)/2 permanent failures for a cluster of N members. Accroding to the etcd DR document https://etcd.io/docs/v3.3/op-guide/recovery/

@zyy17
Copy link
Collaborator

zyy17 commented Dec 23, 2024

@cbisht31

Simulate a high-write workload or introduce a network partition between etcd peers.

Can you give me more context on how to simulate the network partition? I can use your scenarios to try to reproduce the issue.

@zyy17
Copy link
Collaborator

zyy17 commented Dec 23, 2024

@cbisht31 For my experience, it might be:

  1. When initializing the etcd cluster, we have to set --initial-cluster-state=new for each etcd peer, which bitnami chart did(refer to https://etcd.io/docs/v3.5/op-guide/configuration/);

  2. When one of the peers loses the connection for a long time(maybe wal is compacted in the cluster), it's impossible to re-join the cluster using the original setting. You have to remove the old member and add the node as the new member(--initial-cluster-state should be existing) with clean data directory;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Category Bugs
Projects
None yet
Development

No branches or pull requests

3 participants