Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TiDB operator fails and hangs to update the status.status-port in spec.tidb.config #6014

Open
kos-team opened this issue Dec 26, 2024 · 0 comments

Comments

@kos-team
Copy link
Contributor

kos-team commented Dec 26, 2024

Bug Report

What version of Kubernetes are you using?
Client Version: v1.31.1
Kustomize Version: v5.4.2

What version of TiDB Operator are you using?
v1.6.0

What did you do?
We deployed a tidb cluster with 3 replicas of pd, tikv and tidb. After the cluster is fully up and healthy, we changed the spec.tidb.config and set status.status-port to 10079.

The last pod terminated and restarted with the updated configuration. However, TiDB operator cannot connect to the restarted pod because the status-port was still set to old port number (10080) in the StatefulSet, which is inconsistent with the TiDB's configuration. TiDB operator uses the port in the StatefulSet to query the health of the TiDB Pods, and fails to get TiDB's status. It mistakens the TiDB Pods as unhealthy and indefinitely waits.

The TiDB operator gets the status of the Pods in

health, err := m.deps.TiDBControl.GetHealth(tc, int32(id))

which constructs the URL with the v1alpha1.DefaultTiDBStatusPort at
baseURL := fmt.Sprintf("%s://%s.%s.%s:%d", scheme, hostName, TiDBPeerMemberName(tcName), ns, v1alpha1.DefaultTiDBStatusPort)

The status port on the TiDB container is also hardcoded:

ContainerPort: int32(10080),

How to reproduce

  1. Deploy a TiDB cluster with TiProxy enabled, for example:
apiVersion: pingcap.com/v1alpha1
kind: TidbCluster
metadata:
  name: test-cluster
spec:
  configUpdateStrategy: RollingUpdate
  enableDynamicConfiguration: true
  helper:
    image: alpine:3.16.0
  pd:
    baseImage: pingcap/pd
    config: "[dashboard]\n  internal-proxy = true\n"
    maxFailoverCount: 0
    mountClusterClientSecret: true
    replicas: 3
    requests:
      storage: 10Gi
  pvReclaimPolicy: Retain
  tidb:
    baseImage: pingcap/tidb
    config: '
      [performance]

      tcp-keep-alive = true
      '
    maxFailoverCount: 0
    replicas: 3
    service:
      externalTrafficPolicy: Local
      type: NodePort
  tikv:
    baseImage: pingcap/tikv
    config: 'log-level = "info"

      '
    maxFailoverCount: 0
    mountClusterClientSecret: true
    replicas: 3
    requests:
      storage: 100Gi
  timezone: UTC
  version: v8.1.0
  1. Add status.status-port to the spec.tidb.config
apiVersion: pingcap.com/v1alpha1
kind: TidbCluster
metadata:
  name: test-cluster
spec:
  configUpdateStrategy: RollingUpdate
  enableDynamicConfiguration: true
  helper:
    image: alpine:3.16.0
  pd:
    baseImage: pingcap/pd
    config: "[dashboard]\n  internal-proxy = true\n"
    maxFailoverCount: 0
    mountClusterClientSecret: true
    replicas: 3
    requests:
      storage: 10Gi
  pvReclaimPolicy: Retain
  tidb:
    baseImage: pingcap/tidb
    config: '
      [performance]

      tcp-keep-alive = true

      [status]

      status-port = 10079
      '
    maxFailoverCount: 0
    replicas: 3
    service:
      externalTrafficPolicy: Local
      type: NodePort
  tikv:
    baseImage: pingcap/tikv
    config: 'log-level = "info"

      '
    maxFailoverCount: 0
    mountClusterClientSecret: true
    replicas: 3
    requests:
      storage: 100Gi
  timezone: UTC
  version: v8.1.0

What did you expect to see?
We expected to see the tidb restarts and new configuration takes effect.

What did you see instead?
The last pod terminated and restarted. However, operator cannot connect to the pod because the status-port was still set to 10080 in the statefulset. Thus, the operator thought the last pod still needs time to be ready and hanged.

Root Cause
The operator uses 10080 as default value for the status-port. When user specify the status-port in spec.tidb.config, the operator still creates statefulset using the default value. This causes operator connecting to the pod using wrong port.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant