-
-
Notifications
You must be signed in to change notification settings - Fork 232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failed update breaks redis cluster #1097
Comments
@Leo791, as the code shows: // updateStatefulSet is a method to update statefulset in Kubernetes
func updateStatefulSet(cl kubernetes.Interface, logger logr.Logger, namespace string, stateful *appsv1.StatefulSet, recreateStateFulSet bool) error {
_, err := cl.AppsV1().StatefulSets(namespace).Update(context.TODO(), stateful, metav1.UpdateOptions{})
if recreateStateFulSet {
sErr, ok := err.(*apierrors.StatusError)
if ok && sErr.ErrStatus.Code == 422 && sErr.ErrStatus.Reason == metav1.StatusReasonInvalid {
failMsg := make([]string, len(sErr.ErrStatus.Details.Causes))
for messageCount, cause := range sErr.ErrStatus.Details.Causes {
failMsg[messageCount] = cause.Message
}
logger.V(1).Info("recreating StatefulSet because the update operation wasn't possible", "reason", strings.Join(failMsg, ", "))
propagationPolicy := metav1.DeletePropagationForeground
if err := cl.AppsV1().StatefulSets(namespace).Delete(context.TODO(), stateful.GetName(), metav1.DeleteOptions{PropagationPolicy: &propagationPolicy}); err != nil { //nolint
return errors.Wrap(err, "failed to delete StatefulSet to avoid forbidden action")
}
}
}
if err != nil {
logger.Error(err, "Redis statefulset update failed")
return err
}
logger.V(1).Info("Redis statefulset successfully updated ")
return nil
} StatefulSets are only deleted when you attempt to update forbidden fields, such as the persistentVolumeClaimTemplate field. Therefore, in my opinion, when a StatefulSet gets stuck in a pending state due to insufficient resources, we need to manually delete the StatefulSet (and its pods) under the current code design. |
But isn't that the purpose of the annotation: redis.opstreelabs.in/recreate-statefulset: "true"? |
No, we only recreate the StatefulSet when there is an update to forbidden fields. We cannot recreate the StatefulSet when a pod is pending because we cannot determine whether the pending state is temporary or permanent. |
Got it thank you! And regarding the failover issue? Where the operator is looking for a pod that doesn't exist. Are you aware of this issue? |
Actually, the role string in the pod name does not represent the actual role of the Redis node. We should not rely on the pod name to identify its role.
Failover is handled by the Redis cluster itself, not by the operator. The operator simply creates resources and integrates them into a Redis cluster. Failover is automatically managed by the cluster. |
But right now we are indeed using the pod name to identify the role no? And that's what's causing the problem we think. |
How is the operator getting the pod roles and number of masters or slaves? Is it through cluster nodes? If so we believe that the fact that cluster nodes remains with a deleted master in |
Hello, any news regarding this? We still believe that the operator is expecting a wrong number of master nodes and that's why it remains inconsistent. |
What version of redis operator are you using?
redis-operator version:
v0.18.0
Does this issue reproduce with the latest release?
Yes
What operating system and processor architecture are you using (
kubectl version
)?Client Version: v1.30.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.30.0
What did you do?
redis.opstreelabs.in/recreate-statefulset: "true"
to the crdWhat did you expect to see?
What did you see instead?
The operator throws the following errors:
And if we exec into a cluster pod and ask for the cluster node info we get:
We believe the operator is promoting the follower to leader, but expects it to be named leader-3 instead of follower-x. This causes the updated of the stateful set to be blocked and we cannot rollback the cluster to a healthy state.
Is there a way to prevent failover from occurring and the promotion to occur?
The text was updated successfully, but these errors were encountered: