Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NROP CR status is not aligned with the specs #920

Open
shajmakh opened this issue Apr 10, 2024 · 0 comments
Open

NROP CR status is not aligned with the specs #920

shajmakh opened this issue Apr 10, 2024 · 0 comments

Comments

@shajmakh
Copy link
Member

shajmakh commented Apr 10, 2024

steps to reproduce:

  • add tolerations to spec
  • delete them and see the status still contains them
  • however, RTE pods are restarted with no tolerations so it seems only a wrong reporting in NROP CR status

this is not particularly to Tolerations but can happen with the other config fields too.

spec:
  logLevel: Trace
  nodeGroups:
  - machineConfigPoolSelector:
      matchLabels:
        pools.operator.machineconfiguration.openshift.io/worker: ""
status:
  conditions:
  - lastTransitionTime: "2024-04-10T14:31:37Z"
    message: ""
    reason: Available
    status: "True"
    type: Available
  - lastTransitionTime: "2024-04-10T14:31:37Z"
    message: ""
    reason: Upgradeable
    status: "True"
    type: Upgradeable
  - lastTransitionTime: "2024-04-10T14:31:37Z"
    message: ""
    reason: Progressing
    status: "False"
    type: Progressing
  - lastTransitionTime: "2024-04-10T14:31:37Z"
    message: ""
    reason: Degraded
    status: "False"
    type: Degraded
  daemonsets:
  - name: numaresourcesoperator-worker
    namespace: openshift-numaresources
  machineconfigpools:
  - config:
      infoRefreshMode: Periodic
      infoRefreshPeriod: 1s
      podsFingerprinting: Enabled
      tolerations:
      - effect: NoSchedule
        key: sriov
        operator: Equal
        value: "true"
    name: worker
shajmakh added a commit to shajmakh/numaresources-operator that referenced this issue Oct 17, 2024
We have been seeing several cases where the operator status doesn't get
updated correctly yet it's been hard to reproduce and track down the
reason (see https://issues.redhat.com/browse/OCPBUGS-16058 &
https://issues.redhat.com/browse/CNF-9080).

During the updates of the controller tests to extend the coverage of
reflecting the node group config under the status, with further
debugging it turned out that updating the operator status was skipped
in case of 2 similar successive conditions. This is wrong because each
condition status can vary with the reasons, for example, the operator is
successful with one set of NodeGroupConfig and later is configured with
another successful set of settings, which we still want to reflect.
Fortunately this is just a status reflection that explains why the RTE
daemonset for example gets updated even if the status on the NROP CR is
not up-to-date.

Fix this by relaxing the updateStatus() function not to make it a hard
requirement on the successive conditions to unmatch, and thus allowing
always update the status after the end of each reconciliation loop.

candidate fix for openshift-kni#920

Signed-off-by: Shereen Haj <[email protected]>
shajmakh added a commit to shajmakh/numaresources-operator that referenced this issue Oct 17, 2024
We have been seeing several cases where the operator status doesn't get
updated correctly yet it's been hard to reproduce and track down the
reason (see https://issues.redhat.com/browse/OCPBUGS-16058 &
https://issues.redhat.com/browse/CNF-9080).

During the updates of the controller tests to extend the coverage of
reflecting the node group config under the status, with further
debugging it turned out that updating the operator status was skipped
in case of 2 similar successive conditions. This is wrong because each
condition status can vary with the reasons, for example, the operator is
successful with one set of NodeGroupConfig and later is configured with
another successful set of settings, which we still want to reflect.
Fortunately this is just a status reflection that explains why the RTE
daemonset for example gets updated even if the status on the NROP CR is
not up-to-date.

Fix this by relaxing the updateStatus() function not to make it a hard
requirement on the successive conditions to unmatch, and thus allowing
always update the status after the end of each reconciliation loop.

candidate fix for openshift-kni#920

Signed-off-by: Shereen Haj <[email protected]>
shajmakh added a commit to shajmakh/numaresources-operator that referenced this issue Oct 17, 2024
We have been seeing several cases where the operator status doesn't get
updated correctly yet it's been hard to reproduce and track down the
reason (see https://issues.redhat.com/browse/OCPBUGS-16058 &
https://issues.redhat.com/browse/CNF-9080).

During the updates of the controller tests to extend the coverage of
reflecting the node group config under the status, with further
debugging it turned out that updating the operator status was skipped
in case of 2 similar successive conditions. This is wrong because each
condition status can vary with the reasons, for example, the operator is
successful with one set of NodeGroupConfig and later is configured with
another successful set of settings, which we still want to reflect.
Fortunately this is just a status reflection that explains why the RTE
daemonset for example gets updated even if the status on the NROP CR is
not up-to-date.

Fix this by relaxing the updateStatus() function not to make it a hard
requirement on the successive conditions to unmatch, and thus allowing
always update the status after the end of each reconciliation loop.

candidate fix for openshift-kni#920

Signed-off-by: Shereen Haj <[email protected]>
shajmakh added a commit to shajmakh/numaresources-operator that referenced this issue Oct 21, 2024
We have been seeing several cases where the operator status doesn't get
updated correctly yet it's been hard to reproduce and track down the
reason (see https://issues.redhat.com/browse/OCPBUGS-16058 &
https://issues.redhat.com/browse/CNF-9080).

During the updates of the controller tests to extend the coverage of
reflecting the node group config under the status, with further
debugging it turned out that updating the operator status was skipped
in case of 2 similar successive conditions. This is wrong because each
condition status can vary with the reasons, for example, the operator is
successful with one set of NodeGroupConfig and later is configured with
another successful set of settings, which we still want to reflect.
Fortunately this is just a status reflection that explains why the RTE
daemonset for example gets updated even if the status on the NROP CR is
not up-to-date.

Fix this by relaxing the updateStatus() function not to make it a hard
requirement on the successive conditions to unmatch, and thus allowing
always update the status after the end of each reconciliation loop.

candidate fix for openshift-kni#920

Signed-off-by: Shereen Haj <[email protected]>
shajmakh added a commit to shajmakh/numaresources-operator that referenced this issue Oct 21, 2024
We have been seeing several cases where the operator status doesn't get
updated correctly yet it's been hard to reproduce and track down the
reason (see https://issues.redhat.com/browse/OCPBUGS-16058 &
https://issues.redhat.com/browse/CNF-9080).

During the updates of the controller tests to extend the coverage of
reflecting the node group config under the status, with further
debugging it turned out that updating the operator status was skipped
in case of 2 similar successive conditions. This is wrong because each
condition status can vary with the reasons, for example, the operator is
successful with one set of NodeGroupConfig and later is configured with
another successful set of settings, which we still want to reflect.
Fortunately this is just a status reflection that explains why the RTE
daemonset for example gets updated even if the status on the NROP CR is
not up-to-date.

Fix this by relaxing the updateStatus() function not to make it a hard
requirement on the successive conditions to unmatch, and thus allowing
always update the status after the end of each reconciliation loop.

candidate fix for openshift-kni#920

Signed-off-by: Shereen Haj <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant