Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

post-start should check Node status.Conditions (expecting Status=false) #17

Open
poblin-orange opened this issue Sep 20, 2023 · 0 comments
Labels
bug Something isn't working

Comments

@poblin-orange
Copy link
Member

poblin-orange commented Sep 20, 2023

In order to benefit from bosh canary / max in flight mechanism, the bosh release should check all of k8s node status.conditions @ bosh posts-start.
expected state is Status=false.
Status=true should result in post-start failure, thus preventing further impacts on following instance groups

eg: kubectl wait --for=condition=Ready node/agents-concourse-r1-z1-0 --timeout=10s

  conditions:                                                                                                                                                                                                      
  - lastHeartbeatTime: "2023-08-29T17:04:07Z"                                                                                                                                                                      
    lastTransitionTime: "2023-08-29T17:04:07Z"                                                                                                                                                                     
    message: Cilium is running on this node                                                                                                                                                                        
    reason: CiliumIsUp                                                                                                                                                                                             
    status: "False"                                                                                                                                                                                                
    type: NetworkUnavailable 


  - lastHeartbeatTime: "2023-09-20T15:35:02Z"                                                                                                                                                                      
    lastTransitionTime: "2023-09-09T23:53:50Z"                                                                                                                                                                     
    message: kubelet is posting ready status. AppArmor enabled                                                                                                                                                     
    reason: KubeletReady                                                                                                                                                                                           
    status: "True"                                                                                                                                                                                                 
    type: Ready 

Note that Ready has a negated Status and Ready=true should be expectec

https://kubernetes.io/docs/reference/node/node-status/#condition

Node Condition Description
Ready True if the node is healthy and ready to accept pods, False if the node is not healthy and is not accepting pods, and Unknown if the node controller has not heard from the node in the last node-monitor-grace-period (default is 40 seconds)

Sample standard node conditions are documented into https://kubernetes.io/docs/reference/node/node-status/#condition
Additional extra node conditions can be set by 3rd party components, such as node-problem-detector see https://kubernetes.io/docs/tasks/debug/debug-cluster/monitor-node-health/#exporter

https://github.com/kubernetes/node-problem-detector/blob/ed94dff2cd827764dc43a9c90b0b3af773457dbd/config/kernel-monitor.json#L67-L70

"condition": "KernelDeadlock",

@poblin-orange poblin-orange added the bug Something isn't working label Sep 20, 2023
@gberche-orange gberche-orange changed the title post-start should check Node state (expected Ready) post-start should check Node status.Conditions (expected Ready) Sep 20, 2023
@gberche-orange gberche-orange changed the title post-start should check Node status.Conditions (expected Ready) post-start should check Node status.Conditions (expecting Status=false) Sep 20, 2023
@gberche-orange gberche-orange transferred this issue from orange-cloudfoundry/k3s-boshrelease Jan 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant