-
Notifications
You must be signed in to change notification settings - Fork 115
As an operator I would like my apps to stay online during Kubernetes upgrades #600
Comments
We have created an issue in Pivotal Tracker to manage this: https://www.pivotaltracker.com/story/show/176243873 The labels on this github issue will be updated when the story is started. |
Oh interesting, I'd expect that for production, HA (highly available) apps:
@Birdrock @davewalter @acosta11 I'd be curious to hear y'all's input here. |
I think your second point is mostly true. However, there are a few instances were apps would still go offline even with 2+ instances.
You can see how these scenarios can happen no matter the number of instances I choose in CF and may be more likely to happen depending on the surge capacity. For instance scenario 1 is very unlikely to happen if you were to create 5 instances (since all 5 instances would have to be scheduled on just 2 nodes), but it is still possible and would be a waste of resources just to achieve HA. The second scenario is not very likely in the majority of workloads but the first is. cf-for-k8s specifically makes this downtime more likely because it uses The following has some useful documentation that details best practices around this:
Curious what others perhaps with more K8S experience think though. |
@braunsonm Thank you for the explanation - that makes sense. I'll take a deeper look into Eirini and see how it matches against Kubernetes recommendations. My feeling is that Kubernetes already has the machinery to handle situations like this, but we aren't properly taking advantage of it. |
No problem @Birdrock I actually just made a small edit which might be relevant to your investigating with Eirini.
|
Could we solve this without exposing PDBs to the end-user? I imagine that Eirini could automatically setup a PDB with min available 1 or 50% of the desired instance count, or of the desired instance count. |
Absolutely agree on taking yet another look at the StatefulSet vs Deployment/ReplicaSet discussion. IIRC the only reason to make it StatefulSets was to allow for addressing individual instances. /cc @herrjulz and also cross-referencing cloudfoundry/eirini#94 If we could deprecate routing to individual instances, I guess using Deployments could be a thing? |
That's true, if we deprecate routing to individual instances we could switch to Deployments instead of using Statefulsets. @bkrannich @voelzmo |
@herrjulz I guess this is still parallel to the question of PDBs, isn't it? |
@loewenstein yes it is parallel. As PDBs are on a per app basis, It can be considered to make Eirini create a PDB for every app when there are more than one instance. |
50% could work but it makes assumptions about the load that an app can handle. I would prefer it being user configurable. |
@braunsonm one thought I had was that this setting might rather be a setting of the foundation instead of the individual app. Like, if you run an app in production on 20 instances you'll probably have some reason and likely fail to keep it available if you drop below idk 15 instances. If you didn't, why would you run 20 instances in the first place. This might be different for staging, qa, playground systems though. WDYT? |
@loewenstein Hmm I'm confused by your reply. It makes perfect sense what you said but that's exactly why I was thinking it's better being an individual app setting vs foundational. Because I don't care about some app in my |
@braunsonm Good point. I was seeing dev foundation vs. prod foundation. With dev spaces and prod spaces in the same foundation, this is of course looking different. I'd still prefer not to expose PDBs to app developers. They shouldn't know anything about Pods or the details of Kubernetes node update. How's this handled with Diego BTW? |
@loewenstein Ah yea still I don't think that would be preferred. For instance I might have an app with two instances but still not really worry about downtime (perhaps it's a consumer for a rabbit queue and not user facing) I wouldn't want to make assumptions about what availability it needs just because some other app needs 75%. I'd prefer not to expose PDBs either. Not sure how Diego handles this. The only thing I could think of would be a new manifest property for |
@loewenstein In Diego-land, an app developer doesn't need to specify anything more than FYI, I'm a colleague of @braunsonm and CF-for-VMs operator, in case you're wondering where I'm coming from. 😄 In our CF-for-VMs foundations I don't think we've ever seen an app suffer total failure during a CF upgrade if it was running 2+ instances; I think this kind of behaviour is ultimately the goal that @braunsonm (and me, by extension!) are looking to have wiht cf-for-k8s. |
Reading earlier comments in this thread, I came here to say what @9numbernine9 has meanwhile said: I believe that a good (IMHO: the best) default is to do what Diego does because we'll see people moving over from CF-for-VMs not expecting a behavior change when it comes to CF app scaling. If we later on want to add additional flexibility by considering something like dev vs. prod foundations this is fine, but I'd advocate for keeping the status quo first. Re-reading @9numbernine9's comment, I'm not sure if it is suggesting to keep the exact same Diego behavior or if the suggestion is to at least keep an app instance up-and-running to be able to serve requests. As mentioned above, my strong preference would be to retain Diego behavior. |
That's why I've added the question about Diego behavior. My guess would be Diego drain makes sure all apps are properly evacuated to other cells. Getting the exact behavior could get complicated, though. |
Adding in @PlamenDoychev, both for visibility but also to add comments around Diego draining behavior in case they have not been covered here already. |
@bkrannich Sorry, I should've expressed myself more clearly! I don't necessarily think that emulating Diego's draining behaviour exactly should be a requirement, but providing a behaviour that keeps a subset of app instances alive during an upgrade probably should be. In my experience, Diego's behaviour seems quite reasonable to me. If an app is deployed with Personally, Diego's current drain behaviour makes existing infrastructure upgrades (e.g. a CF upgrade or upgrading all the stemcells) the kind of activity that can be done during working hours without disruption to clients, whereas if we didn't have this behaviour we would need to be scheduling upgrades during off-hours - and I hate working weekends. 😆 |
At least one instance could cause issues under load. Or are we discussing having a PDB with minAvailable = instances-1? Thus, minimizing the amount of app disruption |
Depending on the ratio of app instances to k8s worker nodes, a PDB with min instances-1 is likely to block worker node updates I think. When instance count doubles the number of workers, an optimal spread would mean there's no worker that can be shut down. |
So far, we have educated our users around the current Diego behavior which is that if they specify As mentioned, I believe today's Diego behavior should be the default for cf-for-k8s as well (or at least there should be a system-wide option to retain this behavior) because we want people to upgrade to cf-for-k8s without too much changes (otherwise, why upgrade to cf-for-k8s and not alternatives for running stateless apps with buildpacks). I think part of the discussion here is different which is: From a coding perspective, what options does K8s offer to achieve one or the other behavior once we have settled on a default. But I'd suggest to make this a second step which is informed by answering the question of "what do our users want"? |
@loewenstein @bkrannich @9numbernine9 @braunsonm I just discussed this topic with the team and we realised that we already set PDBs, but we set it to 1 instance if the instance count is |
I agree that a default behaviour that matches Diego would be good, but I don't think that we should stop there. As @herrjulz said I would like to see it configurable per app. The guarantees of only a single app being available would take production workloads offline because of the increased load. Also interesting you already default to at least 1 instance PDB! I didn't notice that before. In that case the default behaviour already is what Diego has and this issue is more about improving that since currently only having a single app be available during an upgrade would result in downtime for higher traffic apps. |
With the current approach, we have one single setting for PDBs for every app. If we want to make the setting more individual (eg for every single app), the cloud controller would need to provide this information to Eirini such that Eirini can set the PDB for every app individually. This would require some work on the CC side. |
Hi folks 👋, a bit off-topic but I think that to decide what's best it's good to better understand how Diego maintains zero downtime for apps during cf/infra updates. So I decided to put in a few words regarding it. Note - To understand the details below it's good to have a basic understanding of how Diego works. It's enough to read the first few lines in this doc. During an update bosh stops, updates and then starts all the jobs on a subset of the
In short the behaviour is to kill an evacuating app instance only after it is replicated on another Hope this puts some insight into how Diego handles updates. 😄 Another note - Diego always tries to schedule instances of a single app across different cells to maintain HA. PS0 - I'm not the author of the evacuation feature but I'm familiar with the code base since I had to research it once. |
Hi all, we started to work on exposing the Eirini PDB setting in the Another thing to point out is that we have to deal with the fact that depending on the backend (diego or k8s) a user has more/less features available. |
Hi everyone! In Kubernetes this would be equivalent to
We've going to run a spike about this to see if there's a way to achieve Diego's behaviour in Kubernetes. If we can't find any, we might settle with |
Hey @gcapizzi I'd like to point out that I think this is acceptable but I wanted to point it out. You could speed this time up by changing the CF apps to deploy as |
Hi @braunsonm! Isn't this a problem Diego is already facing? My main argument for this is that it would the closest thing to what we already have on cf-for-vms (and we've had for a long time). It's hard to establish what the contract we have with our users is in this case, which is why we tend to consider the cf-on-vms/diego behaviour to be it, although it might not be the behaviour expected by all users. Maybe we need to talk with more operators about their experience upgrading cf-for-vms and how that would translate into expectations for cf-for-k8s. |
|
@loewenstein could you explain this in a bit more detail? I haven't seen this mentioned anywhere in the Kubernetes docs and it seems quite odd to be honest. We'll make sure that we test this scenario in our spike! |
Well, with e.g. 10 application instances on 5 nodes - if we consider an even spread - we'll have 2 instances on each node. But you might be right and it's just taking longer and Kubernetes is taking off Pods one by one (after spinning them up on other nodes) until finally one node is completely empty and can be replaced. I'd run an experiment though, before I'd go with |
It doesn't sound like it. From what was said above, the instance is started up on another cell before the old is torn down. If you specify a large This behaviour cannot be replicated because CF-for-k8s schedules apps as
This is true since the PDB would require only 1 app can be moved at a time. An example: I have a 10 node cluster on AKS with a surge capacity of 50%. This means during an upgrade AKS will give me 5 more new nodes and will drain 5 old nodes at a time. Even though 5 nodes are draining, the app is only moving one instance at a time to the new node pool which will stall the upgrade (potentially causing it to timeout though you'd have to confirm that). As @loewenstein I think some testing would be required. |
@braunsonm I believe this is the related issue in Eirini: [request] Support to create a K8s Deployment instead of Statefulset resource We'll revisit this after the Eirini team has completed work on generating Deployments instead of StatefulSets. cc @jspawar |
@Birdrock That does not seem related and they seem against the idea. Although a |
Hey everyone! We're currently investigating this, too see which options we have. Re: |
I commented on cloudfoundry/eirini#94 as well, still thought I'd throw it in here directly as well.
This won't help for cluster operation like |
This is interesting but seems more complicated than allowing a user to set |
I guess I should have stated, that I was not necessarily suggesting to implement this. It might just not be worth replicating the exact behaviour here. On the other hand, partially leaking out Kubernetes constructs sounds dangerous to me. Maybe, we should just decide on a reasonable expectation on the percentage of instances that are guaranteed to be up and running and accepting traffic, set this as PDB and rely on the cf-for-k8s/Kubernetes operator to detect and resolve any Node update issues that might be caused by those PDBs. |
Hi everyone! 👋 1. Set all PDBs to
|
Option 3 is still the most flexible option here. Your change will require us to run with double the number of instances we normally would need to, just to ensure we have a decent number of available instances during upgrades. I understand it's additional work but 2 and 4 are not satisfactory in the long term. 2 is still okay as a temporary solution but also does not line up with the process in CF for VMs. |
It's important to understand that the CF-for-VMs behaviour is just not achievable on Kubernetes. Kubernetes will never surge during node drains, so even being able to set |
@gcapizzi 3 would allow us to keep This is also better than 2 because we do not need to waste compute capacity in anticipation of an upgrade. |
Sorry, wrong number! I meant: |
@gcapizzi I suppose 4 would work. It just means all apps need to have a high guarantee of availability even if only one app in the whole foundation does! That is why 3 is ideal in my opinion. However with 4 being an optional setting to override 2 the operator would be aware of upgrade timeouts if they specified I'm fine with this |
@gcapizzi, @braunsonm: In a recent discussion with @emalm I believe the general sense was that it might overall be beneficial to look into # 5, even if it means giving up on a certain set of existing CF behavior (app instance index, etc.). On the other hand, my gut feeling continues to be to aim for something as close as possible to today's Diego behavior when it comes to keeping app instances running (I realize that I'm advocating for dropping and keeping CF compatibility at the same time here). Additionally, I wonder if it would help to increase the number of replicas when we know we are doing an update and decreasing the number of replicas again once the update is done. So, I guess my question would be: Would maybe the combination of the above allow us for a more Diego-like behavior? |
@bkrannich I think your proposal is essentially 6. We just don't think manipulating the instance count of a The unfortunate truth is that there is no way to replicate the behaviour of Diego. This is because Diego can evacuate cells knowing exactly what it can and cannot do when rescheduling containers, while in Kubernetes the drain and upgrade procedure needs to work with any I agree that switching to |
A couple of questions/comments: 1&2 3&4 5&6
The rollouts should eventually leave a draining node empty so that it can be shut down and replaced with an updated node. |
I think there's a misconception that the Kubernetes draining procedure will either kill all Pods on the node or none of them, which is not true. Kubernetes will kill as many Pods as it can within all PDB constraints, and then stop, wait for 5 seconds and try again. This goes on until the node is completely drained. The only way to make this procedure completely stuck is to have PDBs with
This would only really improve things if we used Deployments, as StatefulSets won't surge anyway. Some people have asked to implement Deployment surging during upgrades, but that doesn't seem to be going anywhere. I'm uneasy with trying to hijack the upgrade procedure ourselves. |
@gcapizzi: I read the K8s 1.21 release notes and found https://kubernetes.io/docs/concepts/architecture/nodes/#graceful-node-shutdown - not sure if such a feature would be of any help, though. |
Yes, I think this is related to how |
Is your feature request related to a problem? Please describe.
During Kubernetes upgrades, nodes are drained which will shutdown the pod that is running CF applications before starting up on a new node. This could mean and application could go up and down for the duration of the upgrade.
Describe the solution you'd like
Some way for developers who push apps to configure a
PodDisruptionBudget
so that their application can stay online. It should only be possible to configure this budget when there is more than 1 instance so that upgrades can complete.At first I was thinking CF-for-k8s could just add a budget with minimum available of 1, but on larger deployments that see a lot of traffic, that would probably cause 503's if the service got hammered on a single replica.
You should only be able to configure this if your replica count is more than 1. Configuring a minimum available when you only have a single replica will mean that upgrades cannot happen at all.
Describe alternatives you've considered
There are no alternatives. You can scale your CF apps and "hope for the best" that you won't have all them unscheduled at the same time.
Additional context
Add any other context or screenshots about the feature request here.
The text was updated successfully, but these errors were encountered: