-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support zone awareness #203
Comments
Hi @rverma-nsl, Thank you for reporting this issue. Can you elaborate what you are missing from the helm-chart? |
If you look at the ingester_config as provided here https://cortexmetrics.io/docs/configuration/configuration-file/#ingester_config, you will observe that for a zone awareness, currently we need to provide the az per ingester. It meant that we actually need to have 2 differenet config file and two different deployments of ingesters. I am not sure if this can be handled any other way. |
Yeah to be honest that sounds reasonable to me. With helm you could also have multi values.yml. So you could have one file for the general config and then two or more files with very minimal config which is basically just the cortex AZ config.
Otherwise I'm not sure how the cortex-helm-chart is supposed to support this and I also don't think it makes sense |
The bigger issue here is making sure the pods end-up in the zone they are configured for and fortunately the chart supports a nodeSeletor label. Following on the example above, the problem starts when you create different deployments (or statefulsets for what matters) for each zone: they have to have different names, which the chart doesn't support. Say you add support for deployment names override, then you run into the problem with the service name and headless service too: you don't want those to have different names. Now you have to provide support for service to take independent names from the names of the deployments (currently they all take their name from cortex.fullname helper). This problem is not only limited to ingesters, store gateway and distributors fall in the same category: different deployments are needed in each availability zone. In the end I don't like this approach, it works against K8s' pod scheduler. Without making any change whatsoever, in the current version of the chart the PODs are evenly distributed across AZ's, which is great. The only thing that's left is to insert the AZ name in the Cortex config. I'm thinking an init container (which the chart supports adding) could do that. |
I'm open to ideas if you wanna give that a go. |
I don't really understand that point. When you use a different helm release name the name also changes also with Kubernetes namespaces I don't see any Issue. I never deployed multiple az so maybe I'm missing something. |
You'd want a single helm deployment release, otherwise you'd end-up with multiple querier and distributor services for the same Cortex instance. |
Ah I see. I feel like that it could be in somebody best interest in to have multiple "sets". But for memberlist you would have to configure the memberlist service to pick up multiple azs (the selector currently is quite strict). I don't think you can do that as of now without manually changing stuff. Either way it seems that this feature is not really supported by the helm-chart. I'm open to any suggestion |
Yeah, memberlist is another one, I use etcd so I don't have that problem. The idea is to create a dynamic config file in Cortex and add the zone awareness configuration to it from an init container that has get access to kube API pods and nodes. That would require adding a cluster role with read access to pods and nodes in core API group and binding it with the cortex service account (similar to how was done for ruler sidecar). The init container then would:
For init container to write a file, the config file would have to be a file on an emptyDir mount. Currently that comes from a read-only secret or configmap. So we need the ability in the chart to override the config filename, something like:
The init container would copy the read-ony config that comes from configmap and replace the AZ name. Thoughts ? |
I like the approach of parsing Wait..What are you trying to do exactly here with the configmap? Are you trying to mount the non-az-config and then alter and save that as az-aware-config for use later with the cortex components? Do you have some thought's on this @kd7lxl? |
I think a mutating admission controller might be a better way to inject the zone name into the cortex container arguments or environment. Surely there is an existing admission controller that is capable of injecting a node label into a pod environment, like https://github.com/Collaborne/kubernetes-aws-metadata-service but ideally one that is still maintained. Otherwise, we can write one. Then we can configure Cortex to read that env var. Related reading: elastic/cloud-on-k8s#3933 (comment) |
Update: we also would have to support a way on how to configure memberlist with the 2nd and 3rd zone |
Do you wanna give that a go? @danfromtitan Would be great if we could just use something that already exists |
I'm looking into it. Do you want the webhook config/deployment added to the chart with a zone_aware_enabled flag around it ?
There ain't any that does exactly what we want here, but I can change an existing one. Either way, the admission controller image is asking for a separate project: how do you want to go about that ? There is an option to keep everything outside this chart and deploy the admission controller in a separate namespace. |
👍 sounds good.
Yeah I feared this might be the case. I guess I could ask if we could get a seperate repo in the cortexproject space. And then we could use github actions to build/deploy the docker image and maybe even github registry.
I'm all ears. I mean we could do our part here and maybe have a guide that references the controller that you would have to setup yourself? |
That'd be the cleanest approach, the chart project is already overloaded (ruler sidecar is an example that comes to mind). |
@kd7lxl - there is one fundamental issue with the idea of using an admission controller: at the time the API request makes it through the controller, there is no decision yet by the scheduler about the worker node where the pod would be deployed, meaning the AZ cannot be determined at that time. It seems to me an init container is the only way to determine the AZ, because by the time init container is started, the pod has been scheduled to a given node. Without going into nasty hardwiring, I can't think of any other way to tell a cortex container the AZ where it's running. |
The pod/binding event looked promising. Have you tried that?
That surely exists right? Or would we also have to build that ourselves. |
For an admission controller, I think the sequence would have to be like this:
I haven't tested this to confirm the sequence and events all work. Maybe I missed something.
initContainer would work too, but you still have to come up with a way to get the result of the initContainer into the environment of the primary container. I really don't like having to use wrapper scripts to replace a container's entrypoint, nor do I really want to give the initContainer permission to create ConfigMaps. The reason I like the admission controller approach is that the only thing that needs permission to mutate pods and create config maps is the admission controller, not the app itself. This is a commonly requested feature: kubernetes/kubernetes#62078 (and others) If we can provide a good solution I'm sure many people will appreciate it. |
We can also use https://kubernetes.io/docs/concepts/workloads/pods/pod-topology-spread-constraints/ to ensure that pods are spread across as and then use the configmap with envfrom annotation value which would be correctly assigned. |
I'm not sure the API service would run the resource admission twice through the admission controller between steps 3 and 4 like you've described. In fact what I see happening at step 3 is the POD resource is not blocked from starting, but it's rather created and sits there looping with an error "configmap not found". Once the error condition is cleared by creating the configmap, there is no UPDATE admission request send to the controller, the POD rather follows through and it completes the start-up sequence. I tried this twice, with a standalone POD and pods from a deployment, same outcome. An UPDATE operation could be initiated from an outside request (i.e. helm chart update or kubectl apply) but not by having the admission controller sit in a loop. In the subsequent UPDATE operation I can see indeed the nodeName making it in the admission request, but that is the node where POD runs at the time before update would get applied. I guess after update the scheduler might move the POD to a different node (i.e. let's say because node capacity has been changed by unrelated PODs), so we still won't know where the POD lands next. Providing this would somehow work, there is still an open question of what happens with legit deployment updates: the trick with missing configmap won't work a second time since now the resource already exists. By contrast an initContainer is a bit simpler and it's guaranteed to work all the time since it runs in the same POD:
The only update that's needed in the chart is to change the hardcoded configuration file in the cortex container arguments with a parameter value, i.e.
where
|
There is no problem spreading PODs evenly across AZ's with the current anti-affinity rules in the chart. The issue is telling cortex distributors/queriers about where the ingesters run, just so time series could be replicated across different AZ's and not being lost in case of a zone failure. Having read through that part in the Cortex code that implements the writing across AZ's, I'm not even sure zone awareness is worth the effort of implementing it: the distributors would merely attempt a fair distribution of times series within the replication factor configured and won't make any attempt to keep the traffic in the same AZ and save on the cross-zone data transfer cost. |
Couldn't find any "pod/binding" resource in the API reference. The implementation of binding a pod to machine doesn't look like it leaves room for catching/updating the request in the outside admission controler. |
|
It's called apiVersion: admissionregistration.k8s.io/v1
kind: MutatingWebhookConfiguration
...
webhooks:
...
rules:
- apiGroups:
- ""
apiVersions:
- v1
operations:
- CREATE
resources:
- pods
- pods/binding
scope: '*' {
"kind": "AdmissionReview",
"apiVersion": "admission.k8s.io/v1beta1",
"request": {
"uid": "645135a1-48c9-4550-8ca0-9e5d1343c3d5",
"kind": {
"group": "",
"version": "v1",
"kind": "Binding"
},
"resource": {
"group": "",
"version": "v1",
"resource": "pods"
},
"subResource": "binding",
"requestKind": {
"group": "",
"version": "v1",
"kind": "Binding"
},
"requestResource": {
"group": "",
"version": "v1",
"resource": "pods"
},
"requestSubResource": "binding",
"name": "nginx",
"namespace": "default",
"operation": "CREATE",
"userInfo": {
"username": "system:kube-scheduler",
"groups": [
"system:authenticated"
]
},
"object": {
"kind": "Binding",
"apiVersion": "v1",
"metadata": {
"name": "nginx",
"namespace": "default",
"uid": "75fc040e-c49e-4418-8525-674c4deeabbc",
"creationTimestamp": null,
"managedFields": [
{
"manager": "kube-scheduler",
"operation": "Update",
"apiVersion": "v1",
"time": "2022-02-08T03:16:50Z",
"fieldsType": "FieldsV1",
"fieldsV1": {
"f:target": {
"f:kind": {},
"f:name": {}
}
}
}
]
},
"target": {
"kind": "Node",
"name": "kind-control-plane"
}
},
"oldObject": null,
"dryRun": false,
"options": {
"kind": "CreateOptions",
"apiVersion": "meta.k8s.io/v1"
}
}
} No mutation to this resource is required -- the artifact of this event would be creating a ConfigMap. |
@kd7lxl , there are couple of issues with putting the env vars in a config map:
Instead of configmaps I was going to explore the idea of adding container env vars from the binding admission request. Do you know whether the mutation is possible from the binding event ? I'm seeing the error below while the pod sits in pending and I'm not sure if mutation is not allowed or there is another reason.
|
NVM, I replied my own question: the mutate admission applies to the object kind in the request, in this case "kind": "Binding", so it would not patch the pod here. |
I agree that configmaps are not needed. Labeling the pod works just fine -- no need to create another resource. I wrote a little proof of concept here: https://gist.github.com/kd7lxl/1997a4d162bd49d4dbdb8f69506a1e01. It does the whole process with I would not return a mutation/patch from the pods/binding AdmissionReview (pretty sure that wouldn't work), rather just use the event to trigger a labeling operation on the pod with a client. It's just slightly simpler than watching all pod events. |
I had a peak at some admission webhook examples and the whole setup can be a bit tricky, especially since you need a proper certificate for your webhook server. Which we can automate with stuff like this: https://github.com/newrelic/k8s-webhook-cert-manager (maybe not exactly this) but this sure as hell won't make things easier. So I'm asking myself at what point does this become an entire software suite :D |
A noob question, Can't we use istio, they have zone aware routing? And reduce the zone awareness functionalities from cortex? Similar functionalities are there in cilium too. |
Maybe I wouldn't advise it but that is definitely something you will have to try on your own :) |
Fully functional admission controller that inserts env vars from node labels. There are no changes required for the helm chart. To enable zone awareness:
If there are no outstanding questions, this issue can be closed. |
Just read this interesting article with fits into this disscussion. |
@danfromtitan I must have overlooked this. Thank you for your implementation of the custom webhook. Are you willing to write-up a small Guide in our docs/ folder for setting it up? DEV Guide |
Yeah, can do. |
Hi folks, to confirm my understanding, we have a couple of options to make this work:
I think I prefer the option 2 if we wanted to support this in the Helm chart. Will test out the option 1 as well - though we (as a user) usually don't want to add an additional component (with ++ CVE concerns) so the Helm way is more preferred in our case. |
To support zone awareness we need to create different ingesters configurations as many zones required. Currently this is not possible through the helm install.
The text was updated successfully, but these errors were encountered: