-
Notifications
You must be signed in to change notification settings - Fork 454
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Auto-instrumentation lost on resumption of cluster from hibernation #1329
Comments
Is the hibernation shutting down all pods? If that is the case I would say that the OTEL operator starts after the application pods. (The OTEL operator as well uses the admission mutating webhook to install the auto-instrumentations) Is there a way you could control the starting order of the pods- give the infra/otel operator pods a higher priority? |
Thanks for the pointer. I will verify this and update you again tomorrow after another hibernation. |
Yes, on our cluster this seems to be the case. OTEL operator seems to start after the application pods. To mitigate this issue, I created a Priority class and assigned it to the |
@santoshkashyap any news on this ticket? Can we close it? As a fix maybe we could set the priority class by default? |
@santoshkashyap is this still a problem? We've refactored how reconciliation works which i think should help with this. |
Hi @jaronoff97, |
@M1lk4fr3553r do you have an easy way to reproduce this? I run the operator locally on a kind cluster with autoinstrumentation and it idles and awakes fine. |
I have created this chart to show the issue. |
@M1lk4fr3553r this is a limitation of our current webhook configuration. Right now we only get the injection events on pod creation (see here) and I'm not sure the best way to get around that. The Istio operator functions the same way, I wonder if they have a way of solving this issue... I'll ask around and see if there's anything we can do here. |
Having the operator delete arbitrary Pods sounds like a dangerous capability that I'd rather not add unless we have no other choice. If you'd like your Pods to wait until the operator starts and is able to inject instrumentation, you can set the webhook This is a dangerous setting, as by default it will reject ALL Pods, the operator itself included. If you go down this path, please make sure to also set |
I would not simply delete the pods, I was thinking of Also, in any case, this should be an option that is off by default, since I doubt that anyone is shutting down their production cluster every day. For development and integration clusters, it does not seem uncommon to shut them down during non-working hours to save money. |
I would suggest trying out the webhook settings first, since that seems like a more idiomatic solution to your problem. If you want a rolling restart of your Deployments/StatefulSets/DaemonSets, you can always create a Job with a simple Go program (or even a bash script) that waits until the operator is ready, and then takes care of the restarts. You then have control over what exactly happens to your workloads and in which order. |
To my knowledge https://github.com/Azure/AKS/issues/4002 currently prevents setting the objectSelector correctly in AKS through the helm chart, which means that right now there is no reliable way of using auto instrumentation and sidecar injection with the operator helm chart in AKS. Also, I think that the default settings in the helm chart need to prevent this issue, since it initially isn't obvious. The way this is done in dapr (periodically checking and deleting pods with missing injected sidecars) may not be perfect, but it works for me. |
I'm pinning this issue for now, because I have to link to it fairly often and is something I assume many users are confused by. |
We have our K8S development clusters set to hibernate everyday at the end of regular work hours. The cluster becomes active the next day. We have setup opentelemetry-operator on our cluster. Also configured OpenTelemetry Collector as a Daemon with corresponding annotations on the pod (Java/NodeJS)
For Java:
For NodeJS:
With this setup, everything works fine. For example, for Java apps the JavaAgent is volume mounted automatically. The agent instruments the application and ships the traces to a OpenTelemetry collector pod (created via Otel CR via the operator). Finally, the collector pod ships the traces to our Observability backend service. However, when the workload resumes the next day after hibernation everything seems to be lost (see screen shot below). Not sure why it happens ? There is not much information in the application logs or OpenTelemetry daemon pod logs or even in the
opentelemetry-operator-controller-manager
pod in theopentelemetry-operator-system
namespace.Container spec before hibernation:
After resumption from hibernation: OpenTelemetry setup is lost
Thanks in advance!!!
The text was updated successfully, but these errors were encountered: