Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto-instrumentation lost on resumption of cluster from hibernation #1329

Open
santoshkashyap opened this issue Dec 15, 2022 · 15 comments
Open
Labels
question Further information is requested

Comments

@santoshkashyap
Copy link

We have our K8S development clusters set to hibernate everyday at the end of regular work hours. The cluster becomes active the next day. We have setup opentelemetry-operator on our cluster. Also configured OpenTelemetry Collector as a Daemon with corresponding annotations on the pod (Java/NodeJS)
For Java:

# format <namespace/otel instrumentation CR>
instrumentation.opentelemetry.io/inject-java=dev-opentelemetry/opentelemetry-instrumentation

For NodeJS:

# format <namespace/otel instrumentation CR>
instrumentation.opentelemetry.io/inject-nodejs=dev-opentelemetry/opentelemetry-instrumentation

With this setup, everything works fine. For example, for Java apps the JavaAgent is volume mounted automatically. The agent instruments the application and ships the traces to a OpenTelemetry collector pod (created via Otel CR via the operator). Finally, the collector pod ships the traces to our Observability backend service. However, when the workload resumes the next day after hibernation everything seems to be lost (see screen shot below). Not sure why it happens ? There is not much information in the application logs or OpenTelemetry daemon pod logs or even in the opentelemetry-operator-controller-manager pod in the opentelemetry-operator-system namespace.

Container spec before hibernation:
image

After resumption from hibernation: OpenTelemetry setup is lost
image

Thanks in advance!!!

@pavolloffay pavolloffay added the question Further information is requested label Dec 19, 2022
@pavolloffay
Copy link
Member

pavolloffay commented Dec 19, 2022

Is the hibernation shutting down all pods? If that is the case I would say that the OTEL operator starts after the application pods.

(The OTEL operator as well uses the admission mutating webhook to install the auto-instrumentations)

Is there a way you could control the starting order of the pods- give the infra/otel operator pods a higher priority?

@santoshkashyap
Copy link
Author

Thanks for the pointer. I will verify this and update you again tomorrow after another hibernation.

@santoshkashyap
Copy link
Author

santoshkashyap commented Dec 27, 2022

Is the hibernation shutting down all pods? If that is the case I would say that the OTEL operator starts after the application pods.

Yes, on our cluster this seems to be the case. OTEL operator seems to start after the application pods. To mitigate this issue, I created a Priority class and assigned it to the opentelemetry-operator-controller-manager pod. I will update again tomorrow after cluster hibernation whether this approach works.

@pavolloffay
Copy link
Member

@santoshkashyap any news on this ticket?

Can we close it?

As a fix maybe we could set the priority class by default?

@santoshkashyap
Copy link
Author

santoshkashyap commented Jan 5, 2023

Unfortunately, this still doesn't seem to work.

I have assigned higher priority for operator
image

application pod still have no priority class assigned, hence defaults to '0'.

With this setup, I still see application pods are in running state while the OTEL operator is still in container creating

image

Also it seems to be that even though OTEL operator is scheduled early, it seems to take sometime to complete container creation.

A workaround we are discussing is to have some kind of cronjob that runs daily to rollout restart application deployment after resumption. Meanwhile if there is anything I can try, please let me know.

@jaronoff97
Copy link
Contributor

@santoshkashyap is this still a problem? We've refactored how reconciliation works which i think should help with this.

@M1lk4fr3553r
Copy link

Hi @jaronoff97,
yes, this issue still exists in version 0.96.0.
Feel free to ping me if you need any assistance in resolving this issue.

@jaronoff97
Copy link
Contributor

@M1lk4fr3553r do you have an easy way to reproduce this? I run the operator locally on a kind cluster with autoinstrumentation and it idles and awakes fine.

@M1lk4fr3553r
Copy link

I have created this chart to show the issue.
Once you deploy the chart, you will notice that the pod created from deployment-to-instrument has not been injected.
This occurs because the pod is created before the operator pod is ready to inject other pods. (This behavior is documented here)
There should ideally be a way to let the operator restart pods that should be injected but aren't.

@jaronoff97
Copy link
Contributor

@M1lk4fr3553r this is a limitation of our current webhook configuration. Right now we only get the injection events on pod creation (see here) and I'm not sure the best way to get around that. The Istio operator functions the same way, I wonder if they have a way of solving this issue... I'll ask around and see if there's anything we can do here.

@swiatekm
Copy link
Contributor

swiatekm commented Apr 4, 2024

Having the operator delete arbitrary Pods sounds like a dangerous capability that I'd rather not add unless we have no other choice.

If you'd like your Pods to wait until the operator starts and is able to inject instrumentation, you can set the webhook failurePolicy to Fail. The Pod will be rejected by the API Server, and its controller will start retrying until successful.

This is a dangerous setting, as by default it will reject ALL Pods, the operator itself included. If you go down this path, please make sure to also set objectSelector on the webhook to ignore your critical system services.

@M1lk4fr3553r
Copy link

Having the operator delete arbitrary Pods sounds like a dangerous capability that I'd rather not add unless we have no other choice.

I would not simply delete the pods, I was thinking of rollout restarting the deployment. That way a new pod can spin up before and there should be no danger of a downtime.

Also, in any case, this should be an option that is off by default, since I doubt that anyone is shutting down their production cluster every day. For development and integration clusters, it does not seem uncommon to shut them down during non-working hours to save money.

@swiatekm
Copy link
Contributor

swiatekm commented Apr 8, 2024

I would suggest trying out the webhook settings first, since that seems like a more idiomatic solution to your problem. If you want a rolling restart of your Deployments/StatefulSets/DaemonSets, you can always create a Job with a simple Go program (or even a bash script) that waits until the operator is ready, and then takes care of the restarts. You then have control over what exactly happens to your workloads and in which order.

@KarstenWintermann
Copy link

To my knowledge https://github.com/Azure/AKS/issues/4002 currently prevents setting the objectSelector correctly in AKS through the helm chart, which means that right now there is no reliable way of using auto instrumentation and sidecar injection with the operator helm chart in AKS. Also, I think that the default settings in the helm chart need to prevent this issue, since it initially isn't obvious.

The way this is done in dapr (periodically checking and deleting pods with missing injected sidecars) may not be perfect, but it works for me.

@jaronoff97
Copy link
Contributor

I'm pinning this issue for now, because I have to link to it fairly often and is something I assume many users are confused by.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

6 participants