Auto-instrumentation lost on resumption of cluster from hibernation #1329

santoshkashyap · 2022-12-15T14:05:55Z

We have our K8S development clusters set to hibernate everyday at the end of regular work hours. The cluster becomes active the next day. We have setup opentelemetry-operator on our cluster. Also configured OpenTelemetry Collector as a Daemon with corresponding annotations on the pod (Java/NodeJS)
For Java:

# format <namespace/otel instrumentation CR>
instrumentation.opentelemetry.io/inject-java=dev-opentelemetry/opentelemetry-instrumentation

For NodeJS:

# format <namespace/otel instrumentation CR>
instrumentation.opentelemetry.io/inject-nodejs=dev-opentelemetry/opentelemetry-instrumentation

With this setup, everything works fine. For example, for Java apps the JavaAgent is volume mounted automatically. The agent instruments the application and ships the traces to a OpenTelemetry collector pod (created via Otel CR via the operator). Finally, the collector pod ships the traces to our Observability backend service. However, when the workload resumes the next day after hibernation everything seems to be lost (see screen shot below). Not sure why it happens ? There is not much information in the application logs or OpenTelemetry daemon pod logs or even in the opentelemetry-operator-controller-manager pod in the opentelemetry-operator-system namespace.

Container spec before hibernation:

After resumption from hibernation: OpenTelemetry setup is lost

Thanks in advance!!!

The text was updated successfully, but these errors were encountered:

pavolloffay · 2022-12-19T10:29:48Z

Is the hibernation shutting down all pods? If that is the case I would say that the OTEL operator starts after the application pods.

(The OTEL operator as well uses the admission mutating webhook to install the auto-instrumentations)

Is there a way you could control the starting order of the pods- give the infra/otel operator pods a higher priority?

santoshkashyap · 2022-12-20T07:04:49Z

Thanks for the pointer. I will verify this and update you again tomorrow after another hibernation.

santoshkashyap · 2022-12-27T11:27:58Z

Is the hibernation shutting down all pods? If that is the case I would say that the OTEL operator starts after the application pods.

Yes, on our cluster this seems to be the case. OTEL operator seems to start after the application pods. To mitigate this issue, I created a Priority class and assigned it to the opentelemetry-operator-controller-manager pod. I will update again tomorrow after cluster hibernation whether this approach works.

pavolloffay · 2023-01-04T10:14:03Z

@santoshkashyap any news on this ticket?

Can we close it?

As a fix maybe we could set the priority class by default?

santoshkashyap · 2023-01-05T04:56:12Z

Unfortunately, this still doesn't seem to work.

I have assigned higher priority for operator

application pod still have no priority class assigned, hence defaults to '0'.

With this setup, I still see application pods are in running state while the OTEL operator is still in container creating

Also it seems to be that even though OTEL operator is scheduled early, it seems to take sometime to complete container creation.

A workaround we are discussing is to have some kind of cronjob that runs daily to rollout restart application deployment after resumption. Meanwhile if there is anything I can try, please let me know.

jaronoff97 · 2023-11-28T22:24:30Z

@santoshkashyap is this still a problem? We've refactored how reconciliation works which i think should help with this.

M1lk4fr3553r · 2024-04-03T08:50:00Z

Hi @jaronoff97,
yes, this issue still exists in version 0.96.0.
Feel free to ping me if you need any assistance in resolving this issue.

jaronoff97 · 2024-04-03T17:06:05Z

@M1lk4fr3553r do you have an easy way to reproduce this? I run the operator locally on a kind cluster with autoinstrumentation and it idles and awakes fine.

M1lk4fr3553r · 2024-04-04T11:38:47Z

I have created this chart to show the issue.
Once you deploy the chart, you will notice that the pod created from deployment-to-instrument has not been injected.
This occurs because the pod is created before the operator pod is ready to inject other pods. (This behavior is documented here)
There should ideally be a way to let the operator restart pods that should be injected but aren't.

jaronoff97 · 2024-04-04T14:31:39Z

@M1lk4fr3553r this is a limitation of our current webhook configuration. Right now we only get the injection events on pod creation (see here) and I'm not sure the best way to get around that. The Istio operator functions the same way, I wonder if they have a way of solving this issue... I'll ask around and see if there's anything we can do here.

swiatekm · 2024-04-04T17:00:16Z

Having the operator delete arbitrary Pods sounds like a dangerous capability that I'd rather not add unless we have no other choice.

If you'd like your Pods to wait until the operator starts and is able to inject instrumentation, you can set the webhook failurePolicy to Fail. The Pod will be rejected by the API Server, and its controller will start retrying until successful.

This is a dangerous setting, as by default it will reject ALL Pods, the operator itself included. If you go down this path, please make sure to also set objectSelector on the webhook to ignore your critical system services.

M1lk4fr3553r · 2024-04-05T07:04:08Z

Having the operator delete arbitrary Pods sounds like a dangerous capability that I'd rather not add unless we have no other choice.

I would not simply delete the pods, I was thinking of rollout restarting the deployment. That way a new pod can spin up before and there should be no danger of a downtime.

Also, in any case, this should be an option that is off by default, since I doubt that anyone is shutting down their production cluster every day. For development and integration clusters, it does not seem uncommon to shut them down during non-working hours to save money.

swiatekm · 2024-04-08T14:11:06Z

I would suggest trying out the webhook settings first, since that seems like a more idiomatic solution to your problem. If you want a rolling restart of your Deployments/StatefulSets/DaemonSets, you can always create a Job with a simple Go program (or even a bash script) that waits until the operator is ready, and then takes care of the restarts. You then have control over what exactly happens to your workloads and in which order.

KarstenWintermann · 2024-05-01T09:47:44Z

To my knowledge https://github.com/Azure/AKS/issues/4002 currently prevents setting the objectSelector correctly in AKS through the helm chart, which means that right now there is no reliable way of using auto instrumentation and sidecar injection with the operator helm chart in AKS. Also, I think that the default settings in the helm chart need to prevent this issue, since it initially isn't obvious.

The way this is done in dapr (periodically checking and deleting pods with missing injected sidecars) may not be perfect, but it works for me.

jaronoff97 · 2024-11-27T16:04:38Z

I'm pinning this issue for now, because I have to link to it fairly often and is something I assume many users are confused by.

pavolloffay added the question Further information is requested label Dec 19, 2022

jaronoff97 mentioned this issue Apr 26, 2024

Pods occasionally come up without injected sidecars when AKS cluster is restarted #2901

Closed

This was referenced Jun 3, 2024

Sidecar injection fails when Operator ready after target Pods #1765

Open

Opentelemetry Operator auto instrumentation inject failed during leader election #1797

Open

jaronoff97 mentioned this issue Nov 23, 2024

Operator startup dependency #3489

Closed

jaronoff97 pinned this issue Nov 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto-instrumentation lost on resumption of cluster from hibernation #1329

Auto-instrumentation lost on resumption of cluster from hibernation #1329

santoshkashyap commented Dec 15, 2022

pavolloffay commented Dec 19, 2022 •

edited

Loading

santoshkashyap commented Dec 20, 2022

santoshkashyap commented Dec 27, 2022 •

edited

Loading

pavolloffay commented Jan 4, 2023

santoshkashyap commented Jan 5, 2023 •

edited

Loading

jaronoff97 commented Nov 28, 2023

M1lk4fr3553r commented Apr 3, 2024

jaronoff97 commented Apr 3, 2024

M1lk4fr3553r commented Apr 4, 2024

jaronoff97 commented Apr 4, 2024

swiatekm commented Apr 4, 2024 •

edited

Loading

M1lk4fr3553r commented Apr 5, 2024

swiatekm commented Apr 8, 2024

KarstenWintermann commented May 1, 2024

jaronoff97 commented Nov 27, 2024

Auto-instrumentation lost on resumption of cluster from hibernation #1329

Auto-instrumentation lost on resumption of cluster from hibernation #1329

Comments

santoshkashyap commented Dec 15, 2022

pavolloffay commented Dec 19, 2022 • edited Loading

santoshkashyap commented Dec 20, 2022

santoshkashyap commented Dec 27, 2022 • edited Loading

pavolloffay commented Jan 4, 2023

santoshkashyap commented Jan 5, 2023 • edited Loading

jaronoff97 commented Nov 28, 2023

M1lk4fr3553r commented Apr 3, 2024

jaronoff97 commented Apr 3, 2024

M1lk4fr3553r commented Apr 4, 2024

jaronoff97 commented Apr 4, 2024

swiatekm commented Apr 4, 2024 • edited Loading

M1lk4fr3553r commented Apr 5, 2024

swiatekm commented Apr 8, 2024

KarstenWintermann commented May 1, 2024

jaronoff97 commented Nov 27, 2024

pavolloffay commented Dec 19, 2022 •

edited

Loading

santoshkashyap commented Dec 27, 2022 •

edited

Loading

santoshkashyap commented Jan 5, 2023 •

edited

Loading

swiatekm commented Apr 4, 2024 •

edited

Loading