CI/CD: producing long running traces #1648

Joibel · 2024-12-04T16:20:54Z

Area(s)

area:cicd

Is your change request related to a problem? Please describe.

This is not strictly a semantic-convention discussion, but it's come about because of trying to produce traces for CI/CD and the sem-conv CI/CD WG is the closest place for a home for this discussion.

The problem I am trying to solve is producing spans of large durations, from a kubernetes controller which may restart.

I'm specifically trying to solve this in argo-workflows, and I gave a talk at argocon (slides) which gives a lot of context to this. Argo Workflows can and is used as a CI/CD tool, but also for many other batch and machine learning scenarios where traces of the flow of a workflow would be useful. I'm pretending that the trace is the lifespan of a workflow, which may be several hours (some run for days). There are child spans representing the phase of the workflow and the individual nodes within the DAG that is being executed.

Argo Workflows

The workflow controller is best thought of here as a kubernetes operator, in this case running something not that far removed from a kubernetes job, just that the job is a DAG rather than a single pod. In the usual kubernetes controller manner, this controller is stateless, and can therefore restart at any time. Any necessary state is stored in the workflow Custom Resource. This is how the controller currently works.

I've used the GO SDK to add OTLP tracing to Argo Workflows, which works unless the controller restarts.

Ideas

Delayed span transmission.

My initial thought

My problem seemed to be that the opentelemetry SDK does not allow me to resume a Span. I'm using the golang SDK, but this is fundamental to the operation of spans. Once a Tracer is shutdown it will end all spans, and that's the end of my trace and spans. The spans will get transmitted as ended.

I therefore supposed you could put the SDK or a set of traces/spans into a mode where the end of the span didn't get transmitted at shutdown, and instead the span could be stored outside of the SDK to be transmitted later by the resumed controller once it started up. The SDK could then also facilitate "storing a span to a file".

This is possibly implementable right now with enough hacking, I haven't managed to get time for it.

Links

I could use span links and create multiple traces for a single workflow. This would crudely work, but I'd argue against it unless the presentation layer can effectively hide this from users.

The current UI for argo-workflows already has a basic view showing a timeline of a workflow for

This won't display or concern the user with controller restarts.

The target audience for these spans is probably somewhat less technical than the existing audience for http microservice tracing. Having to explain why their trace is in multiple parts and that they'll just have to deal with it isn't ideal. Span Metrics are a valuable tool here, and they'll be much more complicated or impossible in some cases.

It may be that the presentation layer can hide this - I have limited exposure to the variety of these in the market and how they deal with links.

Events

We could emit events for span start and end, and make it the "collectors" problem to correlate these and create spans. This is how Github Action tracing works - thanks to @adrielp for telling me about this.

For this to work something has to retain state - either the "End Span" event contains enough to construct the whole span (e.g. Start time) or the collector has to correlate a start event and end event, so the start event needs storage. The collector storing state would be wrong - in a cloud native environment I'd not even expect the same collector to receive the end as received the start. I'm trying not to be opinionated on how you configure your collector, but maybe we have to be.

Protocol changes

A different approach to delaying span transmission, but with a similar goal.

We could change the protocol to allow span end to be "ended due to shutdown", and then allow a future span end for the same span_id to end it properly. This probably just pushes the problem onwards to the collector or the storage backend to do correlation in a similar way to events, so isn't an improvement.

Describe the solution you'd like

I don't work in the telemetry business, and so I'm sure I'm missing other prior art.

I'm open to any mechanism to solve this, and would prefer we came up with a common pattern for this and other longer span/restartable binary problems. I believe these problems will also be there in some of the android/mobile and browser implementations, to which I have little visibility or understanding. Some of these proposed solutions may not work there, so coming up with a common solution which works for my use case and these would be ideal.

I hope this sparks a useful discussion.

The text was updated successfully, but these errors were encountered:

christophe-kamphaus-jemmic · 2024-12-06T00:52:10Z

When the controller restarts emitting spans that do not match the actual duration of the tasks they represent and linking them using span links due to an inability of having a common parent span is suboptimal.

I have also observed this issue in the Jenkins OpenTelemetry plugin.

As mentioned in the CICD SIG call of the 28th November 2024, this seems to be a limitation of the OpenTelemetry Go SDK and might be shared by other OpenTelemetry SDKs.

The requirements necessary to solve the issue are:

the ability to create spans with custom start times (that might be before the start of the process creating them)
stable span and trace IDs so that spans emitted before the controller restarts can still be displayed as child spans of the parent trace emitted by a later controller

According to specification for the SDKs https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/api.md#span-creation there should be an optional parameter for Start timestamp when creating spans. Maybe that could allow setting a start time before the start of the process using the SDK.

Checking the Golang SDK, this should be possible when creating spans using the trace.SpanStartOption:
https://github.com/open-telemetry/opentelemetry-go/blob/b4a91a210c865374f5c72973eeb450036290366f/sdk/trace/tracer.go#L29

In the implementation of the Github actions event receiver the opentelemetry-collector receives the Github actions events via webhook. The events have started_at and completed_at attributes which are mapped to the span start/end timestamps respectively:
https://github.com/krzko/opentelemetry-collector-contrib/blob/feat-add-githubactionseventreceiver/receiver/githubactionsreceiver/trace_event_handling.go#L199
This event receiver seems to use the collector internal data structures https://github.com/open-telemetry/opentelemetry-collector/tree/main/pdata and not the opentelemetry-go sdk. As such the Github actions event receiver can directly manipulate all span properties without restrictions from the SDK API.

Regarding stable trace IDs my quick search did not find how to do it in the SDK. However given that tracing across microservices using http headers for context propagation works, there should be a way to provide the context in the SDK (eg. like reading it from a request header).

For the sending of all ongoing spans when the controller restarts, I'm not sure if this behavior can be changed not to send the ongoing spans or whether spans would have to always be created with custom start time just before ending them. What impact would this have on child span context propagation?

In conclusion: It might be possible to implement resilient tracing for long running jobs without spec or SDK changes. However further investigation is needed to verify this or to discover specific issues in the SDK.

dashpole · 2024-12-10T17:35:26Z

As long as you can store a trace ID, span ID, and start time, you can probably just generate the entire span when it ends.

You can set a custom start time for spans using the https://pkg.go.dev/go.opentelemetry.io/otel/trace#WithTimestamp option on start.

The only remaining thing you would need to be able to do is to set the trace and span id. The only way I can think of to do this today is by providing a custom IDGenerator via https://pkg.go.dev/go.opentelemetry.io/otel/sdk/trace#WithIDGenerator. The IDGenerator has access to the context, which you could use to pass the trace ID and span ID you want to use. Don't pass it in using trace.ContextWithSpanContext, since then it will be used as the parent context. You will want to write your own ContextWithHardCodedSpanContext (or similar) to pass it in explicitly through the context.

As for prior art, I and others in the K8s instrumentation SIG explored storing the span context in object annotations as part of kubernetes/enhancements#2312, but didn't move forward with the proposal.

christophe-kamphaus-jemmic · 2024-12-12T12:08:48Z

Notes from SemCon meeting 2024-12-09

There are 2 ways to think about tracing:

2 events: start+stop events
1 event containing both start+stop times or start time+duration

When tracing was designed (leading to the dapper paper?) were mostly around short running operations.
So in OpenTelemetry traces use the start time + duration model.

Long running traces can be problem:

for tail sampling
for post-processing, complete traces. Sometimes there's there's a desire to do some analysis based on
a full trace, and often those are based on assumptions about kind of the maximum length that a trace can realistically be to decide like how long to delay before looking back and processing kind of on an iterative basis.

There's similar concerns also with people who like to mark errors on traces. And so if you have like a message pass queue in the middle, and the message pass queue errors. You'd mark the whole traces in there.

The other thing is with CICD: I don't think you have a large trace problem with this. You just have a long trace problem. Right? Like, your traces are no bigger than like a distributed, complex distributed system.

They just take a long time because some builds can be. At least, if you're doing the horrible build for ARM emulation thing we were doing in Github could be like 4h, for no reason outside of emulating your ARM. Build right, and you have to deal with that. But the trace isn't any bigger. So I think you actually have an interesting use case here with long running traces where you're not pushing against concerns with like tail sampling, where we have to keep giant amounts of data in memory for very long amount of time.
You're also not pushing against some of that post-processing things. If the size of your trace is still reasonable because of the domain that you're in. But you are running into problems where the domain you're in and the design of the SDK are not aligned
right? So you might need to either use a different SDK, or get some kind of extension to our sdks and Apis for this particular workflow. I think both of the like. I think we could probably agree that those are true, but it might be useful to write those down
and have that ready when you come to the spec meeting, because again, when you come to the spec meeting, the kind of concerns and thoughts people are going to have are general tracing, so like how expensive is to process a trace. You know what's the overall size of the trace? If I do tail sampling, how much memory do I have to allocate to keeping these traces to tail, sample. Those are the concerns that went into the tracing design. And so just having something to kind of talk about, that, to address it.

I thought that it might be possible even to solve with current sdks (we can start a trace or span with an earlier start time and that it should be feasible to pass a given context like for http header passing context).

You. You probably can. Yeah, it's probably just not ergonomic. You definitely could do it from Java. I know that I'm pretty sure you can do it from go as well.
Okay, this is also an interesting discussion. You're gonna bring it up in the specification call. I think that's the right place to have that discussion.

svrnm · 2024-12-19T11:00:43Z

Note, that this is a spec issue and not a semconv issue, there are existing discussions on that matter:

open-telemetry/opentelemetry-specification#373,
open-telemetry/opentelemetry-specification#2692

As per my initial thoughts on #1688 I think that SIG CICD SemConv could be a driving force behind that, so please review those existing PRs, especially 373 by @arminru and the comment by @xBis7 from Oct 15th at the very bottom

Joibel added enhancement New feature or request triage:needs-triage labels Dec 4, 2024

github-actions bot added the area:cicd label Dec 4, 2024

christophe-kamphaus-jemmic mentioned this issue Dec 16, 2024

Unified semantic conventions for tasks, workflows, pipelines, jobs #1688

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI/CD: producing long running traces #1648

CI/CD: producing long running traces #1648

Joibel commented Dec 4, 2024 •

edited

Loading

christophe-kamphaus-jemmic commented Dec 6, 2024 •

edited

Loading

dashpole commented Dec 10, 2024

christophe-kamphaus-jemmic commented Dec 12, 2024

svrnm commented Dec 19, 2024

CI/CD: producing long running traces #1648

CI/CD: producing long running traces #1648

Comments

Joibel commented Dec 4, 2024 • edited Loading

Area(s)

Is your change request related to a problem? Please describe.

Argo Workflows

Ideas

Delayed span transmission.

Links

Events

Protocol changes

Describe the solution you'd like

christophe-kamphaus-jemmic commented Dec 6, 2024 • edited Loading

dashpole commented Dec 10, 2024

christophe-kamphaus-jemmic commented Dec 12, 2024

Notes from SemCon meeting 2024-12-09

svrnm commented Dec 19, 2024

Joibel commented Dec 4, 2024 •

edited

Loading

christophe-kamphaus-jemmic commented Dec 6, 2024 •

edited

Loading