Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PROF-9983] Leverage local symbols for otel-profiling-agent #23

Merged
merged 4 commits into from
Jun 25, 2024

Conversation

Gandem
Copy link
Member

@Gandem Gandem commented Jun 19, 2024

Upload symbols for executables when symbols are available locally.

By default, this is a no-op. When DD_EXPERIMENTAL_LOCAL_SYMBOL_UPLOAD environment variable is set to true, this will kick off the symbol upload logic for each new executable found in a mapping.

The datadog uploader requires setting up DD_API_KEY and DD_SITE. For now, symbols are only uploaded for executables which contain DWARF data.

I tried keeping changes to otel-profiling-agent to a minimum, the main changes requires are:

  • Wiring up the uploader in execinfomanager, and
  • Passing the global file ID to execinfomanager (in addition to the host-local ID), to be able to pass it to the uploader, and
  • Including the ELF file path in pfelf.File, as passed to os.Open (so that we're able to get a file path to feed to objcopy, that will always work for containers - for e.g. mapping files path)

A few examples of a profiles remotely symbolicated thanks to the local upload:
image
image

@Gandem Gandem changed the title symbolication: support uploading local symbols remotely [PROF-9983] Leverage local symbols for otel-profiling-agent Jun 19, 2024
@Gandem Gandem force-pushed the nayef/local-symbols branch 5 times, most recently from c839c3e to 941648c Compare June 20, 2024 11:08
@Gandem Gandem force-pushed the nayef/local-symbols branch from 941648c to 9efa85c Compare June 20, 2024 11:18
@Gandem Gandem marked this pull request as ready for review June 20, 2024 12:18
@Gandem Gandem requested review from nsavoire and r1viollet June 20, 2024 12:19
when we only filter out fs.ErrNotExist, we get frequent failures
in the local development environment with the following error:
/proc/<pid>/root// is a directory

we seem to think that /proc/<pid>/root// is a executable path,
this happens for short-lived runc processes where /proc/<pid>/maps
has "/" as the pathname for the runc executable itself.

so when we try to open the elf file, we first try to read it from
/proc/<pid>/map_files and (as the process is short lived) when
this fails, we try to open /proc/<pid>/root// which fails with
a "is a directory" error instead of fs.ErrNotExist

to fix this, we just ignore all errors returned by GetElf() and
only log them for debug purposes.

https://github.com/elastic/otel-profiling-agent/blob/0945fe6/libpf/process/process.go#L221-L239
// The simplest binary gcc 10 can generate ("return 0") has >= 48 bytes for each section.
// Let's not worry about executables that may not verify this, as they would not be of
// interest to us.
if section.Size < 32 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good thinking

Copy link
Member Author

@Gandem Gandem Jun 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed it's neat - to be fair all the credit goes to the original authors of the function

func HasDWARFData(elfFile *elf.File) bool {
👌

@@ -49,8 +50,12 @@ const (
// parsed sections (e.g. symbol tables and string tables; libxul
// has about 4MB .dynstr)
maxBytesLargeSection = 16 * 1024 * 1024
buildIDSectionName = ".note.gnu.build-id"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do we treat go build id, which might be in a differently named note? I think it's maybe .note.go.build-id or something?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is handled in:

func (f *File) GetGoBuildID() (string, error) {
s := f.Section(".note.go.buildid")
if s == nil {
s = f.Section(".notes")
}
if s == nil {
return "", ErrNoBuildID
}
data, err := s.Data(maxBytesSmallSection)
if err != nil {
return "", err
}
return getGoBuildIDFromNotes(data)
}

I had to extract the string for the GNU build id section to a constant since the new HasDwarfData() function uses it, and the linter was yelling at me for string constants duplication 🤦

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: the absence of build id being an error is slightly too high in log level

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Right now the log error is only emitted if the ELF has dwarf data but no build id, so it should be pretty rare. I'm tempted to leave it as is, and happy to follow-up on this in an ulterior PR


### Local symbol upload (Experimental)

For compiled languages (C/C++/Rust/Go), the profiling-agent can upload local symbols (when available) to Datadog for symbolication. Symbols need to be available locally (unstripped binaries).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: a nice thing to support in the future could debug links and other standard locations that are usually searched.

r1viollet
r1viollet previously approved these changes Jun 21, 2024
Copy link
Collaborator

@r1viollet r1viollet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
Nice feature! :-)

nsavoire
nsavoire previously approved these changes Jun 24, 2024
when we fail to initialize the datadog symbol uploader, we set the
uploader to nil, which will cause a panic later on when handling
a new executable for symbol upload.

this sets the uploader to symbolication.NewNoopUploader() when the
datadog uploader is set up, but fails to be initialized.
@Gandem Gandem dismissed stale reviews from nsavoire and r1viollet via d394e12 June 25, 2024 08:40
Copy link
Collaborator

@r1viollet r1viollet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Gandem Gandem merged commit 1101921 into main Jun 25, 2024
5 checks passed
@Gandem Gandem deleted the nayef/local-symbols branch June 25, 2024 11:59
Gandem added a commit that referenced this pull request Jun 26, 2024
* symbolication: support uploading local symbols remotely

* readme: document experimental local symbol upload

* symbolication: remove noisy error log

when we only filter out fs.ErrNotExist, we get frequent failures
in the local development environment with the following error:
/proc/<pid>/root// is a directory

we seem to think that /proc/<pid>/root// is a executable path,
this happens for short-lived runc processes where /proc/<pid>/maps
has "/" as the pathname for the runc executable itself.

so when we try to open the elf file, we first try to read it from
/proc/<pid>/map_files and (as the process is short lived) when
this fails, we try to open /proc/<pid>/root// which fails with
a "is a directory" error instead of fs.ErrNotExist

to fix this, we just ignore all errors returned by GetElf() and
only log them for debug purposes.

https://github.com/elastic/otel-profiling-agent/blob/0945fe6/libpf/process/process.go#L221-L239

* uploader: fix segfault on failure to init datadog uploader (#26)

when we fail to initialize the datadog symbol uploader, we set the
uploader to nil, which will cause a panic later on when handling
a new executable for symbol upload.

this sets the uploader to symbolication.NewNoopUploader() when the
datadog uploader is set up, but fails to be initialized.
Gandem added a commit that referenced this pull request Jun 26, 2024
* symbolication: support uploading local symbols remotely

* readme: document experimental local symbol upload

* symbolication: remove noisy error log

when we only filter out fs.ErrNotExist, we get frequent failures
in the local development environment with the following error:
/proc/<pid>/root// is a directory

we seem to think that /proc/<pid>/root// is a executable path,
this happens for short-lived runc processes where /proc/<pid>/maps
has "/" as the pathname for the runc executable itself.

so when we try to open the elf file, we first try to read it from
/proc/<pid>/map_files and (as the process is short lived) when
this fails, we try to open /proc/<pid>/root// which fails with
a "is a directory" error instead of fs.ErrNotExist

to fix this, we just ignore all errors returned by GetElf() and
only log them for debug purposes.

https://github.com/elastic/otel-profiling-agent/blob/0945fe6/libpf/process/process.go#L221-L239

* uploader: fix segfault on failure to init datadog uploader (#26)

when we fail to initialize the datadog symbol uploader, we set the
uploader to nil, which will cause a panic later on when handling
a new executable for symbol upload.

this sets the uploader to symbolication.NewNoopUploader() when the
datadog uploader is set up, but fails to be initialized.
Gandem added a commit that referenced this pull request Jun 27, 2024
* symbolication: support uploading local symbols remotely

* readme: document experimental local symbol upload

* symbolication: remove noisy error log

when we only filter out fs.ErrNotExist, we get frequent failures
in the local development environment with the following error:
/proc/<pid>/root// is a directory

we seem to think that /proc/<pid>/root// is a executable path,
this happens for short-lived runc processes where /proc/<pid>/maps
has "/" as the pathname for the runc executable itself.

so when we try to open the elf file, we first try to read it from
/proc/<pid>/map_files and (as the process is short lived) when
this fails, we try to open /proc/<pid>/root// which fails with
a "is a directory" error instead of fs.ErrNotExist

to fix this, we just ignore all errors returned by GetElf() and
only log them for debug purposes.

https://github.com/elastic/otel-profiling-agent/blob/0945fe6/libpf/process/process.go#L221-L239

* uploader: fix segfault on failure to init datadog uploader (#26)

when we fail to initialize the datadog symbol uploader, we set the
uploader to nil, which will cause a panic later on when handling
a new executable for symbol upload.

this sets the uploader to symbolication.NewNoopUploader() when the
datadog uploader is set up, but fails to be initialized.
Gandem added a commit that referenced this pull request Jun 28, 2024
* symbolication: support uploading local symbols remotely

* readme: document experimental local symbol upload

* symbolication: remove noisy error log

when we only filter out fs.ErrNotExist, we get frequent failures
in the local development environment with the following error:
/proc/<pid>/root// is a directory

we seem to think that /proc/<pid>/root// is a executable path,
this happens for short-lived runc processes where /proc/<pid>/maps
has "/" as the pathname for the runc executable itself.

so when we try to open the elf file, we first try to read it from
/proc/<pid>/map_files and (as the process is short lived) when
this fails, we try to open /proc/<pid>/root// which fails with
a "is a directory" error instead of fs.ErrNotExist

to fix this, we just ignore all errors returned by GetElf() and
only log them for debug purposes.

https://github.com/elastic/otel-profiling-agent/blob/0945fe6/libpf/process/process.go#L221-L239

* uploader: fix segfault on failure to init datadog uploader (#26)

when we fail to initialize the datadog symbol uploader, we set the
uploader to nil, which will cause a panic later on when handling
a new executable for symbol upload.

this sets the uploader to symbolication.NewNoopUploader() when the
datadog uploader is set up, but fails to be initialized.
Gandem added a commit that referenced this pull request Jun 28, 2024
* symbolication: support uploading local symbols remotely

* readme: document experimental local symbol upload

* symbolication: remove noisy error log

when we only filter out fs.ErrNotExist, we get frequent failures
in the local development environment with the following error:
/proc/<pid>/root// is a directory

we seem to think that /proc/<pid>/root// is a executable path,
this happens for short-lived runc processes where /proc/<pid>/maps
has "/" as the pathname for the runc executable itself.

so when we try to open the elf file, we first try to read it from
/proc/<pid>/map_files and (as the process is short lived) when
this fails, we try to open /proc/<pid>/root// which fails with
a "is a directory" error instead of fs.ErrNotExist

to fix this, we just ignore all errors returned by GetElf() and
only log them for debug purposes.

https://github.com/elastic/otel-profiling-agent/blob/0945fe6/libpf/process/process.go#L221-L239

* uploader: fix segfault on failure to init datadog uploader (#26)

when we fail to initialize the datadog symbol uploader, we set the
uploader to nil, which will cause a panic later on when handling
a new executable for symbol upload.

this sets the uploader to symbolication.NewNoopUploader() when the
datadog uploader is set up, but fails to be initialized.
Gandem added a commit that referenced this pull request Jun 28, 2024
* symbolication: support uploading local symbols remotely

* readme: document experimental local symbol upload

* symbolication: remove noisy error log

when we only filter out fs.ErrNotExist, we get frequent failures
in the local development environment with the following error:
/proc/<pid>/root// is a directory

we seem to think that /proc/<pid>/root// is a executable path,
this happens for short-lived runc processes where /proc/<pid>/maps
has "/" as the pathname for the runc executable itself.

so when we try to open the elf file, we first try to read it from
/proc/<pid>/map_files and (as the process is short lived) when
this fails, we try to open /proc/<pid>/root// which fails with
a "is a directory" error instead of fs.ErrNotExist

to fix this, we just ignore all errors returned by GetElf() and
only log them for debug purposes.

https://github.com/elastic/otel-profiling-agent/blob/0945fe6/libpf/process/process.go#L221-L239

* uploader: fix segfault on failure to init datadog uploader (#26)

when we fail to initialize the datadog symbol uploader, we set the
uploader to nil, which will cause a panic later on when handling
a new executable for symbol upload.

this sets the uploader to symbolication.NewNoopUploader() when the
datadog uploader is set up, but fails to be initialized.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants