-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PROF-9983] Leverage local symbols for otel-profiling-agent #23
Conversation
c839c3e
to
941648c
Compare
941648c
to
9efa85c
Compare
when we only filter out fs.ErrNotExist, we get frequent failures in the local development environment with the following error: /proc/<pid>/root// is a directory we seem to think that /proc/<pid>/root// is a executable path, this happens for short-lived runc processes where /proc/<pid>/maps has "/" as the pathname for the runc executable itself. so when we try to open the elf file, we first try to read it from /proc/<pid>/map_files and (as the process is short lived) when this fails, we try to open /proc/<pid>/root// which fails with a "is a directory" error instead of fs.ErrNotExist to fix this, we just ignore all errors returned by GetElf() and only log them for debug purposes. https://github.com/elastic/otel-profiling-agent/blob/0945fe6/libpf/process/process.go#L221-L239
// The simplest binary gcc 10 can generate ("return 0") has >= 48 bytes for each section. | ||
// Let's not worry about executables that may not verify this, as they would not be of | ||
// interest to us. | ||
if section.Size < 32 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good thinking
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed it's neat - to be fair all the credit goes to the original authors of the function
otel-profiling-agent/libpf/pfelf/pfelf.go
Line 195 in 967d1d5
func HasDWARFData(elfFile *elf.File) bool { |
@@ -49,8 +50,12 @@ const ( | |||
// parsed sections (e.g. symbol tables and string tables; libxul | |||
// has about 4MB .dynstr) | |||
maxBytesLargeSection = 16 * 1024 * 1024 | |||
buildIDSectionName = ".note.gnu.build-id" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how do we treat go build id, which might be in a differently named note? I think it's maybe .note.go.build-id
or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is handled in:
otel-profiling-agent/libpf/pfelf/file.go
Lines 485 to 499 in 967d1d5
func (f *File) GetGoBuildID() (string, error) { | |
s := f.Section(".note.go.buildid") | |
if s == nil { | |
s = f.Section(".notes") | |
} | |
if s == nil { | |
return "", ErrNoBuildID | |
} | |
data, err := s.Data(maxBytesSmallSection) | |
if err != nil { | |
return "", err | |
} | |
return getGoBuildIDFromNotes(data) | |
} |
I had to extract the string for the GNU build id section to a constant since the new HasDwarfData()
function uses it, and the linter was yelling at me for string constants duplication 🤦
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: the absence of build id being an error is slightly too high in log level
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Right now the log error is only emitted if the ELF has dwarf data but no build id, so it should be pretty rare. I'm tempted to leave it as is, and happy to follow-up on this in an ulterior PR
|
||
### Local symbol upload (Experimental) | ||
|
||
For compiled languages (C/C++/Rust/Go), the profiling-agent can upload local symbols (when available) to Datadog for symbolication. Symbols need to be available locally (unstripped binaries). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: a nice thing to support in the future could debug links and other standard locations that are usually searched.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Nice feature! :-)
when we fail to initialize the datadog symbol uploader, we set the uploader to nil, which will cause a panic later on when handling a new executable for symbol upload. this sets the uploader to symbolication.NewNoopUploader() when the datadog uploader is set up, but fails to be initialized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* symbolication: support uploading local symbols remotely * readme: document experimental local symbol upload * symbolication: remove noisy error log when we only filter out fs.ErrNotExist, we get frequent failures in the local development environment with the following error: /proc/<pid>/root// is a directory we seem to think that /proc/<pid>/root// is a executable path, this happens for short-lived runc processes where /proc/<pid>/maps has "/" as the pathname for the runc executable itself. so when we try to open the elf file, we first try to read it from /proc/<pid>/map_files and (as the process is short lived) when this fails, we try to open /proc/<pid>/root// which fails with a "is a directory" error instead of fs.ErrNotExist to fix this, we just ignore all errors returned by GetElf() and only log them for debug purposes. https://github.com/elastic/otel-profiling-agent/blob/0945fe6/libpf/process/process.go#L221-L239 * uploader: fix segfault on failure to init datadog uploader (#26) when we fail to initialize the datadog symbol uploader, we set the uploader to nil, which will cause a panic later on when handling a new executable for symbol upload. this sets the uploader to symbolication.NewNoopUploader() when the datadog uploader is set up, but fails to be initialized.
* symbolication: support uploading local symbols remotely * readme: document experimental local symbol upload * symbolication: remove noisy error log when we only filter out fs.ErrNotExist, we get frequent failures in the local development environment with the following error: /proc/<pid>/root// is a directory we seem to think that /proc/<pid>/root// is a executable path, this happens for short-lived runc processes where /proc/<pid>/maps has "/" as the pathname for the runc executable itself. so when we try to open the elf file, we first try to read it from /proc/<pid>/map_files and (as the process is short lived) when this fails, we try to open /proc/<pid>/root// which fails with a "is a directory" error instead of fs.ErrNotExist to fix this, we just ignore all errors returned by GetElf() and only log them for debug purposes. https://github.com/elastic/otel-profiling-agent/blob/0945fe6/libpf/process/process.go#L221-L239 * uploader: fix segfault on failure to init datadog uploader (#26) when we fail to initialize the datadog symbol uploader, we set the uploader to nil, which will cause a panic later on when handling a new executable for symbol upload. this sets the uploader to symbolication.NewNoopUploader() when the datadog uploader is set up, but fails to be initialized.
* symbolication: support uploading local symbols remotely * readme: document experimental local symbol upload * symbolication: remove noisy error log when we only filter out fs.ErrNotExist, we get frequent failures in the local development environment with the following error: /proc/<pid>/root// is a directory we seem to think that /proc/<pid>/root// is a executable path, this happens for short-lived runc processes where /proc/<pid>/maps has "/" as the pathname for the runc executable itself. so when we try to open the elf file, we first try to read it from /proc/<pid>/map_files and (as the process is short lived) when this fails, we try to open /proc/<pid>/root// which fails with a "is a directory" error instead of fs.ErrNotExist to fix this, we just ignore all errors returned by GetElf() and only log them for debug purposes. https://github.com/elastic/otel-profiling-agent/blob/0945fe6/libpf/process/process.go#L221-L239 * uploader: fix segfault on failure to init datadog uploader (#26) when we fail to initialize the datadog symbol uploader, we set the uploader to nil, which will cause a panic later on when handling a new executable for symbol upload. this sets the uploader to symbolication.NewNoopUploader() when the datadog uploader is set up, but fails to be initialized.
* symbolication: support uploading local symbols remotely * readme: document experimental local symbol upload * symbolication: remove noisy error log when we only filter out fs.ErrNotExist, we get frequent failures in the local development environment with the following error: /proc/<pid>/root// is a directory we seem to think that /proc/<pid>/root// is a executable path, this happens for short-lived runc processes where /proc/<pid>/maps has "/" as the pathname for the runc executable itself. so when we try to open the elf file, we first try to read it from /proc/<pid>/map_files and (as the process is short lived) when this fails, we try to open /proc/<pid>/root// which fails with a "is a directory" error instead of fs.ErrNotExist to fix this, we just ignore all errors returned by GetElf() and only log them for debug purposes. https://github.com/elastic/otel-profiling-agent/blob/0945fe6/libpf/process/process.go#L221-L239 * uploader: fix segfault on failure to init datadog uploader (#26) when we fail to initialize the datadog symbol uploader, we set the uploader to nil, which will cause a panic later on when handling a new executable for symbol upload. this sets the uploader to symbolication.NewNoopUploader() when the datadog uploader is set up, but fails to be initialized.
* symbolication: support uploading local symbols remotely * readme: document experimental local symbol upload * symbolication: remove noisy error log when we only filter out fs.ErrNotExist, we get frequent failures in the local development environment with the following error: /proc/<pid>/root// is a directory we seem to think that /proc/<pid>/root// is a executable path, this happens for short-lived runc processes where /proc/<pid>/maps has "/" as the pathname for the runc executable itself. so when we try to open the elf file, we first try to read it from /proc/<pid>/map_files and (as the process is short lived) when this fails, we try to open /proc/<pid>/root// which fails with a "is a directory" error instead of fs.ErrNotExist to fix this, we just ignore all errors returned by GetElf() and only log them for debug purposes. https://github.com/elastic/otel-profiling-agent/blob/0945fe6/libpf/process/process.go#L221-L239 * uploader: fix segfault on failure to init datadog uploader (#26) when we fail to initialize the datadog symbol uploader, we set the uploader to nil, which will cause a panic later on when handling a new executable for symbol upload. this sets the uploader to symbolication.NewNoopUploader() when the datadog uploader is set up, but fails to be initialized.
* symbolication: support uploading local symbols remotely * readme: document experimental local symbol upload * symbolication: remove noisy error log when we only filter out fs.ErrNotExist, we get frequent failures in the local development environment with the following error: /proc/<pid>/root// is a directory we seem to think that /proc/<pid>/root// is a executable path, this happens for short-lived runc processes where /proc/<pid>/maps has "/" as the pathname for the runc executable itself. so when we try to open the elf file, we first try to read it from /proc/<pid>/map_files and (as the process is short lived) when this fails, we try to open /proc/<pid>/root// which fails with a "is a directory" error instead of fs.ErrNotExist to fix this, we just ignore all errors returned by GetElf() and only log them for debug purposes. https://github.com/elastic/otel-profiling-agent/blob/0945fe6/libpf/process/process.go#L221-L239 * uploader: fix segfault on failure to init datadog uploader (#26) when we fail to initialize the datadog symbol uploader, we set the uploader to nil, which will cause a panic later on when handling a new executable for symbol upload. this sets the uploader to symbolication.NewNoopUploader() when the datadog uploader is set up, but fails to be initialized.
Upload symbols for executables when symbols are available locally.
By default, this is a no-op. When
DD_EXPERIMENTAL_LOCAL_SYMBOL_UPLOAD
environment variable is set to true, this will kick off the symbol upload logic for each new executable found in a mapping.The datadog uploader requires setting up
DD_API_KEY
andDD_SITE
. For now, symbols are only uploaded for executables which contain DWARF data.I tried keeping changes to
otel-profiling-agent
to a minimum, the main changes requires are:execinfomanager
, andexecinfomanager
(in addition to the host-local ID), to be able to pass it to the uploader, andpfelf.File
, as passed to os.Open (so that we're able to get a file path to feed toobjcopy
, that will always work for containers - for e.g. mapping files path)A few examples of a profiles remotely symbolicated thanks to the local upload: