-
Notifications
You must be signed in to change notification settings - Fork 243
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Learning ebpf_exporter by adding examples/cgroup-rstat-flushing #457
Conversation
For now this implements basic counting of cgroup rstat flushes per cgroup depth "level". Further developement depends on tracepoints added in kernel v6.10. Signed-off-by: Jesper Dangaard Brouer <[email protected]>
This implementation for counting locks with different "type" (normal vs yield) and state for contented isn't compatible with the way ebpf_exporter can extract data. The BPF code works, but the ebpf_exporter decoder isn't build for having a complex (e.g. struct) as map value. The 'counters:' metrics isn't build for this. The 'spans:' type can handle and decode map values being complex structs, but seems specific to ringbuf. Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Example output via command: curl -s localhost:9435/metrics # HELP ebpf_exporter_cgroup_rstat_locked_state Number of times rstat lock was obtainted and contended state # TYPE ebpf_exporter_cgroup_rstat_locked_state counter ebpf_exporter_cgroup_rstat_locked_state{contended="0"} 174290 ebpf_exporter_cgroup_rstat_locked_state{contended="1"} 269 Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Signed-off-by: Jesper Dangaard Brouer <[email protected]>
The 'counters:' metric should only be there one time. After this I see all the counters, like this example: # HELP ebpf_exporter_cgroup_rstat_flush_total Total number of times cgroup rstat were flushed (recorded per level) # TYPE ebpf_exporter_cgroup_rstat_flush_total counter ebpf_exporter_cgroup_rstat_flush_total{level="0"} 9480 ebpf_exporter_cgroup_rstat_flush_total{level="1"} 0 ebpf_exporter_cgroup_rstat_flush_total{level="2"} 396127 ebpf_exporter_cgroup_rstat_flush_total{level="3"} 0 ebpf_exporter_cgroup_rstat_flush_total{level="4"} 0 ebpf_exporter_cgroup_rstat_flush_total{level="5"} 0 # HELP ebpf_exporter_cgroup_rstat_locked_state Total number of times rstat lock was obtainted and contended state # TYPE ebpf_exporter_cgroup_rstat_locked_state counter ebpf_exporter_cgroup_rstat_locked_state{contended="0"} 405550 ebpf_exporter_cgroup_rstat_locked_state{contended="1"} 1063 # HELP ebpf_exporter_cgroup_rstat_locked_yield Number of times rstat lock was obtainted again after yield and contended state # TYPE ebpf_exporter_cgroup_rstat_locked_yield counter ebpf_exporter_cgroup_rstat_locked_yield{contended="0"} 1 ebpf_exporter_cgroup_rstat_locked_yield{contended="1"} 0 Signed-off-by: Jesper Dangaard Brouer <[email protected]>
3b26bbe
to
727a241
Compare
# HELP ebpf_exporter_cgroup_rstat_lock_contended Lock contention counters per cgroup level # TYPE ebpf_exporter_cgroup_rstat_lock_contended counter ebpf_exporter_cgroup_rstat_lock_contended{level="0"} 36 ebpf_exporter_cgroup_rstat_lock_contended{level="1"} 847 ebpf_exporter_cgroup_rstat_lock_contended{level="2"} 415 ebpf_exporter_cgroup_rstat_lock_contended{level="3"} 0 ebpf_exporter_cgroup_rstat_lock_contended{level="4"} 0 ebpf_exporter_cgroup_rstat_lock_contended{level="5"} 0 Signed-off-by: Jesper Dangaard Brouer <[email protected]>
72b7a2f
to
c6ff257
Compare
Thinking it will be more obvious for prometheus query to rename the counter cgroup_rstat_locked_state to highlight 'total' in name as cgroup_rstat_locked_total. Example of new naming: # HELP ebpf_exporter_cgroup_rstat_locked_total Total number of times rstat lock was obtainted and contended state # TYPE ebpf_exporter_cgroup_rstat_locked_total counter ebpf_exporter_cgroup_rstat_locked_total{contended="0"} 340710 ebpf_exporter_cgroup_rstat_locked_total{contended="1"} 1234 Signed-off-by: Jesper Dangaard Brouer <[email protected]>
examples/cgroup-rstat-flushing.yaml
Outdated
size: 4 | ||
decoders: # contended boolean converted to 0 and 1 | ||
- name: uint | ||
- name: cgroup_rstat_locked_yield |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cgroup_rstat_locked_yields_total
, see https://prometheus.io/docs/practices/naming/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the link , as a Prometheus newbie I didn't realize that the _suffix had a meaning.
I now realize that general strucure is library_name_unit_suffix.
Jakub recommended: Prometheus: Up & Running, 2nd Edition
examples/cgroup-rstat-flushing.bpf.c
Outdated
#define MAX_CGRP_LEVELS 5 | ||
|
||
struct { | ||
__uint(type, BPF_MAP_TYPE_PERCPU_ARRAY); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tabs are making clang-format unhappy in CI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO then BPF restricted-C code should be written with same tabs style as kernel code.
Lets change for formatting requirements, please!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's leave it for another issue. For now let's just make clang-format happy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This projects .clang-format
file states:
# Minimal format loosely matching the kernel style.
IMHO it doesn't match kernel style at all...
As it contains:
IndentWidth: 4
TabWidth: 4
UseTab: Never
Looking at kernel source in .clang-format
this should have been:
IndentWidth: 8
TabWidth: 8
UseTab: Always
I will respect that this project requires its own special clang-format rule and obey to these, but I still think they are wrong.
examples/cgroup-rstat-flushing.bpf.c
Outdated
(*cnt)++; | ||
} | ||
|
||
/* What cgrp level is interesting, but I didn't manage to encode it in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: cgrp
-> cgroup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure to nit.
I was hoping to get your input to the contents of the comment:
/* What cgrp level is interesting, but I didn't manage to encode it in
* above counters. As contended case is the most interesting, have
* level counter for contended.
*/
When coding BPF, I usually do one map lookup to get a stats struct
, that I will update with multiple stats.
Here I end-up doing 3 map lookups for single counters.
When I started coding I though I could represent something like:
ebpf_exporter_cgroup_rstat_locked_total{contended="0", level="0"} 6558
ebpf_exporter_cgroup_rstat_locked_total{contended="1", level="0"} 0
ebpf_exporter_cgroup_rstat_locked_total{contended="0", level="1"} 34
ebpf_exporter_cgroup_rstat_locked_total{contended="1", level="1"} 123
ebpf_exporter_cgroup_rstat_locked_total{contended="0", level="2"} 0
ebpf_exporter_cgroup_rstat_locked_total{contended="1", level="2"} 0
ebpf_exporter_cgroup_rstat_locked_total{contended="0", level="3"} 0
ebpf_exporter_cgroup_rstat_locked_total{contended="1", level="3"} 0
ebpf_exporter_cgroup_rstat_locked_total{contended="0", level="4"} 0
ebpf_exporter_cgroup_rstat_locked_total{contended="1", level="4"} 0
ebpf_exporter_cgroup_rstat_locked_total{contended="0", level="5"} 0
ebpf_exporter_cgroup_rstat_locked_total{contended="1", level="5"} 0
Question: But ...
- I cannot figure out how to get ebpf_exporter to generate this?
- I don't know if these counters makes sense for Prometheus ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if these counters makes sense for Prometheus ?
They do
I cannot figure out how to get ebpf_exporter to generate this?
Each line here is a separate key in a single map called ebpf_exporter_cgroup_rstat_locked_total
:
ebpf_exporter_cgroup_rstat_locked_total{contended="0", level="0"} 6558
ebpf_exporter_cgroup_rstat_locked_total{contended="1", level="0"} 0
ebpf_exporter_cgroup_rstat_locked_total{contended="0", level="1"} 34
ebpf_exporter_cgroup_rstat_locked_total{contended="1", level="1"} 123
ebpf_exporter_cgroup_rstat_locked_total{contended="0", level="2"} 0
ebpf_exporter_cgroup_rstat_locked_total{contended="1", level="2"} 0
ebpf_exporter_cgroup_rstat_locked_total{contended="0", level="3"} 0
ebpf_exporter_cgroup_rstat_locked_total{contended="1", level="3"} 0
ebpf_exporter_cgroup_rstat_locked_total{contended="0", level="4"} 0
ebpf_exporter_cgroup_rstat_locked_total{contended="1", level="4"} 0
ebpf_exporter_cgroup_rstat_locked_total{contended="0", level="5"} 0
ebpf_exporter_cgroup_rstat_locked_total{contended="1", level="5"} 0
The mapping is effectively:
{contended="0", level="0"}
->6558
{contended="1", level="0"}
->0
{contended="0", level="1"}
->34
- ...
To encode each key, you turn the raw bytes of the key into the values with the help of decoders.
Working backwards from what you expect on the output, you want something like this in the config:
- name: cgroup_rstat_locked_total
help: Total number of times rstat lock was obtainted and contended state
labels:
- name: contended
size: 1
decoders:
- name: uint
- name: level
size: 1
decoders:
- name: uint
On the kernel side you'd want something like this to represent the key:
struct key_t {
u8 contended;
u8 level;
};
The names here do not matter, only sizes do (u8 corresponds to size: 1
in the config).
Given that your keys are complex structs and not small numbers, you cannot store these in an array, so you need a hashmap:
struct {
__uint(type, BPF_MAP_TYPE_PERCPU_HASH);
__uint(max_entries, 128);
__type(key, struct key_t);
__type(value, u64);
} cgroup_rstat_locked_total SEC(".maps");
To actually increment a value, you need to construct the key and to the increment in the map:
// construct however you need to
struct key_t key = {};
key.contended = 1;
key.level = 2;
increment_map_nosync(&cgroup_rstat_locked_total, &key, 1);
Here's an example with a complex key:
- https://github.com/cloudflare/ebpf_exporter/blob/master/examples/pci.yaml
- https://github.com/cloudflare/ebpf_exporter/blob/master/examples/pci.bpf.c
Hope this makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These suggestions worked for encoding more information ("labels" in Prometheus terms).
I've coded it up and pushed as part of this patchset/PR.
(p.s. I've also encoded "yield" state)
The current Ubuntu CI system doesn't have kernel v6.10 which contains the tracepoints cgroup-rstat-flushing uses. Thus, disable CI checks via CONFIGS_TO_IGNORE_IN_CHECK. Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Spell out CGRP to CGROUP. Replace MAX_CGRP_LEVELS -> MAX_CGROUP_LEVELS Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Signed-off-by: Jesper Dangaard Brouer <[email protected]>
This allow us to remove the other maps. What a nice cleanup :-) Signed-off-by: Jesper Dangaard Brouer <[email protected]>
328bcba
to
ee3de80
Compare
Signed-off-by: Jesper Dangaard Brouer <[email protected]>
examples/cgroup-rstat-flushing.bpf.c
Outdated
#define MAX_CGRP_LEVELS 5 | ||
|
||
struct { | ||
__uint(type, BPF_MAP_TYPE_PERCPU_ARRAY); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's leave it for another issue. For now let's just make clang-format happy.
examples/cgroup-rstat-flushing.bpf.c
Outdated
lock_key.yield = 1; | ||
|
||
lock_key.contended = contended; | ||
lock_key.level = (level & 0xFF); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Variable level
is u32 and lock_key.level
u16, so this essentially a cast, and I'm also telling BPF verifier variable is bounded.
It shouldn't be needed, but my previous experience with clang/llvm and BPF-verifier, tells me that sometimes it is needed anyhow.
For verifier concerns, I think I can remove it here, because code is putting another bound on level via MAX_CGROUP_LEVELS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would normally just cast it explicitly: lock_key.level = (u16) level
, which should have the same effect.
examples/cgroup-rstat-flushing.bpf.c
Outdated
u64 now = bpf_ktime_get_ns(); | ||
u64 pid = bpf_get_current_pid_tgid(); | ||
struct lock_key_t lock_key = { 0 }; | ||
u32 level = cgrp->level; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can operate on lock_key.level
directly, no need for a separate variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Below I'm using level
and bounding it to MAX_CGROUP_LEVELS
.
I prefer to keep level type u32
in this bounding code, and do the assignment later to lock_key.level
as it is type u16
.
Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Adjusted to clang-format rule rules Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Too many time measurements are in 1 usec bucket. Start buckets at 0.1 usec resolution to get more better resolution results. Signed-off-by: Jesper Dangaard Brouer <[email protected]>
as it is not longer in usec. Signed-off-by: Jesper Dangaard Brouer <[email protected]>
This have special clang-format rules, that needs to be obeyed. The .clang-format file states "it loosely matching the kernel style". IMHO it doesn't match kernel style as it contains: IndentWidth: 4 TabWidth: 4 UseTab: Never It it want to better match kernel style it should use: IndentWidth: 8 TabWidth: 8 UseTab: Always Signed-off-by: Jesper Dangaard Brouer <[email protected]>
e5c4870
to
824cb1d
Compare
Looking at other examples the name usually contains "latency". Thus name it: cgroup_rstat_flush_latency_seconds Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Hopefully this doesn't create too many records, maximum is cgroups on the system. Per cgroup record average latency via two counters per cgroup flush_seconds_count and flush_seconds_sum. Also add a level label, as this allows up to query "without" the cgroup string, and we can get per level average flush latency cost. Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Let me know when it's ready to go. It would be helpful if you attached the exact metrics that you see locally with this running and capturing some events. |
Running this locally overnight, I see an issue with cgroup decoding: e.g
Something on my system seems to be creating cgroups for a short period and ebpf_exporter is picking these up
I'm want to change the code such that this per-cgroup stats cannot explode like this
|
Regarding above "unknown_cgroup_id:nnn" issue (Cc @bobrik ) Can the cgroup decoder be instructed to exclude unknown cgroups in the output?
|
Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Does ebpf_exporter say anything about
Not in the current state, but it seems like a good idea: #460. Another option, which might not be possible: check for cgroup age and if it's less than some threshold (say 10s), set cgroup_id to zero to do the collapsing manually. |
I don't think
Awesome that you created a ticket for this 😃 |
It's a matter of having v6.6 or newer: torvalds/linux@0ce7c12e88cf. There isn't anything to configure. As long as ebpf_exporter doesn't say anything about fanotify at startup, it is using it. |
Resolution of cgroup_rstat_flush_seconds_{sum,count} was wrong in code, this was in 0.1 usec. Fix by changing this to nanosec resolution. Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Noticed histrogram field _sum was always zero. Function (define) _increment_histogram uses max_bucket+1 for storing the sum counter. Because dealing with maps of array type (BPF_MAP_TYPE_PERCPU_ARRAY) we need an additional +1 (now +2) when creating the map. Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Using read_array_ptr() isn't correct for LRU maps as they can fail the lookup. Always using bpf_map_update_elem() on an LRU map is expensive, because kernel will always create a new element (and insert it before old element in list) and copy over key and value data into new element. Doing a lookup first activate less complex kernel code (that just sets a ref bit on the element, which keeps in on the active list). Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Timestamps helpers had use case with more complex key. Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Observed this on live system: ebpf_exporter_cgroup_rstat_map_errors_total{type="no_elem"} 239 The issue seems to be the order the tracepoints gets activated. E.g. tracepoints using get_timestamp() gets activiated before those creating timestamps via record_timestamp. The fix is to check that the timestamp from get_timestamp() isn't zero. Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Recording two counters per cgroup generates too much data for prometheus. For troubleshooting this will be a practical feature, but don't enable this on all servers per default. Signed-off-by: Jesper Dangaard Brouer <[email protected]>
6fed241
to
3397837
Compare
The different cgroup levels have different flush completion times. Thus add per level in ebpf_exporter_cgroup_rstat_flush_latency latency tracking histogram via level label. This generates less data than CONFIG_TRACK_PER_CGROUP_FLUSH. And gives us an indicator if some levels are showing an abnormal latency pattern. For troubleshooting the individual cgroups the CONFIG_TRACK_PER_CGROUP_FLUSH can be enabled. Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Some example output running this in "localhost" mode
|
The build check bots cannot handle include <errno.h> it fails with: In file included from /usr/include/errno.h:25: In file included from /usr/include/features.h:526: /usr/include/x86_64-linux-gnu/gnu/stubs.h:7:11: fatal error: 'gnu/stubs-32.h' file not found # include <gnu/stubs-32.h> The fix is to include <linux/errno.h> Signed-off-by: Jesper Dangaard Brouer <[email protected]>
1f2bf1e
to
c5e15de
Compare
This PR implements example cgroup-rstat-flushing
Thanks to for @bobrik for reviewing as I need to learn how to use ebpf_exporter