Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Learning ebpf_exporter by adding examples/cgroup-rstat-flushing #457

Merged
merged 33 commits into from
Aug 19, 2024
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
20e65de
Add examples/cgroup-rstat-flushing basic flush counting
netoptimizer Jul 29, 2024
043a0d9
cgroup-rstat-flushing: wrong implementation for decoders
netoptimizer Aug 1, 2024
793ed07
cgroup-rstat-flushing: simple locked counter
netoptimizer Aug 1, 2024
c8bbfaa
cgroup-rstat-flushing: Add seperate counter for yield case
netoptimizer Aug 1, 2024
727a241
cgroup-rstat-flushing: fix yaml config
netoptimizer Aug 1, 2024
c6ff257
cgroup-rstat-flushing: New contended counter per level
netoptimizer Aug 1, 2024
f6855b0
cgroup-rstat-flushing: rename cgroup_rstat_locked_{state,total}
netoptimizer Aug 1, 2024
99c6b9c
cgroup-rstat-flushing: CI system missing tracepoint
netoptimizer Aug 2, 2024
f5dabd2
cgroup-rstat-flushing: trivial whitespace fix in yaml
netoptimizer Aug 2, 2024
623c9ad
cgroup-rstat-flushing: Fix type for level_key used in map lookup
netoptimizer Aug 2, 2024
bd96abc
cgroup-rstat-flushing: use longer define name
netoptimizer Aug 2, 2024
e30576c
cgroup-rstat-flushing: Fix comment spell out cgrp to cgroup
netoptimizer Aug 2, 2024
7168fd3
cgroup-rstat-flushing: use increment_map_nosync
netoptimizer Aug 2, 2024
9959e13
cgroup-rstat-flushing: Attempt with complex lock_key type
netoptimizer Aug 6, 2024
ee3de80
cgroup-rstat-flushing: also encode yield in lock_key_t
netoptimizer Aug 6, 2024
eb44da9
cgroup-rstat-flushing: cleanups
netoptimizer Aug 6, 2024
c32df50
cgroup-rstat-flushing: measure time waiting for lock
netoptimizer Aug 6, 2024
b02f687
cgroup-rstat-flushing: add lock hold time histogram
netoptimizer Aug 6, 2024
dc4ba3e
cgroup-rstat-flushing: adjust histogram resolution to 0.1 microsec
netoptimizer Aug 6, 2024
eaba651
cgroup-rstat-flushing: rename variable delta_usec
netoptimizer Aug 7, 2024
824cb1d
cgroup-rstat-flushing: Obey clang-format rules
netoptimizer Aug 7, 2024
d9a98b3
cgroup-rstat-flushing: Add histogram for flush time latency
netoptimizer Aug 7, 2024
139acaf
cgroup-rstat-flushing: record flush time spend per cgroup
netoptimizer Aug 7, 2024
62bdba0
cgroup-rstat-flushing: Adjustments to obey clang-format rules
netoptimizer Aug 8, 2024
b9cf32c
cgroup-rstat-flushing: cgroup_rstat_flush_nanoseconds_{sum,count}
netoptimizer Aug 15, 2024
4f14b2a
cgroup-rstat-flushing: histogram maps needs extra room for _sum
netoptimizer Aug 15, 2024
054842d
cgroup-rstat-flushing: LRU maps handling
netoptimizer Aug 16, 2024
b80cdd8
cgroup-rstat-flushing: extend timestampe helpers with key
netoptimizer Aug 16, 2024
7a12be4
cgroup-rstat-flushing: set MAX_ERROR_TYPES correctly
netoptimizer Aug 16, 2024
ebe07e7
cgroup-rstat-flushing: handle reload on live system
netoptimizer Aug 16, 2024
3397837
cgroup-rstat-flushing: Default disable per cgroup recording
netoptimizer Aug 19, 2024
1f21ae1
cgroup-rstat-flushing: per cgroup level histograms for flush time
netoptimizer Aug 19, 2024
c5e15de
cgroup-rstat-flushing: include errno.h issues
netoptimizer Aug 19, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
139 changes: 139 additions & 0 deletions examples/cgroup-rstat-flushing.bpf.c
Original file line number Diff line number Diff line change
@@ -0,0 +1,139 @@
/*
* Measure cgroup rstat (recursive stats) flushing overhead and latency.
*
* Loosely based on bpftrace script cgroup_rstat_tracepoint.bt
* - https://github.com/xdp-project/xdp-project/blob/master/areas/latency/cgroup_rstat_tracepoint.bt
*
* Depends on tracepoints added in kernel v6.10
*/

#include <vmlinux.h>
#include <bpf/bpf_tracing.h>
#include "maps.bpf.h"

#define MAX_CGRP_LEVELS 5
netoptimizer marked this conversation as resolved.
Show resolved Hide resolved

struct {
__uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tabs are making clang-format unhappy in CI.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO then BPF restricted-C code should be written with same tabs style as kernel code.
Lets change for formatting requirements, please!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's leave it for another issue. For now let's just make clang-format happy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This projects .clang-format file states:
# Minimal format loosely matching the kernel style.
IMHO it doesn't match kernel style at all...

As it contains:

IndentWidth: 4
TabWidth: 4
UseTab: Never

Looking at kernel source in .clang-format this should have been:

IndentWidth: 8
TabWidth: 8
UseTab: Always

I will respect that this project requires its own special clang-format rule and obey to these, but I still think they are wrong.

__uint(max_entries, MAX_CGRP_LEVELS + 1);
__type(key, u32);
__type(value, u64);
} cgroup_rstat_flush_total SEC(".maps");

/* Total counter for obtaining lock together with contended state
*
* This counter also contains "yield" case. To determine "normal" lock
* case subtract "yield" counter in prometheus query.
*/
struct {
__uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
__uint(max_entries, 2); /* contended state used as key */
__type(key, u32);
__type(value, u64);
} cgroup_rstat_locked_total SEC(".maps");

/* Counter for obtaining lock again after yield (and contended state).
*
* Kernel can yield the rstat lock when walking individial CPU stats.
* This leads to "interresting" concurrency issues. Thus, keep
* seperate counter for "yield" cases to help diagnose.
*/
struct {
__uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
__uint(max_entries, 2); /* contended state used as key */
__type(key, u32);
__type(value, u64);
} cgroup_rstat_locked_yield SEC(".maps");

/* Counter for lock contended case, recorded per cgroup level */
struct {
__uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
__uint(max_entries, MAX_CGRP_LEVELS + 1);
__type(key, u32);
__type(value, u64);
} cgroup_rstat_lock_contended SEC(".maps");

/** Measurement#1: lock rates
* =========================
For locks, the problematic/interesting case is when the lock was contended.

Simply counting the lock+unlock is complicated by yielding the lock in the main
flushing loop (in cgroup_rstat_flush_locked()). The tracepoints "cpu" argument
will be (minus) -1 when the lock is not a yielded lock.

Concern: The lock rates will vary (a lot) and aggregating this as an average
will not capture spikes. Especially given Prometheus capture intervals only
happens every 53 seconds.

Q: will lock "type=yield" be a good idea?

*/

SEC("tp_btf/cgroup_rstat_locked")
netoptimizer marked this conversation as resolved.
Show resolved Hide resolved
int BPF_PROG(rstat_locked, struct cgroup *cgrp, int cpu, bool contended)
{
u32 key = contended;
u64 *cnt;

read_array_ptr(&cgroup_rstat_locked_total, &key, cnt);
netoptimizer marked this conversation as resolved.
Show resolved Hide resolved
(*cnt)++;

if (cpu >= 0) {
read_array_ptr(&cgroup_rstat_locked_yield, &key, cnt);
(*cnt)++;
}

/* What cgrp level is interesting, but I didn't manage to encode it in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: cgrp -> cgroup

Copy link
Contributor Author

@netoptimizer netoptimizer Aug 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure to nit.
I was hoping to get your input to the contents of the comment:

	/* What cgrp level is interesting, but I didn't manage to encode it in
	 * above counters.  As contended case is the most interesting, have
	 * level counter for contended.
	 */

When coding BPF, I usually do one map lookup to get a stats struct, that I will update with multiple stats.
Here I end-up doing 3 map lookups for single counters.

When I started coding I though I could represent something like:

ebpf_exporter_cgroup_rstat_locked_total{contended="0", level="0"} 6558
ebpf_exporter_cgroup_rstat_locked_total{contended="1", level="0"} 0
ebpf_exporter_cgroup_rstat_locked_total{contended="0", level="1"} 34
ebpf_exporter_cgroup_rstat_locked_total{contended="1", level="1"} 123
ebpf_exporter_cgroup_rstat_locked_total{contended="0", level="2"} 0
ebpf_exporter_cgroup_rstat_locked_total{contended="1", level="2"} 0
ebpf_exporter_cgroup_rstat_locked_total{contended="0", level="3"} 0
ebpf_exporter_cgroup_rstat_locked_total{contended="1", level="3"} 0
ebpf_exporter_cgroup_rstat_locked_total{contended="0", level="4"} 0
ebpf_exporter_cgroup_rstat_locked_total{contended="1", level="4"} 0
ebpf_exporter_cgroup_rstat_locked_total{contended="0", level="5"} 0
ebpf_exporter_cgroup_rstat_locked_total{contended="1", level="5"} 0

Question: But ...

  1. I cannot figure out how to get ebpf_exporter to generate this?
  2. I don't know if these counters makes sense for Prometheus ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if these counters makes sense for Prometheus ?

They do

I cannot figure out how to get ebpf_exporter to generate this?

Each line here is a separate key in a single map called ebpf_exporter_cgroup_rstat_locked_total:

ebpf_exporter_cgroup_rstat_locked_total{contended="0", level="0"} 6558
ebpf_exporter_cgroup_rstat_locked_total{contended="1", level="0"} 0
ebpf_exporter_cgroup_rstat_locked_total{contended="0", level="1"} 34
ebpf_exporter_cgroup_rstat_locked_total{contended="1", level="1"} 123
ebpf_exporter_cgroup_rstat_locked_total{contended="0", level="2"} 0
ebpf_exporter_cgroup_rstat_locked_total{contended="1", level="2"} 0
ebpf_exporter_cgroup_rstat_locked_total{contended="0", level="3"} 0
ebpf_exporter_cgroup_rstat_locked_total{contended="1", level="3"} 0
ebpf_exporter_cgroup_rstat_locked_total{contended="0", level="4"} 0
ebpf_exporter_cgroup_rstat_locked_total{contended="1", level="4"} 0
ebpf_exporter_cgroup_rstat_locked_total{contended="0", level="5"} 0
ebpf_exporter_cgroup_rstat_locked_total{contended="1", level="5"} 0

The mapping is effectively:

  • {contended="0", level="0"} -> 6558
  • {contended="1", level="0"} -> 0
  • {contended="0", level="1"} -> 34
  • ...

To encode each key, you turn the raw bytes of the key into the values with the help of decoders.

Working backwards from what you expect on the output, you want something like this in the config:

    - name: cgroup_rstat_locked_total
      help: Total number of times rstat lock was obtainted and contended state
      labels:
        - name: contended
          size: 1
          decoders:
            - name: uint
        - name: level
           size: 1
           decoders:
            - name: uint

On the kernel side you'd want something like this to represent the key:

struct key_t {
    u8 contended;
    u8 level;
};

The names here do not matter, only sizes do (u8 corresponds to size: 1 in the config).

Given that your keys are complex structs and not small numbers, you cannot store these in an array, so you need a hashmap:

struct {
    __uint(type, BPF_MAP_TYPE_PERCPU_HASH);
    __uint(max_entries, 128);
    __type(key, struct key_t);
    __type(value, u64);
} cgroup_rstat_locked_total SEC(".maps");

To actually increment a value, you need to construct the key and to the increment in the map:

// construct however you need to
struct key_t key = {};
key.contended = 1;
key.level = 2;

increment_map_nosync(&cgroup_rstat_locked_total, &key, 1);

Here's an example with a complex key:

Hope this makes sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These suggestions worked for encoding more information ("labels" in Prometheus terms).
I've coded it up and pushed as part of this patchset/PR.
(p.s. I've also encoded "yield" state)

* above counters. As contended case is the most interesting, have
* level counter for contended.
*/
if (contended) {
bobrik marked this conversation as resolved.
Show resolved Hide resolved
u32 level = cgrp->level;

if (level > MAX_CGRP_LEVELS)
level = MAX_CGRP_LEVELS;

read_array_ptr(&cgroup_rstat_lock_contended, &level, cnt);
(*cnt)++;
}

return 0;
}


/** Measurement#2: latency/delay caused by flush
* ============================================
Measure both time waiting for the lock, and time spend holding the lock.

This should be a histogram (for later latency heatmap).

*/


/** Measurement#3: flush rate
* =========================
Simply count invocations of cgroup_rstat_flush_locked().

Concern: The flush rates will vary (a lot), e.g. then cadvisor collects stats
for all cgroups in the system, or when kswapd does concurrent flushing (of root
cgroup). Averaging this (over approx 1 minute) gives the wrong impression.

Mitigation workaround: Store counters per cgroup "level" (level=0 is root).
This will allow us to separate root-cgroup flushes from cadvisor walking all
cgroup levels.

*/
SEC("fentry/cgroup_rstat_flush_locked")
int BPF_PROG(cgroup_rstat_flush_locked, struct cgroup *cgrp)
{
u64 level_key = cgrp->level;
netoptimizer marked this conversation as resolved.
Show resolved Hide resolved

if (level_key > MAX_CGRP_LEVELS)
level_key = MAX_CGRP_LEVELS;

increment_map_nosync(&cgroup_rstat_flush_total, &level_key, 1);

return 0;
}

char LICENSE[] SEC("license") = "GPL";
31 changes: 31 additions & 0 deletions examples/cgroup-rstat-flushing.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
metrics:
counters:
- name: cgroup_rstat_flush_total
help: Total number of times cgroup rstat were flushed (recorded per level)
labels:
- name: level
size: 4
decoders:
- name: uint
- name: cgroup_rstat_locked_total
help: Total number of times rstat lock was obtainted and contended state
labels:
- name: contended
size: 4
decoders: # contended boolean converted to 0 and 1
- name: uint
- name: cgroup_rstat_locked_yield
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cgroup_rstat_locked_yields_total, see https://prometheus.io/docs/practices/naming/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the link , as a Prometheus newbie I didn't realize that the _suffix had a meaning.
I now realize that general strucure is library_name_unit_suffix.

Jakub recommended: Prometheus: Up & Running, 2nd Edition

help: Number of times rstat lock was obtainted again after yield and contended state
labels:
- name: contended
size: 4
decoders: # contended boolean converted to 0 and 1
- name: uint
- name: cgroup_rstat_lock_contended
netoptimizer marked this conversation as resolved.
Show resolved Hide resolved
help: Lock contention counters per cgroup level
labels:
- name: level
size: 4
decoders:
- name: uint

netoptimizer marked this conversation as resolved.
Show resolved Hide resolved
Loading