Expose info on Probe status #36

fangel · 2018-09-28T10:13:06Z

Varnish Varnish+ 4.1
prometheus_varnish_exporter 1.4.1

I've been building up a dashboard based on the Prometheus metrics exposed by this exporter. One of my goals was to try and make the dashboard that the Varnish Agent provides redundant. So far I've been able to replicate everything but one: Healthcheck probe status

That is, for each backend have the number of successful probes, the threshold for when the service will be marked unhealthy, and the total window size.
In the dashboard from Varnish Agent, you can see the status for each backend, e.g. "Healthy 8/8".

The reason why this isn't possible to gather this using this exporter is that the info isn't exposed by varnishstat – instead that info needs to be retrieved using varnishadm.

Some example output from running varnishadm backend.list -p (backends defined by the goto-director)

varnishadm -n [ident] backend.list -p
Backend name                   Admin      Probe
boot.dummy                     probe      Healthy (no probe)
boot.goto.00000000.(XX.YY.ZZ.WW).(http://service.name:80) probe      Healthy 8/8
  Current states  good:  8 threshold:  3 window:  8
  Average response time of good probes: 0.073743
  Oldest ================================================== Newest
  4444444444444444444444444444444444444444444444444444444444444444 Good IPv4
  XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Good Xmit
  RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR Good Recv
  HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH Happy

Ideally, this could be exposed in a handful of new metrics, like

varnish_backend_probe_good{backend="goto.00000000.(XX.YY.ZZ.WW)" server="http://service.name:80"} 8
varnish_backend_probe_threshold{backend="goto.00000000.(XX.YY.ZZ.WW)" server="http://service.name:80"} 3
varnish_backend_probe_window{backend="goto.00000000.(XX.YY.ZZ.WW)" server="http://service.name:80"} = 8
varnish_backend_probe_avg_response_seconds{backend="goto.00000000.(XX.YY.ZZ.WW)" server="http://service.name:80"} = 0.073743

I've never touched Go before, but if this type of information is desired by others, I wouldn't mind trying to create a PR for scraping varnishadm to expose these metrics...

Kind regards
Morten

The text was updated successfully, but these errors were encountered:

jonnenauha · 2018-09-28T12:47:45Z

I'm not sure if its possible to do what you want on just varnishstat output this exporter is using.

Only "probe" metric I can find from my servers is

"VBE.my_backend.happy": {
    "description": "Happy health probes",
    "type": "VBE", "ident": "my_backend", "flag": "b", "format": "b",
    "value": 18446744073709551615
},

I use this to provide the varnish_backend_up = 0|1. This is the latest happy probe value. It is queried from that uint like this https://github.com/jonnenauha/prometheus_varnish_exporter/blob/master/varnish.go#L142-L164

I don't think the threshold and window can be known from varnishstat. Do you know where I could query those values in a programmatic way?

From this varnish code, it looks like the 8 window is hard coded and cant be changed. https://github.com/varnishcache/varnish-cache/blob/master/bin/varnishstat/varnishstat_curses.c#L680 (code that renders varnishstat stdout visualization).

We could easily then do loop of 8 and provide at least varnish_backend_probe_good = [0,8] for each server.

fangel · 2018-09-28T12:53:45Z

No, sadly the info isn't exposed by varnishstat - to get the full info, it would need to scrape varnishadm backend.list -p as well.

fangel · 2018-09-28T12:56:58Z

(Which I realise is a horrible format to try and parse, because it's meant to be more human readable – and thus would be quite fragile because it could die if any Varnish update that changes the format of this text)

fangel · 2018-09-28T13:04:40Z

For a illustration of the graph that I would love to be able to generate would be something like the following:

Which would indicate a short-lived error (lasted for 4 scrape-cycles).

It would be a combination of two queries, maybe like

A: varnish_backend_probe_good{instance="X"}
B varnish_backend_probe_window{instance="X"} - varnish_backend_probe_good{instance="X"}
C: varnish_backend_probe_window{instance="X"} - varnish_backend_probe_threshold{instance="X"}

A and B would be a stepped, stacked graph, and C would then be a line-graph.

jonnenauha · 2018-09-28T13:05:34Z

Hmm, looks like its not hardcoded to 8. I guess that is the max window and most that particular tool will render. I get this for my prod varnish.

Current states good: 5 threshold: 3 window: 5

Its quite horrible to parse but one could do a simple regexp for that. But if it can change per server, the it will require more robust logic to parse it for each server. Doing that with indentation or something is quite fragile and could change between varnish versions.

fangel · 2018-09-28T13:45:48Z

It would be very error-prone and fragile to parse the human-readable text, yes.

I wonder what would be the most resilient way, though. Maybe assuming that the good, threshold, window counts are always on line 1 of each block, and that the avg. response-time is always on line two.

If that's the assumption, then a reg.ex. like /\s([^:\s]+):\s(\d+)/g could be used to extract out the [type]: [count] info.

As for the avg. response line, it would probably just be looking at the tail-end of the second line

As for identifying the backends that have probes (not all backends do – see the first line in my example output in the issue), it would be a question of finding a block of "line starting at char 0, then indented lines and then a blank line" or something along those lines.

For the sake of having some more test-data, here's the full output of varnishadm backend.list -p on one of our hosts: https://gist.github.com/fangel/c24aa039765336b25bb9a4685df61dfd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose info on Probe status #36

Expose info on Probe status #36

fangel commented Sep 28, 2018

jonnenauha commented Sep 28, 2018

fangel commented Sep 28, 2018

fangel commented Sep 28, 2018

fangel commented Sep 28, 2018

jonnenauha commented Sep 28, 2018

fangel commented Sep 28, 2018

Expose info on Probe status #36

Expose info on Probe status #36

Comments

fangel commented Sep 28, 2018

jonnenauha commented Sep 28, 2018

fangel commented Sep 28, 2018

fangel commented Sep 28, 2018

fangel commented Sep 28, 2018

jonnenauha commented Sep 28, 2018

fangel commented Sep 28, 2018