Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose info on Probe status #36

Open
fangel opened this issue Sep 28, 2018 · 6 comments
Open

Expose info on Probe status #36

fangel opened this issue Sep 28, 2018 · 6 comments

Comments

@fangel
Copy link

fangel commented Sep 28, 2018

  • Varnish Varnish+ 4.1
  • prometheus_varnish_exporter 1.4.1

I've been building up a dashboard based on the Prometheus metrics exposed by this exporter. One of my goals was to try and make the dashboard that the Varnish Agent provides redundant. So far I've been able to replicate everything but one: Healthcheck probe status

That is, for each backend have the number of successful probes, the threshold for when the service will be marked unhealthy, and the total window size.
In the dashboard from Varnish Agent, you can see the status for each backend, e.g. "Healthy 8/8".

The reason why this isn't possible to gather this using this exporter is that the info isn't exposed by varnishstat – instead that info needs to be retrieved using varnishadm.

Some example output from running varnishadm backend.list -p (backends defined by the goto-director)

varnishadm -n [ident] backend.list -p
Backend name                   Admin      Probe
boot.dummy                     probe      Healthy (no probe)
boot.goto.00000000.(XX.YY.ZZ.WW).(http://service.name:80) probe      Healthy 8/8
  Current states  good:  8 threshold:  3 window:  8
  Average response time of good probes: 0.073743
  Oldest ================================================== Newest
  4444444444444444444444444444444444444444444444444444444444444444 Good IPv4
  XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Good Xmit
  RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR Good Recv
  HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH Happy

Ideally, this could be exposed in a handful of new metrics, like

varnish_backend_probe_good{backend="goto.00000000.(XX.YY.ZZ.WW)" server="http://service.name:80"} 8
varnish_backend_probe_threshold{backend="goto.00000000.(XX.YY.ZZ.WW)" server="http://service.name:80"} 3
varnish_backend_probe_window{backend="goto.00000000.(XX.YY.ZZ.WW)" server="http://service.name:80"} = 8
varnish_backend_probe_avg_response_seconds{backend="goto.00000000.(XX.YY.ZZ.WW)" server="http://service.name:80"} = 0.073743

I've never touched Go before, but if this type of information is desired by others, I wouldn't mind trying to create a PR for scraping varnishadm to expose these metrics...

Kind regards
Morten

@jonnenauha
Copy link
Owner

I'm not sure if its possible to do what you want on just varnishstat output this exporter is using.

Only "probe" metric I can find from my servers is

"VBE.my_backend.happy": {
    "description": "Happy health probes",
    "type": "VBE", "ident": "my_backend", "flag": "b", "format": "b",
    "value": 18446744073709551615
},

I use this to provide the varnish_backend_up = 0|1. This is the latest happy probe value. It is queried from that uint like this https://github.com/jonnenauha/prometheus_varnish_exporter/blob/master/varnish.go#L142-L164

I don't think the threshold and window can be known from varnishstat. Do you know where I could query those values in a programmatic way?

From this varnish code, it looks like the 8 window is hard coded and cant be changed. https://github.com/varnishcache/varnish-cache/blob/master/bin/varnishstat/varnishstat_curses.c#L680 (code that renders varnishstat stdout visualization).

We could easily then do loop of 8 and provide at least varnish_backend_probe_good = [0,8] for each server.

@fangel
Copy link
Author

fangel commented Sep 28, 2018

No, sadly the info isn't exposed by varnishstat - to get the full info, it would need to scrape varnishadm backend.list -p as well.

@fangel
Copy link
Author

fangel commented Sep 28, 2018

(Which I realise is a horrible format to try and parse, because it's meant to be more human readable – and thus would be quite fragile because it could die if any Varnish update that changes the format of this text)

@fangel
Copy link
Author

fangel commented Sep 28, 2018

For a illustration of the graph that I would love to be able to generate would be something like the following:

screen shot 2018-09-28 at 15 01 01

Which would indicate a short-lived error (lasted for 4 scrape-cycles).

It would be a combination of two queries, maybe like

A: varnish_backend_probe_good{instance="X"}
B varnish_backend_probe_window{instance="X"} - varnish_backend_probe_good{instance="X"}
C: varnish_backend_probe_window{instance="X"} - varnish_backend_probe_threshold{instance="X"}

A and B would be a stepped, stacked graph, and C would then be a line-graph.

@jonnenauha
Copy link
Owner

Hmm, looks like its not hardcoded to 8. I guess that is the max window and most that particular tool will render. I get this for my prod varnish.

Current states good: 5 threshold: 3 window: 5

Its quite horrible to parse but one could do a simple regexp for that. But if it can change per server, the it will require more robust logic to parse it for each server. Doing that with indentation or something is quite fragile and could change between varnish versions.

@fangel
Copy link
Author

fangel commented Sep 28, 2018

It would be very error-prone and fragile to parse the human-readable text, yes.

I wonder what would be the most resilient way, though. Maybe assuming that the good, threshold, window counts are always on line 1 of each block, and that the avg. response-time is always on line two.

If that's the assumption, then a reg.ex. like /\s([^:\s]+):\s(\d+)/g could be used to extract out the [type]: [count] info.

As for the avg. response line, it would probably just be looking at the tail-end of the second line


As for identifying the backends that have probes (not all backends do – see the first line in my example output in the issue), it would be a question of finding a block of "line starting at char 0, then indented lines and then a blank line" or something along those lines.

For the sake of having some more test-data, here's the full output of varnishadm backend.list -p on one of our hosts: https://gist.github.com/fangel/c24aa039765336b25bb9a4685df61dfd

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants