Metrics refactor

This document describes the fabio metrics layer and documents the transition from the go-metrics based layer to a more flexible approach. Once that transition has been completed the documentation for the transition will be removed.

Why change?

Fabio metrics started out with an implementation of the go-metrics library mostly for Graphite since that's what we were using at eCG. This became somewhat more flexible over time but the design doesn't make it easy to add providers like Dogstatd, Prometheus and others which support tagged metrics.

Also, the go-metrics library aggregates histograms internally which does not work well with providers like statsd and Circonus which do the histogram aggregation on the server. Fabio does not support multiple metrics providers simultaneously which makes migration between metrics systems difficult. And last but not least the go-metrics library hasn't seen significant updates in over a year. The last commit is from 28 Nov 2016.

Fabio Metrics <= 1.5.x

Fabio currently supports Graphite, statsd, Circonus and stdout for debugging.

The metric names fall into two groups: service metrics and internal metrics.

Service metric names are generated with the template defined in metrics.names which is by default <service>.<domain>.<path>.<host:port>

Internal metric names are hard-coded like http.status.code.200 or notfound.

All metrics names can have a prefix which can be configured through a template defined in metrics.prefix and which defaults to <hostname>.<exec name>.

The Graphite and statsd providers provide aggregated histograms whereas the Circonus provider sends events to the server.

There are several issues open for additional providers:

Fabio currently provides the following metrics:

Depending on the metrics provider the timer aggregation happens either in the metrics library (go-metrics: statsd, graphite) or in the system of the metrics provider (Circonus)

Name	Type	Description
`http.status.code.${stauts_code}`	timer	aggregation over all http requests per status code
`notfound`	counter	counts all http route lookup failures
`requests`	timer	aggregation of all http requests
`ws.conn`	gauge	current number of open web socket connections
`tcp.conn`	counter	counts the number of successful TCP proxy connections
`tcp.connfail`	counter	counts the number of failed TCP connections
`tcp.noroute`	counter	counts the number of TCP route lookup failures
`tcp_sni.conn`	counter	counts the number of successful TCP+SNI proxy connections
`tcp_sni.connfail`	counter	counts the number of failed TCP+SNI connections
`tcp_sni.noroute`	counter	counts the number of TCP+SNI route lookup failures
`{{ metrics.name }}`	timer
`{{ metrics.name }}.rx`	counter
`{{ metrics.name }}.tx`	counter

timer - counts events and provides an average throughput and latency number
counter - counts events and provides an monotonically increasing value
gauge - current value

New approach

A new metrics layer must be flexible enough support aggregation in process or on the server. It needs to support flat namespaces and tags and it needs to be compatible with existing fabio installations.

These metrics libraries are in use by other projects:

https://github.com/armon/go-metrics
https://github.com/go-kit/kit (metrics pkg)

armon/go-metrics supports circonus, graphite, statsd, statsite, datadog and prometheus.

go-kit/kit/metrics supports cloudwatch, dogstatd, expvar, graphite, influx, pcp, prometheus, statsd. Circonus was supported but later removed because of flaky tests.

go-kit/kit/metrics is the best fit for what fabio provides today and what users want. Existing go-metrics implementations could be written as legacy drivers, if necessary.

The problem that go-kit does not solve however is the name generation for the different metrics providers. Providers like Graphite and statsd which do not support tags need a flat name space with the tag values coded into the name of the metric. Tagged providers can have more generic names and provide additional names as tags. Then we also need to support the existing legacy metric names.

Fabio could make these names configurable with sensible defaults for each provider. However, this would add quite a number of config options which would almost never be changed. Also, we need to decide which attributes should be tagged and which should be part of the name and whether those attributes should be configurable at all or even for each provider.

Metrics names could be evaluated at runtime, e.g. through the Go template engine. However, we would need to determine the alloc overhead for this evaluation since this code is in the hot path and is executed a lot.

Since providers are either tagged or not tagged we could provide two names for each metric and depending on which provider is used we use either the one or the other.

Legacy Name	Flat name	Tagged name
`http.status.code.${stauts_code}`	`http.status.code.${status_code}`	`http.status code:${status_code}`
`notfound`	`http.noroute`	`http.noroute`
`requests`	`http.requests`	`http.requests`
`ws.conn`	`ws.conn`	`ws.conn`
`tcp.conn`	`tcp.conn`	`tcp.conn`
`tcp.connfail`	`tcp.connfail`	`tcp.connfail`
`tcp.noroute`	`tcp.noroute`	`tcp.noroute`
`tcp_sni.conn`	`tcp_sni.conn`	`tcp_sni.conn`
`tcp_sni.connfail`	`tcp_sni.connfail`	`tcp_sni.connfail`
`tcp_sni.noroute`	`tcp_sni.noorute`	`tcp_sni.noroute`
`{{ metrics.name }}`	`{{ metrics.name }}`	`{{ metrics.tagged_name }} service:<svc> host:<host:port>`
`{{ metrics.name }}.rx`	`{{ metrics.name }}.rx`	`{{ metrics.tagged_name }}.rx service:<svc> host:<host:port>`
`{{ metrics.name }}.tx`	`{{ metrics.name }}.tx`	`{{ metrics.tagged_name }}.tx service:<svc> host:<host:port>`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metrics refactor

Why change?

Fabio Metrics <= 1.5.x

New approach

Contents

Clone this wiki locally