Skip to content

Zipkin 2.10 completes our v2 migration

Compare
Choose a tag to compare
@codefromthecrypt codefromthecrypt released this 07 Jul 06:49
· 1337 commits to master since this release

Zipkin 2.10 drops v1 library dependency and http read endpoints. Those using the io.zipkin.java:zipkin (v1) java library should transition to io.zipkin.zipkin2:zipkin as the next release of Zipkin will stop publishing updates to the former. Don't worry: Zipkin server will continue accepting all formats, even v1 thrift, for the foreseeable future.

Below is a story of our year long transition to a v2 data format, ending with what we've done in version 2.10 of our server (UI in nature). This is mostly a story of how you address an big upgrade in a big ecosystem when almost all are volunteers.

Before a year ago, the OpenZipkin team endured (and asked ourselves) many confused questions about our thrift data format. Why do service endpoints repeat all the time? What are binary annotations? What do we do if we have multiple similar events or binary annotations? Let's dig into the "binary annotation" as probably many reading still have no idea!

Binary annotations were a sophisticated tag, for example an http status. While the name is confusing, most problems were in being too flexible and this led to bugs. Specifically it was a list of elements with more type diversity than proved useful. While a noble aim, and made sense at the time, binary annotations could be a string, binary, various bit lengths of integer or floating point numbers. Even things that seem obvious could be thwarted. For example, some would accidentally choose the type binary for string, effectively disabling search. Things seemingly simple like numbers were bug factories. For example, folks would add random numbers as an i64, not thinking that you can't fit one in a json number without quoting or losing precision. Things that seemed low-hanging fruit were not. Let's take http status for example. Clearly, this is a number, but which? Is it a 16bit (technically correct) or is it a 32 bit (to avoid signed misinterpretation)? Could you search on it the way you want to (<200 || >299 && !404)? Tricky right? Let's say someone sent it as a different type by accident.. would it mess up your indexing if sent as a string (definitely some will!)? Even if all of this was solved, Zipkin is an open ecosystem including private sites with their private code. How much time does it cost volunteers to help others troubleshoot code that can't be shared? How can we reduce support burden while remaining open to 3rd party instrumentation?

This is a long winded story of how our version 2 data format came along. We cleaned up our data model, simplifying as an attempt to optimize reliability and support over precision. For example, we scrapped "binary annotation" for "tags". We don't let them repeat or use numeric types. There are disadvantages to these choices, but explaining them is cheap and the consequences are well understood. Last July, we started accepting a version 2 json format. Later, we added a protobuf representation.

Now, why are we talking about a data format supported a year ago? Because we just finished! It takes a lot of effort to carefully roll something out into an ecosystem as large as Zipkin's and being respectful of the time impact to our volunteers and site owners.

At first, we ingested our simplified format on the server side. This would "unlock" libraries, regardless of how they are written, and who wrote them, into simpler data.. data that much resembles tracing operations themselves. We next focused on libraries to facilitate sending and receiving data, notably brown field changes (options) so as to neither disrupt folks, nor scare them off. We wanted the pipes that send data to become "v2 ready" so owners can simultaneously use new and old formats, rather than expect an unrealistic synchronous switch of data format. After this, we started migrating our storage and collector code, so that internal functionality resemble v2 constructs even while reading or writing old data in old schemas. Finally, in version 2.10, we changed the UI to consume only v2 data.

So, what did the UI change include? What's interesting about that? Isn't the UI old? Let's start with the last question. While true the UI has only had facelifts and smaller visible features, there certainly has been work involved keeping it going. For example, backporting of tests, restructuring its internal routing, adding configuration hooks or integration patterns. When you don't have UI staff, keeping things running is what you end up spending most time on! More to the point, before 2.10, all the interesting data conversion and processing logic happened in Java, on the api server. For example, merging of data, correcting clock shifts etc. This setup a hard job for those emulating zipkin.. at least those who emulated the read side. Custom read api servers or proxies can be useful in practice. Maybe you need to stitch in authorization or data filtering logic.. maybe your data is segmented.. In short, while most read scenarios are supported out-of-box, some advanced proxies exist for good reason.

Here's a real life example: Yelp saves money by not sending trace data across paid links. For example, in Amazon's cloud (and most others), if you send data from one availability zone to another, you will pay for that. To reduce this type of cost, Yelp uses an island + aggregator pattern to save trace data locally, but materialize traces across zones when needed. In their site, this works particularly well as search doesn't use Zipkin anyway: they use a log based tool to find trace IDs. Once they find a trace ID, they use Zipkin to view it.. but still.. doing so requires data from all zones. To solve this, they made an aggregating read proxy. Before 2.10, it was more than simple json re-bundling. They found that our server did things like merging rules and clock skew correction. This code is complex and also high maintenance, but was needed for the UI to work correctly. Since v2.10 moves this to UI javascript, Yelp's read proxy becomes much simpler and easier to maintain. In summary, having more logic in the UI means less work for those with DIY api servers.

Another advantage of having processing logic in the UI is better answering "what's wrong with this trace?" For example, we know data can be missing or incorrect. When processing is done server-side, there is friction in deciding how to present errors. Do you decorate the trace with synthetic data, or use headers, or some enveloping? If instead that code was in the UI, such decisions are more flexible and don't impact the compatibility of others. While we've not done anything here yet, you can imagine it is easier to show, like color or otherwise, that you are viewing "a bad trace". Things like this are extremely exciting, given our primary goals are usually to reduce the cost of support!

In conclusion, we hope that by sharing our story, you have better insight into the OpenZipkin way of doing things, how we prioritize tasks, and how seriously we take support. If you are a happy user of Zipkin, find a volunteer who's helped you and thank them, star our repository, or get involved if you can. You can always find us on Gitter.