Releases: openzipkin/zipkin
Zipkin 2.3
Zipkin 2.3 allows querying across all services and introduces Cassandra 3 support
Exploring all data
In the past, the zipkin UI required specifying a service name. This is ok when you know which service you are interested in, but it isn't helpful to explore all data. Now, Zipkin defaults to search across all services.
For example, the below looks for a trace containing an http path and at least 15 milliseconds to complete
Thanks @xqliang on helping with the code on this.
Cassandra 3 support
Cassandra has been a supported storage option in Zipkin for over 5 years. Thanks to immense help from @michaelsembwever and @llinder we have a modernized option taking advantage of Cassandra v3.9+ features and Zipkin v2 data format.
Cassandra is a very capable backend. It supports data expiration (TTL) and has powerful replication models which can simplify your ability to trace even across regions. However, if you used our cassandra schema, you'd notice it wasn't great for browsing: Our schema stored encoded thrift blobs, so you couldn't meaningfully query it in CQL. Moreover, our duration query support was problematic to the point of being a pulled feature.
The "cassandra3" storage type eliminates these problems, retaining all the strengths. It installs automatically into the "zipkin2" keyspace, corresponding with what the data structures look like. You need Cassandra v3.9+ (we test on 3.11.1 which is latest). Look at our README for technical details about the schema.
What's notable about this is that the schema now is a span model, as opposed to serialized thrifts. This means you can look at the data in cqlsh and make sense of it. For example, the trace ID is the same hex as B3 headers, which means you can literally paste into CQL if you want.
While most will use zipkin's UI, some of you will like being able to write queries like below:
cqlsh:zipkin2> select trace_id, toTimestamp(ts) as timestamp, duration as duration_micros, minus(writetime(span), plus(ts,duration)) as write_lag_micros, span as name, value(tags, 'http.path') as path, l_service from span limit 10;
trace_id | timestamp | duration_micros | write_lag_micros | name | path | l_service
------------------+---------------------------------+-----------------+------------------+------+------+-----------
907a5124315b2cc1 | 2017-11-13 08:16:39.365000+0000 | 634 | 810418 | get | /api | backend
907a5124315b2cc1 | 2017-11-13 08:16:39.364000+0000 | 1693 | 810891 | get | /api | frontend
907a5124315b2cc1 | 2017-11-13 08:16:39.363000+0000 | 2891 | 810507 | get | / | frontend
d7f8afc80b3f357b | 2017-11-13 08:14:50.478000+0000 | 616 | 288349 | get | /api | backend
d7f8afc80b3f357b | 2017-11-13 08:14:50.476000+0000 | 3093 | 288574 | get | / | frontend
d7f8afc80b3f357b | 2017-11-13 08:14:50.476000+0000 | 1687 | 289084 | get | /api | frontend
2483c223283177d2 | 2017-11-13 08:14:48.527000+0000 | 895 | 1129547 | get | /api | backend
2483c223283177d2 | 2017-11-13 08:14:48.523000+0000 | 4125 | 1130022 | get | /api | frontend
2483c223283177d2 | 2017-11-13 08:14:48.522000+0000 | 6171 | 1131768 | get | / | frontend
4ff33dd90b4f8f61 | 2017-11-13 08:16:37.350000+0000 | 659 | 798568 | get | /api | backend
(10 rows)
A severe amount of thanks is owed to @llinder (Lance) and @michaelsembwever (Mick). Lance has been piloting a beta model several months, fixing some issues like pagination and generally being first to fire. He also wrote a pilot version of the dependency linking spark job. Mick helped significantly with porting that work in progress to the simpler v2 span model, as well stress tests, advice, tons of advice, code, more code, and advice. Please reach out and thank these two for volunteering hard yards needed to reinvent our Cassandra model.
Zipkin 2.2
Zipkin 2.2 focuses on operations, allowing proxy-mounting the UI and bundles a Prometheus Grafana dashboard
@stepanv modified the zipkin UI such that it can work behind reverse proxies which choose a different path prefix than '/zipkin'. If you'd like to try zipkin under a different path, Stepan wrote docs showing how to setup apache http.
Previously, zipkin had both spring and prometheus metrics exporters. Through hard work from @abesto and @kristofa, we now have a comprehensive example setup including a Zipkin+Prometheus Grafana dashboard. To try it out, use our docker-compose example, which starts everything for you. Once that's done, you can start viewing the health of your tracing system, including how many messages are dropped.
Here's an example, which you'd see at http://192.168.99.100:3000/dashboard/db/zipkin-prometheus?refresh=5s&orgId=1&from=now-5m&to=now
if using docker-machine:
Other notes
- our docker JVM has been upgraded to 1.8.0_144 from 1.8.0_131
- the zipkin-server no longer writes log messages about drop messages at warning level as it can fill up disk. Enable debug logging to see the cause of drops
- elasticsearch storage will now drop on backlog as opposed to backing up, as the latter led to out-of-memory crashes under load surges.
Finally, please join us on gitter if you have any questions or feedback about Zipkin 2.2
Zipkin 2.1
Thanks to @shakuzen, zipkin 2.1 adds RabbitMQ to the available span transports.
RabbitMQ has been requested many times, though we only started formally tracking it this year. A lot of interest grew from spring-cloud-sleuth which supported a custom RabbitMQ transport. Starting with Zipkin 2.1, RabbitMQ support is built-in to zipkin-server (though custom deployments can remove it).
Using this is easy, just set RABBIT_ADDRESSES
to a comma-separated list of rabbit hosts.. if playing around, you can use localhost:
$ RABBIT_ADDRESSES=localhost java -jar zipkin.jar
More documentation is available here.
Once a server is running applications send spans to rabbit, specifically to the queue/routing key associated with zipkin (defaults to "zipkin"). You can post a test trace using normal CLI while you wait for tracers to support RabbitMQ transport.
$ echo '[{"traceId":"9032b04972e475c5","id":"9032b04972e475c5","kind":"SERVER","name":"get","timestamp":1505990621526000,"duration":612898,"localEndpoint":{"serviceName":"brave-webmvc-example","ipv4":"192.168.1.113"},"remoteEndpoint":{"serviceName":"","ipv4":"127.0.0.1","port":60149},"tags":{"error":"500 Internal Server Error","http.path":"/a"}}]' > sample-spans.json
$ rabbitmqadmin publish exchange=amq.default routing_key=zipkin < sample-spans.json
Many thanks to @shakuzen for driving this feature. There's a lot more work than just coding when we add a new default feature. Evenings and weekend time from Tommy are gratefully received.
Zipkin 2
In version 1.31, we introduced our v2 http api, availing dramatically simplified data types. Zipkin 2 is the effort to move all infrastructure towards that model, while still remaining backwards compatible.
What's new?
The core java library (under the package zipkin2) has model, codec and storage types. This includes a bounded in-memory storage component used in test environments.
The following artifacts are new and can coexist with previous ones.
io.zipkin.zipkin2:zipkin:2.0.0
< core libraryio.zipkin.zipkin2:zipkin-storage-elasticsearch:2.0.0
< first v2 native storage driver
Note: If you are using io.zipkin.java:zipkin
and io.zipkin.zipkin2:zipkin
, use version 2.0.0 (or later) for both as we still maintain the old libraries.
What's next?
There are a few storage implementations in-flight and some may port to the new libraries. Next, we will add a v2 native transport library and work on a Spring Boot 2 based server. Expect incremental progress along the way. Please join us on gitter if you have ideas!
The server itself is still the same
Note: if you are only using or configuring Zipkin, there's little impact. Zipkin server hasn't changed, you just upgrade it. If you have java tracing setup, read the below. Otherwise, you are done unless you want extra details.
Changing java applications to use Zipkin v2 format
Java applications often use the zipkin-reporter project directly or indirectly to send data to Zipkin collectors. Our version 2 json format is smaller and measurably more efficient.
Once you've upgraded your Zipkin servers, opt-into the version 2 format like this:
Ex:
/** Configuration for how to send spans to Zipkin */
@Bean Sender sender() {
- return OkHttpSender.create("http://your_host:9411/api/v1/spans");
+ return OkHttpSender.json("http://your_host:9411/api/v2/spans");
}
/** Configuration for how to buffer spans into messages for Zipkin */
- @Bean Reporter<Span> reporter() {
- return AsyncReporter.builder(sender()).build();
+ @Bean Reporter<Span> spanReporter() {
+ return AsyncReporter.v2(sender()).build();
}
If you are using Brave directly, you can stick the v2 reporter here:
return Tracing.newBuilder()
- .reporter(reporter()).build();
+ .spanReporter(spanReporter())
If you are using Spring XML, the related change looks like this:
- <bean id="sender" class="zipkin.reporter.okhttp3.OkHttpSender" factory-method="create"
+ <bean id="sender" class="zipkin.reporter.okhttp3.OkHttpSender" factory-method="json"
destroy-method="close">
- <constructor-arg type="String" value="http://localhost:9411/api/v1/spans"/>
+ <constructor-arg type="String" value="http://localhost:9411/api/v2/spans"/>
</bean>
<bean id="tracing" class="brave.spring.beans.TracingFactoryBean">
<property name="reporter">
<bean class="brave.spring.beans.AsyncReporterFactoryBean">
+ <property name="encoder" value="JSON_V2"/>
What's new in the Zipkin v2 library
Zipkin v2 libraries are under the zipkin2
java package and the io.zipkin.zipkin2
maven group ID. The core library has a few changes, which mostly cleanup or pare down features we had before. Here are some highlights:
Span now uses validated strings as opposed to parsed objects
Our new json encoder is 2x as fast as prior due to factors including a validation approach. For example, before we used the java long type to represent a 64-bit ID and a 32-bit integer to represent an ipv4 address. Most of the time, IDs are and IPs are transmitted and stored as strings. This resulted in needless expensive conversions. By switching to this, using other serialization libraries is easier, too, as you don't need custom type converters.
Ex.
- Endpoint.builder().serviceName("tweetie").ipv4(192 << 24 | 168 << 16 | 1).build());
+ Endpoint.newBuilder().serviceName("tweetie").ip("192.168.0.1").build());
protip: if you have an old endpoint, you can do endpoint.toV2()
on it!
Span now uses auto-value instead of public final fields
We originally had public final fields for our model types (borrowing from square wire style). This has a slight glitch which is that data transformations can't use method references (as fields aren't methods!). This is cleaned up now.
- assertThat(spans).extracting(s -> s.duration)
+ assertThat(spans).extracting(Span::duration)
Asynchronous operations are now cancelable
Most will not make custom Zipkin servers, but those making storage or transport plugins have a cleaner api.
Borrowing heavily from Square Retrofit and OkHttp, Zipkin storage interfaces return a Call object, which represents a single unit of work, such as storing spans. This provides means to either synchronously invoke the command, pass a callback, or compose with your favorite library. Unlike before, calls are cancelable.
For example, before, if you wanted to write integration tests that synchronously invoke storage, you'd need to play callback games. These are gone.
- CallbackCaptor<Void> callback = new CallbackCaptor<>();
- storage().asyncSpanConsumer().accept(spans, callback);
- callback.get();
+ storage.spanConsumer().accept(spans).execute();
As an implementor, the whole thing is simpler especially combined with validated string IDs
- @Override public void getTrace(long traceIdHigh, long traceIdLow, Callback<List<Span>>) {
- String traceIdHex = Util.toLowerHex(traceIdHigh, traceIdLow);
+ @Override public Call<List<Span>> getTrace(String traceId) {
(json) Codec libraries are cleaned up
We've introduced SpanBytesEncoder
and SpanBytesDecoder
instead of the catch-all Codec
type from v1. When writing zipkin-reporter, we noticed that almost all applications do not need decode logic (as they simply serialize and send out of process). For those writing data to Zipkin, we can serialize either the old format or the new with SpanBytesEncoder.JSON_V1
or SpanBytesEncoder.JSON_V2
accordingly. It is important to note that writing v1 format does not require a version 1.X jar in your classpath.
Zipkin 1.30
Zipkin 1.30 accepts a new simplified json format on all major transports including http, Kafka, SQS, Kinesis, Azure Event Hub and Google Stackdriver.
The primary goal of this format is making Zipkin data easier to understand and simpler for folks to write. A dozen folks in Zipkin have vetted ideas on this format for over a year. We took it seriously because we don't want to bother you with a format unless it will last years. Thanks especially to @bplotnick @basvanbeek and @mansu for donating time recently towards vetting final details.
Here's an example curl command that uploads json representing a server operation:
# make epoch seconds epoch microseconds, because.. microservices!
$ date +%s123456
1502677917123456
$ curl -s localhost:9411/api/v2/spans -H'Content-Type: application/json' -d'[{
"traceId": "86154a4ba6e91387",
"id": "86154a4ba6e91387",
"kind": "SERVER",
"name": "get",
"timestamp": 1502677917123456,
"duration": 207000,
"localEndpoint": {
"serviceName": "hamster-wheel",
"ipv4": "113.210.108.10"
},
"remoteEndpoint": {
"ipv4": "77.12.22.11"
},
"tags": {
"http.path": "/api/hamsters",
"http.status_code": "302"
}
}]'
The above says a lot with a little: the server's identifier in discovery (hamster-wheel), the http route and the client IP (likely from X-Forwarded-For or similar). This request took 207ms in the server and resulted in a redirect.
We released collector-side ahead of client/reporter-side, so that folks can roll-out version upgrades ahead of demand. That said, there are already work in progress using this, like census and @flier's c/c++ tracer so update to the most recent patch release as soon as you can!
If you are interested more in this format, check out the newly polished OpenApi spec, or a go client example compiled from it (thx @devinsba). If you have further questions, hop on https://gitter.im/openzipkin/zipkin
Next releases will formalize more including "zipkin2" java types for those who need it. That said, one nice thing about the new format is that it is easy enough for normal json tools to manage. Regardless, keep eyes open for more and thanks for the interest.
Zipkin 1.29
Zipkin 1.29 models messaging spans, shows errors in the service graph and supports Elasticsearch 6
Message tracing
Producing and consuming messages from a broker, such as RabbitMQ or Kafka, is similar but different than one-way RPC. For example, one message can have multiple consumers, and many times the producer of the message can't know if this will be the case. Also, and particularly in Kafka, consuming a message is often completely decoupled from processing of it, and consumption may happen in bulk.
Through community discussion, notably advice from @bogdandrutu from Census, we reached this conclusion for message tracing with Zipkin:
- Messaging consumers should always be a child span of the producing span (and not a linked trace)
- If using B3, this means
X-B3-SpanId
is the parent of the consumer span
- If using B3, this means
- "ms" and "mr" annotate message send and receive events
- span2 format replaces these with Span.Kind.PRODUCER, CONSUMER
- If producer and consumer spans include duration, it should only reflect local batching delay
- time spent processing a message should be in a different child span
There are diagrams of how instrumentation work with this model on the website. You can also look at @ImFlog's Kafka 0.11 tracing work in progress. If you have more questions or want to share your work, contact us on gitter.
Visualizing error count between services
Thanks to @hfgbarrigas' initial work, and lots of review support by @shakuzen,
we now have errorCount on dependency links, indicating how many of callCount
between services were in error.
MySQL users who want this need to add the error_count
column:
alter table zipkin_dependencies add `error_count` BIGINT
The UI is relatively simple, coloring the line yellow when 50% or more calls are in error, and red when 75%. These rates can be overridden or disabled with configuration.
Example of when >50% of calls are in error
Example of when >75% of calls are in error
Trace instrumentation's contract is easy: add the "error" tag, for example on http 500. When aggregating links, the value of the "error" tag isn't important. Please update to latest versions of instrumentation if you don't see errors, yet. For example, zipkin-ruby recently support this thanks to @jcarres-mdsol.
Elasticsearch 6
Currently, Elasticsearch uses one index for all types: spans, dependencies (and a special service name index). Elasticsearch 6 no longer supports multiple types per index. Instead we write separate indexes for span and dependency links when Elasticsearch 6 is detected. Incidentally, we also use the new span2 json format, which is simplified and more efficient.
The next version will support the same single-type indexing with Elasticsearch 2.4+. If you can't wait that long, look at #1674 for the experimental flag you can use today.
Thanks to @anuraaga @ImFlog @xeraa and @jcarres-mdsol for advice and support leading to this feature. The next release will thank those who test it!
Zipkin 1.28
Zipkin 1.28 bounds the in-memory storage component
Since the rewrite, we've always had a way to start zipkin without any storage service dependency. This is great for running examples, unit tests, or ad-hoc tests. It wasn't good for tests in more persistent environments like Kubernetes as eventually the memory would blow up and we'd recommend people to use something else. It also wasn't good for short tests that take a lot of traffic for the same reason.
Initially, we were hesitant to add features that might end up as people accidentally going prod with our in-memory storage. However, many people asked about this, usually after something blew-up in test: We realized bounding the memory provider was indeed worthwhile. Thanks to hard work and tuning by @joel-airspring, the default server now starts and won't likely blow up if you send a lot of traffic to it.
So, now you can play around and zipkin will just drop old traces to make room for new ones.
# run with self-tracing enabled, so each api hit is traces, and max-spans set lower than 500000 spans (default)
$ SELF_TRACING_ENABLED=true java -Dzipkin.storage.mem.max-spans=500 -jar ./zipkin-server/target/zipkin-server-*exec.jar
# in another window, do this for a while
$ while true; do curl -s localhost:9411/api/v1/services;done
# then, check to see the span count is less than or equal to what you set it to: <=500
$ curl -s localhost:9411/api/v1/traces?limit=1000000|jq '.[]|.[]|.id'|wc -l
Please note this option can likely break under certain types of load, so please don't consider the in-memory provider production-grade, or on a path to be the latest data grid! If you are interested in an in-memory storage option for production, you might consider upvoting Hazelcast, noting you want it to work embedded.
Zipkin 1.27
Zipkin 1.27 moves the UI under the path /zipkin, allows listening on multiple Kafka topics and improves Cassandra 3 support.
The Zipkin UI was formerly served from an unmodified server as the base path. We've had folks ask for a year in various ways to have this under a subpath instead. We decided to move the UI under /zipkin as it matched most users' requirements and was easiest for our single-page app to route. Thanks to @eirslett @danielkwinsor and @neilstevenson for help with implementation and testing.
We recently added Kafka 0.10 support. This version includes the ability to listen on multiple topics, something you might do if you have environments where spans come from different sources. Thanks to @danielkwinsor for implementation and @dgrabows for review, we now support this by simply comma-delimiting the topic. Note: there are some gotchas if you are considering migrating from Kafka 0.8 to 0.10. Thanks to @fedj for noting something you might run into.
Some of you may using the experimental "cassandra3" storage type. We had a serious glitch @llinder found where blocking could occur on a query depending on the count of results retuned. Not only did Lance fix the glitch, but also added testcontainers to ensure clean, docker-based integration tests run on every PR.
Finally, Zipkin 1.27 fixes a number of broken windows. Thanks @NithinMadhavanpillai for adding a test to help us fix a bad data bug parsing dependencies, @fgcui1204 for finding out why service names sometimes cut off in the UI, @ImFlog for backfilling docs about how ports can be specified in cassandra and @joel-airspring for fixing a few distracting glitches in our build.
Zipkin 1.26
Thanks to @dgrabows, Zipkin 1.26 now supports Kafka 0.10. Notably, this allows you to run without a ZooKeeper dependency. (Recent versions of Kafka no longer require consumers to connect to ZooKeeper)
Our docker image will automatically use this, if the variable KAFKA_BOOTSTRAP_SERVERS
is set instead of KAFKA_ZOOKEEPER
. An example docker setup is available here.
While you do not need to upgrade your instrumented apps, you can choose to opt-in by using libraries such as our kafka10 sender.
Thanks again for the comprehensive work by @dgrabows and review feedback by @StephenWithPH
Zipkin 1.25
Zipkin 1.25 lets you to disable the query api when deploying collector-only services. It also lets you log http requests sent to Elasticsearch. Finally, it fixes a bug where a non-default MySQL schema would fail health checks.
Disabling the UI and Query api for collector-only servers
@SirTyro's security team wants collectors deployed separately, in a way that reduces exposure if compromised. You can now disable the api and UI by setting QUERY_ENABLED=false. Thanks to @shakuzen for help implementing this.
Understanding Zipkin's requests to Elasticsearch
Reflecting on a troubleshooting session with @ezraroi, we could have used more data to understand why an Elasticsearch index template was missing. This would have saved us time. You can now set ES_HTTP_LOGGING=BASIC
to see what traffic is sent from zipkin to Elasticsearch. Other options include HEADER
and BODY
. Thanks to OkHttp for the underlying interceptor that does this.
Fixed health check when you have a non-default MySQL schema
@zhanglc stumbled upon a bug where the health check misreported a service unhealthy if it had a non-default schema. This is now fixed.