- Active Controller count
- Offline partitions
- Unclean Elections
- Under Replicated Partitions
- Under Min In Sync Replicas
- Max lag by groupId by partition
- Broker IO Activity
- Broker Net Activity
- Zookeper Avg Latency
- Zookeper Connections
- Broker Zookeeper disconnections
Bonus: Why I don't have to monitor the number of brokers?
- Type: Message Delivery
- Description: The Controller is responsible for maintaining the list of partition leaders, and coordinating leadership transitions (topic creation)
- If
ActiveControllerCount < 1
: Producers/Consumers can't get the partition leaders anymore. - If
ActiveControllerCount > 1
: A split-brain occurs, ans that's really bad!
- If
- Metric:: Sum the JMX metric
kafka.controller:type=KafkaController,name=ActiveControllerCount
across the cluster. Note: Each broker exposes theActiveControllerCount
metric where the value is0
or1
wether the node is a controller or not. So it looks like a boolean but you need to do an integer sum across the cluster. - Notification:
- Send a warning when
ActiveControllerCount != 1
- Send an alarm when
ActiveControllerCount != 1
for more than 10s.
- Send a warning when
- Type: Message Delivery
- Description: An Offline Partition is a partition without active leader and are hence not writable or readable. The presence of Offline partitions compromise the data integrity of the cluster.
- Metric: JMX
kafka.controller:type=KafkaController,name=OfflinePartitionsCount
- Notification: Alarm when
OfflinePartitionsCount > 0
- Type: Message Delivery
- Description: Normally, when a broker that is the leader for a partition goes offline, a new leader is elected from the set of ISRs for the partition. An unclean leader election is a special case in which no available replicas are in sync. Because each topic must have a leader, an election is held among the out-of-sync replicas and a leader is chosen—meaning any messages that were not synced prior to the loss of the former leader are lost forever. Essentially, unclean leader elections sacrifice consistency for availability.
- Metric: JMX
kafka.controller:type=ControllerStats,name=UncleanLeaderElectionsPerSec
- Notification:
- Notify when
UncleanLeaderElectionsPerSec > 0
- Alarm when
UncleanLeaderElectionsPerSec > 0
for more than 1min.
- Notify when
- Type: Message Delivery
- Description: In a healthy cluster, the number of in sync replicas (ISRs) should be exactly equal to the total number of replicas. In other words, the metric ensure the partitions are respecting the topic replication factor configuration. Partitions under replicated can appear when a cluster node is down.
- Metric: JMX
kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions
- Notification:
- Notify when
UnderReplicatedPartitions > 0
- Alarm when
UnderReplicatedPartitions > 0
for more than 1min.
- Notify when
- Type: Message Delivery
- Description: If the cluster can't reach the
min.insync.replicas
andacks
is set to all (because there is no enough broker up in the cluster for instance), the data producer can't receive the ack, it raises a timeout exception and may retry depending on its configuration). In a nutshell, when you have partition under min ISR is almost sure the the data production is blocked. - Metric: JMX
kafka.server:type=ReplicaManager,name=UnderMinIsrPartitionCount
- Notification: Alarm when
UnderMinIsrPartitionCount > 0
- Type: Performance
- Description: The lag is the delta between the offset of the last data appended in a topic and the offset of the committed read of a consumer. The bigger the lag is, the slower the consumer is reading the data.
- Metric:
- Notification: N/A
- Type: Performance
- Description: Internally a broker node use the
I/O Thread
to read a message from theRequest Queue
, write it to the OS page cache and place it into thePurgatory
where your replication strategy will be executed. It's interesting to monitor the thread idle time ("Idle" means not active):- When
idle==1
: The broker is inactive, from a pure performance stand point it could be removed. - When
idle==0
: The broker is always processing, you should either increase the number of threads or add a new broker into the cluster.
- When
Note: Thread scaling should be done carefully since it can have huge impact on the overall node performance.
- Metric: JMX
kafka.network:type=SocketServer,name=NetworkProcessorAvgIdlePercent
- Notification: Alarm when
RequestHandlerAvgIdlePercent < 0.4
- Notification: Alarm when
- Type: Performance
- Description: Same as above, a broker node use the
Network Thread
to read a message from the network and place it into theRequest Queue
. It's interesting to monitor the thread idle ("Idle" means not active):- When
idle==1
: The broker has no inbound traffic, from a pure performance stand point it could be removed. - When
idle==0
: The broker is always receiving messages , you should either increase the number of threads or add a new broker into the cluster.
- When
Note: Thread scaling should be done carefully since it can have huge impact on the overall node performance.
- Metric: JMX
kafka.network:type=SocketServer,name=NetworkProcessorAvgIdlePercent
- Notification: Alarm when
NetworkProcessorAvgIdlePercent < 0.4
- Notification: Alarm when
- Type: Cluster Health
- Description: Zookeeper is a key part of the Kafka distributed capability, it is responsible of the Active Controller election, topic configuration, ACLs, broker membership in the cluster. In other word, if your Zookeeper cluster is unhealthy, your Kafka cluster will face issues soon.
The latency is the amount of time it takes for the server to respond to a client request. The value is dependent of the context (10ms is a standard reference), and should be stable over the time.
- Metric: JMX
AvgRequestLatency
- Notification:
- Notify when
AvgRequestLatency > 10ms
- Alarm when
AvgRequestLatency > 10ms
lasts for more than 1m.
- Notify when
- Type: Cluster Health
- Description: Zookeeper has a limit of client connection it can handles (configured by
maxClientCnxns
). When the limit is reached, the new request will be dropped which can affect your Kafka service (controller election, ACLs, etc.) - Metric: JMX
NumAliveConnections
- Notification: Alarm when
NumAliveConnections/maxClientCnxns > 0.7
- Type: Cluster health
- Description: The current Broker has been disconnected from the ensemble. The broker lost its previous connection to a server and it is currently trying to reconnect. The session is not necessarily expired. A high rate of disconnection is a symptom of network issues.
- Metric: JMX
kafka.server:type=SessionExpireListener,name=ZooKeeperDisconnectsPerSec
- Notification:
- Notify when
ZooKeeperDisconnectsPerSec > X
- Notify when
You can be surprised to not see the number of brokers has a monitoring point. Think about it, what can occur you loose a broker.
- The producers will be blocked because the min.isr criterias is not satisfied? The UnderMinInSyncReplicas alert will be triggered.
- You can't guarantee the durability of the messages: Under Replicated? The UnderReplicatedPartitions alert will be triggered.
- You will have performance issue? The BrokeActivity(IO and/or Net) will be trigered.
Your Kafka cluster is shaped to meet one or many of the Kafka key features (speed, durability, resilience, etc.). While having the number of brokers is a good information it's redundant details to build a health check monitoring vision.