You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
description: "Kafka target has disappeared. An exporter might be crashed.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
15
+
16
+
- alert: JvmMemory Filling Up
17
+
expr: (sum by (instance)(jvm_memory_bytes_used{area="heap",juju_charm!=".*"}) / sum by (instance)(jvm_memory_bytes_max{area="heap",juju_charm!=".*"})) * 100 > 80
18
+
for: 2m
19
+
labels:
20
+
severity: warning
21
+
annotations:
22
+
summary: JVM memory filling up (instance {{ $labels.instance }})
23
+
description: "JVM memory is filling up (> 80%)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
summary: 'Broker {{ $labels.instance }} :: Broker State :: The ZooKeeper session has expired.'
66
+
description: 'When a session expires, we can have leader changes and even a new controller. It is important to keep an eye on the number of such events across a Kafka cluster and if the overall number is high.'
summary: 'Broker :: Controller and Partitions :: No active controller'
78
+
description: 'No broker in the cluster is reporting as the active controller in the last 1 minute interval. During steady state there should be only one active controller per cluster.'
79
+
80
+
- alert: Offline Partitions
81
+
expr: sum(kafka_controller_kafkacontroller_offlinepartitionscount{juju_charm!=".*"}) by (instance) > 0
description: 'After successful leader election, if the leader for partition dies, then the partition moves to the OfflinePartition state. Offline partitions are not available for reading and writing. Restart the brokers, if needed, and check the logs for errors.'
summary: 'Broker {{ $labels.instance }} :: Controller and Partitions :: Too many partitions :: {{ $value }} partitions in broker'
96
+
description: 'Recommended number of partition per broker should be below 4000. Increase the number of broker and rebalance partitions in order to keep this number controlled.'
97
+
98
+
# =======================
99
+
# Replicas and Partitions
100
+
# =======================
101
+
- alert: Under Replicated Partitions
102
+
expr: sum(kafka_server_replicamanager_underreplicatedpartitions{juju_charm!=".*"}) by (instance) > 0
103
+
for: 1m
104
+
labels:
105
+
severity: critical
106
+
annotations:
107
+
summary: 'Broker {{ $labels.instance }} :: Replicas and Partitions :: {{ $value }} under replicated partitons'
108
+
description: 'Under-replicated partitions means that one or more replicas are not available. This is usually because a broker is down. Restart the broker, and check for errors in the logs.'
description: 'If a broker goes down, ISR for some of the partitions shrink. When that broker is up again, ISRs are expanded once the replicas are fully caught up. Other than that, the expected value for ISR expansion rate is 0. If ISR is expanding and shrinking frequently, adjust Allowed replica lag.'
139
+
140
+
- alert: ISR Shrinks Rate
141
+
expr: max(rate(kafka_server_replicamanager_isrshrinkspersec{juju_charm!=".*"}[5m])) by (instance) != 0
description: 'If a broker goes down, ISR for some of the partitions shrink. When that broker is up again, ISRs are expanded once the replicas are fully caught up. Other than that, the expected value for ISR shrink rate is 0. If ISR is expanding and shrinking frequently, adjust Allowed replica lag.'
description: '{{ $value }} unclean partition leader elections in the cluster reported in the last 1 minute interval. When unclean leader election is held among out-of-sync replicas, there is a possibility of data loss if any messages were not synced prior to the loss of the former leader. So if the number of unclean elections is greater than 0, investigate broker logs to determine why leaders were re-elected, and look for WARN or ERROR messages. Consider setting the broker configuration parameter unclean.leader.election.enable to false so that a replica outside of the set of in-sync replicas is never elected leader.'
summary: 'Broker {{ $labels.instance }} :: Consumer :: The maximum lag is {{ $value }}.'
180
+
description: 'The maximum lag in terms of number of records for any partition in this window. An increasing value over time is your best indication that the consumer group is not keeping up with the producers.'
181
+
182
+
# ===============
183
+
# Thread Capacity
184
+
# ===============
185
+
- alert: Network Processor Idle Percent
186
+
expr: avg(sum(kafka_network_processor_idlepercent{juju_charm!=".*"}) by (instance, networkProcessor)) by (instance) < 0.3
description: 'The average fraction of time the network processors are idle. A lower value {{ $value }} indicates that the network workload of the broker is very high.'
193
+
194
+
- alert: Request Handler Idle Percent
195
+
expr: avg(kafka_server_kafkarequesthandlerpool_requesthandleravgidlepercent_total{juju_charm!=".*"}) by (instance) < 0.3
description: 'The average fraction of time the request handler threads (IO) are idle. A lower value {{ $value }} indicates that the workload of a broker is very high.'
description: 'Max request queue time exceeded 200ms for a request. It is the time, in milliseconds, that a request currently spends in the request queue.'
description: 'Max response queue time exceeded 200ms for a request. It is the length of time, in milliseconds, that the request waits in the response queue.'
0 commit comments