Skip to content

Commit 924ea21

Browse files
authored
Add job labels to monitoring alerts (#1065)
[comment]: # (Note that your PR title should follow the conventional commit format: https://conventionalcommits.org/en/v1.0.0/#summary) # PR Description [comment]: # (The below checklist is for PRs adding new features. If a box is not checked, add a reason why it's not needed.) # New Feature Checklist - [ ] List telemetry added about the feature. - [ ] Link to the one-pager about the feature. - [ ] List any tasks necessary for release (3P docs, AKS RP chart changes, etc.) after merging the PR. - [ ] Attach results of scale and perf testing. [comment]: # (The below checklist is for code changes. Not all boxes necessarily need to be checked. Build, doc, and template changes do not need to fill out the checklist.) # Tests Checklist - [ ] Have end-to-end Ginkgo tests been run on your cluster and passed? To bootstrap your cluster to run the tests, follow [these instructions](/otelcollector/test/README.md#bootstrap-a-dev-cluster-to-run-ginkgo-tests). - Labels used when running the tests on your cluster: - [ ] `operator` - [ ] `windows` - [ ] `arm64` - [ ] `arc-extension` - [ ] `fips` - [ ] Have new tests been added? For features, have tests been added for this feature? For fixes, is there a test that could have caught this issue and could validate that the fix works? - [ ] Is a new scrape job needed? - [ ] The scrape job was added to the folder [test-cluster-yamls](/otelcollector/test/test-cluster-yamls/) in the correct configmap or as a CR. - [ ] Was a new test label added? - [ ] A string constant for the label was added to [constants.go](/otelcollector/test/utils/constants.go). - [ ] The label and description was added to the [test README](/otelcollector/test/README.md). - [ ] The label was added to this [PR checklist](/.github/pull_request_template). - [ ] The label was added as needed to [testkube-test-crs.yaml](/otelcollector/test/testkube/testkube-test-crs.yaml). - [ ] Are additional API server permissions needed for the new tests? - [ ] These permissions have been added to [api-server-permissions.yaml](/otelcollector/test/testkube/api-server-permissions.yaml). - [ ] Was a new test suite (a new folder under `/tests`) added? - [ ] The new test suite is included in [testkube-test-crs.yaml](/otelcollector/test/testkube/testkube-test-crs.yaml).
1 parent 47fac27 commit 924ea21

File tree

1 file changed

+4
-4
lines changed

1 file changed

+4
-4
lines changed

internal/alerts/example-alert-template.json

+4-4
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@
2020
"rules": [
2121
{
2222
"alert": "Amd64 metric missing in cluster ci-dev-aks-mac-eus",
23-
"expression": "absent(node_uname_info{machine=\"x86_64\"}) == 1 or node_uname_info{machine=\"x86_64\"} == 0",
23+
"expression": "absent(node_uname_info{job=\"node\",machine=\"x86_64\"}) == 1 or node_uname_info{job=\"node\",machine=\"x86_64\"} == 0",
2424
"for": "PT30M",
2525
"annotations": {
2626
"description": "Amd64 metric missing in cluster ci-dev-aks-mac-eus"
@@ -200,7 +200,7 @@
200200
},
201201
{
202202
"alert": "CPU usage % greater than 75 for prometheus-collector containers on cluster ci-dev-aks-mac-eus",
203-
"expression": "sum(sum by (cluster, namespace, pod, container) ( rate(container_cpu_usage_seconds_total{job=\"cadvisor\", image!=\"\", namespace=\"kube-system\", container=\"prometheus-collector\"}[5m]) ) * on (cluster, namespace, pod) group_left(node) topk by (cluster, namespace, pod) ( 1, max by(cluster, namespace, pod, node) (kube_pod_info{node!=\"\", namespace=\"kube-system\"}) )) by (container, pod) *100 > 75",
203+
"expression": "sum(sum by (cluster, namespace, pod, container) ( rate(container_cpu_usage_seconds_total{job=\"cadvisor\", image!=\"\", namespace=\"kube-system\", container=\"prometheus-collector\"}[5m]) ) * on (cluster, namespace, pod) group_left(node) topk by (cluster, namespace, pod) ( 1, max by(cluster, namespace, pod, node) (kube_pod_info{job=\"kube-state-metrics\",node!=\"\", namespace=\"kube-system\"}) )) by (container, pod) *100 > 75",
204204
"for": "PT3M",
205205
"annotations": {
206206
"description": "CPU usage greater than 75% for prometheus-collector on cluster ci-dev-aks-mac-eus"
@@ -218,7 +218,7 @@
218218
},
219219
{
220220
"alert": "Memory usage % greater than 75 for prometheus-collector containers on cluster ci-dev-aks-mac-eus",
221-
"expression": "(sum(container_memory_working_set_bytes{namespace=\"kube-system\", container=\"prometheus-collector\", image!=\"\"}) by (container, pod) / sum(kube_pod_container_resource_limits{namespace=\"kube-system\", container=\"prometheus-collector\", resource=\"memory\"}) by (container, pod)) * 100> 75",
221+
"expression": "(sum(container_memory_working_set_bytes{job=\"cadvisor\",namespace=\"kube-system\", container=\"prometheus-collector\", image!=\"\"}) by (container, pod) / sum(kube_pod_container_resource_limits{job=\"kube-state-metrics\",namespace=\"kube-system\", container=\"prometheus-collector\", resource=\"memory\"}) by (container, pod)) * 100> 75",
222222
"for": "PT3M",
223223
"annotations": {
224224
"description": "Memory usage greater than 75% for prometheus-collector containers on cluster ci-dev-aks-mac-eus"
@@ -254,7 +254,7 @@
254254
},
255255
{
256256
"alert": "New agent version found for prometheus collector",
257-
"expression": "count(count (kube_pod_container_info{image=~\"mcr.microsoft.com/azuremonitor/containerinsights/ciprod/prometheus-collector.*\"}) by (image)) > 4",
257+
"expression": "count(count (kube_pod_container_info{job=\"kube-state-metrics\",image=~\"mcr.microsoft.com/azuremonitor/containerinsights/ciprod/prometheus-collector.*\"}) by (image)) > 4",
258258
"for": "PT60S",
259259
"annotations": {
260260
"description": "New agent version found for prometheus collector. This alert is only used in near ring regions for prod monitoring clusters"

0 commit comments

Comments
 (0)