The KubeVirt metrics should align with the Kubernetes metrics names.
The KubeVirt Users should have the same experience when searching for a node, container, pod and virtual machine metrics.
Naming requirements:
-
Check if a similar Kubernetes metric, for node, container or pod, exists and try to align to it.
-
KubeVirt metrics prefixes:
- Running VM metrics should have a
kubevirt_vmi_
prefix - HCO operator metrics should have a
kubevirt_hco_
prefix - Network operator metrics should have a
kubevirt_network_
prefix - Storage operator metrics should have a
kubevirt_cdi_
prefix - SSP operator metrics should have a
kubevirt_ssp_
prefix - HPP Operator metrics should have a
kubevirt_hpp_
prefix
For Example, see the following Kubernetes network metrics:
- node_network_receive_packets_total
- node_network_transmit_packets_total
- container_network_receive_packets_total
- container_network_transmit_packets_total
The KubeVirt metrics for vmi should be:
- kubevirt_vmi_network_receive_packets_total
- Kubevirt_vmi_network_transmit_packets_total
- Running VM metrics should have a
-
Metric
Help
message MUST be verbose, since it is being propagated to the metrics.md file, when runningmake-generate
.
Use recording rules when doing calculations or when using the same query for other alerts or dashboards, instead of repeating the same query in many places.
The Prometheus recording rules appear in Prometheus as metrics.
In order to easily identify the KubeVirt recording rules, they should follow the same naming conventions as the metrics.
When creating a KubeVirt alert rule, please follow the OpenShift Alerting Consistency Guide.
In addition to the OpeShift Style Guide the KubeVirt alerts MUST include:
-
kubernetes_operator_part_of
label indicating the operator name. Value should be set tokubevirt
. -
kubernetes_operator_component
label indicating the value of the sub operator name. -
operator_health_impact
label indicating how the alert impacts the operator's functionality. This label differs fromseverity
, asseverity
indicates the ability to deliver a service for the cluster as a whole, whereoperator_health_impact
indicates the impact of the issue on the operator's functionality. The loss of operator's functionality doesn't necessarily mean that the ability to deliver services for the cluster as a whole is affected. For example, an alert may have awarning
severity, when talking about the impact on the cluster health, but have acritical
impact on the operator's health. Also, when an alert is tied to a specific workload it can have awarning
severity, but no impact on the operator's health.Valid values for this labels are:
critical
- For alerts that indicate that there is a loss of operator's functionality and part of the operator might not work as expected.warning
- For alerts that indicate that there is a risk for the operator's functionality and soon parts of the operator might not work as expected.none
- For alerts that don't indicate that there is a loss of operator's functionality and it is working as expected.
Optional labels:
priority
label indicating the alert's level of importance and the order in which it should be fixed.
- Valid priorities are:
high
,medium
, orlow
. The higher the priority, the sooner the alert should be resolved. - If the alert doesn't include a
priority
label, we can assume it is amedium
priority alert.
Note: KubeVirt alert runbooks are saved in kubevirt/monitoring repository.