Hacking your way to Observability — Part 2 : Alerts
Taking advantage of metrics by sending notifications via Slack
In my previous post, we deployed Prometheus Operator with the Helm Chart and a set of services to demonstrate how to collect metrics using prom-client and exporters. Don’t forget that the purpose of observability is to infer the status of a system, so for the metrics to serve their purpose, they must communicate something to the right people when the values are not within the boundaries defined by the organization. Alerts can help you with that.
Alerting in Prometheus is separated into two parts. On Prometheus, you create alert rules that define the condition for alerts to be fired. When an alert is fired, Prometheus sends it to an Alert Manager, then the Alert Manager will be able to silence, group, or route the alert to send notifications to different platforms.
In this post, we will create some alert rules and then send notifications of those alerts via Slack. You can find all the resources used in this post here: https://github.com/jonathanbc92/observability-quickstart/tree/master/part2
Alert rules
To create an alert rule using the Prometheus Operator, we will use the PrometheusRule custom resource. PrometheusRule requires you to specify the following:
Groups: A collection of alerts that are sequentially evaluated.
Rules: A condition for an alert to be fired, it includes: the name, the condition, a waiting period and, labels and annotations to include additional information.
The condition expression of the alert is based on Prometheus expressions; you can use Prometheus expression builder to validate the condition before creating it. In the following example, we have a group of rules named database.rules with a single rule that fires when the mysql_up metric has been absent for at least 1 minute.
Create the resource using kubectl apply -f alertrules.yml
and go to the Alerts page in Prometheus.
To test the rule, we need to scale down Mysql Deployment:
kubectl scale deployment/mysql --replicas=0 -n applications
After a minute or so, we will see the alert firing:
We don’t have to build everything on our own; Kube Prometheus helm deployment already comes with many valuable alerts that we can check to learn about Kubernetes metrics or even create new alerts based on them.
Think about the following, what happens if you have a significant system outage?. Many of your applications may fail, and you will end up with many alerts firing up and many notifications being sent to your team. You can reduce the number of notifications by using Alert Manager grouping capability; it will allow you to group alerts of similar nature to be sent on a single notification.
To demonstrate that, we will create a simple alert that will fire when the deployments have less than two container replicas and will test this scenario after Slack is configured.
Preparing Slack
Since we will be sending all our alerts to Slack, we need to do some preparation first. Let’s start by creating a Slack channel.
Create an App in your workspace. Then, enable Incoming Webhooks on your App and add a new webhook to the workspace. Don’t forget to copy the webhook URL; we will need it later.
Configuring AlertManager
To configure Alert Manager, we need to create a custom resource called AlertmanagerConfig. It requires us to configure at least one receiver, which is the platform we will send the message to, and a route to any of the receivers.
The route has some configurations related to grouping.
- The groupBy parameter list the labels that Alert Manager will use to group the alerts in a single notification.
- groupWait indicates how long to wait before sending the initial notification.
- groupInterval indicates how long to wait before sending a notification update.
- repeatInterval indicates how long to wait before sending the last notification again.
Slack receiver configuration has multiple parameters; you can check all of them here.
- Slack configuration requires the webhook URL to be presented as a secret; that secret will be referenced in the apiURL parameter.
- The channel parameter specifies the slack channel that will be used to receive notifications.
- SendResolved is a parameter that indicates Alert Manager to send a notification when the condition for the alert is not met anymore.
- Title and Text are parameters to modify the format of the slack message. Both can include a reference to an existing template.
In the code below, all the alerts that fire and have the same alert name within 30 seconds will be grouped in a single Slack notification.
Create the configuration by running kubectl apply -f alertmanagerconfig.yml
.
Go to the Alert Manager status page and you will see all the routes configured. The route we configured has changed, the name is different and it has a match parameter. The match parameter specifies the labels the alert needs to have to be routed to the receiver. By default, every route you configure will be modified to include the namespace label in the match parameter, even if we include other labels.
To see if the alert will be routed to the right receiver, we can use the routing tree editor. Copy the alert manager configuration on the status page, and test with the labels of your alerts.
Templates
Prometheus supports defining templates for them to be used in the notifications. Templates allow us to standardize a notification message for all our alerts.
The following piece of code is an example of a template:
The define keyword represents a chunk of code that is reusable. The code has three reusable chunks: __title, alert_title, and alert_description.
__title: iterates over firing alerts and resolved alerts to print the alert name.
alert_title: prints the status in uppercase between square brackets and the count of firing alerts. It also includes the content of __title if there is only one alert firing or one alert resolved.
alert_description: if there is only one alert firing or one alert resolved, prints the description of the alert along with the severity and a link to the Prometheus Graph URL. If there is more than one, it prints the list of alerts.
To include a template file on a Prometheus handled by the Operator, we need to update the AlertManager custom resource; we can do that by passing custom values to the helm chart. Since we only need to modify the template files, the following file will be enough.
We can update the helm deployment by running the following command.
helm upgrade --reuse-values prometheus prometheus-community/kube-prometheus-stack -n observability -f values.yml
Testing Notifications
To receive the Slack notifications we should trigger the alerts. Let’s start by scaling down MySQL to receive a single alert notification:
Scale down MySQL: kubectl scale deployment/mysql --replicas=0 -n applications
After a couple of minutes, scale up: kubectl scale deployment/mysql --replicas=1 -n applications
Go to your slack channel to see the notification.
Lastly, we should test a grouped alert notification.
Scale down all the NodeJs services:
kubectl scale deployment/format-service-depl --replicas=1 -n applications
kubectl scale deployment/hello-service-depl --replicas=1 -n applications
kubectl scale deployment/people-service-depl --replicas=1 -n applications
After a couple of minutes, scale up:
kubectl scale deployment/format-service-depl --replicas=2 -n applications
kubectl scale deployment/hello-service-depl --replicas=2 -n applications
kubectl scale deployment/people-service-depl --replicas=2 -n applications
Go back to slack and compare the results.
As you can see, the alerts are grouped under a single notification. In this case, they belong to the same group because all of them have the same alert name. You can change the group configuration by adding more labels in the future.
Conclusion
Notifications are very powerful to communicate to your team what’s is happening with your systems; take advantage of the templates to show more accurate messages; providing precise information helps solve issues faster. Don’t forget to group your alerts to avoid a notification hell!
Happy Alerting 🚨!