Alerting is an essential part of monitoring any pipeline. The general view behind alerting is simple: You pick a metric that you know is related to the capacity of your system in some way. You describe the constraints that define what you understand as the normal behavior of that metric. If the metric breaks those constraints, you emit an alert to bring human attention into the picture in order to rectify the problem.
Nonetheless, just because something is simple, doesn't mean it is easy to perform correctly. The way alerting is performed in most places nowadays is by defining a threshold. This approach is not only inefficient but can even become painful for alert-receiving teams.
Static thresholds don't have the required ability to keep up with your service's evolution. Services grow and evolve, as do the hardware and software that sustain them. This means that a threshold that is tuned for the conditions pertaining today, could be totally irrelevant tomorrow. Moreover, the humans who pick the threshold many times are not able to conceive the complexity behind the web of functionality they are supposed to monitor. They have a very vague idea about how the metrics or thresholds they picked actually impacts the users.
A weak monitoring system is no better than no system at all. As time goes on, if humans don't find value in the alerts they are tasked to investigate, they'll eventually just ignore them. A better approach is needed.
Instead of alerting on system metrics, you should focus on what actually matters and that is, user experience, which is embedded in the very nature of SLOs. Therefore, creating alerts that are based on them, allows you to discard noise while automatically adapting to your service's evolution.
We have mentioned in our previous article In-depth guide on SLIs, SLOs and Error Budgets, that you are only as reliable as your users experience it. It doesn't matter what you see in your fancy system-monitoring dashboards, it matters what your users think of you. If we translate this mindset into the alerting landscape, we can say that problems aren't really problems if they don't impact your users. By alerting on SLO-related metrics, you are bypassing all the complexity of your system and going straight into what matters most.
As summarized by Ewaschuk  and Davidovič et al. :
For you to truly grasp what we're about to explain, you should be somewhat familiar with the terminology behind SRE, namely: SLOs, Error Budgets, and Burn Rates. We provide an overview of these terms in our article Monitoring SLIs and SLOs and a more in-depth analysis in our other article In-depth guide on SLIs, SLOs and Error Budgets. Feel free to check them out before continuing. If you're here, let's get to it!
In general terms, there are two main classes of problems that you need to be concerned about:
Another thing you must consider when setting up your alerts is the time interval between the triggering of an alert, and the resolution of its root cause problem, which is known as response time. Issue resolution is a human task, that takes time and forensic analysis, you must take this into account when planning your alerts - you don't want to alert only when it's too late since you won't have time to prevent the issue from happening. On the other hand, you also don't want to alert too early, otherwise, you risk creating an alert that's just noise and therefore not actionable.
Assuming that your SLOs are meaningful, breaking your error budget states that your quality standards have been broken and that you are not delivering the service that you want to. It should be something that you must avoid at all costs and forces a reflection within your team when it does happen - not something that occurs lightly.
As such, it is only natural for them to be a part of your alerting policy to be based on the amount of remaining error budget (REB) that you have during your SLO's compliance window. You can pick the REB values of which you want to be notified on, so you know when you are at risk of breaking the budget.
By this point, you may be thinking: Hang on a second! Aren't these values just thresholds, and isn't this the very thing we were trying to avoid all along? Well, yes, these values are in fact thresholds, but they aren't the enemy. The problem with traditional threshold alerting is that they are designed to trigger on system metrics. If your server's CPU was stuck nearing 100% for a long period of time you know that traffic has been intense but does that mean there's a problem? Not necessarily. If your latency SLO for your payments service has a REB of 10%, it means that a high share of your users have been stuck for more time than they should when trying to purchase something. That's a problem for sure because it can easily hurt your revenues.
The REB values you pick to be alerted on will highly depend on your SLO's configuration, mainly its window of compliance (size and type), your error budget's evaluation criteria (time-based or request-based).
The threshold themselves may be defined in percentual units (which are always valid) but can be a bit harder to interpret for some of your stakeholders (e.g. alert on 15% REB). Or according to your budgets' evaluation criteria, meaning time units for time-based EB (e.g. alert on 10min REB) and requests for request-based error budgets (e.g. alert on 300 requests REB).
Alerting on error budgets can be immensely useful, but alerts on burn rates are the real game-changer. Burn rates indicate the rate at which the error budget is being consumed. Basically, this means forecasting the amount of error budget that you will have consumed at a future point in time. The simplest use case for this would be to alert if your error budget is being consumed at a faster rate than your SLO allows you to.
Burn rates are measured as follows :
Your burn rate is simply the ratio between the errors that you endured over a period of time, and the amount of errors your SLO allowed for in that same period.
If your burn rate is >1.0 that means you have consumed more budget over that period of time than your budget allows you to. If otherwise, your burn rate is <1.0 you consumed less. A good way of thinking about this metric is by considering it a multiplier of the amount of error budgets that you are expected to consume. If your average burn rate is 2.0 then by the end of your compliance period, you can expect that to have spent twice the amount of error budget that you set out.
Note that the period in which you measure your burn rate is different from your compliance period. You can have an hourly burn rate, a daily burn rate, etc. This is useful for identifying the kind of problems that we're trying to detect, namely the fast burn and slow-burn problems we listed earlier.
Bigger periods of time allow you to detect continuous degradation - slow burns. Smaller periods allow you to identify events that suddenly start spending significant amounts of the budget - fast burns.
If you're following SRE good practices then by setting up your SLOs you are also materializing the underlying topology that supports your application. SLOs help measure the quality of user journeys, which are supported by several services.
This means, that even though your alerts are triggered based on user experience, your engineers will have a clear path of where to start their forensic analysis in order to determine the root cause of the problems that led to the alert.
Furthermore, with the proper tools, you may also have the possibility of correlating alerts with one another in order to find out unknown connections between different parts of your application. This improves the actionability of alerts by allowing you to have a clearer vision of the complex system that supports your application.
"My Philosophy On Alerting" by Rob Ewaschuk | 2014
"Reduce Toil Through Better Alerting" by Štěpán Davidovič and Betsy Beyer | 2019
"Implementing Service Level Objectives: A Practical Guide to SLIs, SLOs, and Error Budgets" by Alex Hidalgo | 2020