From Google’s SRE Book:
An SLI is a service level indicator — a carefully defined quantitative measure of some aspect of the level of service that is provided.
This quantitative measure is generally a ratio between two numbers:
$$\text{ SLI } ={ \text{ Good Events } \over \text{ Valid Events }}$$
In reliability engineering, the events used to compute this ratio are software and IT infrastructure metrics collected by monitoring tools, such as Prometheus, New Relic or Datadog.
In a nutshell, an SLI intends to convert a stream of raw and complex metrics about a software application into a simple but informative health score percentage. For a given time window, the SLI starts at 100% and decreases as the number of bad events increase.
Here’s a practical example of an availability SLI of a REST API:
SLI Type: Availability
SLI Specification: Proportion of requests to the REST API that did not return a 5xx
SLI Implementation:
Good events: Total count of all non-5xx events logged for calls to the REST API
Valid events: Total count of all logged events for calls to the REST API