For the past 5 years, I’ve partnered with dozens of fast-growing startups while at AWS and became part of one later at Unbabel. I’ve observed how engineering teams still need to invest large amounts of time ensuring systems are reliable, no matter how much easier that became when public cloud providers showed up.
Cloud-based architectures are getting more complex and with more and more third-party dependencies. Building reliable systems in 2022 is still incredibly complex for engineers. In addition to this, SRE teams are lacking the accurate data about how their work is impacting, positively or negatively, the customer journeys over time.
For executive or business teams, the site reliability function is often oversimplified around a sort of “how often were the systems down last quarter” kind of thesis. Although of massive importance, it can’t be the single focus as SREs have additional concerns such as latency, security, developer experience, infrastructure costs, etc. If enforced to the limit, this rationale might bias engineers towards availability and fixing short-term issues. A shortsighted focus on basic threshold alerting can lead to a disregard of actual long-term reliability-ensuring tasks.
By agreeing on Service Level Objectives (SLOs) with the business, SRE teams can now create data-driven goals or OKRs, accurately report on the starting metric and corresponding results, and in consequence showcase to the organization the ROI of their work — all by tracking those SLOs on detech.ai. The thesis can now evolve to a longer, holistic approach:
Behemoths like Google, Microsoft or Netflix follow an SLO-based approach to site reliability with lots of internally-built components by hundreds of site reliability engineers working at each of these companies.
In early 2021, I met José Velez for the first time and he told me he wanted to productize this framework to make it available for any organization, no matter its size. And it’s been exciting to watch José and the rest of the team succeeding at doing it. The detech.ai platform is up & running in pre-beta with over 10 organizations using it weekly to:
Now, I’m joining detech.ai to lead our Go-To-Market activities and I’m really looking forward to speaking with engineering teams around the world about their reliability efforts!
If you are an engineer or engineering leader looking after site reliability / DevOps, please reach out if you’d like to learn more.