Site Reliability Engineering Lingo For Newbies
SRE has a vernacular that can take a while for newbies and non-tech stakeholders (e.g. management) to get a grip of. So I'll cover key terms in primer-level detail.
Hey there, fellow SRE fan.
It’s Ash 😉
We will return to our usual programming of understanding SRE issues in-depth shortly, but I need a buffer after the last few
weeks months of putting out fires at work.
As I mentioned earlier, this will be a non-exhaustive explainer of key SRE terms I hear over and over again.
But I hope to offer a different angle to what you may have read before.
After all, my dayjob is literally about getting stakeholder buy-in.
Translation: I have to explain technical issues to non-tech leaders who are constantly glaring at their phones and saying “We’ve got budgets, ya know” at the same time.
On that note, let's begin...
"That last release caused an outage that seriously ate into our error budget!"
SREs are given an error budget. Not the kind my leadership colleagues are thinking of, but it is similar in that it’s something that SREs can “spend” too.
It's an allowance for errors so SREs can experiment, allow systems to fail up (or down depending on how you see it) to the threshold. The idea is that SREs should keep tabs on incidents while keeping this error budget in mind.
Their goal should be to automate away work to reduce error risk. The budget covers not only the SRE's work but also any issues arising from releases deployed to production by developers. So it's a symbiotic effort to stay within the error budget.
You know what I think about this term based on last week’s post, but let’s power through it. Post-mortems are events SREs undertake after an incident. Like any on-call engineer would, they address incidents as fast and as effectively as possible.
They are then expected to analyse the issue as it happened, trawling through timestamped logs and other data sources for contributing causes. Whatever the cause may be, a key tenet of SRE is that post-mortems are blameless i.e. no human gets blamed or shamed as the cause, conduit or complicit party.
SLO (Service Level Objective)
It's easy to think that it might be similar to an SLA, but it's not. SLAs have too loaded a meaning to allow for psychologically safe engineering practices. Corporate IT folklore has told many of the ramifications of not meeting an SLA.
That means punishing for failure. This goes against SRE's philosophy of learning from failure. SREs can tell they are meeting SLOs by keeping tabs on SLIs, service level indicators which are their metrics for performance.
Not exactly exclusive to the SRE world, but a major aspect of it at least. It's both a reactive and proactive approach to identifying indicators of performance, reliability and availability issues. A higher-level view of the system than monitoring.
SREs will often keep an eye on latency, traffic, errors and saturation across microservices. Some SREs will even keep an eye on real-time user experience as a measure of service performance.
It’s getting more nuanced (and exciting) now with distributed tracing practices like full-stack tracing. I’ll cover this subtopic in my upcoming Uber SRE case study.
Toil aka manual work is something no productive software engineer wants to do day in, day out. Manual work is prone to risk from human fatigue, contempt and inattention during repetitive tasks. SREs take eliminating toil seriously, all the way to a firm target in their work scheduling.
At least 50% of their time should be spent to "make tomorrow better than today" i.e. proactive work. A lot of this work involves running experiments, coding up tools, and planning work like capacity planning.
Now, that you understand (or have reinforced) the unique elements of SRE, you can now reiterate its key ethos to others...
"SRE is what happens when you ask a software engineer to design an operations function", Ben Treynor Sloss, VP of Engineering @ Google.