Agile and SRE are NOT mutually exclusive
You may have interpreted from the above title that:
Agile needs SRE and
And SRE needs agile
Well, it’s not a binary for both situations. So let's modify that statement a bit:
Agile needs SRE (to a great extent, across the board) and
SRE needs Agile (to some extent, in some situations)
Let’s begin with how Agile needs SRE (and it so, so, so does)
Let me preface by saying that I am not a professional Agile critic. If I were, I wouldn’t have sat through many a long workshop to get my Scrum Master certifications.
As it turns out, most of the software vendors I “value-add” into at my day job switched over to Agile work only in the last 2 years. Some of them mentioned COVID as the key driver toward this. But I’m now noticing an effectiveness gap in their ability to reliably deliver cloud-based services.
Most of these vendors have come out with more features in the last 2 years than the previous 8, which have also correlated as the most unstable period for these systems. In my opinion, part of this fragility comes from their lack of insight or interest in Site Reliability Engineering.
Every time I mention the need for a systematic way of addressing NFRs, error budgeting and SLOs, my counterparts at these vendors attempt to soothe me with “Mmm-hmm. Uh-huh.”
The more agile work these vendors do, the more fragile their software seems to become. At least, it’s noticeable in production. I’ve heard their test automation engineers are happy campers.
Guess I’ll have to wait another 2-3 years when SRE is the next buzzword among the late majority of adopters e.g. software vendors for healthcare providers. If you’re curious about what buzzwords came before, I personally noted that Agile was the buzziest of words in 2018 and 2019, remote work in 2020 and hybrid work in 2021.
If you're doing Agile, that means you're changing your software (in production) roughly every 4-6 weeks. The changes compound over time, and so software put into production on Day 0 is going to morph into a very different beast by Day 30, 60, 90, 180 etc.
In some situations, by day 365, you may not be able to recognise the same software compared to Day 0. And the more services you add or modify over time, the greater the complexity quotient will be. This quote aptly describes the conundrum we face:
There is a fallacy in computer programming circles that all applications are ultimately decomposable - that is to say, you can break down complex applications into many more simple ones. In point of fact, however, you often cannot get more complex behaviors to actually start working until you have the right combination of components working, and even then you will run into problems with synchronization of data availability, memory usage and deallocation and race conditions - problems that will only become apparent when you've built most of the plumbing. This is why "but will it scale?" entered the lexicon of programmers everywhere. Scale problems only show up once you've built the system out almost completely and attempt to make it work under more extreme conditions. The solutions often entail scrapping significant parts of what you've just built, much to the consternation of managers everywhere. — Kurt Cagle, Community Editor @ Data Science Central
Extreme conditions are nowadays a regular event for many applications. Even the ones that would otherwise never reach the enterprise scale. That’s because business user demand has gone through the roof in recent years. Yes, many non-tech workers used to run on-prem software and relied on faxes for communication. The horror!
SRE gives a nice security blanket effect on top of the whole Agile mess that seems to grow and grow and grow. It gives the ability to respond effectively to outages, performance degradation etc.
It gives assurance that the various underlying services will be able to handle pressure when it occurs. It is the proactive approach to successful production software.
The higher cost of an SRE can be justified in the lessened (real and likely) risk of excessive downtime (costing serious $$$ in almost every industry, now that so many areas of production and service are software-dependent).
Users want features, but now they also need reliability. Because business models are now dependent on cloud services. Downtime means money lost. Thousands, sometimes millions of dollars. At the very least, SRE principles and/or a subset of the full SRE capability set are called for.
I think I spent too much time on why Agile needs SRE and have run out of space for the inverse in today’s post. I’ll ponder further on where SREs need Agile methodologies and write it up as a Part 2.