High-stakes software needs observability
In this post, I'll cover two examples where observability can protect revenue and prevent costly errors. But first, let's set the scene...
Setting the scene
In the past, you could “write some logs and collect some metrics” and get away with it, but with microservices, that’s no longer enough
“When a user makes a request to your service, that may hit 10 different backend services that have to do different operations, which is a nightmare to decipher with traditional metrics”
So you have 10 points of failure — better than a single point of failure, but still won't make for a reliable user experience if several services fail or degrade at the same time
Observability can protect eCommerce revenue (example)
Imagine you’re responsible for the reliability of an online e-commerce website
The user comes to the homepage and sees all the features as a monolith but they are not — they are microservices all glued together for a cohesive user experience
“The frontpage alone will call on the order recommendation system, the authentication system to log them in, which needs to pull up their user preferences, which needs to pick their most recently purchased items — that might involve 30, 40 or even 100 microservices”
Now imagine that you have weak or non-existent observability. Also, imagine that a microservice in a user’s value stream is dysfunctional.
This (simplified) chain of events may occur:
The item-inventory call is non-functional due to failure or latency
Without this service, the add-to-cart service doesn't activate
User's cart can't take on items without the button within the add-to-cart service
End result: cart doesn't take on user's selected items with loss of business
Multiply that by millions of users and you're talking a grand-scale of revenue loss
Good observability can mitigate or prevent an issue like this. More on this later.
Observability can protect a trader's bottom-line (example)
Imagine you’re responsible for the reliability of a professional share trading platform
The user — typically a day trader — loads their Java-based desktop application which serves as mission control for their trading work
Many of their trades are automated with support from quants who are crunching non-linear scenarios e.g. If x goes up and y goes down, then buy z
The application may contain 100s of microservices including the portfolio, daily P&L, portfolio P&L, share chart, share ticker, last price, buy price, sell price, price book etc.
Now imagine that you have weak or non-existent observability. Also, imagine that a microservice within a user's value stream is dysfunctional.
This (simplified) chain of events may occur:
The last-price service is not functioning well due to latency with exchange data
The trader sees a lower than actual market price and puts in a trade
The quant's algorithm detects a low price and triggers a barrage of purchases
End result: the trader ends up with shares that don't match their trading strategy
Professional traders run trades in millions of dollars, so this can be a costly error.
Good observability could mitigate or entirely prevent an issue like this.
Briefly, how good observability works
Get the basics right
You need to measure the latency between all of them
You might need to measure individual function calls in those microservices
That helps you figure out where a performance issue exists and measure how long it has been impacting a user
“Observability everywhere” mindset
Quick filtering and dashboarded analytics for on-call events
Have it within tests like canary testing, blue-green deployments and A/B tests
Go beyond systems-only monitoring mindset — social media monitoring can highlight the impact and scale of an incident from the — most important — end-user perspective
Uptime is only part of the equation - you are not winning if a component is down and the user experience is degraded e.g. a down CDN means slower load times for farther out users
Quotes are from an interview of Google developer relations engineer, Seth Vargo.