High-stakes software needs observability

Aug 30, 2021

In the past, you could “write some logs and collect some metrics” and get away with it, but with microservices, that’s no longer enough
“When a user makes a request to your service, that may hit 10 different backend services that have to do different operations, which is a nightmare to decipher with traditional metrics”
So you have 10 points of failure — better than a single point of failure, but still won't make for a reliable user experience if several services fail or degrade at the same time

Observability can protect eCommerce revenue (example)

Imagine you’re responsible for the reliability of an online e-commerce website

The user comes to the homepage and sees all the features as a monolith but they are not — they are microservices all glued together for a cohesive user experience
“The frontpage alone will call on the order recommendation system, the authentication system to log them in, which needs to pull up their user preferences, which needs to pick their most recently purchased items — that might involve 30, 40 or even 100 microservices”

Now imagine that you have weak or non-existent observability. Also, imagine that a microservice in a user’s value stream is dysfunctional.

This (simplified) chain of events may occur:

The item-inventory call is non-functional due to failure or latency
Without this service, the add-to-cart service doesn't activate
User's cart can't take on items without the button within the add-to-cart service

End result: cart doesn't take on user's selected items with loss of business

Multiply that by millions of users and you're talking a grand-scale of revenue loss

Good observability can mitigate or prevent an issue like this. More on this later.

Imagine you’re responsible for the reliability of a professional share trading platform

The user — typically a day trader — loads their Java-based desktop application which serves as mission control for their trading work
Many of their trades are automated with support from quants who are crunching non-linear scenarios e.g. If x goes up and y goes down, then buy z
The application may contain 100s of microservices including the portfolio, daily P&L, portfolio P&L, share chart, share ticker, last price, buy price, sell price, price book etc.

Now imagine that you have weak or non-existent observability. Also, imagine that a microservice within a user's value stream is dysfunctional.

This (simplified) chain of events may occur:

The last-price service is not functioning well due to latency with exchange data
The trader sees a lower than actual market price and puts in a trade
The quant's algorithm detects a low price and triggers a barrage of purchases

End result: the trader ends up with shares that don't match their trading strategy

Professional traders run trades in millions of dollars, so this can be a costly error.

Good observability could mitigate or entirely prevent an issue like this.

You need to measure the latency between all of them
You might need to measure individual function calls in those microservices
That helps you figure out where a performance issue exists and measure how long it has been impacting a user

Quick filtering and dashboarded analytics for on-call events
Have it within tests like canary testing, blue-green deployments and A/B tests
Go beyond systems-only monitoring mindset — social media monitoring can highlight the impact and scale of an incident from the — most important — end-user perspective
Uptime is only part of the equation - you are not winning if a component is down and the user experience is degraded e.g. a down CDN means slower load times for farther out users

Quotes are from an interview of Google developer relations engineer, Seth Vargo.