Rundown of Netflix's SRE practice
Netflix's extensive movie and TV show library isn't the only aspect that keeps users hooked. Let's explore the people and practices behind its app performance that makes for a sticky experience.
📊 Performance stats for Netflix
When it was alone on top of the streaming world in 2016...
100,000 instances at peak on AWS with 700+ microservices
Over 30 terabits per second of Internet traffic
SREs played an important role in making sure all of this ticked over smoothly
🤝How SRE fits into the Netflix org and culture
SRE team at Netflix is known as CORE (Cloud Operations Reliability Engineering)
Belongs to a larger group known as Operations Engineering
SREs work alongside specialist roles interrelated with SRE work - Performance Engineers, Chaos Engineers etc.
Culture at Netflix is freedom and responsibility — both are important to effective SRE work
CEO Reed Hasting's radical candor approach — be critical because you care about the other person — may make it easier for SREs to call out poor prod decisions
SREs act as consultants for developers who need to run what they build
They are also the last line of defence when issues affect production — e.g. testing service goes down, which will affect the ability to push code to production, so they'll fix it
Solvers of problems that don't have a straightforward approach i.e. RTFM may not work, willingness to experiment and seek novel solutions may help
Fixes can take minutes, hours, days, weeks, months — there is no fixed time to solve — can be larger projects that other teams don’t have time for
A lot of reading source code & documentation, sourcing experiment ideas, running experiments and then measuring the outcomes
Can be done in solo missions or as a temporary problem-specific team
🧰 Support production tooling
Paved paths have been designed by operations engineers for developers to leverage service discovery, application RPC calls and circuit breakers effectively
This is not prescriptive as developers can deviate if they want to create a path for their service
Path deviators are still subject to attacks by the Simian army (chaos monkeys)
Extreme DevOps — you write it, you run it — engineers do the full job of developing software, deploying pipeline and running code in production
SREs codify best practices from past deployments to make sure production is optimal
Netflix is best known to SRE world for its Chaos Monkey tool in Chaos Engineering
But wait, there's more!
Canary tools for developers to check code and make sure there is no performance regression
Dashboards to review service performance like upstream error rates, alerts for supporting services
Distributed system tracing to trace performance across the microservices ecosystem
Chat rooms and pagers and ticket systems for the fun engineer-level support work
Actionable alerts — check the right things, go off when appropriate, quiet when not
Spinnaker — allows for blue-green deployment with multi-cloud setup (insanely powerful)
Pre-production checklist that scores each aspect of service before going into production
Example of SRE codified tool — Is your service production ready?
Source: Jonah Horowitz, SRECon 2016
🔥 Incident management
Get the right people into the room and make sure they can troubleshoot the incident
Document everything during the incident to help with post-mortem analysis
Post-mortems aren't necessarily blameless — something went wrong because someone did something, but rather than punish them, make them own it as a learning process
#1 business metric is SPS - Starts Per Second — number of people successfully hitting the play button
Short and to-the-point checklists for handling emergencies are codified in readily accessible manuals
Developers can assign metrics for their services to be addressed by SRE once certain thresholds are hit
🏎️ Support performance engineers
Need for consistently good service performance rather than one-off high performance — user should have acceptably low time-to-interactive (TTI) and time-to-render (TTR)
TTI - user can interact even if not everything is fully loaded or rendered
TTR - everything above the fold is rendered
SREs support autoscaling for on-demand scaling — saving money vs pre-purchased on-prem computer — for encoding, precompute, failover and blue-green deployments
Handle tricky issues involving autoscaling like under-provisioning of resources, but also sticky traffic, bursty traffic and uneven traffic distribution
An example of a performance dashboard would cover load issues, errors, latency issues, saturation of resources (e.g. CPU load averages) and instance counts
What is Chaos Engineering?
Experimenting on a distributed system in order to build confidence in the system's ability to withstand turbulent conditions in production — Nora Jones, ex-Senior Chaos Engineer, Netflix
Chaos engineering is heavily based on Netflix's work from 2008 through the early 2010s
Builds on the value of common tests like unit testing and integration testing
Chaos takes it up a notch by adding failure or latency on calls between services
Helps uncover and resolve issues typically found when services call on another like network latency, congestion and logical or scaling failure
Can cause culture shift of "What happens if this fails" to "What happens when this fails"
How Chaos Engineering can be done
Graceful restarts and degradations using the Chaos Monkey tool
Targeted chaos for specific components of a system e.g. Kafka — delete topics
Cascading failures of one part of the system triggering failure in other parts of the system
Injecting failure into services in an automated manner with limits on # of users affected — experiment can be shorted if the SPS will drop below acceptable levels
All of this to make sure that you can easily binge-watch your fave show this weekend!