Runbooks for better incident management

An SRE's best friend, runbooks can help engineering teams stop putting out the same fires again and again.

Why runbooks are useful

  1. Automated processes don't always protect against issues -- so software needs 10s to 100s of different activities actioned by skilled humans to keep the system rolling

  2. "30-40% of procedures require human judgement to resolve safely so that's still a bunch of run books won't go away - even if large parts of deployment are push-button processes."

  3. Prevents an issue like this: "I recently ran into a situation where I spent 6 hours understanding how something works that would have taken 20 minutes if the relevant information was stored somewhere."


Ways that teams have set up their runbooks…

Read the rest of this archived post on my website: