Runbooks
Our aim is to plan our response to potential incidents in advance. Principled incident management can go out the window in real-life situations. Our runbooks, alongside gamedays, aim to anticipate failure scenarios and plan for them.
As part of our production requirements, runbooks should be created before a system is put into a production settings.
At minimum, runbooks should analyse the following potential scenarios:
- Failure of any third party dependancies
- Failure of data infrastructure
- Failure of core hosting provider
- Failure of any hosting provider services
- Failure of domain routing and fallback
Each failure scenario should be marked based on the following grading:
- Failure can be monitored within the team, but has automatic failover. 
- Failure will require some manual steps to resolve. 
- Stakeholders need to be communicated with and actions undertaken by them. 
By undertaking the process of runbook creation, we should highlight which issues fall into category 3 and aim to develop fallbacks that allow for these to be migrated into category 1 or 2 issues.
The runbooks should also contain a resolution strategy and a communication strategy for each scenario.
These runbooks should be freely visible by the engineering team and visible as part of an application's documentation.