Running GameDays is an effective chaos engineering practice to rehearse failure scenarios, discover how the system as a whole (including both the technical components and the people involved) respond under stress, and to generally bolster resilience.
We have put together several resources with the aim of inspiring and enabling people to run GameDays in their organisations. Here is an article covering the technical aspects, while this one provides you with guidance on how to approach the planning and facilitation side of things.
Organising a GameDay is a journey which starts with getting a group of people into alignment, and then narrowing in on the logistical planning required to run a series of experiments on a system. The GameDay itself is a highly collaborative and intense experience, involving a cross functional team working together to methodically observe how the system responds under stress.
The framework provided here is based on our experience running a GameDay in an enterprise scale financial institution. The culture, size and complexity of your organisation will determine the degree to which all of the steps and are activities are relevant in your context, so we encourage you to mix and match as feels appropriate.
Getting stakeholders on board
There are two important reasons to take the time to ensure that managers and executives understand the proposed activity and are aligned in their support of the initiative.
One reason relates to differing attitudes towards failure and the early identification of weaknesses. Mature organisations embrace the philosophies that underpin chaos engineering, but you shouldn’t take for granted that all of your stakeholders share that mindset. The concept of discovering “bugs in production” could be very nerve wracking in situations where there is a culture of blame. While it is sometimes necessary to ruffle some feathers to drive important changes through an organisation, this should usually be done strategically, with champions in place who can influence appropriately at each level of the organisation, and provide air cover as necessary.
The other reason is that running a GameDay costs money, and requires people’s focus and attention – usually over and above their normal responsibilities. If there is no clear support for the initiative from above, then its probable that you’ll run into difficulty trying to get all of the right people lined up at the same time.
Conduct a workshop or put together a document/pack which educates the stakeholders on the benefits of running a GameDay, and that outlines the timeline and resource requirements. Ensure that as an outcome a manager with sufficient control of the purse strings is on board, and that there is an appropriate degree of understanding and alignment to be able to proceed safely.
People and roles
Determining exactly who needs to be involved in a GameDay is a bit of a chicken and egg situation. The scenarios you plan to test will inform the people and skills you need, but you won’t know the exact scenarios until the group do some collaborative brainstorming.
A good place to start is with people who have a solid, broad understanding of each layer or component of the stack. They may have a role title like “solution architect” or “platform engineer”. They may not have the deepest understanding of the actual codebase, but they will know how everything hangs together, and they will know the names of the exact people to call for any specialised area.
We have also found that performance engineers tend to be excellently placed to add value to GameDay planning and execution. Due to the nature of the role, they are are often across both the functional aspects of a system, as well as how information traverses the whole stack. And critically, they have the tooling, scripts and test data to be able to execute scenarios and generate load.
It is important to think beyond just the folks who build the software, to also include those who support the systems and interact with the customers. A key tenet of GameDay should be that it takes a customer centric view. Resilient organisations ensure that the customer can always achieve their goals, or at least have an experience which gracefully degrades and that they are well informed when there are issues. So this brings in people such as production support, frontline staff (e.g call centre) and potentially even user experience designers.
The balancing act is to ensure that you have the bare minimum people you need to make the day successful, without ending up with a cast of thousands.
Here is a template to help plan and track the people involved.
Organising the workshops
Running a GameDay involves getting several people into alignment about the objectives and approach. This process is most effectively achieved through a series of workshops as a lot of shared understanding comes about through the conversations.
The goal of these workshops is to emerge with a defined set of scenarios to aim to test, and a plan for getting all of the logistics in place.
Workshop 1 – Setting the scene
Start by getting everyone on the same page as to what a GameDay is, and why it is important to the organisation. Ideally have the senior person or sponsor there to publicly declare their support for the initiative.
Draw up the system architecture on the whiteboard before the meeting, and give everyone attending the workshop a chance to introduce themselves and to point out the areas they are currently working on.
A reasonable goal for this first workshop is to arrive at an agreed view of the whole system architecture that everyone is comfortable with, and to identify the very broad areas or themes that would make sense to think about testing. For example, in a banking environment, discuss whether it would make sense to test a payment flow. It would likely provide great value to do so, but it might be quite complicated to set up. It might be more feasible (initially at least) to test simpler customer flows such as Login or Check Balance. Aim for around three or four themes.
At this time, checkpoint that you have the right group of people on board. You may discover that there are gaps, or maybe that some people aren’t required if certain flows or themes won’t be included in the GameDay.
Workshop 2 – Identify the scenarios
Once you have broadly framed the themes, it is time to identify the scenarios you’ll be exercising on your GameDay.
A way to approach this is as a brainstorming exercise.
- Draw the architecture on the whiteboard, and
- Have everyone write all the things they can think of that could go wrong on Post-It notes, and stick them on the relevant component or integration point in the architecture. Keep the Post-Its colour consistent for each theme – e.g. for Login everyone uses yellow, for Check Balance everyone uses pink etc.
- Once the team have captured their ideas for a theme, have everyone briefly explain what each note means.
- Assign everyone a set amount of votes, and have them indicate the scenarios they think make sense to test. When making their decision, encourage the group to think about what would provide the most value to test in a GameDay – e.g. things that don’t get looked at in the course of normal feature or DR testing.
- Capture all of the items and the number of votes against each of them.
Repeat this process for each of the themes.
Once you have a list of scenarios, across the various themes, as a group you can further refine the list. You might refine the list by plotting them on a value versus complexity graph. You should aim to arrive at a balanced set of three or four scenarios. If they are all highly complex to set up and execute, then it will put a lot of pressure on the preparation and execution of your GameDay. But you also don’t want them to be too simple. Sometimes the really valuable scenarios just feel too tricky to set up for your first GameDay, and that’s OK – be realistic about how much you bite off.
Workshop 3 – Define your hypotheses
For each of the scenarios the team have agreed to exercise during GameDay, you need to define the hypotheses for how you expect the system to respond. This involves spelling out:
- How the system should be performing during steady state?
- What chaos will be injected?
- How the system is expected to respond?
Think about things like what alerts should fire and what the end customer should experience on their actual device.
There may be several aspects of the system which will be impacted under one scenario, and each will need to be investigated and measured on the GameDay.
Repeat this process for each of the scenarios and you may find that it takes more than one workshop.
Workshop 4 – Plan your logistics
With your scenarios and hypotheses identified, you have a clear view of what needs to be done in order to create each of the scenarios on your GameDay.
This workshop is about assigning very tangible actions to people. Things like:
- What environment will this be run in?
- Who will configure the test accounts and data?
- In order to enable having the same viewpoint as an end customer, do you have the necessary versions of the application on appropriate devices that can access the system you will be testing on?
- How will the steady state be created and how will it be verified that the system is functioning normally before any chaos is injected to begin the experiments?
- Who has the necessary system access and privileges to execute the activities that form your chaos scenarios?
- How do you plan to monitor and investigate each of these scenarios to verify that the system is responding as expected?
- How will you clean up once the test is complete to return the system to its original state?
Carefully assign these tasks and responsibilities, and ensure that the people everything is ready in time for the GameDay.
Setting up for GameDay
There are a few different aspects related to setting up for the GameDay itself:
- Technical setup to enable people to work effectively
- Room setup, facilitation tools and materials
- Things to make the day fun
- Prepare a run sheet
Technical Setup
Apart from the environments, data, scripts etc which you’ll need to have set up in advance. Here are some things to think about:
- Does everyone have the necessary network access?
We have seen cases where certain systems are only accessible from particular physical locations, which has proved a challenge when bringing everyone to one spot for a GameDay. Think about the variety of tasks people are planning to do, and how they intend to connect all of the devices they plan to use.
- How will systems be monitored?
Ideally you will be able to display dashboards which give visibility to the relevant key metrics for the systems involved in your GameDay. Identify who has access to these dashboards and who can set them up, and then think about how you will display them during the GameDay. Relying on an projector may not be ideal, depending on your room setup, as they often consume valuable wall space. Think about whether you need dedicated machines to drive the dashboards, and also whether you’ll need long cables to be able to reach a TV monitor.
Room setup, facilitation tools and materials
Having a room where people can comfortably work, focus and collaborate is important. Here are some things to consider:
- Whiteboards!
For the facilitators, GameDays can be challenging environments in that there is a lot of detailed information to visualise and manage. Ideally you should have several whiteboards to work with, so that you can visualised the infrastructure, while also outlining the scenarios and capturing the results as they emerge.
- Will you need a fan?
If you are holding your GameDay in a meeting room, the air conditioning may not be designed to cater for that many people and that many computers running flat out for a whole day.
- How are the ergonomics?
Some people really prefer to work with an external monitor, keyboard and mouse. Think about how you will cater for any such requirements in terms of equipment, desk space and power
- Retrospective
Apart from capturing all the insights about your systems coming out of the GameDay scenarios, be sure to collect learnings about the process itself. Ideally have dedicated wall space to collect feedback and learnings as the day progresses.
Things to make the day fun
GameDay is a great opportunity to build relationships between people that don’t always get to work together. It is also something a bit outside the norm, so a chance to make a it a fun and memorable day for all involved. One great way to do that is with food and snacks. An added benefit of providing food is that it prevents the need for people to leave the GameDay room and risk disturbing the focus of the group.
Prepare a run sheet
There will be a lot to get through and line up on the day itself, and your run sheet will be your bible. While it is important to be comfortable with the concept of going off script to pursue any interesting and valuable path that the experiments may reveal, having a clear and solid plan will give you the flexibility to adjust your timings to accommodate any such deviations.
Here is a template of a run sheet.
Shake out the environments
The first technical activity on the GameDay itself should be a component role call, whereby the team go through and verify that the system is healthy and the steady state is working as expected. Ideally, you need this activity to be over as quickly as possible so as not to eat into your actual experimentation time. If possible, perform the shakeout the day before the GameDay, or at least rehearse and prepare for this activity so that it goes as smoothly and quickly as possible.
Running the GameDay
With all of your preparations in place, the running of the GameDay itself is a largely technical facilitation exercise. It involves guiding the team through the scenarios and capturing the outcomes and learnings. The real art of this process is to know when to encourage the team to stay focussed on the planned scenarios, and when to support them to chase down leads to unearth those precious “unknown unknowns” – the types of issue which are often only discovered through an exercise like a GameDay.
It’s important for the facilitators to be on hand ensure that any impediments the team encounter are removed quickly, wherever possible. In a large scale environment, an example we encountered was needing to ask a specific person from a vendor team to make a system change, who required a timesheet code in order to do engage with us.
After the GameDay
Once the GameDay has been completed, you will hopefully have generated a number of insights and specific findings about where the people and processes can be improved. These now need to be collated and communicated to the relevant people for later prioritisation in a fashion that is appropriate in your workplace.
It’s important to think about how you can take the learnings forward, and to build on each experience you have of running a GameDay. This can include reuse of tools, scripts, planning templates etc, and also upskilling more people in the organisation on how to run a GameDay of their own.
We hope you find these tips and resources useful in planning and running your GameDays. We’d love to hear your feedback, and to incorporate your ideas in the toolkit so that other practitioners can benefit.