“Build it and they will come” - gone are the days where a system is launched into production and left alone for customers to bask in its wonderful glory. Customers have choices, and we are in an ever-present arms race to ensure they make the right choice, by continually increasing the value of our product.
With the rise of the Internet, globalisation, mobile, “the cloud” and, more recently, the Internet of Things (IoT), the demands on our systems - and delivery teams - have never been higher. While we compete in the Red Queen’s Race, we have more people, systems and devices interacting with our systems from all corners of the globe. Our customers expect us to continually delight them with new features and capabilities. Crucially, our customers assume their needs can be served on demand - 3am maintenance windows are simply not acceptable when you have a global customer base.
From these environmental pressures, our systems scaled beyond what any single server - or team - could reasonably be expected to do, moving us into the realm of distributed systems, emergent architectures and redefining how we look at availability and resilience.
Trading offs and the speed of light
One of the side-effects of modern agile development is that architecture is often more a product of evolution than creation. We like to think we have control over how our systems are designed, but truthfully, in reality things are much more fluid than that.
We are constantly making trade-offs. For instance, any system that is distributed is bound by an annoying law of physics - the speed of light. This has frustratingly important consequences for storing data redundantly, and therefore on its availability. It forces us to make a choice in our data consistency model - eventually or strongly consistent - which trades off the availability of a system for a consistent view of the data. These decisions exert other pressures on the system, influencing other design decisions.
But more often, these tradeoffs are grounded in the pressures of everyday delivery: “this feature must be shipped asap so that we can market test it”, “we don’t have the budget for this until Q4, can we deploy the tactical solution and revisit it then?” and so on. Often the consequences of these decisions are opaque to other systems dependent on them.
To illustrate this point further; in one of our GameDay exercises we saw first hand the impact of Conway’s Law, when we discovered a potentially catastrophic flaw in a component that existed purely to work around networking restrictions, saving the team delays from costly requests to other teams. Conway’s Law highlights the very human thing; that getting things done within your own team is much easier than getting others to do it - even if it’s the better technical solution.
The point is, our systems are never as precise and clean as the box-and-arrow view provided by your Enterprise Architect - they are messy.
What it means to be available or “Hope is not a strategy”
We’ve established that our systems are messy, and that there are immense environmental factors at play forcing us to continually evolve a changing, complex ecosystem. So how on earth do we create a stable system if hope is not a strategy?!
Let’s first define what availability is, and scratch beyond the surface of the 9’s of availability to provide meaningful insights into how to improve it.
Availability is usually expressed as a percentage of uptime in a given year, such as 99.99% (referred to as four nines), 99.999% (5 nines) and so on, equating to 52.56 minutes and 5.26 minutes of allowable downtime respectively in a given year.
Expressed as a formula:
Availability = uptime / (uptime + downtime)
Traditional approaches to availability focus on reducing the amount of downtime a system has at any point in time. The leading cause of downtime (failure), are changes introduced into the system, and so naturally this resulted in tension between change agents - Product and Development teams - and those incentivised by maintaining a stable system, traditionally an Operations team.
We can, however, view the same formula in the inverse, and it has the effect of seeing the problem in another light:
Availability = MTTF / (MTTF + MTTR)
MTTF is “mean time to failure” and
MTTF is “mean time to recovery”.
By focusing on recovery time instead downtime, we can see that optimising
MTTF is in many ways a more effective strategy, both mathematically and practically. By focussing on improving the process of releasing software and reducing the severity of incidents, we find that performing change more often reduces the likelihood of failure, and speeds up recovery. By making deployments simple and boring, and constantly flexing the muscles used to release, we avoid lengthy outages when failures do occur, as our collective muscle memory kicks in. This is the case for continuous delivery.
It’s convenient to think of each of our systems as independent boxes on diagrams, however any system of significance needs to collaborate with other systems. This has serious implications for availability.
The first is that a service cannot be more available than the intersection of all its critical dependencies 1. This seems obvious, but even if you exclude transitive dependencies this is still troubling. In order to understand and address this implication, Google uses the “rule of the extra 9”, where critical dependencies must offer one additional 9 relative to your own service. Very quickly, you’ll find that your architecture tightens up as a result of this terrifying constraint.
Secondly, a service cannot be more available than its incident frequency multiplied by the time taken to detect and recover the system. We’ve discussed recovery, however detection time is something that is often overlooked, and emphasises the need for any serious system to have reliable real-time monitoring and alerting, automated controls where possible and a well drilled incident response team.
To achieve high availability, we aim to create resilient systems - the ability to withstand known and potentially unknown conditions and recover automatically from failure.
- know what to expect (anticipation);
- know what to look for (monitoring);
- know what to do (how to respond); and
- know what just happened (learning).
However, the term resilience is not sufficient to capture our true intentions. Resilient systems resist and recover from stress unlike a fragile system which does not, but there is a breaking point along the spectrum of stress or change that even the most resilient system can withstand.
It’s also generally concerned with a narrower view of what we mean by “system” in this article. We are concerned with the combination of technology and the supporting people and process wrapped around it, not simply technology.
What we really want is the opposite of fragile - antifragile. Something that doesn’t simply resist stress up to a point, it benefits from repeated bouts of stress over time.
It’s hard to imagine a technology that can withstand and recover automatically from any stress, including ones it is yet to know about. But when you add humans to a given system, we can create feedback loops to continually stress the system in order to improve it; induce failure, learn and understand what is going on and adjust future behaviour of the system accordingly.
It is a way to artificially speed up the evolution of the natural learning process that comes from production incidents, and provides us with answers to all four of Hollnagel’s questions.
For the purposes of this article (both beyond and leading up to this point), when we refer to resilience, let us mean this broader definition of antifragile resilience.
Born to fail
Complex, distributed systems are destined to fail. As engineers, we resort to oversimplifications so that we can reason about and navigate our systems without being bogged down by minutia such as the speed of light and how it rules us. But deep down we know that our perceptions belie us, and we have no complete comprehension of how these complex systems interact - the behaviour is emergent.
You may have had doubts about the previous statement, however when you add the human factor to the equation, those doubts quickly subside. Decades of research in this field reveal the human hand in most catastrophic failures. In Richard Cook’s “How Complex Systems Fail”, he summarises a number of key takeaways from this research as applied to complex systems, here we highlight just three that are immediately relevant:
#1 Complex systems are intrinsically hazardous systems
#3 Catastrophe requires multiple failures – single point failures are not enough
#6 Catastrophe is always just around the corner
You should be convinced by now, that when we are working with distributed systems we are dealing with complex systems (#1), and perhaps these statements make some form of intuitive sense. If you were to reflect on production incidents you’ve been involved with, or look to well documented, public post-mortems, you will see the compounding effects of #3. Rarely if ever, is a catastrophe a simple bug or accidental keystroke - it’s normally a combination of things.
There is, however, an upside:
#18 Failure free operations require experience with failure
And this is the point. If we want to build resilient systems that resist - or rather, embrace, failure - then we need to practise it.
This realisation, that we must embrace failure to build confidence in our systems, is what led to the birth of Chaos Engineering. A relatively new discipline in our field, it sets out to exploit the fact that although we don’t fully understand how our systems work, we can still test whether it works under certain (volatile) conditions.
Chaos Engineering encompasses a number of different practices, but at its core it is all about running controlled experiments against our systems on various scales, in order to understand its behaviour.
One of the most powerful Chaos Engineering practices are what are referred to as GameDays - a team exercise where we place our systems under stress in order to learn and improve resilience. GameDay exercises simulate catastrophic failure scenarios as a series of controlled experiments on live systems, and test out how our systems, people and processes respond. In our experience, outside of genuine production issues, nothing comes close to the insights and findings uncovered in these exercises.
The name “GameDay” was coined by Jessie Robbins aka the “Master of disaster” whilst working at Amazon in the mid 2000’s on large scale distributed systems. His time training as a firefighter and working as an operations engineer led him to many of the same conclusions we’ve discussed here.
GameDays implement and build upon the principles discussed earlier; practicing failure, understanding our systems’ critical dependencies, how to reduce MTTR and so on, but most importantly we take a customer-centric view by asking the question “what is the impact to the customer if this happened”?
By focusing on the customer, we can start to see problems from another angle and introduce simple but effective changes to improve UX in times of failure. For example, in a banking context if we lose the ability to make a payment, the customer should still be able to login to their mobile banking app and check a balance - the system should not simply crash.
We often get asked the question of how a GameDay differs from a disaster recovery (DR) exercise. DRs are necessary for many compliance type activities or for confirming a project requirement has been completed. But they are not sufficient to find and address the types of problems we are looking for - the unknown unknowns.
Disaster recovery exercises are narrowly focussed on a project related activity, such as a database failover and restore, or a multi data-centre networking switch. We absolutely need them, but they are not designed for discovery. Conversely, GameDay activities are customer focussed and are designed to find obscure and hard to detect problems and, crucially, process problems where there is interplay and collaboration required between multiple teams and systems.
This table summarises some of the key differences between DR and GameDay exercises:
|Approach||Run sheet + requirements||Loose plan + a little chaos|
|Who||Operations||Cross functional, multi-disciplinary team|
|Assumption||System is built to a robust design||System is hazardous|
Now that you have a good understanding as to the challenges modern engineering teams have, and how GameDays and Chaos Engineering can help increase resilience into your products, it’s time to dive a bit deeper into how you can run one of these for yourself!
Interested to learn more? Here is an extensive list of our references.
About the Author: Matt Fellows
Matt is a self-described polyglot who enjoys working at the intersection between computers, humans and software engineers - ideally fully caffeinated - and has been doing so since Y2K was a thing.
Currently frustrated at the large amount of time spent not building products, he has been helping improve the automation and deployment tooling situation (sometimes called DevOps). When not absorbing the Internet via osmosis, he can be found outdoors playing basketball, wakeboarding, snowboarding and other things that reduce his ability to walk on a Monday.
Want to follow Matt’s online adventures? Find him on twitter @mattfellows