An experimental mindset
In this part, we investigate some of the tools and techniques that should be in your kit bag as a Chaos Engineer (yes, this section will be technical), in the context of a fictional music streaming service to demonstrate the concepts. We begin by discussing resilience patterns, in particular microservices and circuit breakers, and then turn to designing and running experiments on our systems to identify weaknesses and improve resilience.
The code for the experiments in this resource can be found on our GitHub page.
Patterns of Resilience
In 1985, Jim Gray advised us that “A way to improve availability is to install proven hardware and software, and then leave it alone”. But as we’ve discovered things are much more complicated than they used to be and this “resilience strategy” is simply not effective. In fact, depriving systems of change is the very opposite of what an antifragile system wants!
Thankfully, as our systems have grown and evolved, we have developed myriad strategies to improve our ability to be resilient and allow flexibility for our different needs.
Uwe Friedrichsen has created some excellent maps that we can use to help in applying many of these patterns in certain situations. Whether offensively to reduce the blast radius of a problem, defensively to prevent cascading failures or otherwise, we are maturing as an industry in this field. Here are two such maps to give you an idea of the breadth of options available:
Figure 1: Resilience strategies grouped by approach https://www.slideshare.net/ufried/patterns-of-resilience
Figure 2: Resilience strategies grouped by lifecycle https://www.slideshare.net/ufried/patterns-of-resilience
Given the vast array of patterns and strategies we will focus on just a couple to demonstrate the usefulness of chaos engineering techniques. Specifically with the current trend, we will focus on microservices and circuit breakers (with “fast fail”).
Antifragility teaches us that almost anything monolithic or large is fragile and when it breaks, it is bound to be catastrophic. Conversely, things that are small and distributed are antifragile – the system as a whole is able to cope with disorder in pockets of the system. Microservices have been posited as antifragile, and may take on many properties of resilient systems: isolation, loose coupling, stateless, idempotent, even-driven and so on.
In short, a microservice architecture is an approach to developing an application as a set of small services, each running in its own process and communicating via lightweight mechanisms, often an HTTP resource API.
However, there are pitfalls with microservices architectures. Examples include architectures that resemble distributed monoliths, “big balls of mud” or star based architectures, where so many services communicate with each other synchronously, that they are fragile with respect to a single weak link in the chain.
A natural complement to microservices and distributed architectures, circuit breakers exist to prevent cascading failures in systems. Cascading failures not only directly impact the customer, they consume increasing resources across the system (often while customers keep hitting “refresh”), including components of the system that are still working, further impacting (a potentially new set of) customers in a nasty, circularly dependent kind of way.
As we learned earlier, a service is only as available as the intersection of its critical dependencies. Of course, non-critical dependencies may still take down the system if not properly handled. We know that failure of a system is inevitable, so we should look to be proactive about it.
The circuit breaker pattern takes its inspiration from real physical circuit-breakers and are applied between two remote systems. When a system makes a call to another system, we wrap the call in a circuit-breaker, which monitors for failure in the communication (timeout, errors etc.). Once the failures reach a certain threshold, the circuit breaker trips, and all further calls to the circuit breaker return with an error immediately, without the protected call being made at all. This has the effect of freeing up system resources, and reducing cascading failures across a system.
Probably the most well known and sophisticated implementation is Hystrix, which takes this one step further and allows you to specify a fallback behaviour when the circuit has been tripped, and provides real-time visibility into the state of the system.
To illustrate these points, we are going to build a fictional recommendations API for a music streaming service, as a set of two microservices, and demonstrate how chaos engineering practices can help us build more reliable and resilient systems.
Music Streaming Recommendations Engine
Let’s assume for a moment you are building the next disruptive music streaming service and as part of that, you provide a recommendations section on your logged-in MyAccount page, showing related artists and songs based on customers existing music preferences.
This service is accessed via the User Service, as it needs to provide extra context for the current user to retrieve the recommendations.
Here is a simplified view of our architecture:
Figure 3: Music Streaming Architecture
Principles of Chaos
We now have a system we want to improve, and we would like a formal system to guide our path of destruction in order to surface fragility in our systems.
Chaos Engineering can be thought of as the facilitation of experiments to uncover systemic weaknesses.
These experiments follow four steps:
- Start by defining ‘steady state’ as some measurable output of a system that indicates normal behaviour.
- Hypothesise that this steady state will continue in both the control group and the experimental group.
- Introduce variables that reflect real world events like servers that crash, hard drives that malfunction, network connections that are severed, etc.
- Try to disprove the hypothesis by looking for a difference in steady state between the control group and the experimental group.
The harder it is to disrupt the steady state, the more confidence we have in the behaviour of the system. If a weakness is uncovered, we now have a target for improvement before that behaviour manifests in the system at large.
These four steps are the core of what is called the principles of chaos.
Metrics, Experiments and Hypotheses
Using our newfound framework for experiments, we must first identify our steady state metric before creating a number of experiments and hypotheses to test. To keep things simple, we will run these as local experiments running in a set of Docker containers.
For this service, a simple metric for us to determine the overall health of the system is the percentage of 200 responses from the User Service, specifically we want 100%. A secondary metric to be interested in however, is how long the requests take to complete as this can be a canary signal telling us that our system is about to fail.
To drive load into the system and provide simple output metrics, we will use Vegeta as it is simple to use, has no dependencies and is cross-platform. We might usually use a tool like Gatling for this purpose, however setting it up is a bit more involved and requires some coding (see this article for a similar example that tests nginx failover using Gatling and Muxy).
To introduce failures into our system, we will use Muxy. Similar to Vegeta, it is dead simple to use, declarative, cross-platform and has no dependencies.
Here is a summary of the tools we are going to use and what they do:
|Docker||Create local environments||Lightweight virtualisation technology. Allows creating programmable, repeatable and immutable environments|
|Muxy||Inject chaos into system||Declarative tool, enables injection of common faults into systems such as network latency, scrambled messages and probabilistic failure.|
|Vegeta||Load and test driver||Cross platform load testing tool. Drills at a consistent rate, and reports on key metrics we’re interested in (Success / failure rate, durations etc.)|
|Statsd, Graphite and Grafana (SGG Stack)||Monitoring and metrics||A suite of tools often used together to provide near real-time visibility of runtime application and system metrics.|
Table 1: Tools for our chaos experiments
GitHub repository and Video demonstration
All of the resources for these experiments can be found at https://github.com/DiUS/gameday-resources so you can follow along.
If you’d rather see the experiments run, you can check out this demonstration video.
Experiment 1: No circuit breaker
Time to get our hands dirty and break open your terminal! First, let’s pull down the following repository:
git clone https://github.com/DiUS/gameday-resources.git cd gameday-resources git checkout experiment1
Figure 4: Experiment 1
The first time you run this it may take some time while all of the containers are pulled down and created.
Our experiment is a failure, as demonstrated by the output of
test_1 | muxy_test.go:50: Expected 200 response code, but got 503 test_1 | muxy_test.go:50: Expected 200 response code, but got 503 test_1 | muxy_test.go:50: Expected 200 response code, but got 503 test_1 | muxy_test.go:50: Expected 200 response code, but got 503 test_1 | FAIL test_1 | exit status 1 test_1 | FAIL_/app5.965s gamedayresources_test_1 exited with code 1 .... vegeta_1 | Requests [total, rate] 750, 50.07 vegeta_1 | Duration [total, attack, wait] 19.981740988s, 14.979999791s, 1.741197ms vegeta_1 | Latencies [mean, 50, 95, 99, max] 3.595773ms, 1.665649ms, 6.091287ms, 46.085067ms, 183.416853ms vegeta_1 | Bytes In [total, mean] 323, 0.43 vegeta_1 | Bytes Out [total, mean] 0, 0.00 vegeta_1 | Success [ratio] 95.60% vegeta_1 | Status Codes [code:count] 200:717 503:33 vegeta_1 | Error Set: vegeta_1 | 503 Service Unavailable
As you can see, we only have a 95% success rate and the tests take ~20s to run, which is 5s more than intended. This is because our User Service has no protections against a faulty downstream dependency – we’ve just identified that our SLAs our at risk.
cntl-c when you’re ready for the next step.
Experiment 2: Circuit breaker
It’s time to apply a basic circuit breaker to the system, it introduces latency and fault-tolerance and the ability to “fail fast”. This is what our updated experiment looks like:
Figure 5: Experiment 2
git checkout experiment2 ./run-chaos.sh
Unfortunately, this test still fails. However, we have reduced the test duration to approximately 15s as now we are failing fast if the circuit is open:
test_1 | muxy_test.go:50: Expected 200 response code, but got 503 test_1 | muxy_test.go:50: Expected 200 response code, but got 503 test_1 | muxy_test.go:50: Expected 200 response code, but got 503 test_1 | muxy_test.go:50: Expected 200 response code, but got 503 test_1 | FAIL test_1 | exit status 1 test_1 | FAIL_/app3.965s gamedayresources_test_1 exited with code 1 .... vegeta_1 | Requests [total, rate] 750, 50.07 vegeta_1 | Duration [total, attack, wait] 14.981740988s, 14.979999791s, 1.741197ms vegeta_1 | Latencies [mean, 50, 95, 99, max] 3.595773ms, 1.665649ms, 6.091287ms, 46.085067ms, 183.416853ms vegeta_1 | Bytes In [total, mean] 323, 0.43 vegeta_1 | Bytes Out [total, mean] 0, 0.00 vegeta_1 | Success [ratio] 95.60% vegeta_1 | Status Codes [code:count] 200:717 503:33 vegeta_1 | Error Set: vegeta_1 | 503 Service Unavailable gamedayresources_vegeta_1 exited with code 0
Experiment 3: Circuit breaker with fallback function
Finally, we are going to add a fallback function to our circuit breaker. This means when we detect the recommendations service is not functioning correctly, we will still return a pre-canned response. This strategy isn’t appropriate for all APIs, however in the case of a recommendations service we can fallback to graceful behaviour, subtlely returning hard coded recommendations of the past month’s’ most popular artists.
Figure 6: Experiment 3
Run the following:
git checkout experiment3 ./run-chaos.sh
You should see something like:
test_1 | Response: test_1 | Call from backup function test_1 | Call from backup function test_1 | --- PASS: Test_Example100calls (4.69s) test_1 | PASS test_1 | ok _/app4.703s gamedayresources_test_1 exited with code 0 ... vegeta_1 | Requests [total, rate] 750, 50.07 vegeta_1 | Duration [total, attack, wait] 14.981356728s, 14.979999806s, 1.356922ms vegeta_1 | Latencies [mean, 50, 95, 99, max] 4.790972ms, 1.590297ms, 6.194134ms, 118.380217ms, 183.775359ms vegeta_1 | Bytes In [total, mean] 18288, 24.38 vegeta_1 | Bytes Out [total, mean] 0, 0.00 vegeta_1 | Success [ratio] 100.00% vegeta_1 | Status Codes [code:count] 200:750 vegeta_1 | Error Set: gamedayresources_vegeta_1 exited with code 0
Finally, our tests pass! Our User Service API will always respond with a 200, even if the recommendations service is down.
The experiments here are fairly simplistic, yet demonstrate a powerful approach that we can apply to much more complicated systems.
In summary we:
- Created simple hypotheses about our systems using customer focussed steady-state metrics to determine success or failure
- Setup repeatable experiments to test the hypotheses using simple, free and open-source tools – Docker, Muxy, Vegeta and Statsd.
- Demonstrated how Circuit Breakers are useful in microservice based architectures
- Used metrics and dashboards to get intuitions about how our system is behaving and steps we could take to improve them
- We methodically applied the principles of chaos, each time resulting in a more resilient system
GameDay Technical Facilitation
Now that we have a solid grasp on the principles of chaos and how to apply them at a small scale, it is time to consider how we might apply them as a team across an entire ecosystem.
In a GameDay exercise, we often have multiple teams, systems and processes and so facilitation becomes one of the key ingredients to success. Taking on the role of “Technical Facilitator” comes with its own unique set of challenges.
You need to help guide the team to a set of testable hypotheses and runnable experiments so that GameDay can be a success. There are a number of aspects to this:
- Measurement – a strong understanding of what is important to the customer and how we will measure it
- Design – how might we create an experiment without suffering from the measurement problem?
- Communication – possess strong visual communication skills (e.g. sequence diagrams, system architecture) to bring shared understanding to the team
- Prepare – what environments, dashboards, people and systems need to be available, setup and configured in order to make the day a success?
One common barrier to effective technical facilitation is intimate knowledge of the systems and their design (or implementation!) in the first place – a common bias is assuming that the system is behaving as intended.
Your job on the day is arguably the most important. Once the exercise starts, the team will be looking to you make numerous and frequent decisions about how long to persist down a certain path, how to interpret results and what to do next. Having the following in mind will help you keep a clear head as the day unfolds:
- Preparation – having the agenda, run sheet and system context/architecture on the whiteboard and ensuring all components are “green” before commencement of any experiment
- Lead – depending on the size of the exercise, you may need to keep yourself free and delegate in order to be able to keep the exercise on track, make decisions and pivot ideas
- Dashboards – ensure everyone has access to salient information, be it visual dashboards, log files etc. Make it visible to the team to improve shared understanding and know when to draw the whole team together to make a point
- Documentation – create visual spaces (e.g. whiteboards) to capture findings, highlight current experiment etc. Ensure any information from the day is stored in a central location such as a wiki.
Most important is adaptability and the ability to uncover the “unknown unknowns”. Whilst you have created a plan, you must be prepared to deviate from it when you sense you might uncover something never seen before. Discovering an unknown unknown is the Holy Grail of GameDay.
Now that you have run GameDay, your job is to follow up:
- Document – capture all information and findings and make it available to the broader team to improve organisational learning
- Learn – what did we learn, and how can we feed these learnings back into the platform (people, process, fixes etc.)?
- Culture – are we seeing recurring patterns in design or issues that we need to address? How might we encourage all teams to test for resilience earlier in their lifecycle?
- Automation – ask yourself if any of the experiments could be automated using tools like Muxy or ChaosMonkey to “bake in” resilience at various steps of the software development lifecycle.
As you can see, GameDays are powerful team exercises to uncover all manner of technical, people, process and even cultural issues.
Starting with small experiments during development cycles is a great way to get your feet wet in Chaos Engineering and socialise the benefits with a broader audience. From there, you can progress from small GameDay’s to bigger ones and then to automated infrastructure testing.