Docker is a topic that needs no introduction. Although the concept of containerisation is nearly as old as computing itself, the specific implementation of it known as Docker has gone from zero to near-ubiquity in less than five years. Never content to let the world pass them by, AWS was quick to launch their container management service ECS (EC2 Container Service) and companion service ECR (EC2 Container Registry) in late 2014.
ECS is far from being the only container management service in existence. In fact, the space is already crowded, with more services being launched at a dizzying rate. However, while many offerings concentrate on the UI, or simplicity, or low barrier to entry, Amazon’s focus is different. ECS is utilitarian—but solid and reliable. It focuses on the fundamentals of deploying and running containers. Being from AWS, it has the complete CLI and API support we are accustomed to.
If this was a typical ‘Introduction to ECS’ blog post, I would now be demonstrating how to stand up a simple container in ECS, and I’d probably conclude with a paragraph about how easy it is. But in the real world, our containers are not perfect spheres hanging in space. They need infrastructure to do useful things—they need load balancers, databases, queues, DNS records and everything else that goes into a working system.
CloudFormation is the primary offering from AWS for orchestrating infrastructure. But how can we make the somewhat rigid CloudFormation approach play nicely with the more dynamic Docker world?
The techniques described in this article were developed in conjunction with our client irexchange, an innovative B2B startup based in Melbourne. irexchange is developing a real-time ordering and fulfilment platform hosted in AWS, The platform is being developed in-house using agile software development techniques, with a high emphasis on automation, monitoring and alerting.
The Scenario
To make the discussion more concrete, let’s imagine we’re working with a dev team who are deploying a set of Docker-based microservices. The MVP is ready, and we’re going to deploy out a set of three containers and make them available—our three example microservices are product, customer and order.
Inside ECS
In this section we are going to examine the building blocks of running services on ECS and see how they can be orchestrated with CloudFormation.
Cluster
The first thing we need is somewhere to deploy our microservices. This is our ECS cluster that will host our Docker containers. One cluster can host many containers.
This is the minimum set of infrastructure for an ECS cluster. It is simply an auto-scaling group with associated launch configuration that boots the Amazon ECS-Optimised AMI and causes the ECS agent running on the instances to register with the designated ECS cluster. To grant the ECS agent permission to register, there is an Instance Profile and IAM Role that uses the AWS-managed ‘AmazonEC2ContainerServiceforEC2Role’.
Once registered with a cluster, the EC2 instance is known as a Container Instance and is ready to host Docker containers.
Services
The fundamental unit of orchestration inside ECS is the Task.
A task defines:
- One or more Docker containers to run
- The container resources resources (memory, CPU, volumes)
- The container environment.
A task may optionally have a TaskRole, which is similar to an instance profile for an EC2 instance. It allows the container to transparently access AWS resources with temporary credentials. This ability for a single ECS cluster to run multiple containers with differing IAM policies is very powerful.
With the task defining what is to be run, the Service defines how to run it. It defines how many instances of the task should be run at once, and the behaviour when upgrading the service to a new version. The service also refers to the AWS-managed ‘AmazonEC2ContainerServiceRole’, which is used to allow services to register with a load balancer.
Load Balancing
In late 2016, AWS announced the Application Load Balancer, or ALB. These are quite different to the original Elastic Load Balancers (which AWS now call Classic Load Balancers). They make ECS much more useful because they support dynamic port allocation. This is where the container instance allocates a port when it starts the Docker container and then communicates this back to the ALB.
The other major improvement the ALB offers is support for what AWS calls content-based routing. This means we can examine the incoming URL and route the request to different target groups. In the context of ECS, there is a one-to-one relationship between a target group and a service.
This flexibility means we have a decision to make on how to expose our services. We could put an ALB in front of each of our three example services, but this would increase costs and make it more difficult to add new services later. Instead we can use a single ALB, and choose between port-based routing and path-based routing.
Port-based Routing
In this model, we would allocate a port for each of our services, for example product on port 8000, customer on port 8001 and order on port 8002. In CloudFormation it looks like this:
(Note: the arrows indicate relationships, not necessarily the flow of traffic)
This approach has the advantage of simplicity and it is conceptually similar to how groups of containers are spun up in a group when using Docker Compose. The disadvantage is that it lacks discoverability. Which services run on which ports is something you just have to know.
Path-based routing
Here we are going to use the capabilities of the ALB to route traffic on a single port based on the URL. We will continue to use port 8000, but we will route URLs starting with /products
to the product service and so forth. In CloudFormation, it looks like this:
Path based routing allows us to build a set of namespaces on a single address and port, for example: api.mycompany.com/products, api.mycompany.com/customers
. It is also discoverable as there is a clear relationship between the path, the rule and the container.
The biggest disadvantage is that the ALB does not strip paths when evaluating rules. That means if a user requests api.mycompany.com/customers/customer/1, the container will receive a request for /customers/customer/1
and not, as you might expect, /customer/1
.
This means that some frameworks are more suitable for this approach that others. Spring Boot, for example, can be very easily configured to deal with the above scenario by setting the environment variable server.contextPath
to /customers
. In Ruby on Rails, a similar effect can be achieved by setting relative_url_root in environments/production.rb.
On Apr 5, 2017 AWS announced support for host-based routing. Conceptually this works the same as path-based routing, i.e. as a ListenerRule
attached to a Listener
. This feature would allow the services to be routed via hostname, e.g. product.api.mycompany.com
.
Pulling it together
Having laid out the components of our cluster, we can use use CloudFormation to pull it all together using CloudFormation Nested Stacks. This what it looks like:
- A master template to contain all the resources and nested stacks
- One ECS Cluster nested stack containing everything to stand up an empty ECS cluster
- An ECS Service nested stack per service we wish to run
- The components of the ALB are split between the two nested stacks. A single ALB and listener are created in the ECS Cluster stack, and a listener rule and target group are created in the ECS Service stack
- Any other infrastructure we might need such as DNS Records, RDS Instance etc can go into the master template, or into other nested stacks depending on the need for reuse
The following diagram shows the outputs of one nested stack are fed into the other as parameters to form the relationships between the services. The ECS Service stack is repeated for each service to be run.
The complexity of the ECS Service template will be relative to the homogeneity of the containers it is supporting. If we can make assumptions about the containers, such as how long they take to start, the location of the health endpoint, what protocol they speak and so on, the template can be generic. Otherwise additional parameters may need to be added to the template to cover more use cases.
As a starting point, a set of example templates is available on the DiUS GitHub account. These example templates implement the pattern described above with path-based routing and can be further customised for your needs.
Sharp edges
Now that we have ECS and CloudFormation working in harmony, is it all sunshine and rainbows? Sadly not. There are a few rough patches and limitations right now with these services. They are discussed briefly below, and should the situation change, this post will be amended.
Grace Periods
The classic combination of an ELB and an autoscaling group supports setting a grace period. This gives the instances the specified time to come online prior to the health of the instance being evaluated, so instances aren’t killed in the middle of booting.
Right now this functionality does not exist for the combination of ALB and ECS. The implication is that the container has an amount of time equal to the HealthCheckIntervalSeconds
* UnheathlyThresholdCount
to start responding to health checks, otherwise ECS will kill the task and replace it with another one (which will almost certainly suffer the same fate).
For containers that come online very quickly this is not a big issue. But if the container has to do substantial work on startup (e.g. load state from an external data store), then there are a few less than perfect options.
- The best option, where it is possible to change the code in the container, is to ensure that the application responds to health checks early in the startup sequence and does any long-running operations in a background thread. If the application later encounters a fatal error, the application can terminate. Unlike applications running on EC2, ECS knows immediately when a task has terminated and does not need to wait for failed health check count to reach the threshold.
- The simple option is to just increase the container health check interval or threshold to give the container more time to start. The downside is that if the container requires 5 minutes to start, then should the container become unhealthy (but not terminate) during normal operation it will also take 5 minutes to detect this state.
- The complex option is to run two containers in each task; the first container is the application and the second container monitors the first for health and reports it to the ALB, with appropriate logic to fake a grace period. The downside of this approach is that upon seeing a healthy status, the ALB will start sending traffic to the application, which will not be able to handle it properly during startup.
Task Placement Strategies and Constraints
ECS supports task placement strategies and task placement constraints. These are used to control how tasks are placed on the container instances, for example to maximise cluster density or ensure a task lands on a specific type of instance.
Sadly neither of these can currently be expressed with CloudFormation right now. And because they are set when the service is created or the task is run, they can’t even changed manually after CloudFormation creates them.
Until AWS adds support for these to CloudFormation, task placement can still be influenced by the use of CPU and Memory Reservation. By setting these appropriately, CPU or memory-hungry tasks can be placed separately to other tasks.
On Apr 28, 2017 AWS announced support for defining placement strategies and constraints in CloudFormation.
One Load Balancer per Service
Despite the relevant parameter for an ECS Service being named in the plural, LoadBalancers
, only one load balancer can be associated with a service. Amazon’s guidance is as follows:
Currently, Amazon ECS services can only specify a single load balancer or target group. If your service requires access to multiple load balanced ports (for example, port 80 and port 443 for an HTTP/HTTPS service), you must use a Classic Load Balancer with multiple listeners. To use an Application Load Balancer, separate the single HTTP/HTTPS service into two services, where each handles requests for different ports. Then, each service could use a different target group behind a single Application Load Balancer.
Managing State
CloudFormation and ECS have some functional overlap in that they both manage state—they both will try to ensure services are healthy. Most of the time they work well together, but it is possible to encounter unexpected failure modes. Consider the following scenario in a rapid development environment:
- There is an ECS cluster. It is hosting a stable service which runs
mycontainer:1
. - The CloudFormation template is updated to use
mycontainer:2
. - While leaving
mycontainer:1
running, ECS starts a new task withmycontainer:2
. - Unfortunately, this version of the container has a bug and terminates with an error, but not before updating the database schema to the latest version.
- There is actually no problem at this point. Our original instance of
mycontainer:1
is still healthy and serving requests. The best course of action is to leave everything as it is and retry the update with a fixed mycontainer:3. But … - ECS reports back to CloudFormation that with mycontainer:2, the service failed to stabilise.
- CloudFormation will attempt to rollback the update by asking ECS to create another instance of
mycontainer:1
. - This new instance of
mycontainer:1
notices that the database schema has been updated to a later version and refuses to start. - ECS reports the failure of the new task. CloudFormation is unable to roll forward or backwards and therefore gives up, marking the stack as UPDATE_ROLLBACK_FAILED, which means no further updates can be applied.
- Recovering the stack at this point would probably involve manually fixing up the database and retrying the rollback.
There are a couple of approaches that can be used to avoid this scenario.
- Rollback can simply be disabled for the CloudFormation stack. However, this has to be set at the stack level, not just the nested stack. Outside of this scenario, rollback is usually desirable.
- Another approach is to create the ECS task and service with CloudFormation, but update it to new versions with the CLI or a third-party tool like ecs-deploy or ecs-deploy. This approach works well with CI/CD tooling, but care must be taken with CloudFormation updates that recreate the task; it will be recreated at the version given in the CloudFormation template, not the version currently running.
Conclusion
In this post we’ve seen how CloudFormation can be used to stand up ECS-based infrastructure, how nested stacks can reduce the amount of repetition in our templates, and speed the deployment of new services. Some potential pitfalls and issues have been discussed along with mitigation strategies.
The combination of ECS and CloudFormation is a powerful one, despite its current limitations, and is an excellent basis for deploying modern microservice-based systems.