Zero Downtime Deployments With Docker Swarm

Zero downtime deployments is an important topic when it is come to hosting a highly available software. This is especially the case with frequent software deployments ,following Agile methodologies, and trying to apply continuous deployment guidelines and techniques.

The Problem

The zero-downtime deployment topic is most relevant to the web applications or services that expose public interfaces for their clients. These interfaces must be highly available for clients all the time, even during software deployment and rolling out new releases. For instance, the back end API for a mobile application should be available for mobile apps all the time, or end uses will start having communications errors from their mobiles, simply because the mobile application cannot reach the back end application.

On the other hand, zero downtime deployment is less important for asynchronous services such as background jobs, because these services are not connected to the end uses directly.

On the Docker world, software deployment or rolling out new releases is done by simply replacing the currently running container with a new container with the new source code. Both containers, should have the same interface and expose the same APIs. During this process, both containers will be down for a period of time and any requests coming to the container will be rejected— this will cause downtime for the corresponding service. The length of the downtime depends also on the method used for the replacement of the containers — if the replacement was done manually, the downtime will be considerable. So, we need to find a way to automate and control the deployment and rollout of Docker containers.

Luckily, Docker Swarm provides a solution to this issue. The solution is simply to use Docker services instead of starting and stopping Docker containers manually. The service object can be defined and configured not only to control and guarantee zero downtime but also to control failure and rollback use cases.

In addition, with Swarm service, we can achieve zero-downtime deployments, even with services that only have one container or replica. In this article, I’ll highlight the service configurations needed to achieve high availability and zero downtime deployment with Swarm services.


The Solution

Let’s start by deploying the Nginx reverse proxy with minimal configurations in a Swarm cluster.

Below is the Docker stack that we’re going to start with. This stack uses the default values for a deployment configuration. As a result of deploying this stack, Swarm will create a new nginx service with only one container and expose it to port 8088, which is mapped to port 80 inside the docker container:

Deploying this service can be done by simply executing following command, assuming that the stack is stored in a file called nginx.yml:

$> docker stack  deploy -c nginx.yml nginx

Redeploying the Nginx service, or updating the service docker image with these configurations, will cause downtime. That’s because we have only one Nginx replica — at some point of time during the rollout of containers, both containers, the old and the new, will be down and any request coming to the service will be rejected. This behavior is a result of the Swarm default configuration of UPDATE_CONFIG. By default Swarm will remove the old container and start the new one.

To enhance the situation we can update the stack file to have the following new configurations:

First, it’s important to provide a health check for the stack services. This allows Swarm to figure out if a service is healthy or not and so it can achieve zero downtime. For instance, in case of deployment of a new Nginx image that contains syntax errors and without the health check, Swarm will start the new Nginx and then remove the old containers. After that, the Nginx docker container with the new image will keep restarting because it is exiting with an error code, but it will never roll back to the old image because the health check is missing. On the other hand, if the health check is present, Swarm will validate if the new docker container is healthy and only if the container is healthy will it continue the update process for the other replicas.

The next important configs section is the update_config. Here we need to change the order of the update process to start-first. This makes Swarm create the new containers first and only do the actual replacement of the containers after they pass the health check.

It’s also important to restore corrupted containers. So, we need to change the failuer_action from the default pause to rollback. This is needed especially in the case of one replica services because Swarm will stop the old containers in the case of pause.

The next section is the rollback_config section. Here we need to rollback all containers simultaneously. therefore we need to set the parallelism to 0, which means applying the rollback for all affected containers at the same time. In addition, the order of the update should be stop-first to make sure that the corrupted containers are removed as soon as possible.

The third section is the restart_policy section. In this section, we specify the maximum number of restarts in case of a failure during the update. We need to limit the number of the restart to avoid consuming a lot of resources from the host server.

The above compose configuration guarantees zero downtime deployments for applications with one replica. But what about the services where we have multiple replicas? Is that configuration enough?

Luckily the above configurations are also applicable to services that have multiple replicas. However, in that case, we need to be aware of the effect of using parallelism configurations with the update_config section.

The default value of parallelism is 1 which means that Swarm will update the containers one at a given time. On the other hand, setting parallelism to 0 means we want to update the containers simultaneously. Sometimes there’s a need to increase the value parallelism to speed up the process of rolling out or updating services on Swarm clusters. In all these cases the new value shouldn’t be equal to the number of service replicas. Otherwise, the situation will be similar to setting parallelism to 0, which introduces the risk of having downtime during the deployment again.

Below is a complete Nginx docker stack with the support for zero-downtime deployment configurations. These configurations can be applied for any docker service to achieve the same goal:


Conclusion

Achieving zero downtime deployments in Swarm is pretty straightforward and can be done by modifying the following service configurations sections in the docker stack for the corresponding service : healthchecck, update_config, rollback_config, restart_policy.