I want to deploy new versions of an application with no downtime. It turns out to be a bit tricky. Here is one solution that sort of works.
The Problem
I am not in control over the deployment process, all I can do is monitor an URL and stop sending traffic to it if there are errors.
I want to deploy small changes often to reduce the risk associated with large deploys. This is not a distributed system with lots of small services, it is a monolith that is redeployed often.
The Solution
The solution is to have more than one server handling the load and divide the traffic between these servers. The technique is called load balancing and is not new. All I have to do is to setup a load balancer and configure it properly.
Two categories of load balancers
Load balancers work on layer 4, the transport layer. Or layer 7, the application layer. I want to load balance a web application so a layer 7 load balancer is what I need. The layers here refer to the OSI model.
Using HAProxy as a layer 7 load balancer does the trick.
Installing HAProxy
The installation of HAPoxy is different on different systems, I installed it on an Ubuntu 16.04 like this:
apt-get install software-properties-common
add-apt-repository ppa:vbernat/haproxy-1.7
apt-get update
apt-get install haproxy
I found the instructions at https://haproxy.debian.net/ and was able to install the latest version, 1.7 as of this writing.
Configure HAProxy
Installing HAProxy was the easy part, the real work was in tuning its configuration. I ended up with this
configuration in /etc/haproxy/haproxy.cfg
global
log /dev/log local0
log /dev/log local1 notice
maxconn 2000
chroot /var/lib/haproxy
stats socket /run/haproxy/admin.sock mode 660 level admin
stats timeout 30s
user haproxy
group haproxy
daemon
defaults
log global
mode http
option httplog
option dontlognull
timeout connect 5000
timeout client 10000
timeout server 10000
frontend loadbalanser
stats enable
stats uri /admin?stats
bind *:80
mode http
default_backend gfr
backend gfr
stats enable
stats uri /admin?stats
mode http
balance roundrobin
option forwardfor
http-request set-header X-Forwarded-Port %[dst_port]
option httpchk GET /service/foretag/6.0/ws?wsdl
server gfr1 l7700744.ata.ams.se:8580 check rise 8 downinter 30000ms observe layer7 on-error mark-down
server gfr2 l7700745.ata.ams.se:8580 check rise 8 downinter 30000ms observe layer7 on-error mark-down
The most important part is the two last lines. They specify two different servers that should handle the load.
- server - indicates that this line specifies a server
- gfr1 - a logical name for the instance
- l7700744.ata.ams.se:8580 - the host and port where the application is served
- check - indicates that this server should be checked if it is online or not. The
option httpchkdefines how the check will be done - rise 8 - the number of succesful health checks that are needed before the server is considered to be operational
- downinter 30000ms - the time between health checks when the server is down. In this case, 30 seconds
- observe layer7 - monitor the application response codes
- on-error mark-down - mark the server as down if an error is received
The real magic, and tuning, was to find values for the server specification so a deploy could be done while using the servers. I used the servers by adding some load generated using Gatling.
The health check was performed using an HTTP call to a url where I check if the wsdl for a web service
is available or not. If it isn't, the application isn't up and running.
- option httpchk - an http check should be done to verify that the application is alive
- GET - the http verb to use when doing the http check
- /service/foretag/6.0/ws?wsdl - the url that should respond properly
Result
The load balancing works. When a server responds with an error, that particular server is marked as down. It will
come back when the deploy is done and the expected wsdl is available again.
I still lose a few calls during deployment. With constant load, about twice the production load, I lose approximately ten calls per server when they are reinstalled. That's not good, but given that I'm not able to alter the deploy process, I guess it will have to do.
I wish I could find a setting that resends a failed call once to another server, but I can't find one that works.
The option redispatch
seemed promising, but it didn't work well for me. When I had option redispatch and retries
set I lost more traffic compared to not having them set.
A better result
If I could change the deploy process, I would change it so that the server that is about to be re-deployed is removed from the load balancer before the deploy. HAProxy is really good at reloading its configuration. A script that removes a server, reloads HAProxy's configuration, performs the deployment, adds the server again, and finally reloads the configuration would not be too hard to write. This would give me a real zero-downtime deployment. Not just short downtime deployment as I am able to achieve with this setup.
Conclusion
HAProxy works very well. It is possible to re-configure it during usage without losing traffic.
Acknowledgements
I would like to thank Malin Ekholm for proof reading.
Resources
- HAProxy - an open source load balancer
- Thomas Sundberg - author