HTTP and its underlying protocols have some nice High Availability features built in. Many of these features come for free when load balancing or scalability improvements are made. In the enterprise our scalability needs are often smaller than a large web site, but our high availability requirements are more stringent. HTTP can be combined with part-time TCP keepalives to work effectively in this environment.
Web vs Enterprise
A large web site can depend on it users to hit reload under some failure conditions. An enterprise system with a significant number of automated clients needs more explicit guidance. The systems I work with tend to require no single point of failure, and an explicit bounded failover time. Say we have four seconds to work with: A request/response protocol needs to be able to determine within 4s whether or not it will receive a response. If a client determines that it will not receive a response, it will repeat its idempotent request to a backup server.
The mechanism HTTP provides to detect a failed server is the request timeout. Say we have a time-out of forty seconds. If 40s passes without a response the client knows that the server is either:
- that the server has failed, or
- that the server is taking an unusually long time to respond.
If we tighten the failure detection window to only 4s this choice becomes more stark. A better approach would be to send heartbeats to the server while our request is outstanding. However, the server is not permitted to respond to HTTP requests out of order. The requests we might send cannot be replied to unless we drop below the HTTP protocol layer. Luckily, TCP gives us an out in this situation.
Using TCP Keepalive
The TCP Keepalive mechanism essentially sends zero bytes of traffic to the server, requiring an acknowledgement at the TCP level. This isn't enough to detect all forms of failure, but will detect anything that causes the TCP connection to terminate.
The interesting thing is that these keepalive probes don't need to add a lot of overhead. A traditional heartbeating system would be active all the time. The TCP keepalive need only be enabled while one or more requests are outstanding on the connection. It should be disabled while the connection is idle. Even when enabled, heartbeats will only be sent when:
- A request takes longer than the heartbeat time, and
- No other requests are being transmitted down the connection
This system of heartbeats really needs to be augmented by local detection on the server side of failures that the client can't detect. For this reason it may still be useful for the client to time requests out eventually. However this then becomes a back-stop that doesn't need to have such stringent requirements on it.
Connecting quickly is still important both after a failure and while a particular server is failed. The HTTP implementation should create simultaneous TCP/IP connections whenever the dns name associated with a URL resolves to multiple IP addresses. The first to respond should typically be the connection used to make requests.
Conclusion
It is important to note that this kind of failure detection is required even when a HA cluster is used. TCP/IP connections typically don't fail over as part of the cluster. Adding TCP keepalives that are enabled only while requests are outstanding and reconnecting quickly adds minimal overhead to achieve a 90% HA solution. This solution can be augmented on the server side with local health monitoring to complete the solution.
Benjamin