I have been reading up over the last week on an area of my knowledge that is sorely lacking. Despite being deeply involved in the architecture of a high availability distributed software architecture, I don't have a good understanding of how high availabilty can be provided at the network level. Given the cost of these solutions in the past and the relatively small scale of the systems I have worked with, we have generally considered the network to be a dumb transport. Deciding which network interface of which server to talk to has been a problem for the application layer. A typical allowable failover time in this architecture would be four seconds (4s). This is achieved with end to end pinging between the servers to ensure they are all up, and from clients to their selected server to ensure it is still handling their requests and subscriptions.
The book I have been reading is a Cisco title, Building Resilient IP Networks It has improved my understanding of layer 2 switching, an area which has changed significantly since I last really looked at networking. Back then, the hub was still king. It also delved into features for supporting host redundancy including NIC teaming, clustering and the combination of Server Load Balancing and a Global Site Selector.
It may be just my lack of imagination, but the books seems to get tantalisingly close to a complete solution for the kinds of systems I build. Just not quite there. It talks about clustering and NIC teaming within a single access module, and that offers at least a half-way solution. It seems you could add a VLAN out to another site (thus another access module) for disaster recovery, but without offering a clear alternative the book repeatedly warns against such an architecture.
So, I have three servers. Two are at my main site. One is at my disaster recovery site. I can issue pings using layer three protocols, so I don't strictly need my servers to be on the same subnet. However, I need my clients to fail over from one to the other within a fixed period after any single point failure. It looks like I need IP address takeover between the sites to solve my failover problem at the network level.
The DNS-based Global Site Selector option discussed in the book is fine if we want the failover to affect only new clients. Old clients will retain cached DNS records, and may not issue another DNS query for requests that are still pending. Issuing a mass DNS cache expiry multicast or using very short DNS cache periods both seem like poor options. Ideally we would contain the failover event within the cluster and its immediate network somehow.
A routing-based failover solution might allow a floating IP address to be taken over a different node within the cluster as failover occurs. For this to occur we would need a fast-converging OSPF network that allowed a single IP to be served from multiple sites. Failover of connections would be handled at the OSPF level. This solution (if implementable) would have similar charactersitics to any multiple-site VLAN solution based on RSTP. The problem remains in either case of clients that are already in particular communication states with the failed server.
A current client may be either part-way through issuing a request, or may be holding a subscription to resources at a server. If the client is to reissue its request to the new server after a failover, the client can wait only as long as the failover time before declaring the original request failed and in an unknown state of completion. The maximum request processing time on the server is therefore bounded by the failover time, less the network latency to the client with the particular failover time.
An alternative to timing out when the failover time is reached would be to sample the state of the connection or request at a rate faster than that of the failover time. If your failover time is four seconds (4s), you could sample the state every three seconds, or two, or one. A timeout would not be necessary if the sampling indicated that the request was still being processed.
The sampling itself could come in the form of a pipelined "PING" request queued up behind the main request. Whenever the transmit queue is nonempty on the client side, TCP will transmit packets on an expontential back-off strategy. So long as routes to the new server of the cluster IP address are established before too many packets are lost, the new server should respond indicating that it doesn't know about the connection. Another option would be to employ a dedicated request-state sampling protocol or to craft specially-designed TCP packets for transmission to sample the state.
Subscription is a problem in it's own right. The server has agreed to report changes to a particular input as they come in, however the server may fail. The client must therefore sample the state of its subscriptions at a rate faster than the failover time if it is to detect failure, and issue requests to the new server to reestablish the subscription. This again is an intensive process that we would rather do without. One solution is to persist subscription state across failover. Clients should not recieve an acknowledgement to their subscription requests until news of the subscription request has spread sufficiently far and wide.
Both the outstanding request and outstanding subscription client states can be resolved through these mechanisms when the server is behaving itself. However, there is the possibility that an outstanding request will never return. Likewise there is the possibility that a client's subscription state will be lost. For this reason, outstanding requests do demand an eventual timeout. Accordingly, outstanding subscriptions do need to be periodically sampled and renewed. These periods can be much longer than the failover period.
Clustering and network-based solutions can be expensive, but they can also provide scalable failover solutions for the service of new clients. Existing clients still need some belts and straps to ensure they fail over safely to the new server.
My high-availability operating-system support wishlist:
- Support for explicit connection-state sampling without having to write more data down the TCP/IP pipe: Send a keep-alive on demand. If OK, keep processing and don't tell me about it. If failed, kill the connection and let me know when I next try to read from it.
- Per-subnet TCP/IP timeout tuning: Set the maximum period between SYN requests for a connect. Control the rate at which TCP/IP retries are sent. Tune TCP/IP down to the size of a corporate WAN for local IPs, and to the size of the Internet for others.
Benjamin