Sound advice - blog

Tales from the homeworld

My current feeds

Wed, 2007-Nov-28

High Availability HTTP

HTTP and its underlying protocols have some nice High Availability features built in. Many of these features come for free when load balancing or scalability improvements are made. In the enterprise our scalability needs are often smaller than a large web site, but our high availability requirements are more stringent. HTTP can be combined with part-time TCP keepalives to work effectively in this environment.

Web vs Enterprise

A large web site can depend on it users to hit reload under some failure conditions. An enterprise system with a significant number of automated clients needs more explicit guidance. The systems I work with tend to require no single point of failure, and an explicit bounded failover time. Say we have four seconds to work with: A request/response protocol needs to be able to determine within 4s whether or not it will receive a response. If a client determines that it will not receive a response, it will repeat its idempotent request to a backup server.

The mechanism HTTP provides to detect a failed server is the request timeout. Say we have a time-out of forty seconds. If 40s passes without a response the client knows that the server is either:

  1. that the server has failed, or
  2. that the server is taking an unusually long time to respond.

If we tighten the failure detection window to only 4s this choice becomes more stark. A better approach would be to send heartbeats to the server while our request is outstanding. However, the server is not permitted to respond to HTTP requests out of order. The requests we might send cannot be replied to unless we drop below the HTTP protocol layer. Luckily, TCP gives us an out in this situation.

Using TCP Keepalive

The TCP Keepalive mechanism essentially sends zero bytes of traffic to the server, requiring an acknowledgement at the TCP level. This isn't enough to detect all forms of failure, but will detect anything that causes the TCP connection to terminate.

The interesting thing is that these keepalive probes don't need to add a lot of overhead. A traditional heartbeating system would be active all the time. The TCP keepalive need only be enabled while one or more requests are outstanding on the connection. It should be disabled while the connection is idle. Even when enabled, heartbeats will only be sent when:

  1. A request takes longer than the heartbeat time, and
  2. No other requests are being transmitted down the connection

This system of heartbeats really needs to be augmented by local detection on the server side of failures that the client can't detect. For this reason it may still be useful for the client to time requests out eventually. However this then becomes a back-stop that doesn't need to have such stringent requirements on it.

Connecting quickly is still important both after a failure and while a particular server is failed. The HTTP implementation should create simultaneous TCP/IP connections whenever the dns name associated with a URL resolves to multiple IP addresses. The first to respond should typically be the connection used to make requests.


It is important to note that this kind of failure detection is required even when a HA cluster is used. TCP/IP connections typically don't fail over as part of the cluster. Adding TCP keepalives that are enabled only while requests are outstanding and reconnecting quickly adds minimal overhead to achieve a 90% HA solution. This solution can be augmented on the server side with local health monitoring to complete the solution.


Mon, 2007-Nov-19

"Four Verbs Should be Enough for Anyone"

The classic four verbs of REST are GET, PUT, POST and DELETE. Whenever the verbs come up it is inevitable that someone will say that it is "obvious" that four verbs aren't enough. I use four basic request verbs, but I don't use POST. It isn't idempotent, so makes efficient reliable messaging difficult or non-scalable. I'm actually not even that big a fan of DELETE, which I see as simply an idiomatic "PUT null".

The point of any messaging in a distributed environment is to transfer information from one place to another. In a high-level design document we can draw flows of data from one place to another without worrying about the details. When it comes time for the detailed implementation a few additional questions need answering:

  1. Which side should contain the configuration information relating to the data flow? This information is held on the client side, and it is the responsibility of the client to ensure the transfer is successful.
  2. Which side is the data being transferred to (Where)?
  3. Which side knows when the data needs to be transferred (When)?
  4. Is the data null?

The correct method to use can be extracted from the following table

Where When Is null Method
Client Client * GET
Server Client No PUT
Server Client Yes DELETE
Client Server * SUBSCRIBE
Server Server * None - Swap Client/Server

SUBSCRIBE is sadly missing from popular specification an implementation at this stage. It is a hard problem with some delicate balancing acts, and a real-time focus. Getting it to work where events are being generated at a rate faster than the network can process them is unsolved in public implementations and specifications.

Other request methods and variations on these methods exist for a number of reasons. Some are for greater efficiency (HEAD, GET if-not-modified, etc). Others are there to deal with non-REST legacy requests or requests that take information not in the request into account (MOVE, COPY, etc).

Two other things are needed after you establish which method you want to use. You need to pick the document type you are transferring, and the URL the client interacts with. The media type should be the simplest, most standard type that conveys the necessary semantics. The fewer semantics you transfer the better, as coupling is reduced. Shared semantics are both the fundamental requirement of machine to machine communication, and its downfall. The less any particular machine knows about the information passing through it, the better. If you can get away with plain text, or a bit of HTML you are laughing.

Picking the right URL is still something of an art, but bear in mind the basic principle: Whichever method you use should make sense for the URL you define. The URL should "name" a some state on the server side that can be sampled or updated as an atomic transaction.


Sat, 2007-Nov-17


I was just watching a video of Stefan Tilkov at the BeJUG SOA conference. I have seen most of this material before, but this time I wanted to comment on slide 31.

The original slide compares REST to "Technical" SOA ((T)SOA) by placing two SOA-style interface definitions beside five URLs conforming to a uniform interface. One implication that could be drawn from this diagram is that REST fundamentally changes the structure of the architecture. My view is that the change isn't fundamental. I see REST as simply tweaking the interface to achieve a specific set of properties.

Following is my diagram. Apologies for its crudeness, I don't have my regular tools at hand:


Some differences to Stefan's model:

Separate domain names

In the business I am in we might use the word "subsystem" instead of "service", taking a military-style systems engineering approach. The client would also be or be part of a subsystem. It is useful to be able to define and control the interface between subsystems separately to the definition and control of interfaces within each subsystem. Stephan puts the URLs for the two services under one authority, but I use a separate authority for each service/subsystem ( and The definition of these URL-spaces would be controlled and evolve separately over time.

Safe and Idempotent methods

I use only safe and idempotent methods, meaning that I have reliable messaging built in: A client retries requests that time out. Reliable messaging is critical to automated software agents. Idempotency provides the simplest, most reliable, and most scalable approach. Note that for automated clients this may mean IDs have to be chosen on the client side. This has some obvious and non-obvious "cons".

HTTP introduces some special difficulties when it comes to reliable ordering of messages, so automated HTTP clients should ensure they don't have different PUT requests outstanding to the same URL at the same time.

Query part of the URL

I use the query part of a URL whenever I expect an automated client to insert parameters as part of a URL. I know that there is a move to do this with URI templates, but I personally view the query part of the URL and its use as a good feature. It helps highlight the part of the URL that needs special knowledge somewhere in client code. Opaque URLs can be passed around without special knowledge, but where a client constructs a URL it first needs to know how. This is especially important for automated clients who don't have a user to help them supply data to a form.

Don't supply every method

I don't provide all valid methods on every URL. Obviously, these are really responded to in practice. If the client requests a DELETE on a URL that doesn't allow it, the request will be rejected with an appropriate error. However, I don't want to complicate the architectural description with these additional no-op methods. Nor do I want developers or architects to feel that they have to provide functions that are not required. It should always be easy to describe what you would expect a GET to the /allorders url to mean, but that doesn't mean we actually need to provide it when we don't expect any client to issue the request.


REST doesn't have to redraw the boundaries of your services or your subsystems. It is a technology that improves interoperability and evolvability over time. It is worth doing because of the short term and long term cost savings and synergies. It provides a shorter path to translating your high-level data-flow diagrams into working code, and should ultimately reduce your time to market and improve your business agility. That said: It needn't erode your existing investments, and from the high level isn't really a big change. In the end, the same business logic will be invoked within the same clients and services.