Sound advice - blog

Tales from the homeworld

My current feeds

Mon, 2008-Jun-09

Delta Encoding in HTTP

So I have a document at a given URL, say "https://example.com/doc". It's a big document, and clients need a consistent synchronised replica of this document. The problem is that this document changes frequently, but only in small ways. It make sense that a client will have to grab a copy of the whole document initially, but do they really need to grab the whole document over and over as it changes? What options exist? Is "Delta Encoding in HTTP" the right answer, or is something else needed?

Straight HTTP

Let's start with the HTTP/1.1 of rfc2616. This protocol lets us go and fetch the data initially. Along with that fetch we might get an ETag, a Last-Modified, and/or explicit cache control mechanisms. These let us avoid transferring the document again if it has not changed. If it has changed, however, we must acquire the whole thing again.

Delta Encoding in HTTP

rfc3229 introduces "Delta encoding". This allows us to make a request that says: "Hey, I have the representation with this ETag already. Can you give me the diff against my document to catch up to the current representation?". It looks a little like this:

GET https://example.com/doc HTTP/1.1
If-None-Match: "123xyz"
A-IM: diffe

This says that the client is willing to accept a patch in "diff -e" format from the server. The client can apply this patch against the document it already has to update it to the current revision. A special 226 IM Used response indicates that the sever has understood the request and knows what to do with it.

The general idea would be that the server keeps either copies or diffs relating to several recent versions of a representation. At request time the server will combine recent diffs into an update that brings the client up to date from its current ETag. If it happens to ask for an ETag that is too old, it will simply get the whole document back again.

The model is still nicely stateless. Any number of clients might sitting on the same Etag-identified document version. The server doesn't try to track individual clients, so should be able to handle a serious number of them. The server's only important decision is how long to make the queue. This could relate to a fixed amount of available storage space, a fixed number of document revisions, a fixed time range, an unbounded revision control history with manual purging, or even perhaps some dynamic behaviour based on requests made to date. Interestingly, I suppose, only revisions that were actually transmitted to someone need to appear in the list of diffs.

The delta encoding RFC is a proposed standard, but seems to have a number of problems in this usage profile:

So what's the alternative?

Multi-URL approach

The two basic approaches here are a one-URL solution or an "n"-URL solution. rfc 3229 uses a one-URL approach, where different diffs are returned to different clients from the main resource URL. An "n"-URL solution would give a unique URL to each diff. Adding URLs seems to be the answer to most problems in the REST domain, so let's have a go at applying it to this one.

The first request

The first request is always going to return the whole document. Instead of simply returning an ETag, consider the possibility that a response will contain a hyperlink. This hyperlink would be to the document that will become the diff against this revision of the resource to update it to the "current" revision at that time. An example response:

HTTP/1.1 200 OK
Link: <https://example.com/doc/deltas?last=123xyz>; rel="Deltas"
Content-Type: ...
...

This rel-deltas link might tell an aware client that it can acquire deltas against the current version by querying the listed URL. The client may use this URL for the second request, below. An Etag and other cache control information may also be included for non-delta-aware clients.

The second request

When the client suspects that the resource has changed it will issue a GET request to the Deltas URL. It will specify the patch formats it understands as mime types in an Accept header, and otherwise the request will be as normal:

GET https://example.com/doc/deltas?last=123xyz HTTP/1.1
Accept: ...
...

The response that comes back might depend on whether changes have occured, and on whether the delta can be generated. The normal delta case:

HTTP/1.1 200 OK
Link: <https://example.com/doc/deltas?last=456abc>; rel="Next"
Content-Type: ...
...

This would deliver the delta to the client in a format the client understands. Normal content negotiation and other features of GET are employed and various kinds of errors can be reported. Some errors would leave the client in the dark as to how it should proceed, and the client would fall back to fetching from the main URL again.

The "Next" link is important. We can't have a client that has successfully applied this delta to ask for its delta from the same place again. Other clients may need this URL, and the client needs to move on to another. In this case the client should move onto the "next" delta in the series for its next request.

Here we see the case where nothing has changed, yet:

HTTP/1.1 204 No Content

Nice and easy. Nothing has changed, so nothing to see here. Cache controls are likely to be the same as for the main URL, but might differ. The client should try again next time it suspects something might have changed.

Finally, we have the case where the delta we are trying to fetch is too old:

HTTP/1.1 410 Gone

Gone is appropriate here, as this URL is no longer expected to work for anyone. However, clients must also be ready to treat a generic 404 Not Found in the same way. The client would be expected to fall back to querying the main URL, restarting the state synchronisation from a current state.

Subsequent Requests

Subsequent requests would hop from next to next as required by the client, and as the client anticipates a change may have occurred.

There is actually a fairly deep question in this approach as to whether the client should follow a Delta or Next link immediately or not. The answer is potentially tied up in caching.

We theoretically want caches to work well with this approach. Now, caches might be smart enough to apply patches themselves and feed them downstream. On the other hand, we might be dealing with caches that are deployed today. Rather than treating the caches as smart relays, we could potentially treat them as "dumb" independent relays for main URL and delta URL responses.

The fact that we want the caches to work for us means that we don't want to set no-cache directives on delta URLs. The data retrieved by any given client may therefore be out of date when it arrives, within freshness criteria attached to the delta. This criteria will either cause the cache to refetch, revalidate, or return data without revalidating. This information is combined with client preferences to determine whether the client is actually returned the freshest possible data or not.

However we do this, I think we can trust the existing caching models for dealing with deltas. Whenever we are given a delta URL the assumption should be that until proven otherwise there is no data there. If the caching model on the data that supplied the new delta URL says that data can be a few seconds out of date, the link will also be a few seconds out of date. If the model says we must revalidate, we'll get something more up to date.

Conclusion

Some advantages to the alternative approach include:

I am fast forming the view that a few links in HTTP headers will be more effective than an implementation of 3229. I think this approach avoids complication at proxies, avoids introduction of new return codes and headers, and will be more generally applicable. For example, it could be used to cheaply transmit new atom feed entries to clients who aren't quite up to date rather than just keep a large document in sync.

Benjamin