Sound advice - blog

Tales from the homeworld

My current feeds

Sun, 2005-Aug-07

Internet-scale Subscription

When characterising an interface within a distributed system I ask myself a series of questions. The first two I ask together.

  1. Which direction is the data flowing, and
  2. Who is responsible for making this interface work?

I also ask:

The first is easy. It starts where the data is, and it ends where the data isn't. It's a technical issue. The second is more complicated and is focused around where the configuration lives and who is responsible for maintaining that configuration, especially configuration for locating the interface. The client is responsible for knowing a lot more than the server, although it may discover some of what it needs to know as it accesses the interface itself. Web browsers get all the information they need for navigation from links on a page, but even they need a home page or a manually-entered URL to get them going. Machine to machine interactions don't have the luxury of a human operator telling them which way to go so typically have fewer steps between their starting configuration and the configuration they use in practice to operate.

When data flows from client to server life is easy. The client can push, and if it's data doesn't get through it can just try again later. It can use bandwidth fairly efficiently even with the HTTP protocol. The use of pipelining allows the client to use available bandwidth efficiently so long as it is generaing an idempotent sequence of requests. Mixes of GET and PUT aren't idempotent (although each request itself should be), so they can cause stalls on the pipeline and reduce performance to a level that depends on latency rather than bandwidth. Depending on the client-side processing this may be able to be avoided altogether or contained in a single client function to avoid overall performance bottlenecks. This is important because it is easier to increase bandwidth than to reduce latency. Unfortunately it has something to do with the speed of light.

The problem is that data often flows both ways. You could reverse the client-server relationship depending on which way the data flows, and sometimes this is appropriate. On the other hand, you're sure to eventually need to push data from server to client. Today's Internet protocols aren't good for this. To reverse the client-server relationship in HTTP you need a HTTP server. That isn't hard. The hard part is opening the port required to accept the associated connections.

At present we have a two-tier internet developing. We have the high end established servers that can accept connections and participate in complex collaborations. We also have the low end of ad hoc clients trapped behind firewalls that will allow them to connect out but not allow others to connect back in. SOAP protocols are devised for the top tier Internet, and rely on two-way connectivity to make things work. I think ourt target should be the second tier. These clients are your everyday web browsers, and when data has to flow from a server to one of these web browsers we don't have any good established options open. Clients are reduced to polling in order to allow data to flow from server.

HTTP Subscription

The GENA protocol I mentioned previously is built for the top tier internet. On the bottom tier the following constraints apply:

That rules out everything so far proposed, I think. I have spent some time on this myself, though. I have a protocol that is regular HTTP and only connects out from the client. There are some other considerations I would like to see work:

The whole thing should be RESTful, so

Here is the closest I've been able to come up with so far:

The theory is that the subscription keeps track of whether new data is available at any time. When the NEXT request arrives it returns the data immediately if it is available. If new data isn't available it holds of replying until data is available. If a proxy is sitting in between client and server it would eventually time out causing the client to issue a new NEXT request.

Clearly this approach has problems. I think the creation of the subscription is fine, but the actual subscription has several problems. The first is that it can't make use of available bandwidth. This problem is endemic to the proxy behaviour and can't be solved without a change to the HTTP protocol that allows multiple responses to a single request rather than this single request/response pair. The second is that no confirmation is given back to the server. A response may be sent down the TCP/IP conection that the NEXT request arrived on but never be transmited to the client due to a connection being closed. This can be solved by adding a URI to both the NEXT request and the two responses available for the NEXT and SUBSCRIBE requests. As an alternative, NEXT requests may have to be directed to the URI(URL) of the next value rather than being sent to the subscription. Responses would have to specify a URI that should be used in the next NEXT request passed to the subscription. If the URI matches what is currently available the server should return the data immediately with a new URI. If the URI doesn't matches an older state the server should return the state, but also indicate how many updates were missed (if possible) back to the client. If the URI is still a future URI (the next URI) the response should be deferred.

The proxies get in the way of a decent solution. Really, the only solution is to come up with a new protocol (perhaps a special extension of HTTP) or use a different existing protocol.

XMPP Subscription

At least one person I know likes to talk about Jabber whenever publish-subscribe comes up. Here is the standard defined for XMPP. The thing that immediately gets the hairs at the back of my neck going is that Jabber doesn't seem to immediately support REST concepts. It's a message passing system where the conceptual framework relies on you connecting to someone who knows more than you about how to find things on the network. That doesn't seem right to me. I prefer the concept of caches that mirror network topology to the idea of connecting to some server three continents away that might be arbitrarily connected to some other servers but most probably is not connected to everything, just a subset of the Internet. My thinking also leads me to think of publish-subscribe as an intrinsic part of a client-server relationship rather than this thing that you slap onto the top of either an established HTTP protocol or an established XMPP protocol.

Those things said, oh gosh it's complicated. I really think that you don't need much over existing HTTP to facilitate every possible collaboration technique, but as you enter the XMPP world you are immediately hit with complicated message types combined with HTTP return codes combined with more complicated return nodes. The resource namespaces don't seem well organised. I'm not sure I quite understand what the node "generic/pgm-mp3-player" refers to in the JEP's example. It's all peer to peer rather than client server and... well... I'm sorry that I can't say I'm a fan. Maybe once it's proven itself a little more I'll give XMPP another look.


I've already suggested some more radical approaches to adding subscription support to HTTP. I do believe it's a first class problem of an internet scale protcol and should be treated as one. I think that making appropriate use of available bandwidth is an important goal and constraint. Unfortunately, I believe that working with the existing Internet infrastructure is also important. At the moment proxies make this a hard problem to solve well. In the interim, feel free to try out my "NEXT" HTTP subscription protocol and see how you like it. It may at least open things up to the second tier of users.