Sound advice - blog

Tales from the homeworld

My current feeds

Sun, 2005-Mar-27

A second stab at HTTP subscription

A few weeks back I put forward my first attempt at doing HTTP subscription for use in fast-moving REST environments. I wasn't altogether happy with it at the time, and the result was a bit half-arsed... but this is blogging after all.

I thought I'd give it another go, based on the same set of requirements as I used last time. This time I thought I'd try and benefit clients and servers that don't use subscription as well as those that do.

One of the major simplifying factors in HTTP has been the concept of a simple request-response pair. Subscription breaks this model by sending multiple responses for a single request. I have called this concept in the past a "multi-callback", which usually indicates a single response with a marker that it is not the last one the client should expect to receive. In it's original incarnation HTTP performed its exchange over a single TCP/IP connection, increasing overhead but again promoting simplicity. In HTTP/1.1 the default behaviour became to "hang on" to a connection and to allow pipelining (multiple requests sent before any response is received) to reduce dependence on a low-latency connection for high performance. Without pipelining it takes at least nxlatency to make n requests.

One restriction that can still affect HTTP pipelining performance is the requirement HTTP has to return all responses in the order requests were made. This may be fine when you're serving static data such as files, but if you are operating as a proxy to a legacy system you may have to make further requests to that system in order to fulfil the HTTP request. In the mean-time, other requests that could be made in parallel to the legacy system could be either backing up in the request pipe or could have been completed but are waiting on a particularly slow legacy system response to be returned via HTTP before they themselves can be.

This brings me to my first recommendation: Add a request id header. Specifically,

14.xx Request-Id

The Request-Id field is used both as a request-header and a response-header. When used as a request-header it provides a token the server SHOULD return in its response using the same header name. If a Request-Id is supplied by a client the client MUST be able to recieve the response out of order, and the server MAY return responses to identified requests out of order. Fairness algorithms SHOULD be used on the server to ensure every identified request is eventually dealt with.

A client MUST NOT reuse a Request-Id until its transaction with the server is complete. A SUBSCRIBE request MUST include a Request-Id field.

I've been tooing and froing over this next point. That is, how do we identify that a particular response is not the end of a transaction? Initially I said that 1xx series response code should be used, but that has its problems. For one, the current HTTP/1.1 standard says that 1xx series responses can't ever have bodies. That's maybe not the final nail in the coffin, but it doesn't help. The main reason I'm wavering from the point, though, is that a 1xx series response just aint very informative.

Consider the case where a resource is temporarily 404 (Not Found), but the server is still able to notify a client when it comes back into existence as a 200 (OK). The subscription should be able to convey that kind of information. I've therefore decided to reverse my previous decision and make the subscription indicate its non-completeness through a header. This has some precedent with the "Connection: close" header used to indicate a HTTP/1.1 server doesn't support pipelining.

Therefore, I would add something like the following:

14.xx Request

The Request general-header field allows the sender to specify options that are desired for a particular request over a specific connection and MUST NOT be communicated by proxies over further connections.

The Request header has the following grammar:

       Request = "Request" ":" 1#(request-token)
       request-token  = token

HTTP/1.1 proxies MUST parse the Request header field before a message is forwarded and, for each request-token in this field, remove any header field(s) from the message with the same name as the request-token. Request options are signaled by the presence of a request-token in the Request header field, not by any corresponding additional header field(s), since the additional header field may not be sent if there are no parameters associated with that request option.

Message headers listed in the Request header MUST NOT include end-to-end headers, such as Cache-Control.

HTTP/1.1 defines the "end" request option for the sender to signal that no further responses to the request will be sent after completion of the response. For example,

       Connection: end

in either the request or response header fields indicates that the SUBSCRIBE request transaction is complete. It acts both as a means for a server to indicate SUBSCRIBE transaction completion and for a client to indicate a subscription is no longer required.

A system receiving an HTTP/1.0 (or lower-version) message that includes a Request header MUST, for each request-token in this field, remove and ignore any header field(s) from the message with the same name as the request-token. This protects against mistaken forwarding of such header fields by pre-HTTP/1.1 proxies.

Benjamin

Sun, 2005-Mar-13

Where does rdf fit into broader architecture?

I've been pondering this question, lately. RDF is undoubtably novel and interesting. Its possibilities are cool.. but where and how will it be used if it is ever to escape the lab in a real way? What are its use cases?

I like to think that sysadmins are fairly level-headed people who don't tend to get carried away by the hype, so when I read this post by David Jericho I did so with interest. David wants to use RDF using the Kowari engine as a back end. In particular, he wants to use it as a kind of data warehousing application which he can draw graphs from and add ad hoc data to as time goes on. As Andrew Newman points out in a late 2004 blog entry, RDF when used in this way can be an agile database.

That doesn't seem to be RDF's main purpose. The extensive use of URIs gives it away pretty quickly that RDF is designed for the web, not the desktop or the database server. It's designed as essentially a hyperlinking system between web resources. RDF is not meant to actually contain anything more complicated than elementry data along with its named reversable hyperlinks. RDF schema takes this up a notch by stating equivalences and relationships between the predicate types (hyperlink types) so that more interesting relationships can be drawn up.

But how does this fit into the web? The web is a place of resources and hyperlinks already. The hyperlinks don't have a type. They always mean "look here for more detailed information". Where does RDF fit?

Is RDF meant to help users browse the web? Will a browser extract this information and build some kind of navigation toolbar to related resources? Wouldn't the site's author normally prefer to define that toolbar themselves? Will it help us to write web applications in a way that snippets of domain-specific XML wouldn't? Is it designed to help search engines infer knowledge about sites, that has no direct bearing on users use of the site itself? One suspects that technolgy that doesn't directly benefit the user will never be used.

So let's look back at the database side. The RDF Primer says "The Resource Description Framework (RDF) is a language for representing information about resources in the World Wide Web." So maybe it isn't really meant to be part of the web at all. The web seems to have gotten along fine without it so far. Perhaps it's just a database about the web's resources. Such databases might be useful to share around, and in that respect it is useful to serialise them and put them onto the web but basically we're talking about rdf being consumed by data mining tools rather than by tools individuals use to browse the web. It's not about consuming a single RDF document and making sense of it. It's more about pulling a dozen RDF documents together and seeing what you can infer from the greater whole. Feed aggregators such as planet humbug might be the best example so far of how this might be useful to end users, although this use is also very application-specific and relies more on XML than RDF technology.

So, we're at the point were we understand that RDF is a good serialisation for databases about the web, or more specifically about resources. It's a good model for those things as well, or will be once the various competing query lanaguages finally coalesce. It has it's niche between the web and the regular database... but how good would it be at replacing traditional databases just like David wants to do?

You may be aware that I've been wanting to do this too. My vision isn't as grand in scope as "a database about the web". I just want "a database about the desktop", and "a database about my finances". What I would like to do is to put evolution's calendar data alongside my account data, and alongside my share price data. Then I'd like to mine all three for useful information. I'd like to be able to record important events and deadlines from the accounting system into the evolution data model. I'd like to be able to pull them both together and correlate a huge expense to that diary entry that says "rewire the house". RDF seems to be the ideal tool.

RDF will allow me to refer to web resources. Using relative URIs with an RDF/XML serialisation seems to allow me to refer to files on my desktop that are outside of the RDF document itself, although they might not be readily tranferrable. Using blank nodes or uri's underneath that of the RDF document's URI we can uniquely identify things like transactions and accounts (or perhaps its better to externalise those)... but what about the backing store? Also, how do we identify the type of a resource when we're not working through http to get the appropriate mime type? How does the rdf data relate to other user data? How does this relate to REST, and is REST a concept applicable to desktop applications just as it is applicable to the web?

In the desktop environment we have applications, some of which may or may not be running at various times. These applications manage files (resources) that are locally accessable to the user. The files can be backed up, copied, and moved around. That's what users do with files. The files themselves contain data. The data may be in the form of essentially opaque documents that only the original authoring application can read or write. They may be a in a common shared format. They may even be in a format representable as RDF, and thus accessable to an RDF knowledge engine. Maybe that's not necessary, with applications like Beagle that seem to be able to support a reasonable level of knowledge management without explicit need for data homogeny. Instead, Beagle uses drivers to pull various files apart and throw them in its index. Beagle is focused on text searching which is not really RDF's core concen... I wonder what will happen when those two worlds collide.

Anyway. I kind of see a desktop knowledge engine working the same way. Applications provide their data as files with well-known and easy to discern formats. A knowledge management engine has references to each of the files, and whenever it is asked for data pertaining to a file it first checks to see if its index is up to date. If so, it uses the cached data. If not, it purges all data associated with the file and replaces it with current file content. Alternatively, the knowledge manager becomes the primary store of that file and allows users to copy, backup, and modify the file in a similar way to that supported by the existing filesystem.

I think it does remain important for the data itself to be grouped into resources, and that it not be seen as outside the resources (files) model. Things just start to get very abstract when you have a pile of data pulled in from who-knows-where and are trying to infer knowledge from it. Which parts are reliable? Which are current? Does that "Your bank account balance is $xxx" statement refer to the current time, or is it old data? I think people understand files and can work with them. It think a file or collection paradigm is important. At the same time, I think it's important to be able to transcend file boundaries for query and possibly for update operations. After all, it's that transcending of file boundaries by defining a common data model that is really at the heart of what RDF is about.

<sigh>, I started to write about the sad state of triplestores available today. My comments were largely sniping from the position of not having an rdf sqlite equivalent. I'm not sure that's true after some further reading. It's clear that Java has the most mature infrastructure around for dealing with rdf, but it also seems that the query languages still haven't been agreed on and that there are still a number of different ways people are thinking about rdf and its uses.

Perhaps I'll write more later, but for now I think I'll just leave a few links lying around to resources I came across in my browsing:

I've been writing a bit of commercial Java code lately (in fact, my first line of Java was written about a month ago now). That's the reason I'm looking at Kowari again. I'm still not satisfied with the closed-source nature of existing java base technology. kaffe doesn't yet run Kowari, as it seems to be missing relevant nio features and can't even run the iTQL command-line given that it is missing swing components. I don't really want to work with something like Kowari until that is ironed out, but if I'm ever going to get back to writing my accounting app the first step will be to throw the current prototype out and start again with a more RDF-oriented back end. I'm concerned about the network-facing (rather than desktop-facing) nature of Kowari and am still not convinced that it will be appropraite for what I want. I would prefer a set of libraries that allow me to talk to the file system, rather than a set that allows me to talk to a server. Opening network sockets to talk to your own applications on a multi-user machine is asking for trouble, and not just security trouble.

Given what I've heard about Kowari, though, I'm willing to keep looking at it for a while longer. I've even installed the non-free java sdk on my debian PC for the purpose of research into this area. If the non-free status of the supporting infrastructure can be cleaned up, perhaps Kowari could still do something for me in my quest to build desktop knowledge infrastructure. On the other hand, I may still just head back to python and Redland with an sqlite back-end.

Benjamin

Sat, 2005-Mar-05

A RESTful subscription specification

Further to my previous entry on the subject, this blog entry documents a first cut at how I would update rfc2616 to support restful subscription. This is a quick hack update, and not a thorough workthrough of the original document.

The summary:

The detail:

  1. Add SUBSCRIBE and UNSUBSCRIBE methods to section 5.1.1
  2. Add SUBSCRIBE and UNSUBSCRIBE methods to the list of "Safe Methods" in 9.1.1
  3. Add section "9.10 SUBSCRIBE" with the following text:

    The SUBSCRIBE method means retrieve and subscribe to whatever information (in the form of an entity) is identified by the Request-URI. If the Request-URI refers to a data-producing process, it is the produced data which shall be returned as the entity in the response and not the source text of the process, unless that text happens to be the output of the process.

    A response to SUBSCRIBE SHOULD match the semantics of GET. In addition to the GET semantics, a successful subscription MUST establish a valid subscription. The subscription remains valid until an UNSUBSCRIBE request matching the successful SUBSCRIBE url is successfully made, until the server returns a 103 (SUBSCRIBE cancelled) response, or until the connection is terminated. A server with a VALID subscription SHOULD return changes using a 102 (SUBSCRIBE update) response to URL content immediately, but may delay responses according to flow control or server-side decisions about priority of subscription updates as compared to regular response messages. Whenever a 102 (SUBSCRIBE update) response is returned it SHOULD represent the most recent URL data state. Data MAY be returned as a difference between the current and previously-returned URL state if client and server can agree to do this out of band. A Updates-Missed header MAY be returned to indicate the number of undelivered subscription updates.

    A SUBSCRIBE request made to a URL for which a subscription is already valid SHOULD match the semantics of GET, but MUST not establish a new valid subscription.

    The response to a SUBSCRIBE request is cacheable if and only if the subscription is still valid. Updates to the subscription MUST either update the cache entry or cause the client to treat the cache entry as stale.

  4. Add section "9.11 UNSUBSCRIBE" with the following text:

    The UNSUBSCRIBE method means cancel a valid subscription. A server MUST set the state of the selected subscription to invalid. A client MUST either continue to process 102 (SUBSCRIBE update) responses for the URL as per a valid subscription, or ignore 102 (SUBSCRIBE update) responses. A successful unsubscription (one that marks a valid subscription invalid) SHOULD return a 200 (OK) response.

  5. Add section "10.1.3 102 SUBSCRIBE update" with the following text:

    A valid subscription existed at the time this response was generated on the server side, and the resource identified by the subscription URL may have a new value. The new value is returned as part of this response.

    This response should not be assumed to be associated with an in-sequence request, and may be returned when no request is outstanding.

  6. Add section "10.1.4 103 SUBSCRIBE cancelled" with the following text:

    A valid subscription existed at the time this response was generated on the server side, but the server is no longer able or willing to maintian the subscription. The subscription MUST be marked invalid on the client side.

    This response should not be assumed to be associated with an in-sequence request, and may be returned when no request is outstanding.

  7. Add section "14.48 Updates-Missed" with the following text:

    The Updates-Missed header MAY be included in 102 (SUBSCRIBE update) response messages. If included, it MUST contain a numeric count of the missed updates.

           Updates-Missed = "Updates-Missed" ":" 1*DIGIT
    

    An example is

           Updates-Missed: 34
    

    This response should not be assumed to be associated with an in-sequence request, and may be returned when no request is outstanding.

I guess the question to ask is whether or not subscription is a compatible concept with REST. I say it is. We're still using URLs and resources. We still have a limited number of request commands that can be used for a wide variety of media types. Essentially, all I'm doing is making a more efficient meta-refresh tag part of the protocol. It's part of what makes some of the proprietary protocols I've used in the past efficient and scalable. It's particularly something that helps servers deal gracefully with killer load conditions. You have a fixed number of clients making requests down a fixed number of TCP/IP connections. When the rate of requests goes up and the server starts to hurt, it simply increases the response time. No extra work is put on the system in times of stress. The server works as fast as it can, and any missed updates simply get recorded as such.

In a polling-based system things tend to degrade. You don't know whether the new TCP/IP connection is an old client or a new one, so you can't decide how to allocate resources between your clients. Even if they are using persistent connections, they keep asking you for changes when they haven't happened yet! If we're to see a highly-dynamic web I'm of the opinion that subscription needs to become part of it at some point.

Why not head for the WS-* stack? Well, I think the question answers itself... but in particular the separation of a client request and the connection it's using to maintain that subscription make the whole question about whether the subscription is still valid hard to assess. When it's not clear whether a subscription is up or down, time is wasted on both sides. My approach is simple and the broad conceptual framework has been proven in-house (although it hasn't been applied to http in-house just yet).

On another note, I was surpised to see the lack of high-performance http client interfaces available in Java. I was hoping to be able to make use of request pipelining to improve throughput where I'm gathering little pieces of data from a wide variety of URL sources on a display. There's just very little around that doesn't require a separate TCP/IP connection for each request, and usually a separate thread also. When you're talking about a thousand requests on a display, and possibly dozens of HMI's that want to do this simultaneously... well the server starts to look sick pretty quickly...

Benjamin