Sound advice - blog

Tales from the homeworld

My current feeds

Sat, 2005-Apr-30

On URI Parsing

Whenever you find you have to write code that parses a URI, DON'T start with the BNF grammar in appendix A of the RFC. A character-by-character recursive descent parser will not work. If you are silly enough to keep pushing the barrow and translate the whole thing into a yacc grammer your computer will explain why. URIs are not LALR(1), or LALR(k) for that matter. They would be if you could tokenise the pieces before feeding them into the parser generator, but that you can't because the same sequence of characters (eg "foo") could appear as easily in the scheme as in the authority part.

Instead, skip to appendix B and take the regular expression as your starting point. If you'd been smart enough to look at it closely the first time you would have realised you need to apply some kind of greedy consumption of characters to it. Start at the beginning, and look for a colon. Next, look for a slash, question-mark or hash. Finally look for hashes and question-marks to delimit the final pieces. Much easier, and no need to waste a whole day on it like I did! :)

I actually spent three days all up. The first was churning parsing techniques. The second was churning design and API. Finally, it was cleaning up and adding support for things like relative URIs. Phew, and minimal documentation update required.

I put my library together as part of RDF handling/generating work I've been assembling. My hope is that by putting my data in an RDF-compatible XML form I can increase its utility to other applications that might cache and aggregate results of the program I'm writing. Working this long on URIs specifically, though, has firmed an opinion I have.

A URI is generally defined as scheme:scheme-specific-part. The scheme determines how the namespace under the scheme is laid out, and has implications on the kinds of things you might be able to do with the resource. A "generic" URI is defined as scheme://authority/path?query#fragment (fragment is actually a bit special). If the generic URI is IP-based, then the authority section looks like userinfo@host:port. A http URI implies an authority (the DNS name and port), a path, and an optional query. When you give someone a http URI you are telling them that you own the right to use that DNS name and port, and to structure the paths and queries beneath that DNS name. Up until http was used for non-URL URNS, you were also giving them a reasonable surety they could put the URI into a web browser to find out more.

http is obviously -not- the only scheme. Ftp is almost identical. It is IP-based and generic also, so uses the same DNS (or IP) authority scheme. It is perhaps more likely to include a userinfo part to its authority than that of http. When you give someone an FTP URI they can almost certainly plug it into a web browser, and what they get back will depend on the path, the DNS name, and the userinfo (they may be looking at files relative to userinfo's home area). You could define any number of other schemes. You could define a mysql scheme which identified a database in the same way and used (escaped) SQL in the query part. You could define your own scheme to break away from DNS to utilise the technically much better ENS (Eric's Naming Service? :)).

As far as I'm concerned, non-deferfernceable URNs should follow the same pattern of any IP-based generic URI. That's why the generic URI idea exists in the rfc, so we don't have to reinvent the wheel every time we come up with a new hierarchical namespace. We especially don't have to reinvent the wheel whenever we come up with a new IP-based hierarchical namespace, but we should use different schemes when we're implying different resources. When I give someone a URN that I know is not a URL (it doesn't point to anything) I should be flagging that to them with an appropriate scheme identifier. When I give them a "urn" URI with an authority component I want them to know that I am identifying something, I do own the DNS part and the right to create paths within that DNS namespace and scheme, and am also transferring my knowledge that there's nothing at the other end. That's useful information, particularly if they're some kind of automated crawler bot or for that matter some kind of web browser. The crawler bot would know not to look up the URI as if it were a http URL, and the web browser would know to explain the problem to their user or to start searching for information "about" the URI instead of "at" the URI.

I think that RDF URIs that aren't URLs but don't tell anyone that they aren't URLs are broken and wrong. I really don't see the point of alaising the different concepts together in such a confusing and misleading way. Is it for future expansion? "Maybe I'll want to put something that the URI some day". It doesn't sound very convincing to me. If you really want to do that, you're changing its meaning slightly anyway, and you can always tell everyone that the new URI is owl:sameAs the old one. I really don't understand it, I'm afraid.


Tue, 2005-Apr-12

A considered approach to REST with RDF content

I wrote recently about beginning to implement a REST infrastructure, and had some thoughts at the time about using RDF as content to enable browsing and crawling as part of the model. This is a quick note to tell you what I've come up with so far.


URIs in RDF files can be dereferenced by default, and you will usually find RDF content at the end point. Other content can also be stored, and you can tell the difference by looking at the returned Content-Type header.


GET to a URI returns an RDF document describing all outgoing arcs from the URI. The limits of the returned graph are either the first named node encountered, or the first cycle. The graph's root is "about" itself. A simple object with non-repeating properties is represented something like:

GET /some-rdf HTTP/1.1

<MyClass rdf:about="">
<MyClass.myReference rdf:reference=""/>

From this input it is possible to crawl over an entire RDF data model one object at a time. The mode itself may refer to external resources which can be crawled or left uncrawled. From this perspective, the RDF model becomes an infinitely-extensible and linkable backbone of the web and can replace HTML in this role.


A PUT should seek to set the content of a URI without respect to prior state. In this case, I believe this should mean its semantics should attempt to create the subject URI. It should remove all existing assertions with this URI as the subject, and it should establish a new set of assertions in their place:

PUT /some-rdf HTTP/1.1

<MyClass rdf:about="">
<MyClass.myReference rdf:reference=""/>

If PUT was successful, subsequent GET operations should return the same logical RDF graph as was PUT.


While PUT should seek to set the content without respect to prior state, POST should seek to supliment prior state. To this end, it should perform a merge with the existing URI graph. The definition of this merge should be left to the server side, but in general all arcs in the POST graph should be preserved in the merged graph. Where it is not valid to supply multiple values for a property or where conflict exists between new and existing arcs the new arcs should take precedence.

PUT /some-rdf HTTP/1.1

<MyClass rdf:about="">


DELETE should seek to remove all outgoing arcs of the URI, possibly erasing server knowledge the URI as a side-effect. Incoming arcs should not automatically affected, though... except as the server chooses to enforce referential integrity constraints.

Where we are now

I believe this gives us a fairly simple mechansim for uniformly exposing object properties such as the properties of a Java bean. Beans can then behave as they consider appropriate in response to changes. They themselves may propagate changes internally or may make further http requests. Interally, there should probably be an abstraction layer in place that makes the difference invisible (by hiding the fact that some objects are internal, rather than hiding the fact that some are external). Under this model objects can interact freely with each other across platforms and across language barriers.

Now, the big question is whether or not I actually use it!

So far I've been using an XAML encoding for object properties. It looks something like this:

<MyClass xmlns="" attr1="foo" attr2="bar">
<MyClass.myWhitespacePreservingProperty> baz </MyClass.myWhitespacePreservingProperty>

With RDF it would look like this:

<MyClass rdf:about="" xmlns="">
<MyClass.myWhitespacePreservingProperty> baz </MyClass.myWhitespacePreservingProperty>

or this:

<ex:MyClass rdf:about="" xmlns:ex="" ex:MyClass.attr1="foo" ex:MyClass.attr2="bar">
<ex:MyClass.myWhitespacePreservingProperty> baz </ex:MyClass.myWhitespacePreservingProperty>

I have two problems with the RDF representation. The first is simple asthetics. I like my XAML-based notation with object attributes in XML attributes. It's simple and easy to read I don't even have to refer to an external RDF namespace. The fact that XML attributes don't fall into the default namespace of their enclosing elements does not help things, meaning each must be prefixed explicitly. The second is transformation. Throwing all of those namespaces into attributes means I have to seed my XPath engine with QName prefixes like "ex:". I'd really rather deal with no namespaces, or just one. Internally I'm actually merging this information with a DOM that contains data from many other sources as well. Some are gathered over the network. Some are from user input. In merging these document fragments together I want to avoid removing existing elements (processing may be triggered from their events), and so my merge algorithm attempts to accumulate attributes for a single conceptual object together to be applied to beans or other objects.

Hrrm... I'm not making a very strong argument. I'd just feel happier if the XML rdf encoding treated @attr1 as equivalent to the sub-element ClassName.attr1 in the parent node's namespace. That would fit better to object encoding. Leave the references to external namespaces to explicit child elements and let the attributes stay simple.

Oh, well...


Sat, 2005-Apr-09

What do you store in your REST URIs?

I have been tinkering away on my HTTP-related work project, and have a second draft together of an interface to a process starting and monitoring application that we built and use internally. Each process has a name, and that is simple to match to a URI which contains read-only information about its state.

You can picture it in a URI like this: http://my.server/processes/myprocess, returning the XML equivalent of "the process is running, is is currently the main in a redundant pair of processes". It actually gets a little more complicated than that, with a single HTTP access point able to return the status of various processes across various hosts, to report on the status of the hosts themselves, and also to arrange the processes in various other ways that are useful to us and make statements about those collections.

My next experimental step was to allow enabling and disabling of processes by adding a */enabled uri for each process. When enabled it would return text/plain "true". When disabled it would return text/plain "false". A PUT operation would change this state and cause the process to be started or stopped. I was hoping I'd be able to access this via a HTML form, but urg... no luck there. I had to add a POST method to the process itself with an "enabled=true" uri-encoding. Not nice, but together they're workable for now.

Now we're at the point where I ask the question: How do I find and represent the list of processes? I ask, "How do I navigate to the Host URI associated with this process?". I ask, "How do I know what to append to find the enabled URI?".

I have been returning pretty basic stuff. If my data is effectively a string, (ie, a datum encoded using XSD data type rules) I've been returning a text/plain body containing that data. If the data is more complicated, and needs an XML representation I've been returning application/xml with the XML document as the body. In my HMI application I typically map those strings and XML elements onto the properties of java beans, or onto my own constructions that map their data eventually onto beans. The expected data format is therefore pretty well known to my application and doesn't need much in the way of explicit schema declaration. The URIs are also explicitly encoded into the XML page definitions that go into the HMI. If I start to look outside the box, though, particularly to debugging- or browsing- style applications that might exist in the future I want to be able to find my data.

As I was working through the problem, I started to understand for the first time where the XLink designers were coming from. My fingers were aching to just type something like

<Processes type="linkset"><a href="foo"/><a href="bar"/></Processes>

and be done with it. XLink is dead, though, and apparently with good reason... so it starts to look like RDF is the "right way to do it".

Rethinking the REST web service model in terms of RDF is an interesting approach, and one I feel could work fairly nicely. I'm still thinking in terms of properties. If I had an object of class foo with property bar, then I could write the following fairly easily:

<foo rdf:about=".">

That's almost identical to the verbose form of the XML structures I'm using right now (I would currently put myvalue into foo/@bar to reduce the verbosity). In this way, the content of each URI would be the rdf about that URI. If this were backed by a triplestore, you might simply list all relationships that the URI has directly in this response body.

It seems simple to produce an rdf-compatible hyperlinking solution for the GET side of things, so what about the PUT?

On first glance this looks simple, but in fact the PUT now needs more data than it did previously. What I really want to do is to PUT an enabled assertion between my process and the literal "false". What do I put, exactly? Perhaps something like this:

PUT /processes/myprocess/ HTTP/1.1
Host: my.server


You can see the difficulty. I need to encode the URI of the subject (http://my.server/processes/myProcess) and the URI of the predicate ( Finally I need to encode the object, clearly identified as a URI, a literal, or a new RDF instance with its own set of properties.

Another thing you need to do is work out the semantics of the PUT operation as well as the POST operation. In the truest HTTP sense it is probably sensible for PUT to attempt to overwrite any existing assertions on the object with the same predicate, while POST would seek to accumulate a set of assertions by adding to rather than overwriting earlier statements.

There is another question unanswered in all of this. If I have a piece of RDF relating to a specific URI, what do I have to do to get more information about it? Sometimes you'll be able to deference the URI and find more RDF. Sometimes you'll get a web page or some other resource, and if you're luckly you'll find rdf at a ".rdf"-extension variant of the filename. Sometimes you'll find nothing at the link. Shouldn't these options be formalised somewhere? I don't think it's possible to write an rdf "crawler", otherwise... or the source RDF document must point to both the target rdf and the related object. In other questions arising from this line of thought, "Is there a standard way to update RDF in a REST way? If so, does it work form a web browser with simple web forms?"

The web browser is becoming my benchmark of how complicated a thing is allowed to be. If you can't generate that PUT from a web form, maybe you're overthinking the problem. If you can browse what you've created happily in mozilla, perhaps you need to simplify it.



Sun, 2005-Apr-03

Wouldn't it be nice to have one architecture?

We seem to be evolving towards a more open model of interaction between software components. Client and server are probably speaking HTTP to each other to exchange data in the form of XML, or some other well-understood content. Under the REST architectural style the server is filled with objects with well-known names as URIs. Clients GET, and sometimes PUT but will probably POST to the server.

The uri determines five things things. The protocol looks something like http://. The host and port number look something like, or maybe The object name looks like /foo/bar/baz, and the query part looks like ?myquery. That's fine for http over a network with a well-known host and port name. I think it might fall down a little in more tightly-coupled environments.

Let's take it down a notch from something you talk to over the internet to something on the local LAN. A single server might offer multiple services, perhaps it could provide not just regular web data to clients but provide information about system status such as the state of various running processes. Perhaps it has an application that provides access to time data to replace the old unix time service. Perhaps it has an application to provide a quote of the day, or a REST version of SMTP. The server is left with an unpleasant set of options. It can let the programs run independently, each opening distinct ports in the classic UNIX style. The client must then know the port numbers, and needs to negotiate them out of band (IANA has helped us do this in the past). If that's no good, and you want to have a single port open to all of these applications you start to introduce coupling. You either operate like inet or a cgi script and exec the relevant process after opening the connection, or you make all of your processes literally part of the one master process using serverlets.

Not so bad, you say. There are still options there, even if the traditional web approach and the traditional unix approaches differ. You can even argue that they don't differ and that unix only ever intended to open different ports when different protocols are in use. We've now agreed on a simple standard protocol that everyone can understand the gist of, even if you need to know the XML format being transported intimately to actually extract useful data out of the exchange.

In a way, the REST concept introduces many more protocols than we are used to dealing with. Like other advances in architecture development it takes out the icky bits and says: "Right! This is how we'll do message exchange". It then leaves the content of the messages down to individual implementors of problem domains to work out for sure. It builds an ecosystem with a common foundation rather than trying to design a city all in one go.

Anyway, back to the options. When you have multiple applications within a single server the uncoupled options look grim. How do I let my client know that the service they want is available on port 8081? Dns allows me to map to an IP address, but does not cover the resolution of port identifiers. That's left to explicit client-side knowledge, so a client can only reasonably query if we have previously agreed that dict should appear in their /etc/services file. It's much more likely that we can agree to a URI of on the standard HTTP port of 80.

This leaves us with the options of either having an inet-equvalent process starting new a new process for each connection made to the server, or making the application a serverlet. The first option is unsatisfying because it doesn't allow a single long-running program to answer the data, and we need to introduce other interprocess communication mechanisms such as shared memory if forked instances of the same process want to share or distribute processing. You can see this conflict in in application like SAMBA. You get a choice between executing via inet for simplicity and ease of adminstration or executing as standalone processes for improved performance. The second option is to me fairly unsatisfying because it introduces coupling between otherwise unrelated applications. In fact, there's a third option. You could have the server process answer the queries by itself quering back-end applications in weird and wonderful ways. That approach is limited because the server may become both a bottleneck and a single point of failure. When all of the data in your system flows through a single process... well... you get the point.

You can see where I'm headed. If I'm uncomfortable with how you would offer a range of different services in a small LAN scenario, imagine my disquiet over how applications should talk to each other within the desktop environment!

I think the REST architecture remains sound. You really want to be able to identify objects some of which may be applications... others of which may represent your files or other data. You want to be able to send a request that reads something like local://mydesktop/mytaxreturn.rdf?respondwith=taxable-income. There's some sensitive data in this space, so you may feel as I do that opening network sockets to access it is a bit of a worry. Even opening a port on may allow other users of the current machine to access your data. A unix named pipe might work, but may not be portable outside of the unix sphere and may be hard to specify in the URL. After all, how you say "speak http to file:///home/me/desktopdata, and request the tax return uri you find there"? You also start running into the set of options for serving your data that you had with the small LAN server. How do you decouple all of the services behind the access-point name in your URI?

So, let's start again and try to abstract that REST architecture. To me it appears decomposable into the following applications:

  1. A client with a request verb and a URI including protocol, access point, and object identifier
  2. An access point broker that can interpret the access point specification and return a file descriptor
  3. A server with a matching URI

It seems that DNS is a fine access point broker for web servers that all live on the same port. An additional mechanism might still be useful for discovering the port number to connect to by name when multiple uncoupled services are on offer. A new access point broker would be needed for the desktop. A new URI encoding scheme might be avoidable if the access broker is able to associate a particular named pipe with a simpler name such as "desktop:kowari", making a whole address look like http://desktop:kowari/mydatabase. Clients would need to be updated to talk to the appropriate access point provider, which I suggest would have to be provided through a shared library like the one we currently use with DNS. Servers would need to open named pipes instead of network sockets, and may need additional protocol to ensure one file descriptor is created per local "connection".

The definition of the access point is interesting in and of itself. What happens when access point data changes? Can that information be propageted to clients so they know to talk to the new location rather than the old? Can you run the same service redundantly so that when one fails the information of the second instance is propagated and clients fail over without spending more than a few seconds thinking about it?

REST is an interesting model for coordinating application interactions. It seems to work well in the loosely-couple large scale environments it was developed for. I like to see it work on the smaller scale just as well, and to see the difference made transparent to both client and server.


P.S. Is it just me, or is there no difference between SOAP over HTTP and REST POST? In fact, it seems to me that an ideal format for messages to and from the POST request could be SOAP. Am I missing some things about REST? I think I understand the GET side fine, but the POST I'm really not sure about...