Sound advice - blog

Tales from the homeworld

My current feeds

Sat, 2005-Apr-30

On URI Parsing

Whenever you find you have to write code that parses a URI, DON'T start with the BNF grammar in appendix A of the RFC. A character-by-character recursive descent parser will not work. If you are silly enough to keep pushing the barrow and translate the whole thing into a yacc grammer your computer will explain why. URIs are not LALR(1), or LALR(k) for that matter. They would be if you could tokenise the pieces before feeding them into the parser generator, but that you can't because the same sequence of characters (eg "foo") could appear as easily in the scheme as in the authority part.

Instead, skip to appendix B and take the regular expression as your starting point. If you'd been smart enough to look at it closely the first time you would have realised you need to apply some kind of greedy consumption of characters to it. Start at the beginning, and look for a colon. Next, look for a slash, question-mark or hash. Finally look for hashes and question-marks to delimit the final pieces. Much easier, and no need to waste a whole day on it like I did! :)

I actually spent three days all up. The first was churning parsing techniques. The second was churning design and API. Finally, it was cleaning up and adding support for things like relative URIs. Phew, and minimal documentation update required.

I put my library together as part of RDF handling/generating work I've been assembling. My hope is that by putting my data in an RDF-compatible XML form I can increase its utility to other applications that might cache and aggregate results of the program I'm writing. Working this long on URIs specifically, though, has firmed an opinion I have.

A URI is generally defined as scheme:scheme-specific-part. The scheme determines how the namespace under the scheme is laid out, and has implications on the kinds of things you might be able to do with the resource. A "generic" URI is defined as scheme://authority/path?query#fragment (fragment is actually a bit special). If the generic URI is IP-based, then the authority section looks like userinfo@host:port. A http URI implies an authority (the DNS name and port), a path, and an optional query. When you give someone a http URI you are telling them that you own the right to use that DNS name and port, and to structure the paths and queries beneath that DNS name. Up until http was used for non-URL URNS, you were also giving them a reasonable surety they could put the URI into a web browser to find out more.

http is obviously -not- the only scheme. Ftp is almost identical. It is IP-based and generic also, so uses the same DNS (or IP) authority scheme. It is perhaps more likely to include a userinfo part to its authority than that of http. When you give someone an FTP URI they can almost certainly plug it into a web browser, and what they get back will depend on the path, the DNS name, and the userinfo (they may be looking at files relative to userinfo's home area). You could define any number of other schemes. You could define a mysql scheme which identified a database in the same way and used (escaped) SQL in the query part. You could define your own scheme to break away from DNS to utilise the technically much better ENS (Eric's Naming Service? :)).

As far as I'm concerned, non-deferfernceable URNs should follow the same pattern of any IP-based generic URI. That's why the generic URI idea exists in the rfc, so we don't have to reinvent the wheel every time we come up with a new hierarchical namespace. We especially don't have to reinvent the wheel whenever we come up with a new IP-based hierarchical namespace, but we should use different schemes when we're implying different resources. When I give someone a URN that I know is not a URL (it doesn't point to anything) I should be flagging that to them with an appropriate scheme identifier. When I give them a "urn" URI with an authority component I want them to know that I am identifying something, I do own the DNS part and the right to create paths within that DNS namespace and scheme, and am also transferring my knowledge that there's nothing at the other end. That's useful information, particularly if they're some kind of automated crawler bot or for that matter some kind of web browser. The crawler bot would know not to look up the URI as if it were a http URL, and the web browser would know to explain the problem to their user or to start searching for information "about" the URI instead of "at" the URI.

I think that RDF URIs that aren't URLs but don't tell anyone that they aren't URLs are broken and wrong. I really don't see the point of alaising the different concepts together in such a confusing and misleading way. Is it for future expansion? "Maybe I'll want to put something that the URI some day". It doesn't sound very convincing to me. If you really want to do that, you're changing its meaning slightly anyway, and you can always tell everyone that the new URI is owl:sameAs the old one. I really don't understand it, I'm afraid.

Benjamin