Sound advice - blog

Tales from the homeworld

My current feeds

Sat, 2008-Feb-23

The Future of Media Types

MIME is the corner stone of document type identification on the Web today. XHTML, for example carries the application/xhtml+xml media type. On the other hand, it also carries a XML namespace. Media types are controlled centrally by the IANA. XML namespaces are URLs, and therefore open to anyone to create. What are the trade-offs, and what is the future for document type identification?


Mark Baker comments on recent attempts to decentralise media type assignment, saying:

Been there, tried that. I used to think that was a good idea, but no longer do.

I'm a centrist on most technical and political issues. I see both sides of this debate. On the one hand, tight technical control ensures that document types on the Web are well controlled. This is a good thing. It means that different components of the Web's architecture can exchange information through these data types. On the other hand, there are always going to be document types that make sense outside of the context of the Web. These might be used within an isolated railway control system, or in a Business-to-Business pairing or grouping.

As the importance of the REST architectural style sinks in outside the Web it becomes less likely that the set of Web content types will be sufficient to convey all useful semantics. If we make the assumption that all document types ultimately need to be identified, we might to ask questions like: "What is the content type of the configuration file format for my SMTP server?". As we spiral outwards from this kind of very specific case, "What is the content type of a machine-readable description of a Railway?". Neither of these cases appear on the Web, and the IETF is unlikely to be an appropriate forum to discuss standardising the identification of these documents. Web protocols and standards extend past the single World-Wide Web, and to some extent I think this is important to recognise in the development of these standards.


Even if we do open up document types to a very wide base, MIME Types and URLs do not contain the same information. Importantly, URLs are generally opaque. In contrast, MIME Types can be interpreted. An atom document containing future xml content knows from the application/future+xml type attribute that it is contained as an unescaped XML sub-document. It can interpret a text/future as meaning that the sub-document is an XML text node. It can interpret the absence of these conditions as meaning that the content is binary, and is base64 encoded. Likewise, parameters on text document types can indicate information such as character encodings.

A danger in heading into the URL approach for document identification is that we loose this additional metadata. We would ideally be able to extract from a document both its type, and any parent types it may have that we could understand.

Where I stand

My solution for the moment is to just make up media types for special-purpose applications. These types are not registered with the IANA, and are not exchanged in general parlance. My theory is that a time will come where this type either comes into contact with another type with the same name, or another type with the same essential data schema. There will be some kind of conflict when either of these things happen, and the conflict will have to be resolved through social (rather than technical) means.

Once these types are well-enough developed for this kind of conflict to have occurred, it is likely that they will be ready for inclusion in some form of register. That might eventually be the IANA, but I suspect that satellite bodies will need to participate in the control of the document type space.

The question with this approach is where it leaves us with XML namespaces, and with URLs in general for document type identification. At present I don't recommend the use of XML namespaces at all. I think that MIME Types are king, and will remain king for the immediately forseeable future. XML namespaces should therefore be ignored by consumers, in general as redundant information. On the other hand, I might just be swimming against the tide on that one. I guess that atom could get by perfectly well with an XML namespace and no type attribute for XML sub-documents. Perhaps there is no practical benefit from being able to parse a document type for additional metadata.


Sun, 2008-Feb-10

Mortal Bloggers

Tim Bray brings up the issue of what happens to private web sites after we die. It may be self-important of me, but I think the state has some responsibility for preserving the public works of its citizens. I first wrote about this problem back in January 2007.

My theory is that digital works such as personal web sites should be considered analogous to other published works, such as books. Libraries should be responsible for performing restoration work on the data by moving it to their servers, and by maintaining it and its associated domain registrations thereafter. The cost of maintaining this store in terms of storage and bandwidth should be easily minimised. It seems therefore reasonable that the state should attempt to preserve all the published digital works of its citizens.

On the other hand, perhaps we vanity bloggers would all be better of moving to hosting on free public sites hosted by existing large companies. If the private sector is already meeting the need, there is little chance that government will step in to stop the rot.


Sat, 2008-Feb-09

SQLite and Polling locks


I agree that the SQLITE_BUSY return code is insane. The root cause is that sqlite is not using blocking operating system locks, meaning it has to poll to determine whether it can obtain access or not. Internally it has already retried its locks a number of times before reporting SQLITE_BUSY back to you.

I hacked sqlite 2 for WRSA's internal use to use blocking locks. Unfortunately, I have never gotten around to figuring out whether blocking locks can be introduced to v3 without causing problems. The v3 locking model is much more complex. Most of this feature is held in a single source file (os.c, iirc), so it should be possible for a single human to get their head around it. If anyone does get it together, perhaps it would be worth submitting a patch back to DRH via the mailing list.


Sat, 2008-Feb-09

RESTWiki Updates

I have had a rough time with services of late, but am now officially back on the Internet. To celebrate, I have done some minor revision to a RESTWiki article on how to design XML file formats for semantic-web-scale information exchange. I have also summarised some recent blog content into a page on how to do reliable messaging with plain HTTP and simple idempotent messages.

Have fun reading!

Update 2008-02-09: And here is another article on achieving high availability characteristics between a HTTP server and client