Sound advice - blog

Tales from the homeworld

My current feeds

Wed, 2006-Jan-11

Machine Microformats

I find microformats appealing. They solve the problem of putting data on the web simply without having to create extra files at extra urls and provide extra links to go and find the files. The data is in the same page as the human-readable content you provide. Like HTML itself, micrformats allow you put your own custom data into the gaps between the standard data. They effectively plug a gap in the semantic spectrum between univerally-applicable and individually-applicable terms. I have been working on various data formats during my first week back from annual leave, and the question has occured to me: "How do I create machine-only data that plugs the gap in a similar way?".

It doesn't make sense to use microformats directly in a machine-only environment. They are designed for humans first and machines second. However, it does make some sense to try and learn the lessons of html and microformats. When XML became the way people did machine to machine comms a strange thing happened. Instead of learning from html and other successful sgml applications, we jumped straight into strongly-typed thinking. HTML allows new elements to be added to its schema implicitly with "must-ignore" semantics for anything a parser does not understand. This allows for forwards-compability of data structures. New elements and attributes can be added to represent new things without breaking the models that existing parsers use. Instead of following this example in XML we defined schemas that do not assume must-ignore semantics. We defined namespaces, and schema versions. When we introduce version 3.0 of our schema, we expect existing parsers to discard the data and raise an error. This is the way we're used to doing things in the world of remote procedure calls and baseclasses. In fact, it is the wrong way.

My approach so far has been to think of an xml document as a simple tree. A parser should follow the tree down as far for as it knows how to interpret the data, and should ignore data it does not understand. Following the microformat lead, I'm attempting to reuse terminology from existing standards before inventing my own. The data I've been presenting is time-oriented, so most terms and structure have been borrowed from iCalendar. The general theory is that it should be possible to represent universal (cross-industry, cross-application), general (cross-application), and local (application-specific) data in a freely-mixed way. Where there is a general term that could be used instead of a local term, you use it. Where there is a universal term that could be used instead of a general one, you do. The further left you push things, the more parsers will understand all of your data.

At present, I am also following the microformat lead of not jumping into the world of namespaces. I am still not convinced at this stage that they are beneficial. One possible end-point for this development would be to use no namespace for universal terms, and progressively more precise namespaces for general and local terms. Microformats themselves only deal in universal terms so they should be able to continue to get away without using namespaces.

By allowing universal and local terms to mingle freely it is possible to make use of universal terms wherever they apply. I suppose this has been the vision of rdf all along. In recent years the semantic web seems to have somehow transformed into an attempt to invent a new prolog, but I think a view of the semantic web as a meeting place for universal and local terms is of more immediate use. I think it would be useful to forget about rdf schemas for the most part and just refer to traditional standards documentation such as rfcs when dealing with ontology. I think it would be useful to forget about trying to aggregate rdf data for now, and think about a single format for the data rather than about multiple rdf representations. Perhaps thinking less about the data model rdf provides and thinking more about a meeting of semantic terms would make rdf work for the people it has so far disenfranchised.

Benjamin