Sound advice - blog

Tales from the homeworld

My current feeds

Sun, 2006-Oct-01

The Preconditions of Content Type defintion (and of the semantic web)

The crowd have an admirable mechanism for defining standards. It relies on existing innovation. New innovations and the ego that goes with it are kept to a minimum. The process essentially amounts to this:

Find the closest match, and use that as your starting point. Cover the 80% case and evolve rather than trying to dot every "i" all at once. Sort out any namespace clashes with previously-ratified formats to ensure that terms line up as much as possible and that namespaces do not have to be used. Allow extensions to occur in the wild without forcing them into their own namespaces.

I have already blogged about REST being the underlying model of the . Programs exchange data using a standard set of verbs and content types (i.e. ontologies). All program state is demarcated into resources that are represented in one of the standard forms and operated on using standard methods.

This is a new layer in modern software practice. It is the information layer. Below it is typically an object layer, then a modular or functional layer for implementation of methods within an object. The information layer is crucial because while those layers below work well within a particular product and particular version, they do not work well between versions of a particular product or between products produced by different vendors. The information layer described by the REST principles is known to scale across agency boundaries. It is known to support forwards- and backwards- compatible evolution of interaction over time and space.

I think that the the microformats model sets the basic preconditions under which standardisation of content type can be achieved, and thus the preconditions under which the semantic web can be established:

  1. There must be sufficient examples of content available, produced without reference to any standard. These must be based on application need only, and must imply a common schema.
  2. There must be sufficient examples of existing attempts to standardise within the problem space. Noone is smart enough to get it right the first time, and relying on experience with the earlier attempts is a necessary facet to getting it right next time

I think there need to be in the order of a dozen diverse examples from which an implied schema is extracted, and I think in the order of half a dozen existing formats. The source documents are likely to be extracted from thousands in order to achieve an appropriately diverse set. This means that there is a fixed minimum scale to machine-to-machine information transfer on the inter-agency Internet scale that can't be forced or worked around. Need is not sufficient to produce a viable standard.

My predictions about the semantic web:

  1. The semantic web will be about network effects relating to data which is already published with an implied schema
  2. Information that is of an obscure and ad hoc nature or structure will continue to be excluded from machine understanding
  3. The semantic web will spawn from the microformats effort rather than any -related effort.
  4. The nature of machine understanding will have to be simplified in order for the semantic web to be accepted for what it is, at least for the first twenty years or so

RDF really isn't the cornerstone of the semantic web. RDF is too closely aligned to artificial intelligence and high ideals as to how information can be reasoned with generically to be really useful as an information exchange mechanism. Machine understanding will have to be accepted as something which relies primarily on human understanding in the future. It will be more about which widget a program puts a particular data element into than what other data it can infer automatically from the information at hand. One is simply useful. The other is a dead end.

The semantic web is here today, with or without RDF. Even when simple HTML is exchanged, clients and servers understand each other's notations about paragraph marks and other information. The level of semantics that can be exchanged fundamentally rely on a critical mass of people and machines implicitly exchanging those semantics before standardisation and shared understanding begin. The microformat community is right: Chase identification, time and date, and location. Those semantics are huge and enough formats exist already to pick from. The next round of semantics may have to wait another ten or twenty years, until more examples with their implied schemas have been built up.