Sound advice - blog

Tales from the homeworld

My current feeds

Sat, 2007-Jan-27

RDF and the Semantic Web - Are we there, yet?

RDF is supposed to be the basis of the , but what is the semantic web and does RDF help realise the semweb vision? I will frame this discussion in terms of the capabilities and , as well involving the architectural style.

The Semantic Web

Tim Bernes-Lee writes:

The Web was designed as an information space, with the goal that it should be useful not only for human-human communication, but also that machines would be able to participate and help... [T]he Semantic Web approach... develops languages for expressing information in a machine processable form.

The goal of the semantic web can therefore be phrased as applying REST practice to machines. On the face of it the semantic web seems like a tautology. Machines already exchange semantics using the REST architectural style. They exhange HTML documents that contain machine readable paragraph markers, headings, and the like. They exchange Atom documents that contain update times, entry titles, and author fields. They exchange vcalendar documents that convey time and date information suitable for setting up meetings between individuals. They even exchange vcard documents that allow address book entries to be transferred from one machine to another.

So the question is not whether or not we are equipped to transfer machine-readable semantics, but why the semantics are so low level and whether or not RDF can help us store and exchange information compared to today's leading encoding approach: XML.

The Fragmented Web

I would start out by arguing that machine to machine communication is hard. The reason it is hard is not because of the machines, but because of the people who write the software for those machines. Ultimately, every succesful information transfer involves agreement between the content producer and content consumer. This agreement covers encoding format, but more that that. It covers vocabulary. The producer and consumer have either directly agreed on or dictated the meaning of their shared document, or have implemented a shared standard agreed by others. Each agreement exists within some soft of sub-culture that is participating in the larger architecture of the web.

Even if I use some sort of transformation layer that massages your data into a form I am more comfortable working with, I still must understand your data. I must agree with you as to its meaning in order to accept your input for processing. Transformations are more costly than bare agreement because an agreement is still required to feed into the transformation process.

REST views the web architecture in terms of universal document types that are transferred around using universally-understood methods and a universally-understood identifier scheme. The document type needs to indicate any particular encoding that is used, but also the vocabulary that is in use. In other words, REST assumes that a limited number of vocabularies plus their encoding into documents will exist in any architecture. Certainly far fewer vocabularies than there are participants in the architecture.

I'll continue with the theme from my last article, that in practice we don't have a single universal architecture. What we have is a rough universal main architecture that is broken down along human sub-culture boundaries into sub-architectures. These sub-architectures will each have their own local concepts, conventions, and jargon. In this environment we can guage the effectiveness of an encoding or modelling approach for data by how well it bridges divides beween main and sub- architectures. Do whole new languages have to be introduced to cope with local concepts, or can a few words of jargon mixed into a broader vocabulary solve the problem?

The eXtensible Markup Language

First, let's look at XML. XML is a great way to encode information into a document of a defined type. It has proven useful for an enormous number of ad hoc document types or document types that have non-universal scope. It is also making a move into the universal world with document types such as atom and xhtml in the mix.

The dual reasons for the success of XML are that it is easy to encode most information into it, and it is easy to work with the information once encoded. The transformation tools such as xslt or pure dom manipulation are good. It is easy to encode information from arbitrary program data structures or database tables, and easy to decode into the same. It imposes low overheads for correctness, demonstrates good properties for evolution, and is basically understood by everyone who is likely to care.

XML has the ability to evolve when its consumers ignore parts of the document they don't understand. This allows producers and consumers of new versions of the document type to interoperate with producers and consumers of the old document type. More generally, XML is good at subclassing document types. A document with extra elements or attributes can be processed as if it did not have those extensions. This corresponds to the ability in an object-oriented language to operate through a base-class or interface-class instead of the specific named class.

Subclassing is not the only way that XML can accomodate changes. An XML docunent can be made to include other XML documents in a form of aggregation. For example, we have the atom specification refering to xhtml for its definition of title and content elements. This is similar to an object-oriented language allowing public member variables to be included in an object.

The Resource Description Framework

As XML can do subclassing and aggregation it makes sense to view it as a practical way to encode complex data in ways that will be durable. However RDF challenges this view from a database-oriented viewpoint. It says that we should be able to arbitrarily combine information, and extract it from a given document using an SQL-like query mechanism. We should be combine information from different documents and vocabularies for use in these queries. This creates hybrid documents that could conceivably be used to combine information from different sub-architectures. By providing a common conceptual model for all information RDF hopes that the vocabularies will sort themselves out within its global context.

Personally, I wonder about all that. Whenever you mix vocabularies you incur a cost in terms of additional namespaces. It's like having a conversation where instead of saying, "I'm going to the shops, then out to a caffe". you say: "I'm old-english:gAn old-english:tO old-english:thE old-english:sceoppa, old-english:thonne old-english:ut old-english:tO a italian:caffe". Just where did that term you are using come from again? Is caffe Italian or French? Imagine if today's html carried namespaces like "microsoft:" and "netscape:" throughout. Namespaces to identify concepts do not handle cultural shifts very well. In the end we just want to have a conversation about going to the shops. We want to do it in today's language using today's tools. We don't want a history lesson. Supporting these different namespaces may even help us avoid coming to proper consensus between parties, fragmenting the vocabulary space unnecessarily.

The main thing RDF does practically today is allow data from different sources to be placed in a database that is agnostic as to the meaning of its data. Queries that have knowledge of specific vocabularies can be executed to extract information from this aggregated data set. So far this class of application has not proven to be a significant use case on the web, but has made some inroads into traditional database territory where a more ad hoc approach is desired.

Conclusion

So it seems to depend on your view of information transfer as to whether XML or RDF currently makes more sense. If you see one machine sending another machine a document for immediate processing, you will likely prefer XML. It is easy to encode information into XML and extract it back out of the document. If you see the exchange as involving a database that can be later queried, RDF would seem to be the front-runner. RDF makes this database possible at the cost of making the pure information exchange more complex.

In terms of how the two approaches support a architecture built up of sub-architectures, well.. I'm not sure. XML would seem to offer all of the flexibility necessary. I can subclass the iCalendar-as-xml type and add information for scheduling passenger information displays on a train platform. I can include xhtml content for display. It would seem that I can introduce my local jargon at a fairly low cost, although it may be advisible to use a mime type that clearly separates the PIDS use from other subclasses. That mime type would ideally include the name of the type it derives from so that it can be recogised as that type as well as the subclass type: application/pids+calendar+xml.

RDF also allows me to perform subclassing and aggregation, and even include XML data as the object of a triple. In RDF I would be required to come up with a new namespace for my extensions, something that is not particularly appealing. However extra functionality is there if you are willing to pay for the extra complexity.

Benjamin