Sound advice - blog

Tales from the homeworld

My current feeds

Sat, 2007-Jan-27

RDF and the Semantic Web - Are we there, yet?

RDF is supposed to be the basis of the , but what is the semantic web and does RDF help realise the semweb vision? I will frame this discussion in terms of the capabilities and , as well involving the architectural style.

The Semantic Web

Tim Bernes-Lee writes:

The Web was designed as an information space, with the goal that it should be useful not only for human-human communication, but also that machines would be able to participate and help... [T]he Semantic Web approach... develops languages for expressing information in a machine processable form.

The goal of the semantic web can therefore be phrased as applying REST practice to machines. On the face of it the semantic web seems like a tautology. Machines already exchange semantics using the REST architectural style. They exhange HTML documents that contain machine readable paragraph markers, headings, and the like. They exchange Atom documents that contain update times, entry titles, and author fields. They exchange vcalendar documents that convey time and date information suitable for setting up meetings between individuals. They even exchange vcard documents that allow address book entries to be transferred from one machine to another.

So the question is not whether or not we are equipped to transfer machine-readable semantics, but why the semantics are so low level and whether or not RDF can help us store and exchange information compared to today's leading encoding approach: XML.

The Fragmented Web

I would start out by arguing that machine to machine communication is hard. The reason it is hard is not because of the machines, but because of the people who write the software for those machines. Ultimately, every succesful information transfer involves agreement between the content producer and content consumer. This agreement covers encoding format, but more that that. It covers vocabulary. The producer and consumer have either directly agreed on or dictated the meaning of their shared document, or have implemented a shared standard agreed by others. Each agreement exists within some soft of sub-culture that is participating in the larger architecture of the web.

Even if I use some sort of transformation layer that massages your data into a form I am more comfortable working with, I still must understand your data. I must agree with you as to its meaning in order to accept your input for processing. Transformations are more costly than bare agreement because an agreement is still required to feed into the transformation process.

REST views the web architecture in terms of universal document types that are transferred around using universally-understood methods and a universally-understood identifier scheme. The document type needs to indicate any particular encoding that is used, but also the vocabulary that is in use. In other words, REST assumes that a limited number of vocabularies plus their encoding into documents will exist in any architecture. Certainly far fewer vocabularies than there are participants in the architecture.

I'll continue with the theme from my last article, that in practice we don't have a single universal architecture. What we have is a rough universal main architecture that is broken down along human sub-culture boundaries into sub-architectures. These sub-architectures will each have their own local concepts, conventions, and jargon. In this environment we can guage the effectiveness of an encoding or modelling approach for data by how well it bridges divides beween main and sub- architectures. Do whole new languages have to be introduced to cope with local concepts, or can a few words of jargon mixed into a broader vocabulary solve the problem?

The eXtensible Markup Language

First, let's look at XML. XML is a great way to encode information into a document of a defined type. It has proven useful for an enormous number of ad hoc document types or document types that have non-universal scope. It is also making a move into the universal world with document types such as atom and xhtml in the mix.

The dual reasons for the success of XML are that it is easy to encode most information into it, and it is easy to work with the information once encoded. The transformation tools such as xslt or pure dom manipulation are good. It is easy to encode information from arbitrary program data structures or database tables, and easy to decode into the same. It imposes low overheads for correctness, demonstrates good properties for evolution, and is basically understood by everyone who is likely to care.

XML has the ability to evolve when its consumers ignore parts of the document they don't understand. This allows producers and consumers of new versions of the document type to interoperate with producers and consumers of the old document type. More generally, XML is good at subclassing document types. A document with extra elements or attributes can be processed as if it did not have those extensions. This corresponds to the ability in an object-oriented language to operate through a base-class or interface-class instead of the specific named class.

Subclassing is not the only way that XML can accomodate changes. An XML docunent can be made to include other XML documents in a form of aggregation. For example, we have the atom specification refering to xhtml for its definition of title and content elements. This is similar to an object-oriented language allowing public member variables to be included in an object.

The Resource Description Framework

As XML can do subclassing and aggregation it makes sense to view it as a practical way to encode complex data in ways that will be durable. However RDF challenges this view from a database-oriented viewpoint. It says that we should be able to arbitrarily combine information, and extract it from a given document using an SQL-like query mechanism. We should be combine information from different documents and vocabularies for use in these queries. This creates hybrid documents that could conceivably be used to combine information from different sub-architectures. By providing a common conceptual model for all information RDF hopes that the vocabularies will sort themselves out within its global context.

Personally, I wonder about all that. Whenever you mix vocabularies you incur a cost in terms of additional namespaces. It's like having a conversation where instead of saying, "I'm going to the shops, then out to a caffe". you say: "I'm old-english:gAn old-english:tO old-english:thE old-english:sceoppa, old-english:thonne old-english:ut old-english:tO a italian:caffe". Just where did that term you are using come from again? Is caffe Italian or French? Imagine if today's html carried namespaces like "microsoft:" and "netscape:" throughout. Namespaces to identify concepts do not handle cultural shifts very well. In the end we just want to have a conversation about going to the shops. We want to do it in today's language using today's tools. We don't want a history lesson. Supporting these different namespaces may even help us avoid coming to proper consensus between parties, fragmenting the vocabulary space unnecessarily.

The main thing RDF does practically today is allow data from different sources to be placed in a database that is agnostic as to the meaning of its data. Queries that have knowledge of specific vocabularies can be executed to extract information from this aggregated data set. So far this class of application has not proven to be a significant use case on the web, but has made some inroads into traditional database territory where a more ad hoc approach is desired.

Conclusion

So it seems to depend on your view of information transfer as to whether XML or RDF currently makes more sense. If you see one machine sending another machine a document for immediate processing, you will likely prefer XML. It is easy to encode information into XML and extract it back out of the document. If you see the exchange as involving a database that can be later queried, RDF would seem to be the front-runner. RDF makes this database possible at the cost of making the pure information exchange more complex.

In terms of how the two approaches support a architecture built up of sub-architectures, well.. I'm not sure. XML would seem to offer all of the flexibility necessary. I can subclass the iCalendar-as-xml type and add information for scheduling passenger information displays on a train platform. I can include xhtml content for display. It would seem that I can introduce my local jargon at a fairly low cost, although it may be advisible to use a mime type that clearly separates the PIDS use from other subclasses. That mime type would ideally include the name of the type it derives from so that it can be recogised as that type as well as the subclass type: application/pids+calendar+xml.

RDF also allows me to perform subclassing and aggregation, and even include XML data as the object of a triple. In RDF I would be required to come up with a new namespace for my extensions, something that is not particularly appealing. However extra functionality is there if you are willing to pay for the extra complexity.

Benjamin

Sat, 2007-Jan-20

Breaking Down Barriers to Communication

When the cut-and-paste paradigm was introduced to the desktop, it was revolutionary. Applications that had no defined means of exchanging data suddenly could. A user cuts or copies data from one application, and pastes it into another. Instead of focusing on new baseclasses or IDL files in order to make communication work, the paradigm broke the comunication problem into three separate domains: Identification, Methods, and Document Types. A single mechanism for identification for a cut or paste point combined with a common set of methods and document types allow ad hoc communication to occur. So why isn't all application collaboration as easy and ad hoc cut-and-paste?

The Importance of Architectural Style

The constraints of REST and of the Cut-and-paste paradigm contain significant overlap. REST also breaks communication down into a single identification scheme for the architecture, a common set of methods for the architecture, and a common set of document types that can be exchanged as part of method invocation. The division is designed to allow an architecture to evolve. It is nigh impossible to change the identification scheme of an architecture, though the addition of new identifiers is an every day occurance. The set of methods rarely change because of the impact this change would have on all components in the architecture. The most commonly-evolving component is the document type, because new kinds of information and new ways of transforming this information to data are created all of the time.

The web is to a significant extent an example of the application of REST principles. It is for this reason that I can perform ad hoc integration between my web browser and a host of applications across thousands of Internet servers. It is comparible to the ease of cut-and-paste, and the antithesis of systems that focus on the creation of new baseclasses to exchange new information. Each new baseclass is in reality a new protocol. Two machines that share common concepts cannot communicate at all if their baseclasses don't match exactly.

The Importance of Agreement

Lee Feigenbaum writes about the weakness of REST:

This means that my client-s[i]de code cannot integrate data from multiple endpoints across the Web unless those endpoints also agree on the domain model (or unless I write client code to parse and interpret the models returned by every endpoint I'm interested in).

Unfortunately, to do large scale information integration you have to have common agreed ways of representing that information as data. This includes mapping to a particular kind of encoding, but more than that. It requires common vocabulary with common understanding of the semantics associated with the vocabulary. In short, every machine-to-machine information exchange relies on humans agreeing on the meaning of the data they exchange. Machines cannot negotiate or understand data. They just know what to do with it. A human told them that, and made the decision as to what to do with the data based on human-level intelligence and agreement.

Every time two programs exchange information there is a human chain from the authors of those programs to each other. Perhaps they agreed on the protocol directly. Perhaps a standards committee agreed, and both human parties in the communication read and followed those standards. Either way, humans have to understand and agree on the meaning of data in order for information to be successfully encoded and extracted.

Constraining the Number of Agreements

In a purely RESTful architecture we constrain the number of document types. This directly implies a constraint on the number of agreements in the architecture to a number that grows more slowly than the number of components participating in the architecture. If we look at the temporal scale we constrain the number of agreements to grow less rapidly than the progress of time. If we can't achieve this we won't be able to understand the documents of the previous generation of humanity, a potential disaster. But is constraining the number of agreements practical?

On the face of it, I suspect not. Everywhere there is a subculture of people operating within an architecture there will be local conventions, extensions, and vocabulary. This is often necessary because concepts that are understood within the context of a subculture may not translate to other subcultures. They may be local rather than universal concepts. This suggests that what we will actually have over any significant scale of architecture is a kind of main body which houses universal concepts above an increasingly fragmented set of sub-architectures. Within each sub-architecture we may be able to ensure that REST principles hold.

Solving the Fragmentation Problem

This leaves us, I think, with two outs: One is to accept the human fragmentation intrinsic to a large architecture, and look for ways to make the sub-architectures work with wider architectures. The other is to forget direct machine to machine communications, involving humans in the loop.

We do both already on the web in a number of ways. In HTML we limit the number of universal concepts such as "paragraph" and "heading 3", but allow domain-specific information to be encoded into class attributes, and allow even more specific semantics to be conveyed in plain text. The class attributes need to work with the local conventions of a web site, but could convey semantics to particular subcultures as microformat-like specifications. The human-readable text conveys no information to a machine, but by adding human-level intelligence a person who is connected to the subculture the text came from can provide an ad hoc interpretation of the data into information.

We see this on the data submission side of things too. We see protocols such as atompub conveying semantics via agreement, but we also have HTML forms which can perform ad hoc information submission when a human is in the loop. The human uses their cultural ties to interpret the source doucument and fill it out for submission back to the server.

Conclusion

I don't think that either or the can ignore the two ends of architectural picture fragmented by human subcultures. Without universal concepts that have standard encodings and vocubulary to convey them we can't perform broad scale information integration across the architecture. Without the freedom to perform ad hoc agreement the architecture opens itself up to competition. Without a bridge between these two extremes the vocabulary that should simply be a few local jargon expressions thrown into a widely-understood conversation will become their own languages that only the locals understand. The RDF propensity to talk about mapping between vocabularies is itself a barrier to communication. It will always be cheaper to have a conversation when a translator is not required between the parties for concepts both parties understand.

Benjamin

Sat, 2007-Jan-06

Death and Libraries

Have you ever wondered what will happen to your when you ? Perhaps it is the influence of parenthood on my life, but I have been thinking about the topic of late. If a part of your legacy is in your blog, what will your legacy be? Perhaps have a role in guaranteeing the future of today's web.

I suspect that most bloggers haven't really thought about the problem. How long will your blog or web site last? Only as long as the money does. The monthly internet bill needs paying, or the annual web hosting fee if your hosting occurs externally. If you have that covered your domain registration will be up for renewal in less than two years. Perhaps you don't have a vanity domain. Maybe you are registered with blogger. This kind of blog is likely to last a lot longer, but for how long? Will your great grandchildren be able to read your blog? Will their great grandchildren? Will your great great great granchildren be able to leave new comments on your old material?

Blogs are collections of resources. Resources demarcate state, and return representations of that state. These representations are documents in particular formats, such as HTML4. So in addition to the question of whether the resources themselves will be durable we must consider how durable the document formats used will be. We may even have to look at whether HTTP/1.1 and TCP/IPv4 will be widely understood a hundred years from now.

The traditional way to deal with these sorts of longevity problems is to produce hard copies of the data. You could print off a run of 1000 bound copies of your blog to be distributed amongst interested parties. These parties might be your descendants, historians who think you have some special merit in the annuls of mankind, and perhaps most universally: Librarians who wish to maintain a collection of past and present thought.

We could attempt the same thing with the web, however the web maps poorly to the printed word given the difficulty of providing appropriate hyperlinks. It also rests on the notion that the person interested in a particular work is geographically close to the place that it is housed, and can find it through appropriate means. Let us consider another possibility in the future networked world. Consider the possibility that those with an interest in the works host the works from their own servers.

Consider the cost of running a small library today. If all data housed in the library eventually became digital data, that data could be distributed anywhere in the world for a fraction of the cost of running a library today. We already see sites like the wayback machine attempting to record the web of yesteryear, or google cache trying to ensure that today's content is reliably available. Perhaps the next logical step is for organisations to start hosting the resources of the original site directly. After all, there is often as much value in the links between resources as there are in the resource content itself. Maintaining the original urls is important. Perhaps web sites could be handed over to these kinds of institutions to avoid falling off the net. Perhaps these institutions could work to ensure the long survival of the resources.

The technical challenges of long-term data hosting are non-trivial. A typical web application consists of application-specific state, some site-specific code such as a PHP application, a web server application, an operating system, physical hardware, and a connection to an ISP. Just to start hosting the site would likely require a normalisation of software and hardware. Perhaps an application that simply stores the representations of each resource and returns them to its clients could replace most of the software stack. The connection to the ISP is likely to be different, and will have to change over time. The application protocols will change over the years as IPv6 replaces IPv4 and WAKA replaces HTTP (well, maybe). The data will have to hop from hardware platform to hardware platform to ensure ongoing connectivity, and from software version to software version.

If all of this goes to plan your documents will still be network accessible long after your bones have turned to dust. However this still assumes the data formats of today can be understood or at worst translated into a form that is equivalent into the future. I suggest that we have already travelled a few decades with HTML, and that we will travel well for another few decades. We can still read the oldest documents on the web. With good standards management it is likely this will still be the case in 100 years. Whether the document paradigm that HTML sits in will still exist in 100 years is another question. We may find that these flat documents have to be mapped into some sort of immersive virtual envrionment in that time. The librarians will have to keep up to date with these trends to ensure ongong viability of the content.

I see the role of librarian and of system administrator as becoming more entwined in the future. I see the librarian as a custodian for information that would otherwise be lost. Will today's libraries have the foresight to set the necessary wheels in motion? How much information will be lost before someone steps in and takes over the registration and service of discontinued domains?

Benjamin