Sound advice - blog

Tales from the homeworld

My current feeds

Sun, 2005-Mar-13

Where does rdf fit into broader architecture?

I've been pondering this question, lately. RDF is undoubtably novel and interesting. Its possibilities are cool.. but where and how will it be used if it is ever to escape the lab in a real way? What are its use cases?

I like to think that sysadmins are fairly level-headed people who don't tend to get carried away by the hype, so when I read this post by David Jericho I did so with interest. David wants to use RDF using the Kowari engine as a back end. In particular, he wants to use it as a kind of data warehousing application which he can draw graphs from and add ad hoc data to as time goes on. As Andrew Newman points out in a late 2004 blog entry, RDF when used in this way can be an agile database.

That doesn't seem to be RDF's main purpose. The extensive use of URIs gives it away pretty quickly that RDF is designed for the web, not the desktop or the database server. It's designed as essentially a hyperlinking system between web resources. RDF is not meant to actually contain anything more complicated than elementry data along with its named reversable hyperlinks. RDF schema takes this up a notch by stating equivalences and relationships between the predicate types (hyperlink types) so that more interesting relationships can be drawn up.

But how does this fit into the web? The web is a place of resources and hyperlinks already. The hyperlinks don't have a type. They always mean "look here for more detailed information". Where does RDF fit?

Is RDF meant to help users browse the web? Will a browser extract this information and build some kind of navigation toolbar to related resources? Wouldn't the site's author normally prefer to define that toolbar themselves? Will it help us to write web applications in a way that snippets of domain-specific XML wouldn't? Is it designed to help search engines infer knowledge about sites, that has no direct bearing on users use of the site itself? One suspects that technolgy that doesn't directly benefit the user will never be used.

So let's look back at the database side. The RDF Primer says "The Resource Description Framework (RDF) is a language for representing information about resources in the World Wide Web." So maybe it isn't really meant to be part of the web at all. The web seems to have gotten along fine without it so far. Perhaps it's just a database about the web's resources. Such databases might be useful to share around, and in that respect it is useful to serialise them and put them onto the web but basically we're talking about rdf being consumed by data mining tools rather than by tools individuals use to browse the web. It's not about consuming a single RDF document and making sense of it. It's more about pulling a dozen RDF documents together and seeing what you can infer from the greater whole. Feed aggregators such as planet humbug might be the best example so far of how this might be useful to end users, although this use is also very application-specific and relies more on XML than RDF technology.

So, we're at the point were we understand that RDF is a good serialisation for databases about the web, or more specifically about resources. It's a good model for those things as well, or will be once the various competing query lanaguages finally coalesce. It has it's niche between the web and the regular database... but how good would it be at replacing traditional databases just like David wants to do?

You may be aware that I've been wanting to do this too. My vision isn't as grand in scope as "a database about the web". I just want "a database about the desktop", and "a database about my finances". What I would like to do is to put evolution's calendar data alongside my account data, and alongside my share price data. Then I'd like to mine all three for useful information. I'd like to be able to record important events and deadlines from the accounting system into the evolution data model. I'd like to be able to pull them both together and correlate a huge expense to that diary entry that says "rewire the house". RDF seems to be the ideal tool.

RDF will allow me to refer to web resources. Using relative URIs with an RDF/XML serialisation seems to allow me to refer to files on my desktop that are outside of the RDF document itself, although they might not be readily tranferrable. Using blank nodes or uri's underneath that of the RDF document's URI we can uniquely identify things like transactions and accounts (or perhaps its better to externalise those)... but what about the backing store? Also, how do we identify the type of a resource when we're not working through http to get the appropriate mime type? How does the rdf data relate to other user data? How does this relate to REST, and is REST a concept applicable to desktop applications just as it is applicable to the web?

In the desktop environment we have applications, some of which may or may not be running at various times. These applications manage files (resources) that are locally accessable to the user. The files can be backed up, copied, and moved around. That's what users do with files. The files themselves contain data. The data may be in the form of essentially opaque documents that only the original authoring application can read or write. They may be a in a common shared format. They may even be in a format representable as RDF, and thus accessable to an RDF knowledge engine. Maybe that's not necessary, with applications like Beagle that seem to be able to support a reasonable level of knowledge management without explicit need for data homogeny. Instead, Beagle uses drivers to pull various files apart and throw them in its index. Beagle is focused on text searching which is not really RDF's core concen... I wonder what will happen when those two worlds collide.

Anyway. I kind of see a desktop knowledge engine working the same way. Applications provide their data as files with well-known and easy to discern formats. A knowledge management engine has references to each of the files, and whenever it is asked for data pertaining to a file it first checks to see if its index is up to date. If so, it uses the cached data. If not, it purges all data associated with the file and replaces it with current file content. Alternatively, the knowledge manager becomes the primary store of that file and allows users to copy, backup, and modify the file in a similar way to that supported by the existing filesystem.

I think it does remain important for the data itself to be grouped into resources, and that it not be seen as outside the resources (files) model. Things just start to get very abstract when you have a pile of data pulled in from who-knows-where and are trying to infer knowledge from it. Which parts are reliable? Which are current? Does that "Your bank account balance is $xxx" statement refer to the current time, or is it old data? I think people understand files and can work with them. It think a file or collection paradigm is important. At the same time, I think it's important to be able to transcend file boundaries for query and possibly for update operations. After all, it's that transcending of file boundaries by defining a common data model that is really at the heart of what RDF is about.

<sigh>, I started to write about the sad state of triplestores available today. My comments were largely sniping from the position of not having an rdf sqlite equivalent. I'm not sure that's true after some further reading. It's clear that Java has the most mature infrastructure around for dealing with rdf, but it also seems that the query languages still haven't been agreed on and that there are still a number of different ways people are thinking about rdf and its uses.

Perhaps I'll write more later, but for now I think I'll just leave a few links lying around to resources I came across in my browsing:

I've been writing a bit of commercial Java code lately (in fact, my first line of Java was written about a month ago now). That's the reason I'm looking at Kowari again. I'm still not satisfied with the closed-source nature of existing java base technology. kaffe doesn't yet run Kowari, as it seems to be missing relevant nio features and can't even run the iTQL command-line given that it is missing swing components. I don't really want to work with something like Kowari until that is ironed out, but if I'm ever going to get back to writing my accounting app the first step will be to throw the current prototype out and start again with a more RDF-oriented back end. I'm concerned about the network-facing (rather than desktop-facing) nature of Kowari and am still not convinced that it will be appropraite for what I want. I would prefer a set of libraries that allow me to talk to the file system, rather than a set that allows me to talk to a server. Opening network sockets to talk to your own applications on a multi-user machine is asking for trouble, and not just security trouble.

Given what I've heard about Kowari, though, I'm willing to keep looking at it for a while longer. I've even installed the non-free java sdk on my debian PC for the purpose of research into this area. If the non-free status of the supporting infrastructure can be cleaned up, perhaps Kowari could still do something for me in my quest to build desktop knowledge infrastructure. On the other hand, I may still just head back to python and Redland with an sqlite back-end.

Benjamin