Sound advice - blog

Tales from the homeworld

My current feeds

Wed, 2006-Mar-29

Desktop Identifiers

The Resource Descriptor Framework (RDF) is based around the use of Uniform Resource Identifier (URI) rerefences. An RDF statement identifies a subject, an object, and a predicate. The subject is a URI that says what the statement is about. The object is a URI or a literal value. The predicate is a URI that identifies the relationship between the subject and object. A collection of statements forms a graph, and this graph is sufficient to describe a logical model of anything and everything. Anything that understands the meaning of the whole set of predicates can understand the whole graph. Anything that understands a subset of the predicates will understand a corresponding subset of the graph. Different parts of the graph can be controlled by different agencies, so long as each identifier used in the graph is unique. The uniqueness of identifiers is the cornerstone of making the system work.

The deep dark secret of URIs is that they are hard to come up with. The problems of a single URI having multiple meanings has been reasonably well canvassed, but the initial URI selection is still a difficult problem. What is the correct URI to use for the iso4217 currency symbol AUD (AUstralian Dollar)? Should iso4217 be used as a scheme to make iso4217:AUD the uri? Is the scheme just "iso", and the URI iso:4217:AUD? Do we trust oasis and use urn:oasis:ubl:codeList:ISO4217: Currency%20Code:3:5:ISO::AUD? How about mddl, or www.xe.com?

Scaling down a touch, which URI do I use to identify a record in my email agent's address book? What about for a file in my filesystem? Is file:/home/fuzzy/accounts.db good enough? How about http://localhost:1234/? Just as in the iso4217 case, I have an identifier. I just don't have an agreed context to work with. Sean McGrath writes:

Utterances are always a rich steamy broth of the extensional and the contextual. The context bit is what makes us human. We take short-cuts in utterances all the time. That is the context. Obviously, this drives computers mad because computers don't do context.

The problem is not unique to RDF. Whenever we have two databases managed by different applications that want to refer to each other, we have a problem. Just how much context do we provide? If I want my accounting application to relate somehow to my email client's address book, what is the best way to do it? If I want my stock market monitor application to match up with my accounting application's records, what key should I use? If the two pieces of data were in the same relational database the problem would be easy to solve, but the schemas of these two databases are controlled by different agencies. In the general case their data models should be able to evolve independently of each other, but there are points at which their data models interact. Those points should be controlled with identifiers that carry enough context to determine whether they indeed refer to the same entity.

I am finding myself itching to solve the desktop accounting problem again. I want to define the cornerstone of the overall data model now. I want to define what a transaction looks like. Transactions have entries, and transaction entries link to accounts. I want the model of what an account looks like to evolve separately to that of a transaction, because it is a much fuzzier concept. It has a lot to do with strange ledgers that refer to specific problem domains. These problem domains don't impact on the core transaction representation, nor do they impact the major financial reporting and quering activities. I would like to be able to provide hard dependable definitions of the hard dependable parts of my data model without setting soft definitions in concrete.

I feel like the best way to achieve something like that is to have a database of transactions alongside one or more databases of accounts. A common key could bind the two data models together. Transactions themselves could have extra information attached to them by in a separate database. Core query and reporting capabilities need only depend on information in the core transactions database. Clever reports and domain-specific ledgers could make use of additional information to mark up transactions and accounts. The ideal key to bind these databases together would be a uniform identifier. That would allow me to unambiguously move these databases around and combine them with other databases in different context. Within a single database I could use simple integer keys (or RDF blank nodes). In a universal database I need to use uniform identifers. Is there a middle line for databases that are spread only across a desktop or corporate context, or is there an easy universal scheme I could use?

We are pretty much working in the world of machine-generated identifiers, now. That may mean we can take microsoft's old favourite technique on board and make use of a machine-generated globally unique identifier. Human-readability is not all that important, so long as the identifier is easy to generate and otherwise work with in the database. Full GUIDs could be used whenever and identifier is used in the form urn:uuid:f81d4fae-7dec-11d0-a765-00a0c91e6bf6 as per rfc4122. We could alternatively try to use it as the context only in the identifier, eg http://localhost/uuid/ f81d4fae-7dec-11d0-a765-00a0c91e6bf6/321 for record 321. We can't attach 321 to the urn:uuid uri because the rfc does not permit it, but this localhost business is still a grand hack.

We could dodge the whole question of context for a time by using a relative uri or uri reference. If we treat the database as a document with its own URI, We could use the identifier "#transaction31" to stand for a unique identifier within the document. This doesn't solve the problem, really, because chances are the database is being located at either file:/home/benjamin/my.db (giving a full url of file:/home/benjamin/my.db#transaction31) or at http://localhost:1234/ (giving a full url of http://localhost:1234#transaction31). Importantly, anything that refers to the identifier using either one of these paths depends on the same port on localhost being opened every time your application starts. It depends on the database being found at the same path every time. In fact, we could make use of a relative URI again. If I have a database at file:/home/benjamin/my.db and another at file:/home/benjamin/myother.db, the two could refer to each other with the relative paths "my.db" and "myother.db". They could refer to each other's identifiers as "my.db#transaction31" and "myother.db#account12". So long as both files moved together their context really could be for the most part ignored.

Perhaps these non-universal universal identifiers are good enough. Perhaps we will never use these databases outside of the context of their original paths on their original machines. Perhaps we will learn to contol the movement of documents and data around a desktop as carefully as we must on the open internet. Perhaps a dns-style abstraction layer is the solution. I think choosing an identifier is still a hard problem, especially in a world at the cusp of the online and offline worlds.

Benjamin