Sound advice - blog

Tales from the homeworld

My current feeds

Mon, 2006-Jan-16

On XML Language Design

Just when I'm starting to think seriously about how to fit a particular data schema into xml, Tim Bray is writing on the same subject. His advice is cheifly, "don't invent a new language... use xhtml, docbook, odf, ubl, or atom". His advice continues with "put more effort into extensibility than everything else".

His column was picked up all over the web, including by Danny Ayers. He dives into discussion about how to build an RDF model, rather than an XML language:

When working with RDF, my current feeling (could be wrong ;-) is that in most cases it’s probably best to initially make up afresh a new representation that matches the domain model as closely as possible(/appropriate). Only then start looking to replacing the new terms with established ones with matching semantics. But don’t see reusing things as more important than getting an (appropriately) accurate model. (Different approaches are likely to be better for different cases, but as a loose guide I think this works.)

I've been following more of the Tim/microformats approach, which is to start with an established model and extend minimally. I think Tim's stated advantages to this approach are compelling, with the increased likelyhood that software you didn't write will understand your input. When your machine interfaces to my machine, I want both to require minimal change in order for one to understand the other. I'm not sure the same advantages are available to an rdf schema that simply borrows terms from established vocabularies. Borrowing predicate terms and semantics is useful, but the most useful overlaps between schemas will be terms for specific subject types and instances.

From Tim,

There are two completely different (and fairly incompatible) ways of thinking about language invention. The first, which I’ll call syntax-centric, focuses on the language itself: what the tags and attributes are, and which can contain which, and what order they have to be in, and (even more important) on the human-readable prose that describes what they mean and what software ought to do with them. The second approach, which I’ll call model-centric, focuses on agreeing on a formal model of the data objects which captures as much of their semantics as possible; then the details of the language should fall out.

I think I fall on Tim's syntax-centric side of the fence. I understand the utility of defining a model as part of language design, however I think this will rarely be the model that software speaking the new language will use internally. I think that any software that actually wants to do anything with documents in your language will transform the data into their own internal representation. Sometimes this will be so that they can support more than one langauge. Liferea understands rss, atom, and a number of other formats. Sometimes it will be related to the way a program maps your data onto it's graphical elements. It may be more useful to refer to a list or map than a graph.

I think a trap one could easily fall into with rdf is to think that the model is important and the syntax is not. This changes a syntax->xml-dom-model->internal-model translation in an app that implements the language to a syntax->xml-dom-model->rdf-graph-model->internal-model translation. With the variety of possible rdf encodings (even just considering the variation allowed for xml) it isn't really possible to parse an xml document based on its rdf schema. It must first be handled by rdf-specific libraries, then transformed. I think that transforming from the lists and maps and hierarchy representation of an XML dom is typically easier than transforming from the graphs and triples representation of an RDF model in code.

From Danny:

This [starting with your own model, then seeing which terms you can exchange for more general ones already defined] is generally the opposite of what Tim suggests for XML languages, but there is a significant difference. Any two (or however many) RDF vocabularies/models/syntaxes can be used together and there will be a common interpretation semantics. Versioning is pretty well built in through schema annotations (esp. with OWL).

There isn’t a standard common interpretation semantics for XML beyond the implied containership structure. The syntax may be mixable (using XML namespaces and/or MustIgnore) but not interpretable in the general case.

Extensibility has to be built into the host language in XML. It should be possible to add extension elements with a defined meaning for anyone who understands both the host language and the extension. I don't think aggregation is an important concept yet for XML, although if Google Base proves useful I may start to revise that view. I think that aggregation is presently still something you do from the perspective of a particular host language or application domain, such as atom or "syndication". From that perspective there is currently little value in common interpretation semantics for XML, as it will only be parsed by software that understands the specific XML semantics.

I have not yet seen a use I consider compelling for mustUnderstand to support extensibility, however I am completely convinced by the need for mustIgnore semantics. I am also convinced that one should start with established technologies and extend them minimally wherever there is a good overlap. While this might not always be possible, I think it will be in a reasonable proportion of cases.