I've spent some of this afternoon playing with
Dia.
I have played with it before and found it wanting, but that was coming from
a particular use case.
At work I've used
Visio
extensively, starting with the version created before Microsoft purchased
the program and began integrating it with their office suite. As I've
mentioned previously
I use a public domain stencil set for authoring UML2 that I find useful in
producing high-quality print documentation. When I used Dia coming from this
perspective I found it very difficult to put together diagrams that were
visually appealing in any reasonable amount of time.
Today I started from the perspective of using Dia as a software authoring
tool, much like Visio's standard UML stencils are supposed to support but with
my own flavour to it. Dia is able to do basic UML editing and because it saves
to an XML file (compressed with gzip) it is possible to actually use the
information you've created. Yay!
I created a couple of xsl stylesheets to transform a tiny
restricted subset of Dia UML diagrams into a tiny restricted subset of RDF
Schema. I intend to add to the supported set as I find a use for it, but for
now I only support statements that indicate the existence of certain classes
and of certain properties. I don't currently describe range, domain, or
multipicity information in the RDFS, but this is only meant to be a rough
scribble. Here's what I did:
- First, uncompress the dia diagram:
$ gzip -dc foo.dia > foo.dia1
Urrgh. That XML format looks terrible:
<dia:object type="UML - Class" version="0" id="O9">
<dia:attribute name="obj_pos">
<dia:point val="20.6,3.55"/>
</dia:attribute>
<dia:attribute name="obj_bb">
<dia:rectangle val="20.55,3.5;27,5.8"/>
</dia:attribute>
It's almost as bad as the one used by gnome's
glade!
I'm particularly averse to seeing "dia:attribute" entities when you could have
used actual XML attributes and saved everyone a lot of typing. The other
classic mistake they make is to assume that a consumer of the XML needs to be
told what type to use for each attribute. The fact is that the type of a piece
of data is the least of a consumer's worries. They have to decide where to
put it on the screen, or which field to insert it into in their database.
Seriously, if they know enough to use a particular attribute they'll know its
type. Just drop it and save the bandwidth. Finally (and for no apparent reason)
strings are bounded by hash (#) characters. I don't understand that at all :)
Here's part of the xsl stylesheet I used to clean it up:
<xsl:for-each select="@*"><xsl:copy/></xsl:for-each>
<xsl:for-each select="dia:attribute[not(dia:composite)]">
<xsl:choose>
<xsl:when test="dia:string">
<xsl:attribute name="{@name}">
<xsl:value-of select="substring(*,2,string-length(*)-2)"/>
</xsl:attribute>
</xsl:when>
<xsl:otherwise>
<xsl:attribute name="{@name}">
<xsl:value-of select="*/@val"/>
</xsl:attribute>
</xsl:otherwise>
</xsl:choose>
</xsl:for-each>
<xsl:apply-templates select="node()">
<xsl:with-param name="parent" select="$parent"/>
</xsl:apply-templates>
Ahh, greatly beautified:
$ xsltproc normaliseDia.xsl foo.dia1 > foo.dia2
<dia:object type="UML - Class" version="0" id="O9" obj_pos="20.6,3.55" obj_bb="20.55,3.5;27,5.8" elem_corner="20.6,3.55"...
This brings the uncompressed byte count for my partular input file from in
excess of 37k down to a little over 9k, although it only reduces the size of
the compressed file by 30%.
Most importantly, it is now much simpler to write the final stylesheet, because
now I can get at all of those juicy attributes just by saying @obj_pos, and
@obj_bb. If I had really been a cool kid I would probably have folded the
"original" attributes of the object (type, version, id, etc) into the dia
namespace while allowing other attributes to live in the null namespace.
So now that is complete, the final stylesheet is nice and simple (I've
only cut the actual stylesheet declaration, including namespace declaration):
<xsl:template match="/">
<rdf:RDF>
<xsl:for-each select="//dia:object[@type='UML - Class']">
<xsl:variable name="classname" select="@name"/>
<rdfs:Class rdf:ID="{$classname}"/>
<xsl:for-each select="dia:object.attributes">
<rdfs:Property rdf:ID="{concat($classname,'.',@name)}"/>
</xsl:for-each>
</xsl:for-each>
<xsl:for-each select="//dia:object[@type='UML - Association']">
<rdfs:Property rdf:ID="{@name}"/>
</xsl:for-each>
</rdf:RDF>
</xsl:template>
Of course, it only does a simple job so far:
$ xsltproc diaUMLtoRDFS.xsl foo.dia2 > foo.rdfs
<rdf:RDF xmlns...>
<rdfs:Class rdf:ID="Account"/>
<rdfs:Property rdf:ID="Account.name"/>
<rdfs:Class rdf:ID="NumericContext"/>
<rdfs:Property rdf:ID="NumericContext.amountDenominator"/>
<rdfs:Property rdf:ID="NumericContext.commodity"/>
...
My only problem now is that I don't really seem to be able to do anything
much useful with the RDF schema, other than describe the structure of the
data to humans which the original diagram does more intuitively. I do have
a script which constructs an sqlite schema from rdfs, but I really don't have
anything to validate the rdfs against. I don't have any program that will
validate RDF data against the schema that I'm aware of. Perhaps there's
something in the Java sphere I should look into.
The main point, though, is that Dia has proven a useful tool for a
small class of problems. Schema information that can be most simply described
in a graphical format and is compatible with Dia's way of doing things can
viably be part of a software process.
I think this is important. I have already been heading down this path lately
with XML files. Rather than trying to write code to describe a constrained
problem space,
I've been focusing on nailing down the characteristics of the space and putting
them into a form that is human and machine readible (XML) but is also
information-dense. The sparsity of actual information in some forms of code
(particularly those dealing with processing of certain types of data) can
lead to confusion as to what the actual pass/fail behaviour is. It can be
hard to verify the coded form against a specification, and hard to
reverse-enginer a specification from existing code. The XML approach allows
a clear specification, from which I would typically generate rather than write
the processing code. After that, hand-written code can pass that information
on or process it in any appropriate way. That hand-written code is improved in
density because the irrelevant rote parts have been removed out into the XML
file.
So what this experiment with Dia means to me is that I have a second human-
and machine- readible form to work with. This time it is in the form of a
diagram, and part of a tool that appears to support some level of extension.
I think this could improve the software process even more for these classes
of problem.
Benjamin