Sound advice - blog

Tales from the homeworld

My current feeds

Sun, 2005-Feb-27

My wishlist RESTful subscription protocol

As part of the HMI prototyping work I've been doing lately I've also put together what I've called the "System Interface" prototype. Essentially, it's a restful way to get at the system's data.

It's probably going to be the most enduring part of any HMI development that actually takes place, as currently the only way to talk to our system is pretty much by linking against our C++ libraries. That's not impossible, but not straightforward for a Java HMI. If you want to write code from another perspective, you're pretty much toast.

So, what does a system interface look like? The principle I've used so far is REST. It keeps it simple, and leaves me in control of the vocabulary that is spoken back and forth between the system and any clients. My rough picture is that you do a HTTP GET (possibly with a query part in the URL) to access the current value of data, a HTTP PUT to set the value of some data, and a HTTP POST to ask the piece of data you've identified to do something for you.

What this doesn't give me is subscription.

When system data changes, it is important that operators are made aware of the change quickly so they can react to it. Our current system uses a proprietary protocol to perform this function, and I want to know whether any existing protocol is going to help me do this in a more standards-compliant way. If none does exist, then perhaps some hints on what I should use as a basis would be useful.

So... here is my model of how the protocol should work:

My list is a little proscriptive but is bourne out of experience. TCP/IP connections between client and server can be an expensive commodity once you get to a few hundred HMI stations, each asking a single server for subscriptions to a thousand values to display. They should be reused. The only thing I've missed is a keep-alive which probably should be able to be specified by the client end to ensure the current value is returned at least every n seconds. That way the client also knows that the server has gone silent.

One picture I have is that this extra protocol capability becomes a proprietary extension to HTTP, where a web browser can connect to our server and use the fallback "regular" GET operation. The response codes would have to be expanded. Each one must indicate the request they are associated with, and the traditional 200 series won't be sufficient for saying "This is a complete response you must process, but hang on the line and I'll give you further updates". Using the 100 series seems equally dodgy.

Another possibility is trying to get something going with the XMPP protocol, but that quickly heads into the land of "no web browser will ever go here". I really would like this to be back-wards compatible to HTTP, at least until web browsers commonly support retrevial of documents by more message-oriented protocols.

Benjamin

Sat, 2005-Feb-26

RDF encoding for UML?

Paul Gearon asks

Is there a standard for serialised UML into RDF?

The standard encoding for UML 2.0 is XMI, but rdf schema already does a nice job at modelling some equivalent concepts to those of UML. The w3c has this to say:

Web-based and non-Web based applications can be described by a number of schema specification mechanisms, including RDF-Schema. The RDF-Schema model itself is equivalent to a subset of the class model in UML.

Here is an attempt to overcome the limitations of XMI by mapping UML to rdf more generally than is supported by rdf schema. Since rdfs is a subset of UML, this is "similar to defining an alternative RDF Schema specification".

Benjamin

Sat, 2005-Feb-26

CM Synergy

Martin Pool has begun a venture into constructing a new version control system called bazaar-NG. At first glance I can't distinguish it from CVS, or the wide variety of CM tools that have come about recently. It has roughly the same set of commands, and similar concepts of how working files are updated and maintained.

This is not a criticism, in fact at first glance it looks like we could be seeing a nice refinement of the general concepts. Martin himself notes:

I don't know if there will end up being any truly novel ideas, but perhaps the combination and presentation will appeal.

To the end of hopefully contributing something useful to the mix, I thought I would describe the CM system I use at work. When we initially started using the product it was called Continuus Change Management (CCM), and has since been bought by Telelogic and rebadged as CM Synergy. Since our earliest use of the product it has been shipped with a capability called Distributed Change Management (DCM), which has since been rebadged Distributed CM Synergy.

Before I start, I should note that I have seen no CM synergy source code and have only user-level knowledge. On the other hand, my user level knowledge is pretty in-depth given that I was build manager for over a year before working in actual software development for my company's product (it's considered penance in my business ;). At that time Telelogic's predecessor-in-interest, Continuus, had not yet entered Australia and we were being supported by another firm. This firm was not very familiar with the product, and for many years the CCM expertise in my company exceeded that of the support firm in many areas. Some of my detailed knowledge may be out of date. I've been back in the software domain for a number of years.

CCM is built on an Informix database which contains objects, object attributes, and relationships. Above this level is the archive, which uses gzip to store object versions of binary types and a modified gnu rcs to store object versions of text types. Above this level is the cache, which contains extracted versions of all working-state objects and static (archived) objects in use within work areas. Working state objects only exist within the cache. The final level is the work area. Each user will have at least one, and that is where software is built. Under unix, the controlled files within the work area are usually symlinks to cache versions. Under windows, the controlled files must be copies. Object versions that are no longer in use can be removed from the cache by using an explicit cache clean command. A work area can be completely deleted at any time and recreated from cache and database information with the sync command. Atribitrary objects (including tasks and projects, which we'll get to shortly) can be transferred between CCM databases using the DCM object version transfer mechanism.

CCM is a task-based CM environment. That means that it distinguishes between the concept of a work area, and what is currently being worked on. The work area content is decided by the reconfigure activity which uses reconfigure properties on a project as its source data. A baseline project and a set of tasks to apply (including working state and "checked-in" (static) tasks). This set is usually determined by a set of task folders, which can be configured to match the content of arbitrary object queries.

Once the baseline project and set of tasks is determined by updating any folder content, the tasks themselves and the baseline project are examined. Each one is equivalent to a list of specific object versions. Starting at the root directory of the project, the most-recently-created version of that directory object within the task and baseline sets is selected. The directory itself specifies not object versions, but file-ids. The slots that these ids identify are filled out in the same way, by finding the most-recently-created version of the object within the task and baseline sets.

So, this allows you to be working on multiple tasks within the same work area. It allows you to pick up tasks that have been completed by other developers but not yet integrated into any baseline and include them in your work area for further changes. The final and perhaps most imporantant thing it allows you to do is perform a conflicts check.

The conflicts check is a more rigourous version of the reconfigure process. Instead of just selecting the most-recently-created object version for a particular slot, it actively searches the object history graph. This graph is maintained as "successor" relationships in the informix database. If the the graph analysis shows that any of the objects selected by the baseline or task set are not predecessors of the selected objects then a conflict is declared. The user typically resolves this conflict by performing a merge between the two selected but branch versions using a three-way diff tool. Conflicts are also declared if part of a task is included "accidentally" in a reconfigure. This can occur if you have a task A and task B where B builds on A. When B is included, but A is not included some of A's objects will be pulled into the reconfigure by virtue of being predecessors of "B" object versions. This is detected and the resolution is typically to either pull A in as well, or to remove B from the reconfigure properties.

The conflicts check is probably the most important feature of CCM from a user perspective. Not only can you see that someone else has clobbered the file you're working on, but you can see how it was clobbered and how you should fix it. On the other side, though, is the build manager perspective. Task-based CM makes the build manager role somewhat more flexible, if not actually easier.

The standard CCM model assumes you will have user work areas, an integration work area, and a software quality assurance work area. User work areas feed into integration on a continuous or daily basis, and every so often a cut of the integration work area is taken as a release candidate to be formally assessed in the slower-moving software quality assurance work area. Each fast moving work areas can use one of the slower-moving baselines as its baseline project (work area, baseline, and project are roughly interchangeable terms in CCM). Personally, I only used an SQA build within the last few months or weeks of a release. The means of delivering software to be tested by QA is usually a build, and you often don't need an explicit baseline to track what you gave them in earlier project phases.

One way we're using the CCM task and projects system at my place of employment is to delay integration of unreviewed changes. Review is probably the most useful method for validating design and code changes as they occur, whether it be document review or code review. Anything that hasn't been reviewed isn't worth its salt, yet. It certainly shouldn't be built on top of by other team members. So what we do is add an approved_by attribute to each task. While approved_by is None, it can be explicitly picked up by developers if they really need to build upon it before the review cycle is done... but it doesn't get into the integration build (it's excluded from the folder query). When review is done, the authority who accepts the change puts their name in the approved_by field, and either that person or the original developer does a final conflicts check and merge before the nightly build occurs. That means that work is not included until it is accepted, and not accepted until it passes the conflicts check (as well as other check such as developer testing rigour). In the mean-time other developers can work on it if they are prepared to have their own work depend on the acceptance of the earlier work. In fact, users can see and compare the content of all objects, even working state objects that have not yet been checked in. That's part of the beauty of the cache concept, and the idea of checking out objects (and having a new version number assigned to the new version) before working on them.

I should note a few final things before closing out this blog entry. Firstly, I do have to use a customised gnu make to ensure that changes to a work area symlink (ie, selection of a different file version) always cause a rebuild. It's only a one-line change, though. Also CCM is both a command-line utility and a graphical one. The graphical version makes merging an understanging of object histories much easier. There is also a set of java GUIs which I've never gotten around to trying. Telelogic's Change Synergy (a change request tracking system similar in scope to bugzilla) is designed to work with CCM, and should reach a reasonable level of capability in the next few months or years but is currently a bit snafued. Also, I haven't gone into any detail about the CCM object typing system or other aspects that there are probably better solutions to these days anyway. I also haven't covered project hierarchies, or controlled products which have a few interesting twists of their own.

Benjamin

Sat, 2005-Feb-19

Aliens don't use XML

According to Sean McGrath, aliens don't use XML. He says they have separate technology stacks for dealing with tabular data, written text data, and relational data. I wonder, then, what the aliens do when they want to mix their data? :)

Perhaps a non-XML alternative for data representation will reemerge at the cutting edge some time in the future, but the homogeny issues will still have to be addressed by this new definition. CSV++ would have to find a way to embed or uniformly refer to XHTML++ and N3++ data. XHTML++ and N3++ would need simllar embedding.

XML with namespaces looks like holding the top spot in being able to both define the structure and identify the correct interpretation of its content for the time being.

Sun, 2005-Feb-13

To disable Internet Explorer

The HUMBUG mailing lists have recently been a-buzz with talk of Suncorp-Metway's apparent pro-IE+Windows and anti-Mozilla+Unix stance in its online terms and conditions. These conditions were later clarified in a form that I hope will stand up in court but have personal doubts about with in this follow-up.

Greg Black linked to this article which contained a comment by Bill Godfrey:

I keep IE hanging around, but I have the proxy server set to 0.0.0.0 and I make exceptions in the no-proxy-for list.

I've now adopted this policy on my work machine, although I've set my proxy to localhost (127.0.0.1) instead of 0.0.0.0 as I consider this a touch safer. Since IE and Mozilla have distinct proxy settings this prevents IE and its variants from accessing remote sites while allowing me to roam freely under Firefox. This is particularly important for me because Lotus Notes explicitly embeds IE for its internal web browsing and has a habit of doing things like looking up IMG tags from spam. Hopefully this will put a stop to that.

Sat, 2005-Feb-12

XP and Test Reports

Adrian Sutton writes about wanting to use tests as specifications for new work to be done:

You need to plan the set of features to include in a release ahead of time so you wind up with a whole heap of new unit tests which are acting as this specification that at some point in the future you will make pass.  At the moment though, those tests are okay to fail.

It isn't crazy do want to do this. It's part of the eXtreme Programming (XP) model for rapid development. In that model, documentation is thrown away at the end of a unit of work or not produced at all. The focus is on making the code tell the implementation story and making the tests tell the specification story. What the XP model would call what you're trying to do is not unit testing, but acceptance testing:

Customers are responsible for verifying the correctness of the acceptance tests and reviewing test scores to decide which failed tests are of highest priority. Acceptance tests are also used as regression tests prior to a production release.

Implicit in this model is the test score. In my line of work we call this the test report, and it must be produced at least once per release but preferrably once per build. A simple test report might be "all known tests pass". A more complicated one would list the available tests and their pass/fail status.

Adrain continues,

The solution I'm trying out is to create a list of tests that are okay to have fail for whatever reason and a custom ant task that reads that list and the JUnit reports and fails the build if any of the tests that should pass didn't but lets it pass even if some of the to do tests fail.

If you start from the assumption that you'll produce a test report the problem of changes to the set of passed or failed tests can become a configuration management one. If you commit the status of your last test report and perform a diff with the built one during the make process you can break the build on any unexpected behaviour in the test passes and fails. In addition to ensuring only known changes occur to the report it is possible to track (and review) positive and negative impacts on the report numbers. All the developer has to do is check in a new version of the report to acknowledge the effect their changes have had (they're triggered to do this by the build breakage). Reports can be produced one per module (or one per Makefile) for a fine-grained approach. As a bonus you get to see exactly which tests were added/removed/broken/fixed at exactly which time, by whom, and who put their name against acceptance of the changed test report and associated code. You have a complete history.

This approach can also benefit other things that are generated during the build process. Keep a controlled version of your object dump schema and forever after no unexpected or accidental changes will occur to the schema. Keep a controlled version of your last test output and you can put an even finer grain on the usual pass/fail criteria (sometimes it's important to know two lines in your output have swapped positions).

Sat, 2005-Feb-12

On awk for tabualar data processing

My recent post on awk one liners raised a little more contraversy than I had intended.

It was a response to Bradley Marshall's earlier post on how to do the same things in perl, and lead him to respond:

I'm glad you can do it in awk - I never suspected you couldn't... I deal with a few operating systems at work, and they don't always have the same version of awk or sed installed... it was no easier or harder for me to read. Plus, with Perl I get the advantage of moving into more fully fledged scripts if I need to...

Byron Ellacott also piped up, with:

As shown by Brad and Fuzzy, you can do similar things with different tools that often serve similar purposes. So, here's the same one-liners using sed(1) and a bit of bash(1)... (Brad, I know you were just demonstrating an extra feature of perl for those who use it. :)

I knew that also, and perhaps should be been more explicit in describing the subtleties of why I responded in the first place. Firstly, I have a long and only semi-serious association with awk vs perl advocacy. My position was always that perl was filling a gap that didn't exist for my personal requirements. I seem to recall that several early jousts on the subject were with Brad.

To my mind, awk was desperately simple and suited most tabular data processing problems you could throw at it. My devil's advoate position was that anything too complicated to do in awk was also too complicated to do legibly in perl. Clearly the weight of actual perl users made this position shaky (if not untenable) but I stuck to my guns and for the entire time I was at university and for several years later I found no use for perl that couln't be more appropriately implemented in another way.

Perl has advanced over the years, and while I still have no love for perl as a language the existance of CPAN does make perl a real "killer app". Awk, with its lack of even basic "#include" functionality will never stack up to the range of capabilities available to perl. On the other hand, bigger and better solutions are again appearing in other domains such as python, .NET, JVM-based language implementations and the like. I've had to learn small amounts of perl for various reasons over the years (primarily for maintaining perl programs) but I'll still work principally in awk for the things awk is good at.

So, when I saw Brad's post I couldn't resist. The one-liners he presented were absolutely fundamental awk capabilities. They were the exact use case awk was developed for. To present them in perl is like telling a lisp programmer that you need to do highly recursive list handling, so you've chosen python. It's a resonable language choice, especially if you're already a user of that language. It's just that you have to push that language just a little harder to make it happen. It's not what the language was meant to do, it's just something the language can do.

I absoluately understand that you can do those things in other langauges. I sincerely sympathise with Brad's "Awk? Which version of awk was that, again?" problem. I don't believe everyone should be using awk.

On the other hand, if you were looking for a language to do exactly those things in I would be happy to guide you in awk's direction. Given all the alternatives I still maintain that for those exact use-cases awk is the language that is most suitable. As for Brad's "with Perl I get the advantage of moving into more fully fledged scripts" quip, awk is better for writing full-fledged scripts than most people assume. So long as your script fits within the awk use case (tabular data handling) you won't have to bend over backwards to make fairly complicated transformations fly with awk. If you step outside that use-case, for example you want to run a bunch of programs and behave differently based on their return codes... well awk can still do that, but it's no longer what awk is designed for.

Benjamin

Sun, 2005-Feb-06

Awk one-liners

I felt I had to respond to this post by Bradley Marshall on perl one-liners. My position as an awk advoate is a long-suffering one and one that could do with some updating :)

Given the following data set in a file:

foo:bar:baz

The following one-liner will pull it out in a useful fashion:

$ awk -F: '{print $2}' filea
bar

A neat extension is, given a dataset like:

1:2:3:4
5:6:7:8

You can use a one liner like the following:

$ awk -F: '{tot += $1};END{print tot}' fileb
6

I, for one, feel cleansed. Now wasn't that a touch easier and more legible? :) As always, see the awk manpage for more information...

Benjamin

Sat, 2005-Feb-05

The Common Development and Distribution License

Open solaris has been launched with the release of dtrace under the Common Development and Distribution License (CDDL), pronnounced "cuddle" (although I always think ciddle when I see it). CDDL is an OSI-approved license. This should be a good thing, and is.

This blog entry covers some of the scope of contraversy over CDDL, and sun's decision to create it and to use it. I'll be reporting on discussions within the Groklaw, HUMBUG, and Debian organisations but mostly I'll be linking to things I was interested to hear or things I agree with ;)

Groklaw was looking into the CDDL quite early in the peace. During December 2004 PJ wrote this article requesting feedback on the license as submitted to OSI for approval. After noting the license was not going to be GPL-compatible she put this message forward, front and center:

So what, you say? Other licenses are not (GPL-compatable) either. But the whole idea of Open Source is that it's, well, open. For GNU/Linux and Solaris to benefit each other, for example, they'd need to choose a licence that allows that cross-pollination. So Sun is letting us know that it is erecting a Keep Out sign as far as GNU/Linux is concerned with this license...

She goes on to quote Linus Torvalds who had earlier spoken to eWeek.com about CDDL to support her view.

By the 18th of December 2004 Sun had responded to Groklaw concerns regarding some parts of the license. PJ doesn't comment further on the pros and cons of the license at this time.

By the 26th of January 2005 OSI approval had been granted. It was time for PJ to get back on the bandwagon with this sort of sentiment:

Yes, they are freeing up 1600 patents, but not for Linux, not for the GPL world. I'm a GPL girl myself. So it's hard for me to write about this story on its own terms. I am also hindered by the fact that I've yet to meet a Sun employee I didn't like personally. But, despite being pulled confusingly in both those directions at once, in the end, I have to be truthful. And the truth is Sun is now competing with Linux. That's not the same as trying to kill it, but it's not altogether friendly either. Yet, at the teleconference, Sun said they want to be a better friend to the community. I feel a bit like a mom whose toddler has written "I LUV MOMMY" on the wall with crayons. Now what do I say?

A further 28th of January article highlights another possible technical issue with the CDDL arrangment, but expects the problem will be solved when the Contributors Agreement is drawn up.

After reading the groklaw articles I had a number of things to think about and wrote three emails to the humbug general mailing list. The first just pointed to the 26th of January 2005 Groklaw article. The second two were a bit more exploriatory of the problem of GPL-incompatiblity, and of what it means to create a new open source license. On the 30th of January 2005, I wrote:

I, for one, am not surprised by this release initially being under the CDDL only. It does seem like a reasonable license given the circumstances, just as the MPL did in the early days of mozilla. I think (and hope) that over time the open source experiment will prove beneficial to all parties and that dual-licensing under the GPL or LGPL will one day be possible. It does seem unlikely that the GPL camp will move too far from its position regarding compatibility after all this time. As the newcomer to open source, sun.com will eventually have to expose itself to the GPL if it is to maximise its community support and exposure. Eventually, I hope that this open source experiment leads to benefits to open source operating system development everywhere.

After some interesting followup by a resident Sun employee (who was not representing Sun in the conversation), I wrote a more concrete exploration piece covering the topics of whether opening Solaris would benefit Linux and of general open source license compatability. I wrote:

In order to make CDDL and GPL compatible we have to look at both directions of travel. CDDL->GPL could be achieved by dual licensing of the software or dropping the CDDL in favour of something like LGPL or newBSD. Both options are probably unacceptable to Sun who wrote this license in order to protect itself and do other lawyerly things. On the flipside, GPL->CDDL is equally hair-raising. Linux code would similarly have to dual-license or use a weaker license. Would the CDDL terms be an acceptable release condition for Linux kernel software? Probably not, because CDDL allows combinations with closed-source code. That would allow the linux kernel to be combined with closed-source code also. The two licenses exist under two different ideologies and two different commercial realities. The licenses are reflective of that.

and this:

I suspect developers need to be careful about looking at Solaris code with a view to repeating what they've seen in Linux, and likewise linux developers may have to be careful about what they would contribute to OpenSolaris. Sure, it's fine to recontribute your own work but if you're you're copying someone else's work there may be issues. Like closed-source software, open-source software is not public domain and can't be copied from one place to the other without reference the the relevant licence agreements. When Microsoft code has leaked onto the internet in days past I seem to recall fairly strong statements that anyone who had seen such code would not be able to work on the same kinds of features in Linux as they had seen in the leaked code. There's too much of a risk that subconscious copying (or the perception of it) could lead to future legal difficulties.
...
Still, even if no copying or cross-pollination can occur at the code level the open sourcing of Solaris should bring the developers closer at the collaborative community level. From that perspective even with the GPL/OSI fracture we should all see some benefits from Sun's undeniably generous and positive actions.

More recently, I've read up on what debian legal had to say about CDDL. Mostly it was "We shouldn't be spending any time thinking about this until someone submits CDDL code for inclusion into Debian" but some interesting opinions did come up. Personally, I trust debian's take on what free software is more than I trust OSI's take on what open source software is. Despite striking similarity between Debian's Free Software Guidelines and OSI's Open Source Definition (the OSD is based on the DFSG) the two organisations seem to put different emphasis on who they serve. Debian appears very conservative in making sure that user and redistributor freedoms are preserved. I've never quite worked out who's freedoms OSI is intended to preserve, but I believe they have a more business-oriented slant. From my reading it seems that Debian's (notional) list of acceptable licenses is shorter than OSI's (official) list.

Two threads appears on the debian-legal mailing list. One commented on the draft license, and the other on the OSI-approved license. I think the most pertinant entry from the former thread was this one, by Juhapekka Tolvanen which states:

It probably fails the Chinese Dissident test, but I don't think that's a problem. The requirement to not modify "descriptive text" that provides attributions /may/ be a problem, but that'll depend on specific code rather than being a general problem...

Andrew Suffield elaborates, saying:

Is that license free according to DFSG?
Not intrinsically. Individual applications of it may be, with a liberal interpretation, or may not be, with a lawyer one. Notably it's capable of failing the Chinese Dissident test, and of containing a choice-of-venue provision. It also has a number of weasel-worded lawyer clauses that could be used in nasty ways...
Yeah, it's another of those irritating buggers. We'll have to analyse each license declaration that invokes this thing.

Followups in the later thread reinforce that none of the problems debian-legal had with the orignal draft appears to have shifted.

To close out this entry I'd like to bring the sagely words of Stuart Yeates from debian-legal to bear:

The CDDL is almost certainly better from pretty much every point of view (including that of the DFSG) than the current licences for Solaris. If you had ethical no problems with the old licences for Solaris, you're unlikely to have ethical problems with the CDDL.

As for the free software world's general acceptance of and participation in the CDDL, it is probably no worse than the Mozilla Public Licenese or any number of other licenses that have appeared over time and been declared open source. Personally I won't be trusting any license that Debian doesn't support, but we won't find out whether that test is passed for quite some time yet (unless someone wants to try a dtrace port...)

New licenses are created because the developer can't accept the protections, guarantees, and restrictions of existing licenses under which to release their code. Lawyers who write these licenses deliberately make their license incompatable with other licenses in order to prevent their code being distrubted under such unacceptable terms. In doing so, they prevent cooperation at the source level between them and anyone else. They create a barrier.

If I were Sun I'd want to be pretty damn sure that other people had the same view about existing licenses and saw their license as the perfect alternative before shutting out so much of the existing developer community. Regardless of your attitudes about what represents open source or free software, these barriers are not good. Every time a new license is written, somewhere a fairy dies. Please, think of the fairies.

Benjamin