Sound advice - blog

Tales from the homeworld

My current feeds

Wed, 2004-Dec-29

Own the Standards

Harry Phillips writes:

Linux does not need a "Killer app", more open source projects need to be ported to Windows.

He refers to this article as his inspiration.

The article does not appear to be well-researched. It is an opinion piece, but I think it holds a few grains of truth. This blog entry is an opinion piece too, and is probably less well-researched. Today, we already see open source software creeping onto the windows desktop. Companies like Green PC in the sub-$500 PC market appear to love the cost savings they can make with a bit of open source software (in this case, openoffice) thrown into the bundle.

It's a niche with pros and cons, but I think some of the pros are notable ones. People may start to get used to working with openoffice software at home. Slowly it creeps into educational institutions, four years later it starts to make its mark on industry.

Maybe that's a pipe-dream. Doubly-so because I would rate the software's usability at around 2/10. I hate the bastard.

Nevertheless, the crossover of open source software onto the windows desktop is very interesting indeed. The mozilla and openoffice technology stacks are mature on windows as well as unix-like and mac platforms (as a short list). GTK applications are starting to make a splash as well. Gimp now has windows support which should continue to improve over the next few gtk revisions. Some might point out qt, especially given their recent version four beta release, which has been mature and open source for some time but still has the ring of "that company owns it, not us" about it and has a strong C++ aftertaste.

Anyway, the future is looking bright for open source software under windows. Why is this important?

I think the use of open source software on the windows platform is not that useful to open source. It may have positive impacts on the individual products. They may recieve more contributions and more feedback from the wider audience they've courted. This may feed back to a better overall user experience. It may even raise the profile of open source. On the whole, though, I don't think it will be a big win in and of itself. To me, the win is in retaking the platform.

For some years, Microsoft has been able to claim all the best software because of a self-fulfilling prophecy (as well in part to a reasonably sound software development platform). You need things to work under windows, so you develop for the windows platform. Java tried to break this model with the write once, run everywhere philosophy. This philosophy has been adopted by a number of other portable platforms that have been migrated over the top of windows. As I recall, the first major beachhead was retaken with perl. It was genuinely useful to windows users, and filled a hole that Microsoft had failed to. With more and more common platform between Windows and those other camps, it is starting to become viable again to write software that will work under other operating-systems "as well".

I'm currently tinkering with (nibbling at) an accounting system written in python. To get it working under windows all I had to do was to make sure it had only gtk dependencies, not those of gnome. I use sqlite as the file-storage-slash-database, which was also available trivially. I expect that if I add further dependencies they'll be available in a similarly straghtforward fashion. I can write once, and run (test) everywhere.

I think the most important problem in software at the moment is in defining practical platforms for the development of high-level software. I think the main risk to this problem being solved entirely is still the split between the three main free desktop camps. Mozilla, openoffice, and gnome (Did I forget KDE? Oh, my bad...) are all built on essentially-incompatable software stacks with three different sets of abstractions. I believe this harms the solution not only on the free desktop but on the windows desktop. If this problem isn't solved in the next five years or so (perhaps in the timeframe of the Windows Longhorn release) I think the tide may turn back again and we may find Microsoft dictating the platform once again. Perhaps Mono or other .NET clones will help us to still write portable software in that time, but my real ambition for open source would be for Microsoft itself to eventually be forced to contribute to the one true codebase.

I'm all for open source software under Windows, especially if it means that open source platforms become defacto standards under the windows dekstop. I would love to see sufficient momentum behind those platforms that they outevolve and outdevelop Microsoft's competition on its own operating system.

Once the standard windows desktop is gnome-based, why would anyone buy windows?

Benjamin (the dreamer)

Sat, 2004-Dec-25

Windows is everywhere

Ben fowler writes:

If you're stupid enough, despite the EULA, to use our (Microsoft Windows) software to run a nuclear reactor, weapon system or other safety critical system, then it's your funeral (and maybe everyone else's)

The trouble is, windows is everywhere.

I've been working with Solaris for the last five years, but that's coming to an end. Particularly in asia Sun is often seen by our customers as a supplier with an unsteady future ahead of it. They want commodity hardware to work with, and commodity software. Some ask for Linux by name. Some ask for Windows by name, especially for desktop machines.

The reasons are sometimes complex and varied. Sometimes they ask for a SCADA system but secretly dream of a general computing platform that they can use to access other systems as well, or maybe just trawl the internet for humor. Sometimes they want to avoid us having them over a barrel when it comes time to upgrade. They want to be able to buy their own hardware, or at least feel confident they could if they needed too.

Often the customer doesn't have as much expertise as they think they do, and what they assume anyone can do would actually introduce risks to the system unless it is very carefully considered with some reasonably in-depth knowledge. In the end, we provide a solution. When the solution needs to be updated we are probably the best people to do it (if you're still staying with our systems).

Anyway, I meander from the point.


Windows is used in saftey-related systems. Not all of them, but many of them. People who work with safety-related systems want commodity hardware and software, too, and until recently the options have been very slim indeed. They remain slim, to be honest. When you aren't dealing with hard realtime requirements and you have a software-heavy solution you don't want to reinvent the operating system, hardware, and development environment. You use something off the shelf that does the job and doesn't cost too much money. Companies that provide these systems aren't very good at sharing with each other, so truth be known there's not a whole lot out there. Windows is said to be a good choice, at least once you've pulled out any uneccessary services and run the same version for half a decade or so. It can be a good choice. Most vendors in the field have far more experience in deploying Windows solutions than Linux or BSD solutions.

My background over most of the past five years hasn't actually been with saftey-related software. The project I was working on involved code that was one step removed from the safety-related component. That's no longer true, so I've been immersed in saftey-related thinking relatively recently. At first it surprised me that the people who did have a lot of safety-related software experience dealt mostly with windows. It surprised me more when they told me that while they were currenly only certified with an ageing windows NT operating system base they felt confident in achieving certification very soon under Windows XP. They weren't much interested in Linux, and the idea of using Solaris seemed outright confusing to them.

Of course, we're not talking about nuclear reactors here. We're talking SIL2 systems (sometimes called non-vital in the old tongue) that tell SIL4 systems what to do. In the end it is the SIL4 systems that decide whether something is safe or not, and are perfectly willing to override the SIL2 decision when it suits them. Some of those SIL4 systems are technological. Some are procedural. Still, it's very embarrassing when your SIL2 system goes down even after accumulating several years of uptime. We prefer to see the hardware fail before the software does. Likewise, SIL2 systems do have safety-related responsibilities (otherwise they'd be SIL0). Unlike a vital safety-critical (SIL3 or SIL4) system, your non-vital saftey-related (SIL1 or SIL2) system typically generates unsafe situations when it fails badly as opposed to actually killing someone directly. We all like to be sure that ours sytems aren't going to fail badly.

Meandering again, yes.

The safety-related parts of our applications also tend to be running on secure networks, although we've seen recently that isn't always a true safeguard. Oh well.

Now what does the UK Health & Safety directorate have to say about the use of Linux and Windows? They're pretty conservative guys, but are surprisingly positive about linux. They were (very breifly) less positive about windows. That report was pulled, though, no doubt after pressure from both Microsoft and parts of the various represented industries that were more comfortable with their windows solutions than any potential linux solution.

In the end, I think we'll see a linux vs windows battle on the safety-related software front. Each one will creep up to around the high SIL2 mark and most applications will be able to make use of either one. Currenly windows is still out in front in that arena (at least from where I'm sitting) because of longer term industry exposure. When look towards the SIL4 mark we will continue to see (as we do now) lots of bespoke software and hardware that brings in enough margin not to have to commoditize. Until the market shifts to put pricing pressure on those guys I think we'll continue to see that approach. On my end of the market, though, there is a price squeeze that makes bespoke impossible. Non-commodity non-bespoke solutions such as the use of Sun hardware and software are becoming a nonsense to our sector as they are to many other sectors. Windows and Linux look like the only contenders (and linux has a lot of catching up to do).


Tue, 2004-Dec-21

More on Exceptions

I suppose I should get around to replying to this article, part of a blogosphere exchange that Adrain Sutton and myself were having about a month ago when I dropped off the net for a while.

I'll start by attempting to resummarise our respective positions. Adrian, please let me know if I've misrepresented you on any of these points:

The QuestionAdrian's positionMy position
Main Language Background Java C++
Industry Background Web-based Desktop applictions Safety-related distributed back-end applications
Main Software Concerns Not surprising the user Failing (over to the backup) in a way that doesn't kill anyone
When to throw an exception (when to try and recover from an error) Whenever unexpected input arrives. Some other code might be able to handle this problem in a way that allows the user to save their work. Whenever unexpected input arrives from outside the program. Unexpected input from inside the program is an indication of faulty software and faulty software must be terminated with prejudice to avoid unsafe situations from arising.
Uncaught exceptions An inherently good backup strategy for regular expection handling. If noone knows how to deal with the condition the program exits (reasonably) gracefully. What you do when you can't find the assert function :), but faulty nontheless. The thrower and rethrowers of the exception expected someone to catch it, and include code to be able to continue operating after the exception is caught and further functions are called on them. If the exception wasn't caught this recovery code is untestable at the system level and likely to be buggy. Throwing an exception in this case just adds a longer code path than you would find in the assert-based case and gives developers the false impression that they can rely on the recovery code.
(Implicity or explicitly) Rethrown Exceptions Good for propagating exceptions to the guy who knows how to deal with the problem

Possibly-surprising side-effects to function calls. In java this isn't a huge problem, as your code won't compile if you don't catch all of the exceptions you are supposed to catch. C++ doesn't hold you hand, unfortunately, but that's the fault of C++. Python is similar to C++ in this respect and I'll come back to python later in this blog entry.

I get serious about control flow simplicity. In the normal case I forbid the use of multiple return statements in a function. If forbid the use of multiple exit points in loops. I forbid the use of not operators in boolean expressions where the "then" and "else" branches could simply be reversed. Let's just say that exceptions (like GOTOs) mess with my... space.

Caught exceptions The standard way of dealing with exceptional conditions Personally, I would prefer an object be returned. If the object's return code (or activator function) was not accessed by the time the last reference disappeared, that's where things should die. The object would form a contract between the "thrower" and the caller to handle the object, and failure to do so would be termination-worthy. Awkward control flow jumps are bad. Returning multiple types from a single function is bad. Exceptions encapsulate both badnesses and make them a first-order language feature.

The reason I started writing about exceptions was not to annoy the gentle reader. The spark was actually in python's pygtk module. You see, python has a problem: It doesn't detect errors at compile time. This means that whenever you make a typo or do anything equally silly you have to find out at runtime. Rather than terminating the process when these are discovered, python throws an exception.

Fine. An exception is thrown. I don't catch it. I get a stack trace on the command-line, and everyone's happy right?

No. The pygtk main loop catches it, reports it, and keeps running.

You can't really blame pygtk, can you? Just above it on the stack is the C-language gtk library code which has no exception support whatsoever. Even if it did, chances are it wouldn't support arbitrary python exceptions being thrown through it. The pygtk library has one of two choices: Catch it, or kill it.

As you might have gussed by now, my preference is "kill it!". There is supposed to be an environment variable in pygtk to govern this behaviour, but it appears to have disppeared at some point as I haven't been able to stop it from handling the exception. As a result, my program keeps running (after not fulfilling the task I gave it) and I have to explicitly check the command-line to see what went wrong. Pain-in-the-bloody-arse-exceptions.

I don't think exceptions are a great idea. I think they add more code than they reduce, and add more code paths than the (in my opinion) usually better course of action in terminating immediately. I know my views are heavily-grounded in systems where failure of the process doesn't matter because you aways have a redundant copy and where you're more worried about misleading data making its way to the user than to have them see no data at all. Still, I think my views are more applicable to the desktop world than the views of the desktop world are applicable to my world.

Save your work? That's for sissys. Use a journalling file-saving model. Save everything the user does immediately. You can support the traditional file save/load facility using checkpoints or other niceties but I fail to see why any application in this modern age of fast hard drives should ever lose data the user has entered more than a few hundred milliseconds ago.

Use exceptions because of a bug in some piece of code that's feeding you data? You're just propagating the bug and making handling it a much more complicated problem. Just fix the code in question and be done with it. Don't get me wrong, I absolutely think exceptions have some good points over returning a silly little error code. I just think that the bad outweighs the good. I believe that a refinement of the return code model rathern than the complete reimplementation we see in exceptions would have been a better course for everyone.


Sat, 2004-Oct-23

I think exceptions in Java a still at least a bit evil

Again, I think it comes to the number of paths through a piece of code. Adrian points out that Java uses garbage collection (a practice I once considered at least a bit dodgy for reasons I won't go into here, but have warmed to somewhat since using it in python), and that garbage collection makes things much simpler in Java than they would be in C++. I have to agree. A consistent memory management model across all the software written in a specific language is a huge step forward over the C++ "generally rely on symmetry but where you can't do that: hack up some kind of reference-counting system...".

After his analysis I'm left feeling that exceptions are at worst half as evil than those of C++ due to that consistent memory management. Leaving that aside, though, we have the crux of my point. Consider the try {} finally {} block. C++ has a similar mechniam (that requires more coding to make work) called a monitor. You instantiate an object which has the code from your finally {} block in its destructor. It's guaranteed to be destroyed as the stack unwinds, unlike dynamically-allocated objects.

Unfortunately, in both C++ and Java when you arrive in the finally {} block you don't know exactly how you got there. Did you commit that transaction, or didn't you? Can we consider the function to have executed sucessfully? Did the exception come from a method you expected to raise the exception, or was it something that (on the surface) looked innocuous? These are all issues that you have to consider when using function return codes to convey success (or otherwise) of operations, but with an exception your thread of control jumps. The code that triggered the exception does not (may not) know where it has jumped to, and the code that catches the exception does not (may not) know where it came from. Neither piece of code may know why the exception was raised.

So what do I do, instead?

The main thing I do is to minimise error conditions by making things preconditions instead. This removes the need for both exception handling and return code handling. Instead, the calling class must either guarantee or test the precondition (using a method alongside the original one) before calling the function. Code that fails to meet this criteria effectively gets the same treatment as it would if an unhandled exception were thrown. A stack trace is dumped into the log file, and the process terminates. I work in fault-tolerant software. A backup instance of the process takes over and hopefully does not trigger the same code-path. If it does though, it's safer to fail than to continue operating with a known malfunction (I work in safety-related software).

The general pattern I use to make this a viable alternative to exception and return code handling, though is to classify my classes into two sets. One set deals with external stimulus, such as user interaction and data from other sources. It is responsible for either cleaning or rejecting the data and must have error handling built into it. Once data has passed through that set objects no longer handle errors. Any error past that point is a bug in my program, not the other guy's. A bug in my program must terminate my program.

Since most softare is internal to the program, most software is not exposed to error handling either by the exception or return code mechanisms. A small number of classes do have to have an error handling model, and for that set I continue to use return codes as the primary mechanism. I do use exceptions, particularly where there is some kind of deeply-nested factory object hierarchy and faults are detected in the bowels of it. I do so sparingly.

I'm the kind of person who likes to think in very simple terms about his software. A code path must be recognisable as a code path, and handling things by return code makes that feasible. Exceptions add more code paths, ones that don't go through the normal decision or looping structures of a language. Without that visual cue that a particular code path exists, and without a way to minimise the number of paths through a particular piece of code, I'm extremely uncomfortable. Code should be able to be determined correct by inspection, but the human mind can only deal with so many conditions and branches at once. Exceptions put a possible branch on every line of code, and that is why I consider them evil.


Sun, 2004-Oct-17

It's the poor code in the middle that gets hurt

Adrian Sutton argues that exceptions are not in fact harmful but helpful. I don't know about you, but I'm a stubborn bastard who needs to be right all the time. I've picked a fight, and I plan to win it ;)

Adrian is almost right in his assertion that

Checking return codes adds exactly the same amount of complexity as handling exceptions does - it's one extra branch for every different response to the return code.

but gives the game away with with this comment:

I'd move the exception logic up a little higher by throwing it from this method and catching it above somewhere - where depends on application design and what action will be taken in response to each error.

He's right that exceptions add no more complexity where they are thrown or where they are finally dealt with. It's the code in-between that gets hurt.

It's the code in-between that suddenly has code-paths that can trigger on any line of code. It's the code in-between that has to be written defensively according to an arms treaty that it did not sign and for which it is not aware of the text. It is the code in-between that suffers and pays.

This article is what got many of us so paranoid about exception handling. It is referenced in this boost article supportive of the use of exceptions under the "Myths and Superstitions" section but which doesn't address my own central point of increased number of code paths. Interestingly, in its example showing that exceptions don't make it more difficult to reason about a program's behaviour they cite a function that uses multiple return statements and replace it with exceptions. Both are smelly in my books.

Code should be simple. Branches should be symmetrical. Loops and functions should have one a single point of return. If you break these rules already then exceptions might be for you.

Personally, they form a significant part of my coding practice. I take very seriously the advice attributed to Brian W. Kernighan:

Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it.

Adrian also picks up my final points about how to program with the most straightforward error checking, and perhaps I should have worded my final paragraphs more clearly to avoid his confusion. I don't like threads (that share data). They do cause similar code path blowouts and comprehensibility problems as do exceptions unless approached with straightforward discipline (such as a discipline of message-passing-only between threads). Dealing with the operating system can't be written off so easily, and my note towards the end of my original post was meant to convey my sentiment of "exceptions are no good elsewhere, and if the only place you can still argue their valid use is when dealing with the operating system... well... get a life" :) Most software is far more internally-complex than it is complex along its operating system boundary. If you want to use exceptions there, feel free. Just don't throw them through other classes. I personally think that exceptions offer no better alternative to return code checking in that limited environment.


Sat, 2004-Oct-16

Exceptions are evil

Today's entry in my series of "programming is evil" topics, we have exceptions. Yup. They're evil.

Good code minimises branches. Particularly, it minimises branches that can't be exercised easily with testing. Testability is a good indicator of clean design. If you can't get prod an object sufficiently with its public interface to get decent coverage of its internal logic, changes are you need to break up your class or do something similarly drastic. Good design is testable (despite testable design not always being good).

Enter exceptions. Exceptions would not be the complete vilest of sins ever inflicted on computer software except for the terrible error some programmers make of catching them. It isn't possible to write a class that survives every possible excpetion thrown through its member functions while maintaining consistent internal state. It just isn't. Whenever you catch an exception your data structures are (almost) by definition screwed up. This is the time when you stop executing your program and offer your sincerest apologies to your user.

Unfortunately, that's not always what happens. Some people like to put a "catch all exceptions" statement somewhere near the top of their program's stack. Perhaps it is in the main. Perhaps it is in an event loop. Even if you die after catching that exception you've thrown away the stack trace, the most useful piece of information you had available to you to debug your software.

Worse still, some people catch excpetions and try to keep executing. This guarantees errant behaviour will occur only in subtle and impossible to track ways, and ensures that any time an error becomes visible the cause of it is lost to the sands of time.

The worst thing that catching an exception does, though, is add more paths through your code. An almost infinite number of unexpected code paths sprouts in your rich soil of simplicilty, stunting all possible growth. Pain in the arse exceptions.

To minimise paths through your code, just follow a few simple guidelines:

  1. Make everything you think should be a branch in your code a precondition instead
  2. Assert all your preconditions
  3. Proivde functions that client objects can use to determine ahead of time whether they meet your preconditions

In cases where the state of preconditions can change between test and execute, return some kind of lock object (where possible) to prevent it. Of course multiple threads are evil too, and for many of the same reasons. When dealing with the operating system, just use the damn return codes ;)


Sat, 2004-Oct-16

The evil observer pattern

There are some design patterns that one can take a lifetime to get right. The state pattern is one of these. As you build up state machines with complicated sub-states and resources that are required in some but not all states some serious decisions have to be made about how to manage the detail of the implementation. Do you make your state objects flyweight? How do you represent common state? When and how are state objects created and destroyed? For some time I've thought that the observer pattern was one of these patterns that simply take experience to get right. Lately I've been swinging from that view. I've now formed the view that the observer pattern itself is the problem. Before I start to argue my case, I think we have to look a little closer at what observer is trying to achieve.

The Observer pattern is fairly simple in concept. You start with a client object that is able to navigate to a server object. The client subscribes to the server object using a special inherited interface class that describes the calls that can be made back upon the client. When events occur, client receives the callbacks.

Observer is often talked about in the context of the Model-View-Controller mega-pattern, where the model object tells the controller about changes to its state as they occur. It is also a pattern crucial to the handling of asynchronous communications and actions in many cases, where an object will register with a scheduler to be notified when it can next read or write to its file descriptor. I developed a system in C++ for my current place of employment very much based around this concept. Because each object is essentially an independent entity connected to external stimulus via observer patterns they can be seen to live a live on their own, decoupled nicely.

The problems in the observer pattern start to emerge when you questions like: "When do I deregister from the server object?" and "If this observation represents the state of some entity, should I get a callback immediately?" and "If I am the source of this change event, how I should I update my own state to reflect the change?".

The first question is hairier than you might think. In C++ you start to want to create reference objects to ensure that objects deregister before they are destroyed. Even in a garbage-collected system this pattern starts to cause reference leaks and memory growth. It's a pretty nasty business, really.

The second two questions are even more intriguing. I've settled on a policy of no immediatle callbacks and (where possible) no external notification of a change I've caused. The reasons for this are tied up with objects still being on the stack when they recieve notifications of changes, and also with the nature of how an object wants to update itself often chaning subtly between a purely external event (caused by another object) and one that this object is involved in.

I've begun to favour a much more wholistic approach to the scenarios that I've used observer patterns for in the past. I'm leaning towards something much more like the Erlang model. Virtual concurrency. Message queues. Delayed execution. I've built a little trial model for future work in my professional capacity that I hope to integrate into some of the work already in-place, but the general concept is as follows: We define a straightforward message source and sink pair of baseclasses. We devise a method of connecting sources and sinks. Each stateful object has some sources which provide data to other objects, and some sinks to recieve updates on.

Once connected, the sources are able to recieve updates, but instead of updating their main object, they register their object with a scheduler to ensure that it is eventually reevaluated. In the mean-time it clears an initial flag that can be used to avoid processing uninitilised input, and sets a changed flag that can be used by the object to avoid duplicate processing.

I'm working on making it a little more efficient and straightforward to use in the context of existing framework, but I'm confident that it will feature prominently in future development.

Because of its ties with HMI-related code, this will probably also make an appearance in TransactionSafe. As it is, I developed the ideas at home with TransactionSafe first, before moving the concept to code someone else owns. Since I'll be working more and more on HMI-related code at work, TransactionSafe may become a testbed for a number of concepts I'm going to use elsewhere.

I actually haven't had the chance to do much python-hacking of late due to illness and being completely worn-out. I don't know if I'll get too much more done before my holidays, but I'll try to get a few hours invested each week.


Sun, 2004-Oct-03

Playing with Darcs

I know that the revision control system is one of the essential tools of a programmer, behind only the compiler/interpreter and the debugger, but so far for TransactionSafe I haven't used one. I suppose it is also telling that I haven't used a debugger so far, relying solely on the old fashioned print(f) statement to get to the bottom of my code's behaviour. Anwyay, on the issue of revision control I have been roughly rolling my own. I toyed with rcs while the program was small enough to fit reasonably into one file, but since it broke that boundary I have been rolling one tar file per release, then renaming my working directory to that of my next intended revision.

That is a system that works well with one person (as I remain on this project) but doesn't scale well. The other thing it doesn't handle well is the expansion of a project release into something that encapsulates more than a few small changes. A good revision control system allows you to roll back or review the state of things at a sub-release level. To use the terminology of Telelogic CM Synergy, a good revision control system allows you to keep track of your changes at the task level.

Well, perhaps by now you've worked out why I haven't committed to using a specific tool in this area yet. I use CM Synergy professionally, and am a little set in my ways. I like task-based CM. In this model (again using the CM Synergy terminology), you have a project. The project contains objects from the central repositry, each essentially given a unique object id and a version number. A project version can either be a baseline version of an earlier release, or a moving target that incorporates changes going forward for a future release. Each developer has their own version of the project, which only picks up new version as they require it. Each project only picks up changes based on the developer's own criteria.

The mechanism of change acceptance is the reconfigure. A reconfigure accepts a baseline project with its root directory object, and a list of tasks derived by specific inclusion or by query. Tasks are beasts like projects. They each contain a list of specific object versions. Directories contain a specific list of objects, but does not specify their version. The reconfigure process is simple. Take the top-level directory, and select the latest version of it from the available tasks. Take each of its child objects and choose versions of them using the same algorithm. Recurse until done. Developer project versions can be reconfigured, and build manager project versions can be reconfigured. It's easy as pie to put tasks in and to pull them out. Create a merge task to integrate changes together, and Bob's your uncle.

This simple mechanism, combined with some useful conflicts checking algorithms to check for required merges make CM Synergy the best change management system I've used. It's a pity, really, because the interface sucks and its obvious that no serious money has gone into improving the system for a decade. Since its closed source I can't make use of it for my own work, nor can it be effectively improved without serious consulting money changing hands.

I've been out of the revision control circles for a few years, now, but Anthony Towns' series of entries regarding darcs has piqued my interest. Like CM synergy, every developer gets to work with their own branch version of the project. Like CM synergy, each unit of change can be bundled up into a single object and either applied or backed out of any individual or build manager's project. The main difference between darcs' and CM Synergy's approaches appear to be that while CM Synergy's change unit is a set of specificially selected object versions, darcs change unit is the difference between those object versions and the previous versions.

It is an interesting distinction. On the plus side for darcs, this means you don't have to have explicit well-known version numbering for objects. In fact, it is likely you'll be able to apply the patch to objects not exactly the same as those the patch was originally derived-from. That at least appears to bode well for ad-hoc distributed development. On the other side, I think this probably means that the clever conflicts checking algorithms that CM Synergy uses can't be applied to darcs. It might make it harder to be able to give such guarantees as "All of the object versions in my release have had proper review done on merges". Perhaps there are clever ways to do this that I haven't thought of, yet.

On the whole darcs looks like the revision control system most attuned to my way of thinking about revision control at the moment. I'll use it for the next few TransactionSafe versions and see how I go.

While I'm here, I might as well give an impromptu review of the telelogic products on display:

CM SynergyGood fundamental design, poor interface.
Change SynergyA poor man's bugzilla at a rich man's price. It's easy to get the bugs in, but the query system will never let you find them again. Don't go near it.
Object MakeGive me a break. I got a literal 20x performance improvement out of my build by moving to gnu make. This product has no advantages over free alternatives.
DOORSA crappy requirements tracking system in the midst of no competition. It is probably the best thing out there, but some better QA and some usability studies would go a long way to making this worth the very hefty pricetag.

Fri, 2004-Oct-01

Singletons in Python

As part of my refactoring for the upcoming 0.3 release of TransactionSafe (which I hope to include an actual register to allow actual transactions to be entered) I've made my model classes singletons. Singletons are a pattern I find makes some things simpler, and I hope this is one of those things.

As I'm quite new to python idioms I did some web searching to find the "best" way to do this in Python. My searching had me end up at this page. I took my pick of the implementations available, and ended up using the one at the end of the page from Gary Robinson. His blog entry on the subject can be found here.

Gary was kind enough to release his classes into the public domain, and as is my custom I requested explicit permission for use in my project. As it happens, he was happy to oblige.

As well as the standard singleton pattern, I like to use something I call a "map" singleton under C++. That simply means that instead of one instance existing per class you have one instance per key per class. I've renamed it to "dict" singleton for use in python and adapted Gary's original code to support it. In keeping with the terms of release of his original code I hereby release my "dict" version to the public domain.

Fri, 2004-Oct-01

Less Code Duplication != Better Design

I'm sure that Adrian Sutton didn't mean to imply in this blog entry that less code duplication implied better design, but it was a trigger for me to start writing this rant that has been coming for some time.

Code duplication has obvious problems in software. Code that is duplicated has to be written twice, and maintained twice. Cut-and-paste errors can lead to bugs being duplicated several times but only being fixed once. The same pattern tends to reduce developer checking and reading as they cut-and-paste "known working" code that may not apply as well as they thought to their own situation. Of all the maintainence issues that duplicated code can cause, though, your first thought on seeing duplication or starting to add duplication should not be "I'll put this in a common place". No! Stop! Evil! Bad!

Common code suffers from its own problems. When you share a code snippet without encapsulating in a unit of code that has a common meaning you can cause shears to occur in your code as that meaning or parts of it change. Common code that changes can have unforseen rammifications if its actual usage is not well understood throughout your software. You end up changing the code for one purpose and start needing exceptional conditions so that it also meets your other purposes for it.

Pretty soon, your common-code-for-the-sake-of-common-code starts to take on a life of its own. You can't explain exactly what it is meant to do, because you never really encapuslated that. You can only try to find out what will break if its behaviour changes. As you build and build on past versions of the software cohesion itself disappears.

To my mind, the bad design that comes about from prematurely reusing common-looking code is worse than that of duplicated code. When maintaining code with a lot of duplication you can start to pull together the common threads and concepts that have been spelt out by previous generations. If that code is all tied up in a single method, however, the code may be impossible to save.

When trying to improve your design, don't assume that reducing duplication is an improvement. Make sure your modules have a well defined purpose and role in module relationships first. Work on sharing the concept of the code, rather than the behaviour of the code. If you do this right, code duplication will evaporate by itself. Use the eradication of code duplication as a trigger to reconsider your design, not as a design goal in itself.

In the end, a small amount of code duplication probably means that you've designed things right. Spend enough design effort to keep the duplication low, and the design simplicity high. Total elimination of duplication is probably a sign that you're overdesigning anyway. Code is written to be maintained. It's a process and not a destination. Refactor, refactor, refactor.


Sat, 2004-Sep-25

TransactionSafe release 0.1

Well, the first revision is out. It's buggy, non-functional, and poorly-designed. Oh, and it doesn't really handle any error or exceptional cases, yet. You've got to start somewhere, I suppose. Currently the application only supports creation of account trees (not even deletion of accounts). You can't enter transaction data or get it back out again in the form of reports. The design is a first-cut pulled together from a lot of examples on the web rather than one that was planned. I also have a bug that I haven't gotten to the bottom of where when you rename an account it sometimes looks like its changing the name of a completely different account.

I decided to go with python for this revision, and have found no reason yet that I will move away from it in future revisions. I've had to learn python (which I'd never programmed in before), and gtk (which I'd never programmed with before), but at least I was familiar with the sqlite interface. Given the hurdles I think I've done ok for a first cut.

I made the choice of python because although it isn't the shortest path between sqlite and gtk, it does appear to be the path of least resistence. I haven't had to even think about creating a "database abstraction layer" because the python database model of returning row objects with attributes already in place to do the relevant things is exactly what I would have wanted to put together anyway. I really like the way python deals with databases.

I intend for my next revision to focus on refactoring the GUI code, which is currently in a terrible state. I might even start to put together a basic register (general ledger). Soon after that point I'll have to go back to my data models and work out how to get predicates from different rdf schemas to play nicely in the same database. At that point the current data model is likely to be replaced by a new one, so don't go using the formats I've specified so far for anything important :)

To those scared off reading my previous documentation revisions by the staroffice format they were released in, fear no longer. I have a pdf in the 0.1 release tarball which I hope will make for somewhat interesting reading, even though I haven't really put the time into it to clean it up or to cover what I'm doing with the python.

Anyway, that about does it. The quick link is here

Sat, 2004-Sep-11

A draft in time for HUMBUG

I haven't recieved any feedback on drafts of my proposal so far, which perhaps isn't surprising given that it's 26 pages long and doesn't even get to talking about design for code or definition of what I'm actually building, yet. Anyway, I thought I'd get another draft (TransactionSafe_Proposal,1.1C.sxw) out before going to HUMBUG for the evening.

This draft tightens up a few things in terms of the schema and conversions. It writes up some requirement of what kind RDF schemas I expect to work with and how they should be translated to sqlite queries. It also includes lot support into the RDF and SQLite queries.

This is the last revision I intend to write before starting to hammer out an API for dealing with the data. I'm still leaning towards doing the things in C++, as old habits die hard... but I'm not really thrilled about any current alternatives (C++ included). Anyway... it's more of the conceptual framework I want to map out. The ideas of how you want to access database data inside a software application. My basic requirements are that you should be able to write sql queries to your heart's content, but also be able to work with and navigate between objects without having to do any string concatenation in the process.

Have fun reading


Sun, 2004-Sep-05

My sourceforge submission

I've put together a project sumbission for sourceforge and if accepted I intend to upload a copy of my current propoal document. Any advice, suggestions, or feedback from Humbug members or other readers of this blog are welcome. It's that feedback I hope a sourceforge presence will elicit.

The project name I've chosen for now is TransactionSafe. It keeps a focus on the practical outcomes I want from this project in terms of a better personal accounting system. It perhaps detracts a little from my grander schemes of univeral data access, but I think I need to remain focused on the single task if anything greater is going to happen anyway.

The proposal document I have is in openoffice format (which is a package I personally feel stinks like the goats, but few free alternatives exist for what I'm trying to do). I have two revisions which roughly correspond to two weekends' musings that can be found at TransactionSafe_Proposal,1.1A.sxw, and TransactionSafe_Proposal,1.1B.sxw respectively. I haven't done much this weekend except a marginal amount of cleanup on the 1.1B version.

The first revision walks a fairly naieve road and tries to essentially reinvent the wheel wherever possible. 1.1B is my first attempt to map my thinking to available technologies and at least attempts to use an rdf schema to describe the data model. I'm not sure that the rdfs is actually correct, as I haven't actively investigated any validation tool alternatives.

I've laid out much of my technical and philisophical thinking in the document and hopefully it is scrubbed up enough that it won't get too many derisive giggles. There are a couple of inconsistencies already emerging with 1.1B retaining the 1.1A section on lots but not actually including them into its data model (I'm still thinking about how best to implement them). I also expect changes to the data model with the addition of things like account types (is it an asset or a liability, for example?) and the obligitory challenges as to what should be left to the chart of accounts if required and what should be dealt with explicitly by the data model. I may also go back to some existing accounting data model definitions to see if I can straighten out some terms and make them more consistent in my own model (xbrl-gl is an obvious source of material for comparision).

I'm hoping that another side-effect of sourceforge exposure will be that I'm pushed to actually producing something concrete (however insignificant) at least once every couple of weekends to get into the swing of this thing. I'm far too prone to dropping things I get bored of if they're not over quickly, and I don't want this itch to go unscratched.

Sun, 2004-Jul-18

Status update

I was asked tonight at humbug about the status of my accounting software concepts and where I thought they were going. To some bemusement of the onlookers I had to say that it hadn't come very far.

I'm coming to terms with a number of concepts I haven't had to be directly exposed to before, and I'm the kind of person who has to understand a bit of philosophy before continuing onto the real deal.

Here is the summary of my thinking so far:

Firstly, the technology stack. It's short. It's sqlite. I want to create a database schema in sqlite that could become a unifying force rather than a device force in free software accounting circles. Ideally, I would build a back-end API for use in both gnucash and kmymoney2 which could be used by disparate apps to extract and enter accounting data. Although lacking in at least one essential feature, sqlite as a database-in-a-file appeals terrifically to me for various reasons. I'm also intimately familiar with much of the sqlite source code, so could modify it as necessary to suit my goals. This might mean a branch for my proposed works, or may be something I can feed back to the author D Hipp.

Secondly, there's the database schema. I'm without a way of running visio or anything I've encountered that I'd consider up-to-scratch for even this diagram, so I'll describe it. You have a transaction entity. You have a transaction entry entity. You have an account entity. You have a commodity entity. The rules are these: A transaction can have multiple entries. An account can have multiple entries. Each entry has a single transaction and account. Each account has a single currency. The final rule is that all the amount of the transaction entries associated with a single transaction must sum to zero for their respective currency. Positive amounts indicate debit. Negative amounts indicate credit.

There are a couple of special fields in the data model. Each transaction has a date. I'm considering also giving either transactions or their entries a lot number. Each transaction entry has a memo field, and a reconciled status. Accounts are a special case unto themselves.

Since SQL doesn't handle graphs very well, I'd really like to be able to use some kind of bridge into rdf data to manage the relationships between accounts. As it is, I plan to create a table with an rdf feel to it. It will have three columns: subject, predicate, and object. The obvious predicate is something like "parent", where the subject is any account and the object is another account. It's a pity I'll still only be able to use sql over the structure (although it might be possible in alternate realities and future worlds to splice an rdf query mechanism in above the sqlite btree layer...).

Now... since the database effectively "is" the API of the backend I'm proposing, we need to consider how changes to the database structure will affect applications. Since I plan for this structure to be used and reused across applications and for ad-hoc query and reporting applications to be written independent of the application that created the accounting data, I need to consider a versioning scheme. Not only that, but I need to consider what other data might live alongside the accounting data, and be able to ensure that my data doesn't interfere with it or vice-versa while still allowing applications to query across the combination.

My current thoughts aren't well-formed on this issue, but wander along the COM versioning ideas. Essentially you never allow something that is different to the earlier version to be called by the same name. You do this by assigning unique ids to things you care about being the same. I haven't prototyped this concept at all, but my thinking is that all table names would be uuids gathered from a program like uuidgen. Here's one: f013170f-b8ff-419f-abb7-81306e2ccbdb. When the structure or semantics of that data changes, I create a new table to hold the new structure and call it 09c14549-3ab0-4517-a052-aba00af2c30d. I probably also create a table with a well-known uuid to map between uuids and sensible names and version numbers for applications that don't care about certain minor revisions. My thinking is that a minor revision would be one that doesn't cause queries that select specific columns to have to be altered, for example adding a new column. A major change would be one where columns changed names or meaning. Any application that inserts data would likely be sensitive to all schema changes.

Migration programs could be written to take advantage of the explicit versioning structure. When the program finds old data it could move it or replicate it into the new form. Additionally, multiple schemas could live alongside each other. In addition to the accounts themselves, a small payroll system might be included or a table to track market value of your shares.

We end up with a database schema that looks something like this:

CREATE TABLE '5ac164f0-78d2-4461-bb6f-12bbb32b39f6'(
        UUID, Source, Name, Major, Minor
INSERT INTO '5ac164f0-78d2-4461-bb6f-12bbb32b39f6' VALUES(
        "5ac164f0-78d2-4461-bb6f-12bbb32b39f6", "", "Schema", 1, 0
CREATE TABLE 'ded2f12d-8fd5-4490-bb9e-3e3b31c46b22'(
        TransactionHeaderId INTEGER PRIMARY KEY,
INSERT INTO '5ac164f0-78d2-4461-bb6f-12bbb32b39f6' VALUES(
        "ded2f12d-8fd5-4490-bb9e-3e3b31c46b22", "", "TransactionHeader", 1, 0
CREATE INDEX '11da5cb6-c03f-43c4-917a-c0f2502c5bc2' ON 'ded2f12d-8fd5-4490-bb9e-3e3b31c46b22' (Date);

CREATE TABLE '7ba6ce04-66d2-4f32-9d40-31f5838d5bd4'(
        TransactionEntryId INTEGER PRIMARY KEY,
INSERT INTO '5ac164f0-78d2-4461-bb6f-12bbb32b39f6' VALUES(
        "7ba6ce04-66d2-4f32-9d40-31f5838d5bd4", "", "TransactionEntry", 1, 0
CREATE INDEX 'c234ebb4-11e5-4b09-9036-32a1486fd5fa' ON '7ba6ce04-66d2-4f32-9d40-31f5838d5bd4' (AccountId);
CREATE TABLE '08fd9a02-1497-4f31-8bcf-dc9d4fed74fd'(
        AccountId INTEGER PRIMARY KEY,
INSERT INTO '5ac164f0-78d2-4461-bb6f-12bbb32b39f6' VALUES(
        "08fd9a02-1497-4f31-8bcf-dc9d4fed74fd", "", "Account", 1, 0
CREATE INDEX '686a4d47-6cd8-48fb-a8ba-e844e13d85a2' ON '08fd9a02-1497-4f31-8bcf-dc9d4fed74fd' (AccountName);
CREATE TABLE '31580110-8eb8-42a1-909a-9aa72cb9534a'(
        CommodityId INTEGER PRIMARY KEY,
INSERT INTO '5ac164f0-78d2-4461-bb6f-12bbb32b39f6' VALUES(
        "31580110-8eb8-42a1-909a-9aa72cb9534a", "", "Commodity", 1, 0
CREATE INDEX 'dfbfa695-dee2-4e61-90a0-2000d72e6e2d' ON '31580110-8eb8-42a1-909a-9aa72cb9534a' (CommodityName);
CREATE TABLE '58426fd9-6b99-4e2c-8f5f-975b5508ae93'(
        Subject, Predicate, Object
INSERT INTO '5ac164f0-78d2-4461-bb6f-12bbb32b39f6' VALUES(
        "58426fd9-6b99-4e2c-8f5f-975b5508ae93", "", "Relationships", 1, 0
CREATE INDEX 'd08d9b0a-3391-4e60-84c7-b8af312b1ad7' ON '58426fd9-6b99-4e2c-8f5f-975b5508ae93' (Subject, Predicate);
CREATE INDEX '151b627b-af70-43df-959d-9dd43301f6e7' ON '58426fd9-6b99-4e2c-8f5f-975b5508ae93' (Object, Predicate);

An obvious flaw with this as a database model for the time-being is that transactions are not checked to ensure they sum to zero. I'm going to have to think some more about that

Now, let's see an example reporting application:

sqlite foo.sqlite "SELECT Amount FROM $Account JOIN $TransactionEntry USING (AccountId);"

It's just like a bought one.

That's where I'm at. No further. I'm still prototyping in shell scripts with sqlite. Perhaps I'll get some more done soonish, but no more tonight.

Sun, 2004-Jul-04

Thin client, Thick client

A humbug post has kind-of-by-proxy raised the issue for me of thin and thick clients.

After reading that the project mentioned relies on PHP and javascript on a web server for its workings, I began drafting a response asking opinions from humbug members about what platforms they think are worth developing software for in the modern age.

It's not really something anyone can answer. I'm sure that at present there is no good language, api, or environment for software that's supported well across multiple operating systems and hardware configurations for GUI work. Instead of replying to humbug general I thought I'd just vent a little here instead.

The web server thing bugs me at the moment. You can write a little javascript to execute on the client, but basically you're restricted to the capabilities of HTML. Unless you dive into the land of XUL and friends or svg and flash there is not really anyway to create complex, responsive, and engaging interfaces for users to interact with. Worse, it's very difficult to interact with concepts the user themselves are familiar with in a desktop environment:

As soon as you put both a web browser and a web server between your software and your user it's very difficult to accomplish anything that the user can understand. If you save or operate on files they tend to be on the server-side... and what the hell kind of user understands what that is, where it is, and how to back the damn thing up?

You might have guessed that I'm not a fan of thin clients... but that's not exactly true. I think it's fundamentally a problem of layering, and thats where things get complicated and go wrong.

I think you should be able to access your data from anywhere. Preferrably this remote interface looks just like your local one. Additionally, you should be able to take your data anywhere and use it in a disconnected fashion without a lot of setup hassle. Thirdly, I think you need to be able to see where your files are and know how to get them from where they are to where you want them to be.

What starts to develop is not a simple model of client and server, but of data and the ways of getting at it. If we think about it as objects, it's data and the operations you can perform on it. Like a database, the filesystem consists of these funny little objects that have rules about how you can operate on them. Posix defines a whole bunch of interesting rules for dealing with the content of files, while the user-level commands such as cp and mv define high-level operations for operating on the file objects. Other user-level applications such as grep can pull apart data and put it back together in ways that the original developer didn't expect. Just as with databases, its important that everyone have the same perspective as to the content and structure of files and that ad hoc thing can be done with them.

To me, it is the files more than the client programs that need to be operated on remotely. You need to be able to put a file pretty much anywhere, and operate on it in the ways it defines for itself.

To start on the small scale, I think its important that you can see all your data files in your standard data file browser. This means they shouldn't be hidden away in database servers like postgres or mysql. On the other hand, you should be able to do all the normal filesystem operations on them as well so sqlite is a little deficient also. You can't safely copy and sqlite file unless you first get a read lock on exactly the appropriate place. Ideally, you should be able to use cp to copy your database and ensure that neither the source database nor the copied database are corrupt when you're done.

I feel there is a great wilderness here, where we're a step away from having a way of ensuring files are operated on in a consisten manner. It's like every file itself has an API and needs hidden code behind it to ensure appropriate action is always taken. Standard apis like the ones used to copy files or to issue sql queries or greps against them must exist and some fundamental apis every file must implement... but I keep feeling that we need more. Everyone needs to know how to operate on your file both locally and remotely. That inculdes nautilus. That includes your rdf analysis tool.

To my mind the fundamental layer we need to put in place is this architecture of file classification and apis to access the files. Past that, thick and thin clients can abound in whatever ways they are appropriate. I think that in the end thick clients will still win in the end, but they will resemble thin clients so closely in the clarity of definition of their environment that we won't really be able to tell the difference.

Sun, 2004-Jun-27

The objects that live in a database

Hello again. It's that time of the evening when my musings about how databases and objects relate to each other reach a new level.

Let's view the world as it is, but through the eyes of objects. Objects live in different places. Some live inside processes. Others live inside databases. Inside processes we have a wide variety of objects and object types. Inside databases there are two.

Inside a database, the first and most fundamental object is a row. Each row has assoications via foreign keys with other row objects. Each row has a set of operations that can be performed on it, and through triggers or other means a mechansim for reporting failure or causing other operations to occur when operations are performed on them.

The other kind of object is any cohesive set of tables. This set of tables and their rows may have its own rules and supports operations on its constitent row objects

Rows are simple objects of a strict type that essentially cannot be extended. This other kind of object, which I will call a database object, is something a little more fluid. Each set has its own type, but may support a set of operations that overlap with operations applicible to another database object. The overlap could be considered inheritance. A simple database object may be incorporated into a larger set of tables and operations. That could be considered composition.

As we build up the database model as objects, instead of trying to store objects in a database we see a slightly non-traditional view of how databases could be constructed. Traditionally we see a static database schema, because we traditionally see the entire database as a single database object. Perhaps another way to construct things is as follows

If we have a database object of type foo with tables bar and baz, we could create multiple instances of foo in a single database by prefixing the bar and baz table names. and foo1.baz could sit alongside and foo2.baz in the same database. Each could enforce its own rules internal to itself, the larger database object we've now constructed to contain these objects could enforce further rules.

A database object, therefore, can be defined as a set of tables with defined operations and rules plus a set of sub-objects. Lists or structures of sub-objects could be managed also.

An accounting system operates in a single currency on a single accounting entity. Multiple accounting entities could be managed in the same database as instances of a general ledger database object type. Multiple currencies for the same accounting entity, or even multiple gaaps might be managed as different objects within the greater database.

If the basic rules of the accounting model can be encapsulated inside a database object, perhaps more complicated structures can be built around the basic models. An interesting side-effect of this way of thinking about objects within databases is that you can still do SQL joins and queries between parts of different database sub-objects. They can still be made efficient.

Perhaps the biggest problem with this way of thinking about database objects is that SQL doesn't really support it. You can do the following kinds of operations:

... but you can't SELECT * FROM TABLES LIKE, or select these things based on type. You would certainly have to do some logic in the calling application if you were trying to perform "arbitrary" data mining across objects.

Oh, well. Something to think about, anyway.

As to how these object relate to applications...

Proxy objects can be created in an application to talk to database objects. You could begin a transaction, instantiate new database object proxies (or extract them from the database), and them modify and commit them. This would be an easy programmatic interface to work with. If a simple XML file defined the structure of the database objects it would be trivial to write an XSLT to generate code for the proxy objects

Proxy objects could be pulled from the database with select statements and the like. Foreign keys could be represented in the objects as smart pointers that, when dereferenced, would load the next object. Issues you might run into include updating proxy objects when appropriate as the result of triggers and the like (triggers would best be implimented in the object modifying the database in this case), and in dealing with relationships between database sub-objects which might not be modelled particularly well in this concept.

Another issue to consider is one that plauges the qof framework that underlies gnucash. Consistency. Consistency between the objects in memory and the database can't be guaranteed easily when multiple processes are modifying the database concurrently. The simple answer would be "start a transaction when you start pulling objects out of the database, and destroy all proxies when you commit". That would be nice, but you may be locking other users out of critical parts of the database. If you use sqlite you'll be locking them out of the whole database.

The naieve solution is not to load these objects until you are ready to make your change. The problem with this approach is that you typically need at least some data from the database to know what change you want to make in the first place. This leaves you in the uncomfortable position of making objects that are one level above the proxy objects.

These super-proxy objects would essentially check for consistency whenever they're accessed, or perhaps more often and warn the calling objects of possible inconsistency when they do change. This makes the whole system much more complicated than it otherwise might be, and adds one more complication. When you have modified your in-memory proxy objects and want to commit the changes to disk they must first run a final consistency check. Consistency failure means the commit cannot proceed and must be rolled back and possibly reapplied with the fresh data.

Oh, well. Complication always abounds when you try to exert your influcence over objects that someone else owns.

The new sqlite version (3.0) will hopefully alleviate this problem a little. When a writer begins a transaction it won't exclusively lock the database. Instead, it will prepare the transaction in-memory while it keeps a reserved lock. Readers can still read. It's only writers that can't get into the database. Still, it would be a pain in the arse to have to wait for slow-poke-allen to finish entering his transaction before you can even begin editing yours.

One more optimisation could be applied, here. New objects might be able to be created in proxy form, and then applied to the database in a short transaction. Since they essentially have nothing to check consistency against they could be constructed in a way that was essentially guaranteed to commit successfully. This is good news, because in a financial database most transactions are insertions. Only updates would need to lock the database.

The only problem with that approach is when you're relying on the database for some information about the new rows or database objects, for example if you're relying on the database to assign you a unique identifier which you use as a foreign key between objects. Bugger. That's almost always required.

Hrrm... oh well. Things are forming and I think I'm getting to a more concrete place with the philsophy behind this development. Perhaps one day I'll even see the qof "light" and move back to gnucash with a better apprecation of what's already been created there.

Thu, 2004-Jun-24

Objects and Databases (again)

I'm obsessed about this. Databases and objects. What is the relationship, and how do we model one in the other?

A specific database schema (or XML schema for that matter) can be modelled as an object. It has a bounded set of operations that can be performed on it, defining its "type". It represents a specific set of data and a bounded information system. It is finite and controllable. It can be queried and updated in simple, standard ways. Changes to the schema result in an essentially new database object, which can be operated on in a new set of ways.

Other objects can interact with a database object. Some objects are able to use their specialised knowledge about the database schema to modify it. Ideally, the database itself defines modification functions and enforces any necessary rules for data integrity. Other objects are able to use their specialised knowledge to query the database in specific ways. Again, complex queries are best defined inside the database themselves.

So now comes the itch I've been trying to scratch: What if you have two database schemas?

Traditional database design has taken everything that can be conceptually connected and put them into one database, i.e one object for management. This solves the problem on the small scale, but doesn't work on internet-scale information architectures. Once you get to the massive scale it doesn't make sense to keep all your data in the one place. If we are to develop an architecture that scales we should consider these problems while they're still on the smallest level

Let's take a simple example: Your accounting database and your daily stock prices database. Your accounting database has a table with account entries in it, each of a specific amount on a specific date and tied to a specific transaction. Your stock prices database shows the value of certain stocks over time. Combine the two, and you can come up with a report that shows the value of your investment portfolio over time. You don't want to duplicate the information between databases, but neither do you want to combine the two databases into one.

Here's the picture (in mock-UML ascii art!): |Accounts|<---|PortfolioValueReport|--->|StockValue|

There's no need to couple the Accounts and StockValue objects together. Apart from a common thread of terminology that relates information to the "ASX:AMP" sstock ticker symbol there's no connection between the two and I want to keep it that way. I want PortfolioValueReport to be the only object that has to consider the two side-by-side. So how do we make this report?

We could do it like a web service does it. We could query Accounts, then query StockValue, and leave PortfolioValueReport to put the queries together and make some sense out of the data. I would call that an object-based approach to the information architecture.

Another approach would be to work like sqlite does with multiple databases. You "attach" databases so that they form a single meta-database as needed. You run your query on the meta-database. I would differentiate this form a strict object-based approach. I think it's a relational approach with independent merit. In this approach you get access to all the indices in each database and any query optimisation the databases can do in concert. You don't have to read the whole result of both queries. You essentially let the database technology work out how to do things.

I feel comfortable with this kind of approach. In addition to a unified query mechanism, with sqlite you even get atomic transactions. Perhaps you have several independent data models, but you want to perform an operation on all. You want the operation to succeed only if it can be done to every one. The simple example here might be that you've bought some shares of a company you've never been involved in before. You want to add it to your accounts, but also to your share price watch list... or maybe you want to keep several instances of your account database type. Perhaps you want to keep one set of accounts for australian gaap, and a different (but related) set for american gaap. You'd want to know both were updated before allowing the transaction to go ahead.

I belive that web services do have a distributed transaction mechanism available, and that's something I may use in the future as technology and frameworks become more advanced. In the mean-time, I'm thinking that these multiple objects as multiple on-disk sqlite database files might be a good first step.

My current thinking is that I define and publish the database schemas. Once that is done, I start work on applications or objects that are able to operate on those databases for various purposes. I think a well-defined schema level will provide a capability for adding functionality in a modular way that current information systems really only dream about. We have applications all over the place that have a "private" schema that means you have to go through the front-end to do anything useful. I'm not keen on it. I want the data API to be clean, simple, and published. It's what's above that should be "private".

My biggest concerns with sqlite as a data solution are that it won't be able to incorporate the full data model I want to push into each database object. I may have to add functions using the C apis and provide my own sqlite interface. That wouldn't be so good. Sqlite doesn't really have anything along the lines of stored procedures, although views and triggers are supported.

The other issue with sqlite is the remoting capability. You would essentially have to mount the database in order to access it remotely, and that's fraught with problems of bad nfs implementations and the like. I don't think I can offer remoting capabilities for the time-being.

Hrrm... this weekend for sure.

Mon, 2004-Jun-14

Objects and Databases

In information technology there are the two main archetypes. A software engineer works in programs, applications, procedures, objects. The database stream works in tables, forms, reports, data. The two are worlds apart.

Objects are a means of data hiding. They behave in concrete and reliable ways because they themselves define all the operations that can be performed on them and how those operations affect their state. Their data is hidden. It may reflect what is held within the object, but there is no need for it to be so.

A schema in the software world is the set of objects and their methods. If an object is transmitted across a network or stored in a file a memento must be created in the object's place to be activated, read, or otherwise interpreted to construct the equivalent object in the target process. The memento itself can't be understood by anything but the receiving object in that target process.

Well, that's not true. A memento can be anything. It could be a hard binary representation subject to change whenever an object is modified. It may be an XML representation that can at least be partially understood... but still it can't reconstruct that object on the other side.

Databases are a weird and woolly world, where it is not the behaviour that is tightly controlled by the representation. Instead of defining a set of operations that can be performed on an object they define the object in terms of its internal data. The operations are basic, standard, and implicitly allowed. They allow modification to any part of the data. This fairly primitive model of operations is suffered because the main purpose of database is not to maintain the integrity of the objects it represents, but to allow those objects to be pulled apart, rearranged, and reassembled at will during the advanced query process.

SQL queries are a declaritive mechanism for defining how source data from tables should be transformed into target tables for reporting. These new tables are essentially new objects, so the whole database can be thought of as an engine for collecting and transforming objects into new object sets. It's an essential and well-performed service.

The problem with SQL databases is that they don't deal so well with the things that the software world has learned regarding objects over the years. Because the data is so exposed, database folk get used to the idea and it seems only small steps have been taken towards object-orientation. Stored procedures are the "standard" way of hiding data from applications that want to modify it behind standard operations. This, in effect, makes the entire database an object with metohds.

Views are an attempt to hide data from the other side. A number of "standard" object sets are created to do queries against, while the actual data remains hidden. The underlying data can thus change without harming the queries and reports that hinge of it.

Is there a way to make databases more like object models, while retaining the query and reporting capabilities that have made SQL great? I suppose another way of looking at the question is to ask: Is there a way of making object models more like databases, without sacrificing the data encapsulation capabilities we hold so dear?

It seems what we need to do is define an database that contains real objects. Those objects define a set of operations that can be performed on each, and present a number of "properties" that SET and UPDATE operations can be performed against. Further properties, or the same ones, would support SELECT statements. The objects themselves would be able to reject certain updates with an error or perform actions on other objects to achieve the SET operations.

One spanner-in-the-works of this whole way of thinking is that in databases, objects are essentially either rows or are the entire database. In software, objects are often complex beasts involving a number of sub-objects. These kinds of objects can be represented in a database, but don't match their natural form when doing so. The (understandable) obsesson with tables in database technologies means that complex objects are difficult to query over and to extract.

To get to the point, if you have a database containing accounting objects you need at least three classes. You need a transaction, a set of transaction entries, and an account for each transaction entry to reference. Certain constraints must be present on these objects, for example "at the end of each update the total credit transaction entries must equal the total debit transaction entries for each transaction instance". In essence, a transaction and its entries behave as one complicated object, but must be represented for the database as several objects. This has its advantages, making it easy to pull out transaction entries associated with individual accounts. It has its disadvantages, making it difficult to know when to apply this test. It can't be done each time an insert, erase, or update is issued to a transaction entry. Perhaps before each commit? But if so, which transactions do we run the check on? It's infeasible to check all transactions.

Perhaps I'd better just give in and start thinking relational. I suppose with triggers I could just maintain a dirty list, checked and cleared before each transaction completes.

P.S. On another note, I apologise for the untruncated nature of my blog entries. I've done some experimentation with the "seemore" plugin, but it doesn't appear to work for statically rendered blogs such as this one. Perhaps some knowledgeable person will take pity and refer me to one that does work on static pages :)

Sat, 2004-May-29

Micrsoft's Service Oriented Architecture

I've just skimmed an article discussing entity aggregation in Microsoft's Service-Oriented Architecutre. I don't really have the brain-power to throw at its detail right now, so I'm left with a little bit of confusion as to whether SOA is a real technology or a set of "best practices".

Regardless of the concreteness of SOA as a technology it does appear to have some interesting synergies with the way I've been thinking about creating information architecutres.

Essentially the framework seems to come down to this:

The aggreation is what this article was focused on. Something I thought was particularly intersting is that the architecture of the aggregator is very similar to the architectue of the gnucash Query Object Framework (QOF). You have means of querying through the aggregator to the silo web services, and a means of updating the data in the back-end stores. There's an option for various levels of replication between the aggreagor and the back-end services.

The thing that's happening, there, does seem to be the right one. I'm not sure how to define the query mechanisms, yet, but maybe I don't have to. When it comes to transactions there are a few complex queries that can be codified as function calls, and less common simple queries that could probably be expressed as xpath. The xpath could be translated to sql by the transaction management service to adapt onto an sqlite database.

I'm more and more heading into the domain of web services with my thinking. I guess I'm not sure why, really. I like the sound of being able to transplant these data stores to any machine on the network and to be able aggregate and update data from highly disparate sources without having to rethink the kind of API you're coding to.

Modern implimentations of the .NET web services allow you to target web services as your platform, but then be able to open up IPC capabilities that follow a less heavy-weight encoding of the transmitted data. I guess that by targetting web services my hope is that I'll create a scalable solution that will still be able to apply to the smaller scale.

With a fairly clear idea of my technology-base in hand I suppose I should begin actual work on this project soon. My first target will be to create a web service that allows the creation, reading, update, and deletion of transactions. It should allow a query for all transaction entries that relate to a specific set of accounts, a query for transaction entries between two sets of accounts, and an xpath query processor.

Subsequent data islands will be the data island describing all account metadata (probably RDF-based), a data island for budgeting information, a data island for accessing stockmarket information, and other data islands for scheduled transaction handling (a very difficult-to-define application).

The last thing I want to do before I being working on this (which will begin with learning some new languages methinks) is to have a good hard look at gnue, the gnu CRM suite written in python. I just don't know enough about it yet to decide whether I'll be just duplicating work they've already done or whether what I want to do is genuinely new.

Now, if only my wife didn't want to get back on the computer...

Sun, 2004-May-23

Accounting for Commodities

I'm gathering confidence that my understanding of international accounting or accounting for commodities or for inventory is the correct one. This kind of accounting is where you carry shares, foreign currency, bicycles, whatever that can't be described strictly in the (australian) dollar sense. Nonetheless, your accounts must track their value in order to be correct.

I've been using Accounting 3[1] as my main reference tome. It has sections on inventory and international business, but always talks about them in dollar terms. This is clearly correct, but is not the whole picture. The book explains how when you purchase inventory at a particular cost price all the costs of that inventory go into your inventory asset account. When you sell the item you take the cost of the item out of your inventory, and record the income event in the same transaction:

Inventory purchase, 5 books

Sales Revenue$150CR
Cost of Sales$100DR
Inventory sale, 5 books

This is actually really simple. You spend some money to get something, but it's actually going to bring in some income pretty soon. You don't want to record the expense event until you can match it to the income event. It brings some balance to your chaotic accounting system. It can get more complicated when you don't sell all your books at once, and it can also get more complicated when you buy some more books before you've finished selling the old ones. You might buy and sell some of the books at different prices, so it can be tricky to keep track of things.

One way of keeping track is to take every indivdual book and record its indivdual cost and sale price. That can get overly weary, so mostly you take your idential books and put them in a single pool. You use some standard methods for guessing the actual cost of each single book you sell, based on how many books are left and how much the whole lot cost you. You can see what's missing, though.

We haven't actually recorded the number of books we have on hand in a machine-readable way!

This brings us to the crux of this issue. You've got a complete, self-consistent set of accounts in aussie dollars which track the cost value of the books you own. They don't track the number of books, and can't tell you anything about the books themselves. What you needs to do is to keep another set of records.

This second record set looks very much like a set of accounts, but it tracks a different kind of commodity. In this case your identical set of books. I see this again as a self-contained accounting system:

Inventory5 books DR
Books recieved5 books CR
Inventory purchase, 5 books

Books sold5 books DR
Inventory5 books CR
Inventory sale, 5 books

Notice the parallels with the "real money" transactions. So we now have two sets of accounts that track the same thing, but in different quantities. As in this example you will often have the two amounts vary at the same time. On the other hand, they may vary independently. You might be given 5 books for free. This doesn't change the cost basis of your (now) 10 books. It does change the amount you have on hand, though.

I think this concept of a separate set of accounts stands up across all kinds of commodities, and even stacks up in a system of barter where you buy commodities with other commodities... or you buy US stocks with US dollars. Every commodity you carry has its own set of accounts to deal with them.

I see real deficiencies in the way gnucash deals with multiple currencies. Under its current system, a transaction involving multiple currencies has a "home" currency. All components of the transaction must have home currency equivalent value attached to them and no checking is done to ensure that the foreign currency amounts actually balance. I think what will be needed in the future accounting system is to allow a single transaction to affect accounts in several currencies, and ensure that each currency's total balances correctly.

The other impact this all has is at query time. Queries can only really occur over transaction entries in the same set of accounting records. You can't compare apples and oranges (except perhaps in terms of taste, size, tangyness, and texture). So you can see how your net worth is doing based on the cost value of your shares... or you could see how many shares you have. You might even pull in current market data to do that apples and oranges comparison and try to see how much you would be worth if you sold your shares. In the end, though. It is the accounts that I'm trying to defined. They're the sacred... uhh... apple. The accounts themselves will be consistent for each currency represented.

[1] From inside the front cover:

3rd ed.
Includes index.
ISBN 0 7248 0500 1.

1. Accounting. I. Horngren, Charles T., 1926-.


Sat, 2004-May-22

Attributation found

I didn't have to look very hard in the end for the attributation of the XML-related quote I used in this post. A fellow humbugger, Mr Martin Pool had posted it in this article.

Sat, 2004-May-22

Avoiding Data Islands

I've been working on the models of accounting for useful things in a future accounting system. I'm pretty happy with my understandings of most basic accounting functions, but am still a little unclear on handling of multiple commodities and the like. On the whole things are progressing well, usually during my bouts of insomnia at around 12:30.

I still feel like my biggest problem is coming up with an acccessable technology base.

I'm comfortable with the notion of quite a simple accounting data model of transactions accounts. Each transaction lists a number of entries and each entry lists the identifier of the account it affects. What I'm not comfortable with is how to selectively expose this model to applications, generally. What API should be provided? What kind of query and update language should be used. How can the data in this island be combined with data from other islands?

Again, I'm still trying to work out the details of this. If any accountant-type readers are tuning in right now I'd love to hear your advice on anything I might be getting a little wrong. I think the following is a clear case of wanting to bring data together from different data mines:

Say I own some shares. GAAP requires that I report the value of these shares at the "lesser of cost and maket value". I can account for shares as I would inventory, that is to say in australian dollars at cost basis instead of as share counts. That provides the "cost" part of my query, but if I want to combine this information with current market value to fill out my report I have to know the following:

  1. The number of units in my posession, and
  2. the current market value of those units

Suddenly I have to know about a lot more than that which lives in my general ledger, and I need a general interface to query the information for the generation of reports. It would also be useful to have that information stored in such a way as the backup operations I would apply to my accounting information also cover that other information I might run reports over.

I might want to run less directed queries. I might want to compare the share price of a company with the rainfall statistics that affect that business. I might want to pull in the data of my purchases and sales of the stock and compare my profit or loss to the profit I might have made in an alternative scenareo.

My feeling of how something like this must work is as follows:

Many of the objectives I have appear to be best met by some XML technologies. Others appear to be best met by existing relational database technologies.

As I mentioned earlier, I'm having real trouble trying to find a technology base that's really applicable. Essentially I'm in the market for a transplantable platform that covers all major data handling functions in a cross-platform, beautifully-integrated manner

I heard a cute quote a while back. Just long enough ago that I don't recall where I saw it or the attributation, but the quote itself was as follows: "XML is like violence. If it doesn't solve your problem, you're not using enough of it". I kind of feel that way. I really like where the XML world is heading in many ways, but in terms of data management (as opposed to data exchange) XML still appears to be in a confused place. At the same time traditional database technoligies are looking outmoded and unagile. I think its a question of unsolved problems.

AJ discussed loss of diversification due to competition in this blog entry. If you follow it through to the "see more" part of his post he discusses the fact that competition doesn't seem to have killed off the various email servers of the internet. We essentially have a "big four". AJ refers to email being somewhat of a solved problem where competition is not really required anymore.

I'm of a mind to think that there's always a money angle. Whereas I think AJ is leaning towards technical issues when he talks about solved problems, I would lean towards the economic issues. Mail servers don't suffer a lot of compeition because they've already reached a price point where they're a commodity. The fact that none can gain an effective foothold over the others on a technical basis maintains the commodity status. I think the fact that several offerings are free software helps contribute to the commoditisation of the solutions and therefore the continuing diversity of choice.

Five years ago it looked like data management was a solved problem, too. Relational databases were and still are king, and back then it looked like they would stay king. The XML hype has put question marks over everything. XML has become the standard way of doing data interchange, so the data storage has to become more and more XML friendly. At the same time we've also been transitioning from a world of big backend monolithic databases to a world of loosely-coupled, distributed data and data more closely tied to an individual user and their desktop than to the machine. We want to carry more data around with us so we can look at our data at work as easily as we can at home. We want to be able to look at it again while we're on the train.

I think that some form of XML technology will eventually be involved with filling the gap between what the big databases currently provide and what we actually need. We've had several attempts to fill it so far. sqlite is awesome for little things but it's still hard to get the data in and out. Web services are starting to lean away from the big iron and onto the desktop, especially with Longhorn's Indigo offerings coming in a few years. Actually, I suspect that the only technology we'll still be betting our businesses on in five years time in the data management arena will be XPath, which has already survived quite a few major changes of hosting environment. XPath has even been implimented in silicon. It's really hard to pick what's going to happen above that level. Will XQuery really pick up? Will it be superceeded by something more geared towards querying and collating results from multiple web services? Again, I don't really know.

In the end I think the data management world has some catching up to do before it can fill the new niches and still claim to be mature technology. What we do now will influence that process. As for my usage, I'm still undecided but I'm watching the stars and the blogs and the news for signs that a uniform approach is starting to emerge.