Sound advice - blog

Tales from the homeworld

My current feeds

Thu, 2006-Mar-30

Desktop Identifiers

The Resource Descriptor Framework (RDF) is based around the use of Uniform Resource Identifier (URI) rerefences. An RDF statement identifies a subject, an object, and a predicate. The subject is a URI that says what the statement is about. The object is a URI or a literal value. The predicate is a URI that identifies the relationship between the subject and object. A collection of statements forms a graph, and this graph is sufficient to describe a logical model of anything and everything. Anything that understands the meaning of the whole set of predicates can understand the whole graph. Anything that understands a subset of the predicates will understand a corresponding subset of the graph. Different parts of the graph can be controlled by different agencies, so long as each identifier used in the graph is unique. The uniqueness of identifiers is the cornerstone of making the system work.

The deep dark secret of URIs is that they are hard to come up with. The problems of a single URI having multiple meanings has been reasonably well canvassed, but the initial URI selection is still a difficult problem. What is the correct URI to use for the iso4217 currency symbol AUD (AUstralian Dollar)? Should iso4217 be used as a scheme to make iso4217:AUD the uri? Is the scheme just "iso", and the URI iso:4217:AUD? Do we trust oasis and use urn:oasis:ubl:codeList:ISO4217: Currency%20Code:3:5:ISO::AUD? How about mddl, or www.xe.com?

Scaling down a touch, which URI do I use to identify a record in my email agent's address book? What about for a file in my filesystem? Is file:/home/fuzzy/accounts.db good enough? How about https://localhost:1234/? Just as in the iso4217 case, I have an identifier. I just don't have an agreed context to work with. Sean McGrath writes:

Utterances are always a rich steamy broth of the extensional and the contextual. The context bit is what makes us human. We take short-cuts in utterances all the time. That is the context. Obviously, this drives computers mad because computers don't do context.

The problem is not unique to RDF. Whenever we have two databases managed by different applications that want to refer to each other, we have a problem. Just how much context do we provide? If I want my accounting application to relate somehow to my email client's address book, what is the best way to do it? If I want my stock market monitor application to match up with my accounting application's records, what key should I use? If the two pieces of data were in the same relational database the problem would be easy to solve, but the schemas of these two databases are controlled by different agencies. In the general case their data models should be able to evolve independently of each other, but there are points at which their data models interact. Those points should be controlled with identifiers that carry enough context to determine whether they indeed refer to the same entity.

I am finding myself itching to solve the desktop accounting problem again. I want to define the cornerstone of the overall data model now. I want to define what a transaction looks like. Transactions have entries, and transaction entries link to accounts. I want the model of what an account looks like to evolve separately to that of a transaction, because it is a much fuzzier concept. It has a lot to do with strange ledgers that refer to specific problem domains. These problem domains don't impact on the core transaction representation, nor do they impact the major financial reporting and quering activities. I would like to be able to provide hard dependable definitions of the hard dependable parts of my data model without setting soft definitions in concrete.

I feel like the best way to achieve something like that is to have a database of transactions alongside one or more databases of accounts. A common key could bind the two data models together. Transactions themselves could have extra information attached to them by in a separate database. Core query and reporting capabilities need only depend on information in the core transactions database. Clever reports and domain-specific ledgers could make use of additional information to mark up transactions and accounts. The ideal key to bind these databases together would be a uniform identifier. That would allow me to unambiguously move these databases around and combine them with other databases in different context. Within a single database I could use simple integer keys (or RDF blank nodes). In a universal database I need to use uniform identifers. Is there a middle line for databases that are spread only across a desktop or corporate context, or is there an easy universal scheme I could use?

We are pretty much working in the world of machine-generated identifiers, now. That may mean we can take microsoft's old favourite technique on board and make use of a machine-generated globally unique identifier. Human-readability is not all that important, so long as the identifier is easy to generate and otherwise work with in the database. Full GUIDs could be used whenever and identifier is used in the form urn:uuid:f81d4fae-7dec-11d0-a765-00a0c91e6bf6 as per rfc4122. We could alternatively try to use it as the context only in the identifier, eg https://localhost/uuid/ f81d4fae-7dec-11d0-a765-00a0c91e6bf6/321 for record 321. We can't attach 321 to the urn:uuid uri because the rfc does not permit it, but this localhost business is still a grand hack.

We could dodge the whole question of context for a time by using a relative uri or uri reference. If we treat the database as a document with its own URI, We could use the identifier "#transaction31" to stand for a unique identifier within the document. This doesn't solve the problem, really, because chances are the database is being located at either file:/home/benjamin/my.db (giving a full url of file:/home/benjamin/my.db#transaction31) or at https://localhost:1234/ (giving a full url of https://localhost:1234#transaction31). Importantly, anything that refers to the identifier using either one of these paths depends on the same port on localhost being opened every time your application starts. It depends on the database being found at the same path every time. In fact, we could make use of a relative URI again. If I have a database at file:/home/benjamin/my.db and another at file:/home/benjamin/myother.db, the two could refer to each other with the relative paths "my.db" and "myother.db". They could refer to each other's identifiers as "my.db#transaction31" and "myother.db#account12". So long as both files moved together their context really could be for the most part ignored.

Perhaps these non-universal universal identifiers are good enough. Perhaps we will never use these databases outside of the context of their original paths on their original machines. Perhaps we will learn to contol the movement of documents and data around a desktop as carefully as we must on the open internet. Perhaps a dns-style abstraction layer is the solution. I think choosing an identifier is still a hard problem, especially in a world at the cusp of the online and offline worlds.

Benjamin

Sat, 2006-Mar-18

Emergent Efficient Software

This bugzilla comment came across my desk this morning:

If there are any developers out there interested in implementing this functionality (to be contributed back to Mozilla, if it will be accepted) under contract, please contact me off-bug, as my company is interested in sponsoring this work.

The comment applies to a mozilla enhancement request for SRV record support. SRV records are part of the DNS system, and would allow one level of server load balancing to be performed by client machines on the internet of today. Unfortunately, HTTP is a protocol that has not wholeheartedly embraced the approach as yet. What I think is interesting, however, is that there are customers who find bugs and want to throw money at them.

The extent to which this is true will be a major factor in the success of Efficient Software. This particular individual would like to develop a 1:1 releationship with a developer who can do the work and submit it back to the project codebase. I wonder how open they would be to sharing the burden and rewards.

This is the kind of bug that seems likely to attract most intrest for the Efficient Software initiative. It has gone a long time without a resolution. There is a subset of the community with strong views about whether or not it should be fixed. There seems to be some general consensus that it should be fixed, but for the project team for whatever reason it is currently not a priority.

It is unclear whether putting money into the equation would make this bug a priority for the core team, or whether they would prefer to stick to more release-critical objectives. There may be a class of developer who is more occasional that could take on the unit of work and may have an improved incentive if money were supplied. That is how I see the initiative making its initial impact: By working at the periphery of projects. I don't see the project being a good selector of which bugs should recieve funds because, after all, if the cashed up core developers thought it was release critical they would already be working on it. No, it is the user base who should supply the funds and determine which bugs they should be directed to.

There are important issues of trust in any financial transaction. I think that an efficient approach can adress these issues. The individual who commented on the SRV record bug is willing to contract someone to do the work, but whom? How do they know whether the contractor can be trusted or not? The investor needs confidence that their funds will not be exhausted if they are supplied towards work that is not completed by the original contractor. Efficient software does this by not paying out the responsible party (the developer or the project) until the bug is actually resolved. Likewise, the contractor must know their money is secure. Efficient Software achieves this by requiring investment is supplied up-front into effectively an escrow account while the work is done.

The biggest risk for an investor is that they will put their money towards a bug that is never resolved, despite the incentive they provide. A project may fork if funds are left sitting in the account. The investor's priorities may change. They may want that money put to a more productive use. I don't know of any way to mitigate that risk except to supply more and more incentive, or to first find a likely candidate to perform the implementation before actually putting funds into escrow. Perhaps the solution is to allow investors to withdraw funds assigned to a bug up until the bug is commenced. Once work is started, the money cannot be withdrawn. If the developer fails to deliver a resolution they may return the bug to an uncommenced state and investors can again withdraw funds to put to a more productive use.

The fact that the efficient software approach is an emergent phenomenon gives me increased confidence that it can be developed into workable process. In time, it may even become an important process to the open source software development world. Do you have comments or suggestions regarding an efficient approach to software? Blog, or join us on our mailing list.

Benjamin

Sun, 2006-Mar-12

Bounty Targeting

Bounties have been traditionally seen in open source as a way of brining new blood into a project, or increasing the pool of developer resources available to open source. Offering money for the production of a particular feature is intended to inspire people not involved with the project to come in, do a piece of work, then go back to their day to day lives. The existing developers may be too overworked to implement the feature themselves due to preexisting commitments. The item of work may even be designed to cross project boundaries and inspire cooperation at a level that did not exist before the bounty's effect was felt.

There are seveeral problems with this view of a bounty system, but perhaps the most important is one that Mary Gardiner identifies:

I mean, these things just seem like a potential minefield to me. And I don't mean legally, in the sense of people suing each other over bountified things that did or did not happen or bounties that did or did not get paid. I just mean in the sense of an enormous amount of sweat and blood spilled over the details of when the task is complete.

The point she makes is that it isn't possible to simply develop new feature x as a stand-alone piece of software and dump it into someone else's codebase. There is a great deal of bridge building that needs to happen on both the technical and social levels before a transfer of code is possible between a mercenary developer and an fortified project encampment.

These are the same kinds of issues a traditional closed software house has when they hire a contractor. Who is this contractor? What are their skills? Why are they being paid more than me? Will they incorporate into our corporate culture? Will their code incorporate into our codebase? Will they follow our development procedures and coding standards? There are plenty of ways to get each other off-side.

I consider it important to look for in-house talent. I don't think bounty systems should be geared towards the outside contractor, but instead to the core development team. I don't think bounty funds should be provided by the core development team to outsiders. Instead, I see bounties as a way for users of free software to contribute effectively to the core development team.

The Effient Software view of bounty collection and dispersion is that bounties are paid to developers who are already integrated on a social and techinical level with the core team. They may be full time or part time. They may work with other projects as well. This does not make them a mercenary. These are the people who don't come to the project just to do a single job. They watch the mailing lists. They spend appropriate time in irc channels and involved in other forms of instant communications for the sake of resolving technical issues. It is the core developer who should be rewarded for meeting the needs of the project's user base. It is the core developer who has the best chance of a successful development.

Finding the conclusion of the development should be straightforward and uncontraversial. It is as per project policy. The policy may be that an initial code drop is sufficient to collect a bounty. The policy may require a certain level of unit testing or review. It may require a certain level of user satisfaction. Because the developer is engaged in the policy process, the process is not a surprise or a minefield. Newer developers may be attracted to the project by successful funding of more established developers, and will have to break into the culture and policy... but that is to be expected when an outsider wants to become part of any core development group. The newcomer learns the policies over time, and the policies are as reasonable as the project needs them to be to both attract new blood and to fund the project as a whole. The interesting thing about open source is that if they get this balance wrong, it is likely they will be outcompeted by another group working on a fork of their software. The incentive is strong to get it right.

Money is a dangerous thing to throw into any organisation, and most open source projects get by without any dependable supply. There are real risks to changing your development model to one that involves an explicit money supply. I see rewards, however, and I see an industry that is ready to turn down this path. I think open source is the best-poised approach to take this path to its natural conclusion of efficient software production.

Benjamin

Sun, 2006-Mar-05

Free Software is not about Freedom of Choice

I was at HUMBUG last week, and was involved in a wide-ranging discussion. The topic of a particular closed-source software product came up, and a participant indicated that he maintained a windows desktop just to run the software. It was so good and integral to his work practices that he had a whole machine dedicated to it. He went on to criticise sectors of the open source community who tended to be irritated that closed source software was still in use. These are the sectors who have somewhat of a "with us" or "against us" view, and would prefer that closed source not be a part of anyone's lives. He asked (I think I'm getting the words right, here) After all, isn't free software about freedom of choice?.

I don't think it is.

Software alternatives are about freedom of choice. Whether the alternative is open source or closed source, the freedom of choice is not really affected. If I wrote a closed source alternative to Word, I would be providing freedom of choice to consumers. If I wrote an open source alternative to Word, I would be providing the same kind of freedom of choice. The difference is in the freedom of the customer once a transaction has been made. Open source software is primarily about post-choice customer freedom rather than freedom of choice, so it makes sense on at least one level for free software advocates to actively seek out unshackled alternatives to any closed source software they use from day to day.

In the software world we would traditionally see the freedoms of a consumer and the freedoms of a producer of software to be in conflict, however the foundation of open source development is to view the separation of consumer and producer as artificial. Freedoms given to the consumer are also given back to the producer, because the producer is also a consumer of this software. The barrier between consumer and producer exists naturally when only one entity is doing the producing. In that case the producer has automatic freedoms, and granting more to themselves has no meaning. However, consider the case of multiple producers. The freedoms granted to consumers are also granted to every producer when the production is shared between multiple entities. Open source produces a level playing field where entities that may compete in other areas can each limit the cost of this particular shared interest domain by working together.

When viewed from the angle of productivity improvement in a domain of shared interests, closed source alternatives can seem ugly and limiting. You will always know you are limited in closed source no matter how featureful a particular product is. You often can't make it better, and it would cost you a great deal to produce a competitive alternative as an individual. If competative alternatives exist you may be able to transition to one of the available alternative products, however you will still be in the same boat. You can't add a feature, and it only the threat you may change to another competitor that drives the supplier to use your license fee to produce software that suits you better. The competitors won't be sharing their code base with each other, so the overall productivity of the solution is less than the theoretical ideal. If the competitors joined forces they may be able to produce a more abundant feature set for a lower cost, however while they compete the customer pays for the competition. Which is worse? An unresponsive monopoly, or a costly war of features? Closed software represents a cost that the customer cannot easily reduce in an area that is different from their core competancies. It behaves like taxation from a supplier that does not need to outlay any more to continue reaping the benefits of its invesment, or a set of suppliers that duplicate efforts at the cost of the combined customer base. Open source may provide a third alternative: A cooperative of customers each working to build features they need themselves, and forking when their interests diverge.

People who are interested in open source are often also interested in open standards. Unlike open source, open standards do promote freedom of choice. Unlike open standards, open source does promote post-choice freedoms. Both have a tendancy to promote community building and shared understandings, and both are important to the health of the software industry moving forwards. The worst combination for overall productivity is and will continue to be a closed source product that uses closed data formats and interaction models.

Benjamin