Sound advice - blog

Tales from the homeworld

My current feeds

Sun, 2004-Jun-27

The objects that live in a database

Hello again. It's that time of the evening when my musings about how databases and objects relate to each other reach a new level.

Let's view the world as it is, but through the eyes of objects. Objects live in different places. Some live inside processes. Others live inside databases. Inside processes we have a wide variety of objects and object types. Inside databases there are two.

Inside a database, the first and most fundamental object is a row. Each row has assoications via foreign keys with other row objects. Each row has a set of operations that can be performed on it, and through triggers or other means a mechansim for reporting failure or causing other operations to occur when operations are performed on them.

The other kind of object is any cohesive set of tables. This set of tables and their rows may have its own rules and supports operations on its constitent row objects

Rows are simple objects of a strict type that essentially cannot be extended. This other kind of object, which I will call a database object, is something a little more fluid. Each set has its own type, but may support a set of operations that overlap with operations applicible to another database object. The overlap could be considered inheritance. A simple database object may be incorporated into a larger set of tables and operations. That could be considered composition.

As we build up the database model as objects, instead of trying to store objects in a database we see a slightly non-traditional view of how databases could be constructed. Traditionally we see a static database schema, because we traditionally see the entire database as a single database object. Perhaps another way to construct things is as follows

If we have a database object of type foo with tables bar and baz, we could create multiple instances of foo in a single database by prefixing the bar and baz table names. foo1.bar and foo1.baz could sit alongside foo2.bar and foo2.baz in the same database. Each could enforce its own rules internal to itself, the larger database object we've now constructed to contain these objects could enforce further rules.

A database object, therefore, can be defined as a set of tables with defined operations and rules plus a set of sub-objects. Lists or structures of sub-objects could be managed also.

An accounting system operates in a single currency on a single accounting entity. Multiple accounting entities could be managed in the same database as instances of a general ledger database object type. Multiple currencies for the same accounting entity, or even multiple gaaps might be managed as different objects within the greater database.

If the basic rules of the accounting model can be encapsulated inside a database object, perhaps more complicated structures can be built around the basic models. An interesting side-effect of this way of thinking about objects within databases is that you can still do SQL joins and queries between parts of different database sub-objects. They can still be made efficient.

Perhaps the biggest problem with this way of thinking about database objects is that SQL doesn't really support it. You can do the following kinds of operations:

... but you can't SELECT * FROM TABLES LIKE foo%.bar, or select these things based on type. You would certainly have to do some logic in the calling application if you were trying to perform "arbitrary" data mining across objects.

Oh, well. Something to think about, anyway.

As to how these object relate to applications...

Proxy objects can be created in an application to talk to database objects. You could begin a transaction, instantiate new database object proxies (or extract them from the database), and them modify and commit them. This would be an easy programmatic interface to work with. If a simple XML file defined the structure of the database objects it would be trivial to write an XSLT to generate code for the proxy objects

Proxy objects could be pulled from the database with select statements and the like. Foreign keys could be represented in the objects as smart pointers that, when dereferenced, would load the next object. Issues you might run into include updating proxy objects when appropriate as the result of triggers and the like (triggers would best be implimented in the object modifying the database in this case), and in dealing with relationships between database sub-objects which might not be modelled particularly well in this concept.

Another issue to consider is one that plauges the qof framework that underlies gnucash. Consistency. Consistency between the objects in memory and the database can't be guaranteed easily when multiple processes are modifying the database concurrently. The simple answer would be "start a transaction when you start pulling objects out of the database, and destroy all proxies when you commit". That would be nice, but you may be locking other users out of critical parts of the database. If you use sqlite you'll be locking them out of the whole database.

The naieve solution is not to load these objects until you are ready to make your change. The problem with this approach is that you typically need at least some data from the database to know what change you want to make in the first place. This leaves you in the uncomfortable position of making objects that are one level above the proxy objects.

These super-proxy objects would essentially check for consistency whenever they're accessed, or perhaps more often and warn the calling objects of possible inconsistency when they do change. This makes the whole system much more complicated than it otherwise might be, and adds one more complication. When you have modified your in-memory proxy objects and want to commit the changes to disk they must first run a final consistency check. Consistency failure means the commit cannot proceed and must be rolled back and possibly reapplied with the fresh data.

Oh, well. Complication always abounds when you try to exert your influcence over objects that someone else owns.

The new sqlite version (3.0) will hopefully alleviate this problem a little. When a writer begins a transaction it won't exclusively lock the database. Instead, it will prepare the transaction in-memory while it keeps a reserved lock. Readers can still read. It's only writers that can't get into the database. Still, it would be a pain in the arse to have to wait for slow-poke-allen to finish entering his transaction before you can even begin editing yours.

One more optimisation could be applied, here. New objects might be able to be created in proxy form, and then applied to the database in a short transaction. Since they essentially have nothing to check consistency against they could be constructed in a way that was essentially guaranteed to commit successfully. This is good news, because in a financial database most transactions are insertions. Only updates would need to lock the database.

The only problem with that approach is when you're relying on the database for some information about the new rows or database objects, for example if you're relying on the database to assign you a unique identifier which you use as a foreign key between objects. Bugger. That's almost always required.

Hrrm... oh well. Things are forming and I think I'm getting to a more concrete place with the philsophy behind this development. Perhaps one day I'll even see the qof "light" and move back to gnucash with a better apprecation of what's already been created there.

Thu, 2004-Jun-24

Objects and Databases (again)

I'm obsessed about this. Databases and objects. What is the relationship, and how do we model one in the other?

A specific database schema (or XML schema for that matter) can be modelled as an object. It has a bounded set of operations that can be performed on it, defining its "type". It represents a specific set of data and a bounded information system. It is finite and controllable. It can be queried and updated in simple, standard ways. Changes to the schema result in an essentially new database object, which can be operated on in a new set of ways.

Other objects can interact with a database object. Some objects are able to use their specialised knowledge about the database schema to modify it. Ideally, the database itself defines modification functions and enforces any necessary rules for data integrity. Other objects are able to use their specialised knowledge to query the database in specific ways. Again, complex queries are best defined inside the database themselves.

So now comes the itch I've been trying to scratch: What if you have two database schemas?

Traditional database design has taken everything that can be conceptually connected and put them into one database, i.e one object for management. This solves the problem on the small scale, but doesn't work on internet-scale information architectures. Once you get to the massive scale it doesn't make sense to keep all your data in the one place. If we are to develop an architecture that scales we should consider these problems while they're still on the smallest level

Let's take a simple example: Your accounting database and your daily stock prices database. Your accounting database has a table with account entries in it, each of a specific amount on a specific date and tied to a specific transaction. Your stock prices database shows the value of certain stocks over time. Combine the two, and you can come up with a report that shows the value of your investment portfolio over time. You don't want to duplicate the information between databases, but neither do you want to combine the two databases into one.

Here's the picture (in mock-UML ascii art!): |Accounts|<---|PortfolioValueReport|--->|StockValue|

There's no need to couple the Accounts and StockValue objects together. Apart from a common thread of terminology that relates information to the "ASX:AMP" sstock ticker symbol there's no connection between the two and I want to keep it that way. I want PortfolioValueReport to be the only object that has to consider the two side-by-side. So how do we make this report?

We could do it like a web service does it. We could query Accounts, then query StockValue, and leave PortfolioValueReport to put the queries together and make some sense out of the data. I would call that an object-based approach to the information architecture.

Another approach would be to work like sqlite does with multiple databases. You "attach" databases so that they form a single meta-database as needed. You run your query on the meta-database. I would differentiate this form a strict object-based approach. I think it's a relational approach with independent merit. In this approach you get access to all the indices in each database and any query optimisation the databases can do in concert. You don't have to read the whole result of both queries. You essentially let the database technology work out how to do things.

I feel comfortable with this kind of approach. In addition to a unified query mechanism, with sqlite you even get atomic transactions. Perhaps you have several independent data models, but you want to perform an operation on all. You want the operation to succeed only if it can be done to every one. The simple example here might be that you've bought some shares of a company you've never been involved in before. You want to add it to your accounts, but also to your share price watch list... or maybe you want to keep several instances of your account database type. Perhaps you want to keep one set of accounts for australian gaap, and a different (but related) set for american gaap. You'd want to know both were updated before allowing the transaction to go ahead.

I belive that web services do have a distributed transaction mechanism available, and that's something I may use in the future as technology and frameworks become more advanced. In the mean-time, I'm thinking that these multiple objects as multiple on-disk sqlite database files might be a good first step.

My current thinking is that I define and publish the database schemas. Once that is done, I start work on applications or objects that are able to operate on those databases for various purposes. I think a well-defined schema level will provide a capability for adding functionality in a modular way that current information systems really only dream about. We have applications all over the place that have a "private" schema that means you have to go through the front-end to do anything useful. I'm not keen on it. I want the data API to be clean, simple, and published. It's what's above that should be "private".

My biggest concerns with sqlite as a data solution are that it won't be able to incorporate the full data model I want to push into each database object. I may have to add functions using the C apis and provide my own sqlite interface. That wouldn't be so good. Sqlite doesn't really have anything along the lines of stored procedures, although views and triggers are supported.

The other issue with sqlite is the remoting capability. You would essentially have to mount the database in order to access it remotely, and that's fraught with problems of bad nfs implementations and the like. I don't think I can offer remoting capabilities for the time-being.

Hrrm... this weekend for sure.

Mon, 2004-Jun-14

Objects and Databases

In information technology there are the two main archetypes. A software engineer works in programs, applications, procedures, objects. The database stream works in tables, forms, reports, data. The two are worlds apart.

Objects are a means of data hiding. They behave in concrete and reliable ways because they themselves define all the operations that can be performed on them and how those operations affect their state. Their data is hidden. It may reflect what is held within the object, but there is no need for it to be so.

A schema in the software world is the set of objects and their methods. If an object is transmitted across a network or stored in a file a memento must be created in the object's place to be activated, read, or otherwise interpreted to construct the equivalent object in the target process. The memento itself can't be understood by anything but the receiving object in that target process.

Well, that's not true. A memento can be anything. It could be a hard binary representation subject to change whenever an object is modified. It may be an XML representation that can at least be partially understood... but still it can't reconstruct that object on the other side.

Databases are a weird and woolly world, where it is not the behaviour that is tightly controlled by the representation. Instead of defining a set of operations that can be performed on an object they define the object in terms of its internal data. The operations are basic, standard, and implicitly allowed. They allow modification to any part of the data. This fairly primitive model of operations is suffered because the main purpose of database is not to maintain the integrity of the objects it represents, but to allow those objects to be pulled apart, rearranged, and reassembled at will during the advanced query process.

SQL queries are a declaritive mechanism for defining how source data from tables should be transformed into target tables for reporting. These new tables are essentially new objects, so the whole database can be thought of as an engine for collecting and transforming objects into new object sets. It's an essential and well-performed service.

The problem with SQL databases is that they don't deal so well with the things that the software world has learned regarding objects over the years. Because the data is so exposed, database folk get used to the idea and it seems only small steps have been taken towards object-orientation. Stored procedures are the "standard" way of hiding data from applications that want to modify it behind standard operations. This, in effect, makes the entire database an object with metohds.

Views are an attempt to hide data from the other side. A number of "standard" object sets are created to do queries against, while the actual data remains hidden. The underlying data can thus change without harming the queries and reports that hinge of it.

Is there a way to make databases more like object models, while retaining the query and reporting capabilities that have made SQL great? I suppose another way of looking at the question is to ask: Is there a way of making object models more like databases, without sacrificing the data encapsulation capabilities we hold so dear?

It seems what we need to do is define an database that contains real objects. Those objects define a set of operations that can be performed on each, and present a number of "properties" that SET and UPDATE operations can be performed against. Further properties, or the same ones, would support SELECT statements. The objects themselves would be able to reject certain updates with an error or perform actions on other objects to achieve the SET operations.

One spanner-in-the-works of this whole way of thinking is that in databases, objects are essentially either rows or are the entire database. In software, objects are often complex beasts involving a number of sub-objects. These kinds of objects can be represented in a database, but don't match their natural form when doing so. The (understandable) obsesson with tables in database technologies means that complex objects are difficult to query over and to extract.

To get to the point, if you have a database containing accounting objects you need at least three classes. You need a transaction, a set of transaction entries, and an account for each transaction entry to reference. Certain constraints must be present on these objects, for example "at the end of each update the total credit transaction entries must equal the total debit transaction entries for each transaction instance". In essence, a transaction and its entries behave as one complicated object, but must be represented for the database as several objects. This has its advantages, making it easy to pull out transaction entries associated with individual accounts. It has its disadvantages, making it difficult to know when to apply this test. It can't be done each time an insert, erase, or update is issued to a transaction entry. Perhaps before each commit? But if so, which transactions do we run the check on? It's infeasible to check all transactions.

Perhaps I'd better just give in and start thinking relational. I suppose with triggers I could just maintain a dirty list, checked and cleared before each transaction completes.

P.S. On another note, I apologise for the untruncated nature of my blog entries. I've done some experimentation with the "seemore" plugin, but it doesn't appear to work for statically rendered blogs such as this one. Perhaps some knowledgeable person will take pity and refer me to one that does work on static pages :)