Sound advice - blog

Tales from the homeworld

My current feeds

Thu, 2004-Jun-24

Objects and Databases (again)

I'm obsessed about this. Databases and objects. What is the relationship, and how do we model one in the other?

A specific database schema (or XML schema for that matter) can be modelled as an object. It has a bounded set of operations that can be performed on it, defining its "type". It represents a specific set of data and a bounded information system. It is finite and controllable. It can be queried and updated in simple, standard ways. Changes to the schema result in an essentially new database object, which can be operated on in a new set of ways.

Other objects can interact with a database object. Some objects are able to use their specialised knowledge about the database schema to modify it. Ideally, the database itself defines modification functions and enforces any necessary rules for data integrity. Other objects are able to use their specialised knowledge to query the database in specific ways. Again, complex queries are best defined inside the database themselves.

So now comes the itch I've been trying to scratch: What if you have two database schemas?

Traditional database design has taken everything that can be conceptually connected and put them into one database, i.e one object for management. This solves the problem on the small scale, but doesn't work on internet-scale information architectures. Once you get to the massive scale it doesn't make sense to keep all your data in the one place. If we are to develop an architecture that scales we should consider these problems while they're still on the smallest level

Let's take a simple example: Your accounting database and your daily stock prices database. Your accounting database has a table with account entries in it, each of a specific amount on a specific date and tied to a specific transaction. Your stock prices database shows the value of certain stocks over time. Combine the two, and you can come up with a report that shows the value of your investment portfolio over time. You don't want to duplicate the information between databases, but neither do you want to combine the two databases into one.

Here's the picture (in mock-UML ascii art!): |Accounts|<---|PortfolioValueReport|--->|StockValue|

There's no need to couple the Accounts and StockValue objects together. Apart from a common thread of terminology that relates information to the "ASX:AMP" sstock ticker symbol there's no connection between the two and I want to keep it that way. I want PortfolioValueReport to be the only object that has to consider the two side-by-side. So how do we make this report?

We could do it like a web service does it. We could query Accounts, then query StockValue, and leave PortfolioValueReport to put the queries together and make some sense out of the data. I would call that an object-based approach to the information architecture.

Another approach would be to work like sqlite does with multiple databases. You "attach" databases so that they form a single meta-database as needed. You run your query on the meta-database. I would differentiate this form a strict object-based approach. I think it's a relational approach with independent merit. In this approach you get access to all the indices in each database and any query optimisation the databases can do in concert. You don't have to read the whole result of both queries. You essentially let the database technology work out how to do things.

I feel comfortable with this kind of approach. In addition to a unified query mechanism, with sqlite you even get atomic transactions. Perhaps you have several independent data models, but you want to perform an operation on all. You want the operation to succeed only if it can be done to every one. The simple example here might be that you've bought some shares of a company you've never been involved in before. You want to add it to your accounts, but also to your share price watch list... or maybe you want to keep several instances of your account database type. Perhaps you want to keep one set of accounts for australian gaap, and a different (but related) set for american gaap. You'd want to know both were updated before allowing the transaction to go ahead.

I belive that web services do have a distributed transaction mechanism available, and that's something I may use in the future as technology and frameworks become more advanced. In the mean-time, I'm thinking that these multiple objects as multiple on-disk sqlite database files might be a good first step.

My current thinking is that I define and publish the database schemas. Once that is done, I start work on applications or objects that are able to operate on those databases for various purposes. I think a well-defined schema level will provide a capability for adding functionality in a modular way that current information systems really only dream about. We have applications all over the place that have a "private" schema that means you have to go through the front-end to do anything useful. I'm not keen on it. I want the data API to be clean, simple, and published. It's what's above that should be "private".

My biggest concerns with sqlite as a data solution are that it won't be able to incorporate the full data model I want to push into each database object. I may have to add functions using the C apis and provide my own sqlite interface. That wouldn't be so good. Sqlite doesn't really have anything along the lines of stored procedures, although views and triggers are supported.

The other issue with sqlite is the remoting capability. You would essentially have to mount the database in order to access it remotely, and that's fraught with problems of bad nfs implementations and the like. I don't think I can offer remoting capabilities for the time-being.

Hrrm... this weekend for sure.