Author: | Andrew J. Todd esq <andy47@halfcooked.com> |
---|---|
Date: | September, 2009 |
This paper describes a journey through identifying, understanding and managing change in database schemas objects. It describes some common terminology, describes the kind of changes that can happen to database schemas and describes some ways to record and control these changes.
Changing application source code is hard. But there are handy tools and techniques to help developers cope with them. Assuming that these tools and techniques work well with your relational database can lead to problems. This paper attempts to explain the what, how and why of changes to databases - or more particularly the objects within them - and provide some useful pointers that should help anyone who works with databases as part of their software applications.
For the purposes of this paper I am focussing on relational databases, although the principles should apply to any application data store. Commonly used open source relational databases include MySQL, PostgreSQL, SQLite, Firebird and Ingres.
In this paper I discuss why and when your database objects will change and what to do to support these changes in as painless a fashion as possible. The database components of an application complicate configuration management. Application code (or compiled objects created from it) can be deleted and replaced, but you can't just drop and replace database objects when you want to upgrade, you have to deal with not only the containers of your data but also their contents.
Throughout this paper I am going to use some terms very specifically and it is worth defining here what I mean by them;
Things in a software application system change mainly because what we want them to do changes. Either the functional requirements are modified with requests for new features or our understanding of the problem domain improves as we work in it.
With data models as with any other aspect of a software system there is a trade off between what you need now and designing for the future. This has been explored in many publications and the arguments won't be repeated here. However, due to the nature of data and domain models building out areas of the domain model that have not been previously implemented can be treated almost as new development. Modifying objects is rarer than changing application code because they have less vectors of change. Database objects can be added, dropped or renamed and little else. Hence database schemas, unlike application code, are unlikely to require large scale refactoring or updating as new features are introduced.
Data models (and the structures that implement them) have a lower rate of change than the applications that work with them, but that doesn't mean that they never change. A common anti-pattern of software development is to assume that the domain model is fixed because 'we got it right the first time' and then to adapt functionality to work with it in often unnatural ways. This is often why developers think databases are hard to work with, intractable or just plain useless.
Typically a project development team will build a database schema based on some kind of conceptual or logical model. Because this looks primarily at information requirements for a domain or subject area rather than just an application or functional area it should result in a structure that will support multiple processes and therefore is partially future proof. Separating the application logic from its information requirements means that the transitory, transactional, requirements of the application do not overly affect the structure of the data it uses.
For instance we know that an Order can have one or more Order Lines and build a data model accordingly. If our particular application limits the number of lines per order to 5 the model will support this, and if this limit is increased in the future it is unlikely that the database schema will need to change to support the new functionality.
All of the objects in a database schema can change but we only need to deal with and record some of the changes. Specifically;
Some of these events happen more often than others in software applications. In particular it is common for some objects such as tables to never be removed because 'they may be useful later'. This should be discouraged. As long as there is decent configuration management there should never be a need to keep unnecessary objects in the database, just like there should never be redundant application code.
The only thing that should rarely, if ever, change is the database users used to store the application's objects. Different database engines have different terms to describe schemas but fundamentally once a schema is established it should rarely, if ever, need to change any of its attributes such as name.
The usual answer to this problem with relational databases is to make sure that the source control system has an up to date copy of the DDL required to create our database schema. Or, if you are using frameworks like Rails, it will contain a neutral model that can be used to generate DDL that will create the database schema.
The problem with this is that your DDL is not your database. It is a model of what the database should be and for all we know the actual schema objects in your production, test or development databases may, or may not, reflect that model. It is also missing one crucial element of any database - the data. The data in a database is what distinguishes it from any other artefact of a software system. An application without data is of little or no use. The need to preserve data during changes is of paramount importance in any software application. Whatever change is applied to the system must not affect that ability of users to use it.
The DDL is also missing some implementation details. For instance, in PostgreSQL you do not specify exactly where to store each block of data. In MySQL two databases may have different default storage engines and creating a table from the same DDL statement in the two databases will result in a subtly different objects being created. Another common cause of confusion is the encoding (character set) used to store the data elements. This isn't usually defined at the object level and although the data may be consistent the way it is written to disk may be completely different. These differences are subtle but can be very important when changing a software application.
Fundamentally, though, the presence of data in a database schema is what distinguishes it from the DDL statements used to create the component objects within it. If you decide to rename a column in a table you can't just drop and re-create the table. What do you do with the rows of data within it? This is what makes database refactoring tricky. Luckily, some smart people have done a lot of thinking about the problems inherent in this situation and as a result we have the concept of evolutionary databases and database refactoring
But how do you manage these changes? Even with the approaches discussed above you still need to change database objects. Teams responsible for software applications need to use an ordered approach to recording and making changes to database schemas.
A common approach is not to bother subjecting the database schema to version control. This is a bad idea and falls out of favour the first time a change to a database schema needs to be backed out. Once the database schema is treated the same way as other application components and is subject to version control the project team has a number of choices. Regardless of technique chosen the approach to controlling the change has two use cases; to create a new (test, development, disaster recovery) instantiation of the database schema in an empty database, or to migrate an existing database schema from one specific revision to another (usually the latest). Any configuration management scheme needs to be able to cope with both.
The usual approach is to store a complete set of DDL scripts in version control. This has the advantage of being able to completely create (or recreate) a schema at any point in time. The disadvantage is that it is hard to determine what changes need to be made to change an existing schema to a specific revision. The complementary approach is to store an initial set of DDL scripts and then to keep copies of the series of change DDL scripts used for each revision. In this way a schema can be upgraded by applying each change DDL between the version it currently represents and the intended version. The disadvantage of this approach is that it can be a time consuming process to create a new database schema with the latest version of the system.
The recommended approach is to store both sets of DDL scripts in your version control system. This does require a little duplication and redundancy but has the advantage that it supports both of the use cases.
It is also advantageous to store scripts to populate key reference data with the DDL scripts in version control. Any semi-static data in the system used to categorise or describe transactional data (e.g. status codes, tax rates, group upper and lower limits) is stored to both support initial schema creation but also to check that an existing schema has the correct values.
Problems happen. At least they do in the real world. The ability to cope with problems is a key reason for having a defined configuration management process and associated version control tools.
The secret to success in this, as in many other areas of software development, is good testing. Each migration should be treated as a unit of work and should be appropriately tested, preferably with some form of automated test script to validate that it has been performed successfully. The migration should then be run in a non-production environment and the tests executed successfully before any production upgrades are scheduled.
At the very least your change DDL scripts should have both an 'upgrade' and a 'downgrade' component. Rails migrations do this with the up and down methods on each migration. Alternatively, each set of upgrade scripts should have a corresponding set of downgrade scripts that can be run to back out migration and to leave the data intact. This isn't always as simple as it sounds, especially if tables or columns are dropped. It is even more tricky when column definitions change (for instance reducing the precision or changing to a more restrictive data type).
Luckily, there are specific strategies for coping with this described in the literature (see database refactoring). One straight forward and simple mechanism is;
For a belt and braces approach many DBAs also perform full backups of production databases before applying any migrations. This enables them to restore to a known good state before any migrations are applied and can be useful with large or complex migrations. However this will lose any data that is created or updated whilst the migration is taking place and may not be ideal in environments where it isn't possible to restrict or stop activity whilst the migration is taking place.
In environments with high availability requirements more complex arrangements are usually involved which are outside the scope of this paper. They tend to include live mirrors and transaction recording and replay to provide seamless operations whilst upgrades are taking place.
There is a wealth of material on this subject. Some good places to start are Martin Fowler's essay on evolutionary databases, database refactoring by Scott Ambler, in fact the whole Agile Data web site is worth a read and you could also try Pramod Sadalage's blog