Using Relational Databases with Python

Author: Andrew J. Todd esq <andy47@halfcooked.com>
Date: September, 2006
License: GNU Free Documentation License

Abstract

Python has a standard mechanism for accessing databases called the Database API (DB-API). It is documented in Python PEP 249 and is influenced by ODBC, JDBC and to a lesser extent Perl's DBI. Modules are available for most of the popular relational databases and a number of non-relational databases as well.

Introduction

In this presentation to OSDC 2006 I'll talk about how you can use Python to access data stored in databases and the various tools and technologies available to help you. Primarily I'll be focussing on relational databases, although you can use Python to access other data stores. I'll also, for illustration, refer to the ways that databases are accessed and used in other programming languages such as Java, Perl and PHP.

Writing Python code to access databases is made easier by the presence of the Python Database API (commonly referred to as the DB-API). Because it is a pythonic API it fits nicely into existing Python code and allows Python programmers to easily store and retrieve data from databases.

The Python DB-API specifies a way to connect to databases and issue commands to them. This gives the advantage that there is a standard way to write code that deals with a database using connections, cursors and transactions. It also defines a standard exception hierarchy that modules must implement.

The DB-API has support for bind variables, data type mapping and transactions. Each of these latter features have multiple implementation options. The DB-API specifies a number of optional extensions in the form of additional attributes and methods that DB-API modules and classes may implement.

The DB-API gives a common starting point for anyone writing code to interact with databases. However there are parts of the standard with multiple implementations and the presence of optional features means that writing cross database code is rather problematic. The addition of inconsistencies in SQL support of the different databases makes this is less of a problem than it could be. Even if each of the Python DB-API modules had exactly the same public interface the SQL commands they need to be given would still be different.

There are a number of Python modules that build on the foundations of the DB-API with different levels of abstraction. They range from simple result set wrappers to full blown object relational mappers. The simple wrappers typically wrap the results of database operations in more Pythonic data structures like dictionaries whilst the object relational mappers allow Python programmers to (largely) distance themselves from writing SQL statements.

This paper assumes only a passing acquaintance with either Python or relational databases (and particularly SQL). A little knowledge of either subject will certainly help though. If you wish to follow along with the examples shown here you will need some software installed on your computer. At the very minimal you will need Python, a database and the appropriate Python database module.

A good place to start is the SQLite database and the accompanying pysqlite2 Python database module. These are the tools we will use in the example code in this paper. The code should work with most recent releases of Python although I suspect that you will have problems with releases before version 2.1. If you've got Python 2.5 then you will already have the pysqlite2 module as it is now part of the standard library (except that is has been renamed to sqlite3).

Python

Best described as an Agile language, Python was first released in 1996. Python has an extensive standard library but access to databases is available mainly in third party modules. The standard library has more low level storage interfaces like dbm, pickle and shelve. Some of the pure object oriented databases available for Python (ZODB, Durus and Schevo) build on these standard library modules.

Databases

The definition of a database is a persistent store for your data. I'd extend that to only include those that conform to the ACID principle for transactions. ACID is defined in ISO/IEC 10026-1:1992 Section 4, but to save you the research here are the salient points;

Atomicity
All of a transaction is committed or nothing is
Consistency
Either a transaction succeeds or if it fails everything is returned to its prior state
Isolation
A transaction in process must be isolated from every other transaction
Durability
Committed data is preserved

When most people talk about database these days they are referring to relational databases (also known as RDBMS) which are persistent data stores that implement an interface based on the relational calculus. In a relational database everything is a relation. Tables are relations, as are query results. The relational calculus then defines a number of operations that operate on relations, giving rise to the set based, declarative language SQL.

Practically, relational databases are made up of tables. Each table has a number of columns with defined data types and precisions. Each table will contain zero, one or more rows which are something like instances of objects.

Python DB-API

Originally published in April 1996 the first version of the Python database API (DB-API) was defined in Python PEP 248. But modules complying to this standard are few and far between.

The reason for this is that it was then revised and a version 2.0 was released in April 1999. This paper will concentrate on this current version of the DB-API which can be found in Python PEP 249.

DB-API 2.0

The DB-API doesn't specify a common module for providing connectivity like Perl's DBI, rather it assumes that each database has its own module. But unlike PHP where each database driver implements its own often slightly different commands for interacting with the database in Python there is a level of consistency between modules.

Modules are available for most of the popular relational databases, be they F/OSS or commercial. If you have a database that you already use the chances are that there is a Python database module for it. Each DB-API module must provide at the very least a constructor and a number of module globals. It also acts as the owner of a number of standard exceptions. So, for our SQLite database we would do something like:

>>> try:
...     from sqlite3 import dbapi2 as sqlite
... except ImportError:
...     from pysqlite2 import dbapi2 as sqlite
...
>>> db_connection = sqlite.connect('sample.db')

To see the available methods and attributes on your connection use Python's introspection features:

>>> dir(db_connection)

You should see something like this:

['DataError', 'DatabaseError', 'Error', 'IntegrityError', 'InterfaceError', 'InternalError', 'NotSupportedError', 'OperationalError', 'ProgrammingError', 'Warning', '__class__', '__delattr__', '__doc__', '__getattribute__', '__hash__', '__init__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__str__', 'close', 'commit', 'create_aggregate', 'create_function', 'cursor', 'isolation_level', 'rollback']

Certain methods are available on the connection object returned by this constructor. They all relate to 'global' operations for transaction control such as commit or rollback and most importantly allow us to create cursor objects. Each connection can have multiple cursors. Generally you'll create one for each series of transactions, although it is perfectly common just to create one per connection. As a rule you should create one cursor for each concurrent transaction (or group of transactions). We create cursors with a call to the constructor method on the connection:

>>> db_curs = db_connection.cursor()

A cursor object is the means by which we issue SQL statements to our database and then get the results. To run a specific SQL statement use the execute method:

>>> db_curs.execute("CREATE TABLE people (
                        id INTEGER PRIMARY KEY, first_name VARCHAR(20),
                        last_name VARCHAR(30), date_of_birth DATE)")
>>> db_curs.execute("INSERT INTO people (first_name, last_name, date_of_birth)
                     VALUES ('David', 'Bailey', '1938-1-2')")

Transaction control is effected through our connection object, so to commit this change we use the commit method:

>>> db_conn.commit()

Then we need to be able to get our data back again. We can continue to use our original cursor as we don't need to keep the results of any prior operations around. Getting our data is a two step process:

>>> db_curs.execute("SELECT * FROM people")
>>> db_curs.fetchall()
[(u'David', u'Bailey', u'1938-1-2')]

For those who know their databases these operations are analogous to parse and execute.

The DB-API specifies a number of different 'fetch' methods that a cursor should provide. They are fetchone, fetchmany and fetchall. They pretty much do what they say fetching one row from the result set, a group of rows or every row that your query will return in one step. Obviously the fetchall method should be avoided when you are likely to have very big result sets as it may take a long time to return any data.

Note that the 'fetch' methods only have to return a sequence. In Python any number of data types are classified as sequences so don't assume that you will always get a tuple (or a list).

The DB-API isn't as comprehensive and prescriptive as for instance, JDBC, and this is sometimes seen as a weakness. The DB-API provides a basic standard level of functionality enabling Python programs that deal with databases to be quite similar in structure and content. That said its can be a non-trivial exercise to take some code written for one DB-API module and run it against a different database through another DB-API module. This isn't a as big a problem as it first seems because rarely do two different databases implement the same functionality and when they do it is rarely through exactly the same interface. This means that even if the DB-API modules provided exactly the same methods and properties you would still need to change your SQL code anyway.

The specification authors took the view that the DB-API would be like the SQL standard, specifying a core of standard functionality and recognising that different databases would need different code to support their different extensions. It was better to provide some flexibility in implementation because this reflects the reality that is modern databases. When they all use exactly the same dialect of SQL I'm sure we will see a uniform DB-API, although I'm not holding my breath.

Parameter Style

One of the trickiest things people new to databases and Python get into trouble with is bind parameters. If you've used JDBC or Perl then the concept will be familiar. PHP by default doesn't have bind parameters so PHP programmers coming to Python often have problems with them. The typical first use scenario of parameters is something like:

>>> search_name = 'Parkinson'
>>> stmt = "SELECT * FROM people WHERE last_name='%s'" % search_name
>>> db_curs.execute(stmt)
>>> db_curs.fetchone()

Which has two main problems. One is that you are potentially vulnerable to SQL injection attacks, especially if you are taking the value of search_name from an untrusted source. Secondly you aren't giving the computer a chance to properly optimise your query because each time you run db_curs.execute(stmt) it will be treated as a separate statement in the database and be parsed again, adding to the cost of the operation.

Proper use of bind variables and parameters addresses both of these problems. Lets try another insert into our table:

>>> stmt = "INSERT INTO people (first_name, last_name, date_of_birth) VALUES (?,?,?)"
>>> db_curs.execute(stmt, ('Terence', 'Donovan', '1936-9-14'))
>>> db_curs.execute(stmt, ('Norman', 'Parkinson', '1913-4-21'))

In this case the database is more likely to keep the parsed version of stmt around and save a few machine cycles on the second insert. Because we are passing the values as explicit parameters the DB-API module can properly escape the contents and reduce the likelihood of malicious or accidental damage to our database.

People new to the Python DB-API often have trouble with this concept which is why I'm taking the time to go over it here. This confusion isn't helped by the paramstyle options provided by the DB-API. The module author is free to support one or more of the five available styles:

qmark
Question mark style, e.g. '...WHERE name=?'
numeric
Numeric, positional style, e.g. '...WHERE name=:1'
named
Named style, e.g. '...WHERE name=:name'
format
ANSI C printf format codes, e.g. "...WHERE name=%s'
pyformat
Python extended format codes, e.g. '...WHERE name=%(name)s'

The format option provides all kinds of opportunities for trouble. Especially as it is supported by the MySQL driver which is a common first DB-API module. Consider these two examples:

>>> stmt1 = "SELECT * FROM people WHERE name='%s'"
>>> stmt2 = "SELECT * FROM people WHERE name=%s"

stmt1 has an extra set of quote marks around the %s and will therefore require the use of string substitution whereas stmt2 uses the format parameter style and will therefore be treated as a bind parameter. One is good practice, the other bad but the visible difference is very subtle. To execute these statements you would do something like:

>>> db_curs.execute(stmt1 % 'Bailey)
>>> db_curs.execute(stmt2, 'Bailey')

Again, semantically worlds apart but syntactically quite similar. The good news is that this parameter style will go away in DB-API 2.1 (or 3000). The bad news is that there isn't a target date for its release.

Building on the DB-API

There are a variety of different Python modules that provide services on top of the DB-API. From simple helpers like dtuple.py and db_row to full blown object relational mappers like SQLObject and SQLAlchemy. There is a tool for every need. For a list of these modules your best bet is the higher level database programming page on the Python Wiki.

Of most benefit to the new or casual user are helpers like dtuple.py. This module by Greg Stein allows you to deal with the result sets that are returned from cursors as a dictionary or an object rather than a sequence. For example:

>>> import dtuple
>>> db_curs.execute("SELECT * FROM people")
>>> descr = dtuple.TupleDescriptor(db_curs.description)
>>> raw_row = db_curs.fetchone()
>>> raw_row
(1, u'David', u'Bailey', u'1938-1-2')
>>> row = dtuple.DatabaseTuple(descr, raw_row)
>>> row.first_name
u'David'
>>> row.date_of_birth
u'1938-1-2'

Sadly, you are on your own if you want to change anything:

>>> row.last_name = 'Smith'
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
  File "dtuple.py", line 148, in __setattr__
    raise TypeError, "can't assign to this subscripted object"
TypeError: can't assign to this subscripted object

More information can be found in the Python cookbook recipe Using dtuple for Flexible Query Result Access.

Programmers coming from an object oriented background often don't want to write SQL. Object relational mappers (ORM) are the perfect tools to keep SQL statements out of application code. The two most popular Python ORMs are SQLObject and SQLAlchemy. Django also comes with its own ORM loosely based on SQLObject.

These ORMs enable a table centric view of your database allowing you to describe your tables in code or to read them from the database data dictionary. SQLAlchemy then adds a number of different ways of mapping these objects to your application whereas SQLObject leaves you to define them yourself.

The ORMs provide rich wrappers that produce data transfer objects (DTOs) from the records in your database tables. Operations on these DTOs are transparently echoed into your database by the services they provide. Using our SQLite database from earlier and SQLAlchemy here are some of the things we can do without writing any SQL:

  1. Establish a connection to a database
>>> from sqlalchemy import *
>>> import datetime
>>> sa_conn = BoundMetaData('sqlite:///tutorial.db')
  1. Define an object that corresponds to a table
>>> people = Table('people', sa_conn,
                     Column('first_name', String(20)),
                     Column('last_name', String(30)),
                     Column('date_of_birth', Date())
                  )
  1. Create a row
>>> i = people.insert()
>>> i.execute(first_name='Patrick', last_name='Anson', date_of_birth=datetime.date(1939, 4, 25))
  1. Read a row from the table
>>> r = people.select().execute()
>>> r.fetchone()
[(u'David', u'Bailey', datetime.date(1938, 1, 2))]
  1. Delete a row
>>> class Person(object):
...     pass
...
>>> personmapper = mapper(Person, people)
>>> session = create_session()
>>> query = session.query(Person)
>>> earl = query.get_by(first_name='Patrick')
>>> session.delete(earl)
>>> session.flush()
  1. Update some values
>>> david = query.get_by(first_name='David')
>>> david.first_name
u'David'
>>> david.first_name = 'Brian'
>>> david.first_name
u'Brian'

And the equivalent operations with SQLObject:

  1. Establish a connection to a database
>>> from sqlobject import *
>>> import os
>>> db_filename = os.path.abspath('tutorial.db')
>>> sqlhub.processConnection = connectionForURI('sqlite:'+db_filename)
  1. Define an object that corresponds to a table
>>> class Person(SQLObject):
...     class sqlmeta:
...         table = 'people'
...     firstName = StringCol(length=20)
...     lastName = StringCol(length=30)
...     dateOfBirth = DateCol()
...
  1. Create a row
>>> cecil = Person(firstName='Cecil', lastName='Beaton', dateOfBirth=datetime.date(1904, 1, 14))
  1. Read a row from the table
>>> check = Person.selectBy(firstName='Cecil')
>>> cecil == check[0]
True
  1. Delete a row
>>> cecil.destroySelf()
  1. Update some values
>>> david = Person.selectBy(firstName='David')
>>> david = david[0]
>>> david.firstName
'David'
>>> david.firstName = 'Brian'

The advantage of these tools is that they can initially make your application code simpler. By letting the application code interact only with Python objects you can worry about solving the problems your application is aimed at and don't have to deal with the object relational impedance mismatch. The drawback is that the compromises they make in transaction management and generalising between different databases may mean that they actually end up making your application code more complex than it needs to be. In cases where you just need the ability to persist objects instances as rows in a table ORMs can provide incredible boosts to productivity. An example of this can be seen in the applications produced by Django and Ruby on Rails.

To avoid situations where the requirements for your application aren't met by standard functionality both of the ORMs mentioned here allow you to drop down to raw SQL when and where you need to. This provides a great compromise between the object view that will be prevalent in your application and the set based approach that is the strength of the relational database.

Non-relational Databases

Of course, you don't need to be a disciple of Ted Codd to want to persistently store data from your Python programs. The standard library provides the pickle and shelve modules which are perfectly suited to saving Python objects to the file system. They are quite low level and don't provide much support in the way of complex transactions, multi user access and network access. To this end there are a number of object oriented database modules available on the Python platform. I'm not going to cover them in this paper but do check out:

Further Reading

Hopefully this paper will have whetted your appetite for all things Python and database. If you want to find out more here are some suggestions to start with: