March 5, 2011, 1:25 p.m.
posted by oxy
Item 33: Consider using optimistic concurrency for better scalability
Consider the canonical problem of two users trying to work with the same data: each takes out a working copy of the data in the database and makes updates to it (thus creating transient state for each, as described in the introduction to Chapter 5). Now each looks to preserve the changes back to the data store. Several mechanisms are available to us—in one case, we might rely on the underlying database's concurrency facilities by starting a transaction when the first user reads the data, thus taking out a lock on the data and ensuring that nobody else can tamper with the data until our first user commits the changes made. Unfortunately, this approach creates a huge lock window (violating the advice of Item 29) by explicitly ceding control back to the user while holding a transaction open (see Item 30). As scalability goes, you can't do much worse than this.
So the next approach might be to make the read be one transaction and the updates a separate transaction. The problem we encounter here is plain to see when we run through the full scenario—our two users do their reads, then when the time comes to do their updates, the two updates are done one right after the other, the second one blithely overwriting the first. Worse yet, the underlying resource manager, the database, in this case, doesn't offer any hint or signal that data has just been overwritten.
An alternative might be to turn down the isolation level on each transaction (see Item 35 for details), thus allowing simultaneous reads to occur. The net effect will be the same, however: data is lost because each user will update without realizing that he or she is overwriting another user's changes.
Optimistic concurrency models, also known as Optimistic Offline Locks [Fowler, 416] or Optimistic Locks [Nock, 395], offer a way to have your cake and eat it too—for a small bit of additional work on your part, you can keep the number of actual locks taken out to a minimum. In essence, you're tacking on some kind of "data-modification marker" to each set of data retrieved from the data store, and when doing an update, checking that marker held in transient state against what's in the database, and if they're different, taking appropriate action to resolve the difference. "Appropriate action" varies from one system to the next, but some possibilities include trying to automatically merge the changes, presenting the user with the changed data and allowing him or her the opportunity to do the merge, simply overwriting (the problem we tried to get away from in the first place is sometimes still the right thing to do), or presenting "hey, somebody modified your data, what should I do next?" messages to the user.
Actual optimistic concurrency implementations can take a variety of forms; Version Number [Marinescu, 70] is one such approach, in which each table contains an additional column for a version number; other approaches prefer to use a timestamp of last modification. Using a version number gives you an indication of how many times the data has been modified since you last looked at the data (if you're holding version 1, and the current version is 20, the data has had a lot of changes, and you might prefer to just throw it away and start over), which isn't present with a simple timestamp approach. On the other hand, a timestamp approach takes you halfway to having a temporal element to your database, which can be extremely useful for auditing purposes, something we'll discuss in a bit.
Regardless of which approach you choose, the basic code for accessing data under an optimistic concurrency model looks something like this:
SELECT primary_key, data elements of interest, last_modified FROM Table WHERE primary_key=? Cache off last_modified; when it comes time to update, do: UPDATE Table SET (data elements of interest = new values, last_modified = system-generated timestamp) WHERE primary_key=? AND last_modified=old last_modified value
Keeping an eye on the row count of the UPDATE statement is crucial; if it returns 0 rows, it means that the update's predicate (the WHERE clause) failed. Unless you have a data corruption problem, that means the row identified by the primary key doesn't have a last_modified column matching the last_modified data you thought it should have—that's a sign that the last_modified timestamp of the row in question has changed, which tells you the whole row has possibly changed.
At this point, what happens next is entirely up to you. In particular, the question on the table is what to do with the data that's now out of sync with what's stored in the database.
Not one approach will fit all systems—in many cases, you'll use more than one resolution approach even within the same module or section of code. It all really depends on the business and the domain logic needs.
In some cases, you may even want to add a temporal axis to your data; by including both start and end timestamp columns, representing the "live" period this data was active, and then creating a composite primary key consisting of the original primary key column(s) and the start and end columns, you create a historical record of every row in that table:
CREATE TABLE person ( primary_key INTEGER, first_name VARCHAR(120), last_name VARCHAR(120), other interesting data elements, start TIMESTAMP, end TIMESTAMP ), PRIMARY KEY (primary_key, start, end);
Retrieving the "most current" row for a given primary_key simply means querying for the record where the end column is set to either 0, NULL (if your database supports NULL values in primary keys), or some nonsensical value:
SELECT first_name, last_name, ... FROM person WHERE primary_key=? AND end=0
Updates to the table are deceptively simple, in that you only UPDATE the existing row to close off the record's lifetime by filling in the end column, and INSERT a new row (with the new data) with a NULL end column:
Date now = new java.util.Date(); String sql = "UPDATE person SET (end=?) " + "WHERE primary_key=? AND end=NULL"; PreparedStatement ps = conn.prepareStatement(sql); ps.setTimestamp(1, now); ps.setInteger(2, primaryKeyValue); int rows = ps.executeUpdate(); if (rows != 1) throw new InconsistentDatabaseException( "Something's rotten..."); sql = "INSERT INTO person(" + "primary_key, first_name, last_name,..., start)" + "VALUES (?, ?, ?, ..., ?)"; ps = conn.prepareStatement(sql); ps.setInteger(1, primaryKeyValue); ps.setString(2, firstName); ps.setString(3, lastName); // ... Fill in any other data for the person table as well ps.setTimestamp(10, new java.util.Date(now.getTime()+1)); // In other words, 1 millisecond after the // end of the previous record rows = ps.executeUpdate(); if (rows != 1) throw new InconsistentDatabaseException( "Something's rotten...");
Bear in mind that both of these statements should really be done as part of a transaction, to eliminate the race condition between the UPDATE and the INSERT statements. Deletions are similarly simple: you just UPDATE the row to fill in the end column, and don't insert a new record with a given start and an empty end column.
While it means that data is never actually erased from the database, it also means that by including timestamps on each of your audit log statements, you can recreate the exact state of the database at any given moment of time—a powerful auditing tool. It also helps to simplify some of your programming logic, since you never actually UPDATE an existing row to modify data (although you will have to UPDATE to fill in the end column at some point). Instead, you just INSERT a new row with the current timestamp in the start column and the new data. In fact, the only time you need to do any special-case logic is the moment the first logical row is entered (in other words, when a new primary-key-identified row is in serted), since there will be no corresponding UPDATE to fill in the "previous" version's end column.
Bringing this discussion back around to optimistic concurrency, clients working with data sets they've read from the database can verify the continued authenticity of the data they hold by checking to see whether the start column of their data matches the most recent row's start column. If the two don't match, obviously somebody modified the data, and we go back to the four-point decision raised earlier: abort, overwrite, merge, or ask.
Optimistic concurrency offers a couple of benefits, the first and foremost of which is the lack of any native database locks taken out during the modification of data read earlier; in other words, client code reading data doesn't have to decide "up front" whether it wants to take out a lock. Instead, simply read the optimistic marker (version number or timestamp) as part of the data, and only later, when you want to update the data back to the database, do you check to see whether the data was modified. Only then, during the actual modification (the UPDATE statement), do you need any kind of transactional lock, and that's usually handled for you by the database and/or the JDBC driver (if it's in auto-commit mode).
A side benefit to the optimistic concurrency approach is its improved diagnostic capability. In the event that another user modified the data, your code has an opportunity to inform your user of what's happened, as well as what to do about it. Had native database locking been used, no information about why a SQLException was thrown would be available unless the database itself chose to present it. As Nock points out, "You can tailor this notification to include whatever diagnostic information will suit clients best. For example, it may be helpful to inform users who updated the data last and when it changed. This type of information can encourage users to be more aware of concurrency issues and remind them to refresh working copies after returning from extended breaks" [Nock, 398–399].
Unfortunately, all is not perfect with the optimistic concurrency model. In particular, you're relying on application code to keep the concurrency model straight, meaning that if some other program decides to access your database (see Item 45), you're at the mercy of that code and its programmers to get the concurrency model right. For that matter, you're even at the mercy of the other programmers on your team to get the model right, which makes a good case for hiding the details of the concurrency model behind some kind of encapsulation layer (see Item 42).
Also, you may find that the optimistic concurrency model doesn't give you quite as much control as you'd like over how everything works together. For example, you may want to create "read-intent" locks or "write-intent" locks within the system, where a user indicates the desire to have exclusive access to a row or collection of rows for a period of time.
Yet you're still reluctant to fall back to native database locks, both for scalability reasons and because you want to control the locking a bit more finely than the database itself will allow. In such situations, you should consider building a pessimistic concurrency model (see Item 34), which can coexist quite peacefully with an optimistic concurrency model, as long as you don't try to use both for the same data.
Many object-relational data access toolkits (JDO implementations and the like) may provide optimistic concurrency mechanisms "out of the box" for you. If the one you use provides such a mechanism, unless you have solid reasons not to (see Item 11), take advantage of it, particularly since it usually costs you nothing more than a line in a configuration file or persistence descriptor.
In general, the optimistic concurrency model is intended purely as a scalability measure, designed to minimize lock windows (see Item 29) by reducing the actual number of native database locks taken out and the length of time they're held. In general, for most enterprise Java systems, the optimistic concurrency model should probably be the default concurrency model—relying on the underlying database to provide all concurrency might work for a short while but ultimately will create contention as you scale up the system.