June 15, 2011, 1:23 a.m.
posted by oxy
Item 47: Eager-load frequently used data
Eager-loading is the opposite of lazy-loading (see Item 46): rather than taking the additional network round-trip to retrieve data later, when it's needed, we decide to pull extra data across the wire now and just hold it on the database client on the grounds that we'll need it eventually.
This idea has a couple of implications. First, the payoff—actually accessing the extra data—has to justify the cost of marshaling it across the wire from server to client. If the data never gets used, eager-loading it is a waste of network bandwidth and hurts scalability. For a few columns in a single-row retrieval query, this is probably an acceptable loss compared with the cost of making the extra round-trip back to the database for those same columns. For 10,000 rows, each unused column just plain hurts, particularly if more than one client executes this code simultaneously.
Second, we're also implicitly assuming the cost of actually moving the data across the wire isn't all that excessive. For example, if we're talking about pulling a large BLOB column across, make sure that this column will be needed, since most databases aren't particularly optimal about retrieving and sending BLOBs across.
You don't hear much about eager-loading data because to many developers it conjures up some horrific images of early object databases and their propensity to eager-load objects when some "root" object was requested. For example, if a Company holds zero to many Department objects, and each Department holds zero to many Employee objects, and each Employee holds zero to many PerformanceReview objects . . . well, imagine how many objects will be retrieved when asking for a list of all the Company objects currently stored in the database. If all we were interested in is a list of Company names, all the Department, Employee, and PerformanceReview objects are obviously a huge waste of time and energy to send across the wire.
In fact, eager-loading has nothing to do with object databases (despite the fact that they were blamed for it)—I've worked with object-relational layers that did exactly the same thing, with exactly the same kind of results. The problem, although "solved" differently, is the same as that for lazy-loading: the underlying plumbing layer just doesn't know which data to pull back when retrieving object state. So, in the case of the eager-loading system, we err on the side of fewer network round-trips and pull it all back.
Despite the obvious bandwidth consumption, eager-loading does have a number of advantages to it that lazy-loading can't match.
First, eager-loading the complete set of data you'll need helps tremendously with concurrency situations. Remember, in the lazy-loaded scenario, it was possible (without Session Façade [Alur/Crupi/Malks, 341] in place) for clients to see semantically corrupt data because each entity bean access resulted in a separate SQL query under separate transactions. This meant that another client could modify that row between transactions, thus effectively invalidating everything that had been retrieved before, without us knowing about it. When eager-loading data, we can pull across a complete dump of the row as part of a single transaction, thus obviating the need for another trip to the database and eliminating the corrupt semantic data possibility—there's no "second query" to return changed data. In essence, eager-loading fosters a pass-by-value dynamic, as opposed to lazy-loading's pass-by-reference approach.
This has some powerful implications for lock windows (see Item 29), too. The container won't have to hold a lock against the row (or page, or table, depending on your database's locking defaults) for the entire duration of the session bean's transaction, if accessed via EJB, or the explicit transaction maintained either through a JTA Transaction or JDBC Connection. Shorter lock windows mean lower contention, which means better scalability.
Of course, eager-loading all the data also means there's a window of opportunity for other clients to modify the data, since you're no longer holding a transactional lock against it, but this can be solved through a variety of means, including explicit transactional locks, pessimistic concurrency models (see Item 34), and optimistic concurrency models (see Item 33).
Second, as already implied, eager-loading data can drastically reduce the total time spent on the wire retrieving data for a given collection if that data can be safely assumed to be needed. Consider, for example, user preferences: data that individual users can define in order to customize their use of the application in a variety of ways, such as background images for the main window, whether to use the "basic" or "advanced" menus, and so on. This is data that may not be needed immediately at the time the user logs in, but it's a fair bet that all of it will be used at some point or another within the application. We could lazy-load the data, but considering that each data element (configuration item) will probably be needed independently of the others, we're looking at, again, an N+1 query problem, in this case retrieving each individual data element rather than individual row. Go ahead and pull the entire set across at once, and just hold it locally for use by the code that needs to consult user preferences when deciding how to render output, windowing decorations, or whatever.
Eager-loading isn't just for pulling back columns in a row, either. As with lazy-loading, you can apply eager-loading principles at scopes larger than just individual rows. For example, we can apply eager-loading principles across tables and dependent data, so that if a user requests a Person object, we load all of the associated Address, PerformanceReview, and other objects associated with this Person, even though it might mean multiple queries executed in some kind of single-round-trip batch form (see Item 48).
While we're at it, there's never any reason why a table that holds read-only values shouldn't be eager-loaded. For example, many systems put internationalized text (like days of the week, months of the year, and so on) into tables in the database for easy modification at the client site. Since the likelihood of somebody changing the names of the days of the week is pretty low, go ahead and read the entirety of this table once and hold the results in some kind of in-process collection class or RowSet for later consultation. Granted, it might make system startup take a bit longer, but end users will see faster access on each request-response trip. (In many respects, this is just a flavor of Item 4.)
In the end, eager-loading data is just as viable and useful an optimization as lazy-loading data, despite its ugly reputation. In fact, in many cases it's a far more acceptable tradeoff than the lazy-loading scenario, given the relatively cheap cost of additional memory compared with the expensive and slow network access we currently live under. As always, be sure to profile (see Item 10) before doing either lazy- or eager-loading optimizations, but if an eager-load can save you a couple of network accesses, it's generally worth the extra bandwidth on the first trip and the extra memory to hold the eager-loaded data.