March 18, 2011, 11:55 a.m.
posted by oxy
Item 46: Lazy-load infrequently used data
Given the cost of a trip across the network to the remote database, it follows that we don't want to actually pull any data across the network unless we need it.
This seems like a simple idea, but it's remarkable how often it gets left behind in systems that try to hide the database from the programmer—for example, naïve object-relational mapping mechanisms often map objects to tables in a one-to-one fashion, so that any request to retrieve a particular object from the database means every column for a particular row gets pulled across to completely fill out the object in question.
In many cases, this is obviously a bad situation: consider the canonical drill-down scenario, where I want to present a list of items to the user from which he or she selects one (or more) for a more complete examination. If a large number of items must be presented to the user in a summary view for selection, say, 10,000 persons that meet the initial search criteria, pulling each one back in its entirety means 10,000 x N bytes of data must be sent across the network, where N is the complete size of a single row from the table. If the row has foreign-key relations that must in turn be retrieved as part of this object retrieval (the Person has 0..* Address instances associated with it, perhaps), that number easily bloats to unmanageable proportions.
For these reasons, many object-relational mechanisms choose not to retrieve all of the data for a particular row as part of a standard fetch operation but instead choose to lazy-load the data, preferring to take the additional round-trip to the database in exchange for keeping this summary request down to manageable levels. In fact, this is such a common scenario that it has been documented in several places already, most notably as the Lazy Load pattern [Fowler, 200].
Take careful note: the key is to lazy-load infrequently used data, not just data that hasn't been used yet.
Here the entity beans portion of EJB tends to fall down in catastrophic fashion, similar to the classic N+1 query problem. Many EJB entity bean implementers decided that since we wanted to avoid pulling back any data that wasn't going to be used immediately, when retrieving an entity bean from a Home-based finder method, nothing was going to be retrieved from the actual data store (relational database), and instead only the primary key for the row would be stored in the entity bean inside the EJB container. Then, when a client accessed a property method (get or set) for that entity bean, a separate query would be issued to retrieve or update that particular column value for that particular bean.
The dangerous part, of course, occurred if you wrote client-side code that did something like this under the hood of the entity bean:
PersonHome personHome = getHomeFromSomewhere(); Person person = personHome.findByPrimaryKey(1234); String firstName = person.getFirstName(); String lastName = person.getLastName(); int age = person.getAge();
With such code, you were making requests like this:
SELECT primary_key FROM Person WHERE primary_key = 1234; SELECT first_name FROM Person WHERE primary_key = 1234; SELECT last_name FROM Person WHERE primary_key = 1234; SELECT age FROM Person WHERE primary_key = 1234;
In addition to this being a phenomenal waste of CPU cycles (parsing each query, projecting a query plan for each, marshaling the returned column in a relational tuple, and so on), as well as flooding the cache to the point of irrelevancy, you're also taking a huge risk that the data doesn't change between bean method calls—remember, each of those calls represents a separate EJB-driven transaction, thus leaving open the possibility that a different client could change the same row between calls (see Item 31). For these reasons and more, two patterns were born: the Session Façade [Alur/Crupi/Malks, 341] and the Data Transfer Object [Fowler, 401].
The problem, however, isn't necessarily with the idea of lazy-loading; the problem is knowing what data to lazy-load. In the case of a container-managed entity bean implementation, where all details of SQL are hidden from the EJB developer, the EJB container has no way to know which data elements are more likely to be used than others, so it makes the more drastic assumption that all of them won't necessarily be used. In retrospect, this is probably a bad decision, one that's only slightly better than the other alternative left to the container implementor, that of assuming that all of the columns will be used. The ideal situation would be for the entity bean implementor to allow you to provide some kind of usage hints about which fields should be pulled back as part of pulling the initial bean state across and which ones would be best retrieved lazily, but such things are vendor-dependent and reduce your portability significantly (if you care—see Item 11).
If you're writing your own relational access code, such as writing a BMP entity bean or just writing plain-vanilla JDBC or SQL/J code, the decision of what to pull back falls into your lap once again. Here, as with any SQL access, you should be explicit about precisely what data you want: nothing more, nothing less. This means avoiding SQL queries that look like this one:
SELECT * FROM Person WHERE ...
Although it may not seem all that important at the time you write this query, you're pulling back every column in the Person table when this code gets executed. If, at first, the definition of the Person table happens to match the exact list of data you want, it seems so much easier and simpler to use the wildcard * to retrieve all columns, but remember—you don't own the database (see Item 45). Several things can happen to the database definition that will create problems in your code if you use the wildcard.
Despite the additional typing required, it's almost always preferable to list the columns explicitly. Fortunately, it's also something that's fairly easy to generate via a code-generation tool, should the opportunity arise.
Lazy-loading isn't just restricted to loading columns from the query, however. There are other opportunities to apply the same principle to larger scales. For example, go back to the canonical drill-down scenario. In most thin-client applications, when retrieving search results, we often display only the first 10, 20, maybe 50 results to the user, depending on the size of the results window. We don't want to display to the user the complete set of results—which sometimes stretches into the thousands of items when the user provides particularly vague criteria. Reasons for this are varied but boil down to the idea that the user will typically either find the desired result in the first 10 or 20 items or go back and try the search again, this time with more stringent criteria.
So what happened to the other 990 rows you retrieved as part of that request? Wasted space, wasted CPU cycles, wasted bandwidth.
There are a couple of ways to control how much data gets pulled back as part of the SQL query. One is at the JDBC level, by setting setFetchSize to retrieve only a number of items equivalent to the display window you wish to present to the user: if you're showing only the first 10 items retrieved, then use setFetchSize(10) on the ResultSet before retrieving the first row via next. This way, you're certain that only that number of rows is retrieved, and you know that you're making one round-trip to retrieve this data you know you're going to use. Alternatively, you can let the driver try to keep what it believes to be the optimal fetch size for that driver/database combination, and instead choose to limit the absolute number of rows returned from this statement by calling setMaxRows; any additional rows beyond the number passed in to setMaxRows will be silently dropped and never sent.
Another approach, supported by some databases, is to use the TOP qualifier as part of the request itself, as in SELECT TOP 5 first_name, last_name FROM Person WHERE....Only the first five rows that meet the predicate will be returned from the database. Although a nonstandard extension, it's still useful and is supported by a number of the database vendors. TOP is SQL Server syntax; other databases use terms like FIRST (Informix), LIMIT (MySQL), or SAMPLE (Oracle). All work similarly.
Again, the idea is simple: only pull back what data is required at this time, on the grounds that you want to avoid pulling back excessive amounts of data that won't be used. Be careful, however; you're standing on a slippery slope when you start thinking about lazy-loading, and if you're not aware of it, it's easy to find yourself in a situation where you're excessively lazy-loading data elements across the wire, resulting in what Fowler calls "ripple-loading" or as it's more commonly known, the N+1 query problem [Fowler, 270]. In those situations, sometimes it's better to eager-load the data (see Item 47) in order to avoid network traffic.