Centralizing Your Data

Centralizing Your Data

For all the talk about grids, clustered servers, and the like, spreading data across many servers means adding a considerable amount of complexity to a system. The more complicated a structureany type of structurethe less robust it is. Technological advance does indeed slowly push up the threshold of acceptability. In the eighteenth century, clocks indicating the minutes were considered much less reliable than those indicating only the hour, and much more reliable than those showing the day in the month or the phases of the moon. But nevertheless, try to keep the theater of operations limited to that which is strictly required.

Transparent references to remote data are performance killers, for two reasons. First, however "transparent" it may look, crossing more software layers and a network has a heavy cost. To convince yourself, just run a procedure that inserts a few thousands rows into a local table, and another one doing the very same thing acrossfor instance, an Oracle database link, even on the same databaseyou can expect performance to be in the neighborhood of five times slower, if not worse, as you see demonstrated in Chapter 8.

Second, combining data from several sources is extremely difficult. When comparing data from source A to data from source B, you have no choice other than literally copying the data from A to B or the reverse. Transfer is one significant overhead. Data drawn from its own carefully constructed environment no longer benefits from the planning which went into establishing that environment (carefully thought-out physical layout, indexes, and so forth). Instead, that data lands in some temporary storagein memory if the amount of data transferred is modest, otherwise on disk. The management of temporary storage is another major overhead. In a case where nested loops would be, in theory, the most efficient way to proceed when querying local data, an optimizer is left with two unattractive possibilities when some of the data is remotely located:

  • Using nested loops and incurring high overhead with each iteration

  • Sucking the remote data in, and then operating against the local copy, which has left all indexes behind

Optimizers can be forgiven for not performing at their best under these circumstances.

When it comes to the placement of major data repositories, some of the art is simply keeping a balance. If your company operates worldwide, keeping all the data at one location is unlikely to be a popular solution with people who live and work at the antipodes. Hitting a remote server is certainly no problem when surfing the Internetit is quite another matter when using an application intensely. It's not a question of bandwidth, it's a question of light speed, for which, unfortunately, not much improvement can be expected from technological progress. Whatever you do, issuing a query against a server located on another continent adds another quarter or half second to response times, depending on the continentand this at the best of times. If you need everyone to have the global picture, replication solutions and products (as opposed to remote access) should be contemplated. For each group of players, keep their own chessboard right at handdon't make players reach.

The nearer you are to your data, the faster you can get at it!

 Python   SQL   Java   php   Perl 
 game development   web development   internet   *nix   graphics   hardware 
 telecommunications   C++ 
 Flash   Active Directory   Windows