System Management

System Management

Most of this chapter focuses on the back-room requirements for managing your DW/BI system in production. Although we’ve separated these operational issues into this chapter, you need to think ahead during design and development to ensure you build a maintainable system.

There are several components of the back-room system management:

  • Executing and monitoring the ETL system

  • Monitoring resources and usage

  • Managing data growth and disk space

  • Performance tuning

  • Managing partitioning

  • Backup and recovery

  • Generating statistics for the BI portal

The more automated you can make your systems management, the better. At the very least, automate backups and launching the ETL packages. SQL Server provides enough tools that the basics are easy, and there’s no excuse for not implementing some system automation.

Unlike many issues where we’ve talked about how small teams might cut corners, organizations of any size benefit from system automation. Indeed, the smallest organizations are perhaps least equipped to apply human resources to a problem that can be automated. It’s hard to imagine how a DW/BI team of one to three people could possibly operate without significant automation.

The ideal management system requires no human intervention except for the occasional troubleshooting. Such a system automatically adds and drops partitions, checks for disk space, reports on performance problems or unusual usage, and corrects the vast majority of data oddities during the ETL process.

No matter how automated your operations are, you must have a plan. Like all plans, your Operations Plan should be written down.

Executing the ETL Packages

During development, you design and execute Integration Services packages within BI Studio. In the development environment, you can set breakpoints, examine variables’ values and status, watch the movement of data through a Data Flow task, and explore the data with data viewers. All of these tools are valuable during development, but are of no interest in production. You want the ETL system to execute, upon a schedule, with no human intervention.

Integration Services packages are easy to schedule. SQL Server ships the tools you need: dtexec and SQL Agent. You should already be familiar with dtexec and its friend dtexecui, from the process of testing the Integration Services packages. Dtexec and dtexecui execute a package from the command line. They are basically the same, except that dtexecui brings up a user interface that helps you construct the command by choosing which package to run and picking various options like logging levels and connection information.


Why is this utility called dtexec? It’s an artifact of Integration Services’ ancestry to SQL Server 2000 DTS. The feature name was changed to Integration Services relatively late in the game, and Microsoft decided not to mess with all the back-end names. The utility will be called dtexec long after DTS fades from our collective memory.

Once you’ve set up the package execution options, you can create a SQL Agent job to run the package, and then schedule that job. SQL Agent is a standard feature of the SQL Server database.

To schedule an Integration Services job, launch Management Studio and open Object Explorer for the server that you’ll use to run SQL Agent. This is often, but not necessarily, the database server on which the Integration Services packages will run. Navigate to the SQL Server Agent folder, and right-click on Jobs to create a new job.

Name the job, and categorize it if you wish. Categorization is extremely useful if you have a lot of jobs. Next, go to the Steps page of the New Job dialog box. When you add a new step, the New Job Step dialog box appears, as illustrated in Figure.

Image from book
Figure: SQL Agent New Job Step

Name the step, and set the Type to SQL Server Integration Services package execution. The package execution job step includes a user interface for setting up your Integration Services package. This interface is effectively the same as the one in dtexecui. First identify the package you plan to execute. Packages can be stored remotely but they always execute on the machine that’s running SQL Agent. That point is worth repeating: Packages execute within the process of the calling executable, SQL Agent in this case.

Looking back to Figure, you can see a variety of job configuration options. Most people will never use any of these options, but the most common ones are:

  • Data sources: Change the database connection information, for example to point to the production servers. Note: We generally prefer to change database connection information in a configuration file, as discussed in the next bullet point.

  • Configurations: Change the location of a configuration file containing variable values and other object properties. The location of configuration files might be different on your development, test, and production servers (although it’s certainly best practice to set up test and production the same way). Of more interest, you can overlay a configuration file to change one or two items, like global variables that contain the date range for which to run the package.


The dtexec utility is well documented in the Books Online topics “dtexec Utility” and “How to: Run a Package Using the DTExec Utility.”

Once you’ve set up your SQL Agent job step, schedule it to run on the appropriate schedule. In most cases, you’ll define one or a small number of SQL Agent jobs for your ETL system. Each job calls a master Integration Services package, which in turn calls subpackages in the correct order. Build logic flow and dependencies into the Integration Services master package, and use SQL Agent only to kick off the main job.

Some organizations have standardized on enterprise-wide job management software like Autosys. SQL Agent is nothing special; if your organization uses such software, by all means conform to the corporate standard. The job management software simply calls dtexec on the computer running Integration Services.


Job management software is designed to handle complex dependencies very well. It can wait until files have arrived in a directory, a table has been loaded, or a flag set, before firing off additional steps. Integration Services can do all these things, too, so we don’t expect anyone will buy external job management software to support the ETL system. However, if you have job management software in-house, you should evaluate whether you want to design your Integration Services packages so they leverage this functionality from the external tool.

Monitoring the Business Intelligence System

You need to implement and understand two types of monitoring. First, what’s going on in your system? What system resources like memory are being used, when, and by what processes? If you can’t monitor system events and resource usage, you’ll never be able to troubleshoot and tune your system. The other kind of monitoring focuses on the users: Who’s accessing the system, and when? What queries are they running, and how much time are those queries taking? If you don’t know what your users are doing, you’ll never be able to satisfy them.

Resource Monitoring

Microsoft does a good job of providing tools for you to monitor resource usage for the relational data warehouse, ETL system, Analysis Services database, and BI applications. Consistent and integrated monitoring tools are a significant advantage of the SQL Server toolset. It’s not a feature that’s sexy or exciting, but it affects your ability to manage your system.

The two most basic tools are familiar to your system administrators: System Monitor and Event Viewer. These Windows monitoring tools are part of the operating system, and the SQL Server components write counters and events that you can view in a standard way. System Monitor counters and events can be written to files or to a SQL Server database. System Monitor is sometimes referred to as Performance Monitor or by the shortened name PerfMon. The Event Viewer is used primarily for logging errors. The Event Viewer is sometimes referred to as the error log.

SQL Server ships with some utilities and functionality that are key to monitoring your DW/BI system. SQL Profiler is a stand-alone utility that you launch from the Start menu. Profiler tracks engine process events, like when a transaction was rolled back or a user logged in. SQL Profiler 2005 is similar to the Profiler familiar from SQL 2000, with several significant improvements. First, you can now profile Analysis Services databases and display events raised by Integration Services. The relational engine and Analysis Services include a flight recorder feature that captures the last sequence of trace events, which can later be placed back using Profiler. The flight recorder is extremely valuable for diagnosing a system crash or other unwelcome event.

The Activity Monitor window in Management Studio shows the current activity on the relational database: who’s logged in, what processes are running, and what locks are active. Old hands may prefer the classic system stored procedures like sp_who. Activity Monitor and Profiler present similar information for the relational database. Use Activity Monitor to see what’s happening right now. Set up Profiler to capture similar information into a permanent file or database.

The Analysis Services Activity Viewer is a sample utility that ships with SQL Server 2005. The Activity Viewer provides functionality for the Analysis Services database that’s analogous to the Activity Monitor window in Management Studio. It’s available as a sample utility but isn’t integrated into Management Studio.


See the topic “Activity Viewer” in Books Online.

Integration Services and Reporting Services have various mechanisms for logging system information. These features are discussed in the following sections.

System Monitor, SQL Profiler, Integration Services, and Reporting Services can all log their information to SQL Server. You should plan to create databases to hold these log tables.

Baseline Performance

Part of your Operations Plan should be to establish a baseline performance profile for your system. It’s orders of magnitude easier to fix a performance problem in the future if you’ve kept track of how the system performed over time.

The most important measures to track in your performance baseline are pretty obvious: the usage of memory, processors, and disk. Plan to monitor your production servers all the time, perhaps at 15-minute intervals. Store the monitored information, preferably in a SQL Server database. For all the servers in your system, set up System Monitor to track at least the following counters.

For disk usage:

  • PhysicalDisk: Avg. Disk sec/Read and Avg. Disk sec/Write.

  • PhysicalDisk: Avg. Disk Queue Length. If this counter is consistently too high, it means I/O is being performed so fast that it can’t be handled with the physical disk. This is the first place to look if you think your system is I/O bound.


What is a value for PhysicalDisk that’s too high? It depends on the configuration of the RAID set for the disk. “Too high” is 2 to 3 times the number of spindles in the RAID set. For a drive that consists of a single disk, then, “too high” is 2 or 3.

For CPU usage:

  • Processor: % Processor Time

  • System: %Total Processor Time

For memory usage:

  • Memory: Available Bytes

  • Memory: Pages/sec

  • Memory: Page Faults/sec

In addition to the ongoing monitoring, it’s really useful to create a fine-grained baseline on a monthly basis. This fine-grained baseline will differ from the ongoing monitoring by sampling performance more frequently, say at 5-second intervals. You may also collect additional counters.

For the monthly detailed logging of the ETL process, ramp up the logging frequency just before kicking off ETL processing, and return to normal levels as soon as the ETL job completes. For the monthly detailed logging of usage, ramp up monitoring for a few hours during a reasonably busy time of day.

Verbose logging can degrade system performance, so you may want to monitor from a remote server. This is especially true if you’re writing the logs directly to a SQL Server database rather than to files. For performance reasons, consider writing verbose logs to a file. As the DW/BI team, you should not find it difficult to import the log file into the SQL Server logging database.

Monitoring RDBMS Resources

There are books, classes, and certifications offered on how to monitor a relational database. Much of this information is oriented to transactional databases rather than data warehouse databases, but the concepts are standard. The biggest issue we’ve seen is educating an experienced transactional DBA on the requirements of a relational data warehouse. Such DBAs inevitably want to improve performance by adding hints to queries, which is impractical for ad hoc use of the relational data warehouse. They like to convert all known queries to stored procedures, which doesn’t hurt but doesn’t provide much benefit either, as the time to compile the query plan usually pales in comparison to executing the analytic query. They agonize over any denormalizations you’ve made. And they tend to underestimate the complexity of indexing a dimensional schema, and the cost of maintaining each new index at load time.

It really helps to have the operational DBA involved with the project during the development phase, so they will understand how the system is intended to be used and what the tradeoffs are. If that’s not possible, then hand them the system documentation we’ve been begging you to write. In a large organization, consider hiring and training operational DBAs who focus exclusively on the business intelligence system.

The tools for monitoring and managing the SQL Server relational database are more mature than the other BI components. In addition to the generic System Monitor counters listed previously, the relational database server should also track some counters published by the database engine:

  • SQL Server: Buffer Manager Page Reads/sec

  • SQL Server: Buffer Manager Page Writes/sec

  • Process: Working Set

  • SQL Server: Buffer Manager: Buffer Cache Hit Ratio

  • SQL Server: Buffer Manager: Total Pages

  • SQL Server: Memory Manager: Total Server Memory (KB)

Many DBAs use SQL Server Profiler only in an ad hoc way, to investigate problems. We recommend that you set up a basic trace that runs whenever the database is running. This basic trace should be as lean as possible, so justify every event class that you’re logging. Consider tracing the following events:

  • Security Audit: Audit Login and Audit Logout, and Sessions: ExistingConnection. These event classes are most important for usage monitoring, although user counts affect performance. In some regulated environments, you will be required to capture login/logout information for all users of the system. You may also be required to collect the text of all queries.

  • Database: Log and Data File Autogrow and Autoshrink (four event classes). These (let’s hope infrequent) events can significantly impact system performance.

  • Transactions: SQLTransaction. In a normal data warehouse database populated by batch processes, there are relatively few (large) transactions. Monitoring transactions is reasonable and not terribly expensive. The closer you move your processing to real time, the more transactions you’ll have. The cost of tracing all those transactions may well outweigh the benefits.

  • SQL: BatchCompleted. Track users’ queries and any other SQL being executed. The TextData column shows the SQL being executed. This event class is likely to generate significant data volumes, especially if your trace includes the TextData. You may also want to track SQL: BatchStarting.


The Books Online topics that start with “Monitoring Performance and Tuning Data Stores” contain good information about how to troubleshoot relational database performance.

Monitoring Integration Services

The primary goal for monitoring Integration Services package execution on the production system is to help you evaluate whether and how to improve processing performance. You also want to be able to tie the Audit dimension, discussed in Chapter 6, to information about the package’s execution.

The most important tools for monitoring Integration Services are System Monitor (PerfMon) and Integration Services Logging. In addition to the generic System Monitor counters that we listed previously, Integration Services publishes some counters that you can see in PerfMon. These are not as useful as you’d hope because they track information at a high level. What you really want to see is how much memory each step of a data flow is using; instead, you can see how much memory Integration Services is using.

Nonetheless, the following SQLServer:SSIS Pipeline counters are somewhat useful:

  • Buffer Memory: How much memory is Integration Services using? If this number is larger than the physical memory available to Integration Services, some data is being spooled to disk during processing.

  • Rows Read: The total number of rows read from source adapters.

  • Rows Written: The total number of rows written to destination adapters.

The logging that’s generated by Integration Services packages is akin to the SQL Server tracing and profiling functionality that we discussed previously. Like Profiler, it tracks events. The kind of events that can be tracked and logged will be familiar to anyone who’s run an Integration Services package: It’s exactly the same kind of information that you can see in the Execution Results tab every time you execute a package in the BI Studio.

You seldom want to store all those package execution results permanently. Define package logging so that you store information about interesting events only. Most often, you’d set up the packages on your production system so they log to a SQL Server database, though you have several choices.

For every task, and for the overall package, you can track many events, of which the most useful are:

  • OnPreExecute: Logs a row when execution begins

  • OnPostExecute: Logs a row when execution ends

  • OnWarning: Logs a row when the task issues a warning

  • OnError: Logs a row when the task issues an error

Unfortunately, the Data Flow task is like a black box to the logging system. There are multiple steps and transformations within a Data Flow task, and ideally you’d like to know how long each step takes. This is simply impossible with the logging framework as it exists. Although it’s a hard problem, we hope to see Data Flow task logging in a future release.


Check the Microsoft web site The Project Real team at SQL BI worries about operational issues like this one, and may post scripts, workarounds, and other useful information on this and other problem areas.

If Analysis Services processing is part of your ETL, make sure you gather statistics on it independently and as a component of the overall flow. It’s important to establish a baseline for Analysis Services processing as a separate entity because its tuning activities are so different from the other ETL steps. We discuss the Analysis Services counters in the following section, “Monitoring Analysis Services.”

Set up logging by editing the package in BI Studio. Choose SSIS Logging, and specify where the logs will be stored. Note that SQL Profiler is one of the options. Logging to SQL Profiler makes it easier to interleave package events with database events, and is especially useful during testing. However, most people log to a SQL Server table in production. It’s best to create a separate database to hold logging information. Integration Services will automatically create its logging table, compellingly named sysdtslog90.

You could also store the logs in the file system, or write to a custom log provider. The latter option is interesting if you’ve purchased a third-party system management product that works with Integration Services.

Specify the events that you want to log on the Details tab of the Configure SSIS Logs Wizard, as illustrated in Figure.

Image from book
Figure: Configure Integration Services logging

On a production system, you’ll seldom need to capture more events than those listed previously. If a problem should arise, for example if the packages start throwing warnings or errors, you should use your test system to evaluate. Most performance testing and re-work should also occur on test or development systems. That’s good because there’s no way at run time to change the logging configuration; your only solution is to edit the package.

Once you’ve executed a package, and logged events to SQL Server, you can report on those logs. A set of basic sample reports is included with SQL Server. These reports are implemented in Reporting Services. You can find them in the Samples folder.

As we described in Chapter 6, you should build your packages to collect information about package execution times, rows inserted into target tables, and possibly other information like the sum of amounts. This metadata is stored in the Audit dimension and, at your discretion, is available to business users for querying. You may choose to pick up additional information from logging, to include with the Audit system.

Finally, you may want to set up package logging to log errors to the application event log, accessible from the Event Viewer. This application event may still be written even if Something Really Bad happens on the database server, which would prevent the standard logging from working properly.

Monitoring Analysis Services

You use the same tools, for the same reasons, to monitor the Analysis Services database as you do for the relational database. The main tools are System Monitor (PerfMon) and SQL Profiler.

In addition to the basic system information we’ve already discussed, add some Analysis Services counters to your System Monitor configuration. Over 100 counters are available. For the standard monitoring in production, don’t over-monitor. We’ve found the following counters to be useful.

To monitor OLAP query performance, use the MSAS 2005: Storage Engine Query counters:

  • Avg time/query

  • Total queries requested

  • Total queries answered

  • Total bytes sent

To monitor connection requests, use the MSAS 2005: Connection counters:

  • Total requests or Requests/sec

  • Total successes or Successes/sec

If your DW/BI system includes data mining models, you should monitor the performance of both processing and ongoing queries, using the following MSAS 2005: Data Mining Prediction counters:

  • Total Predictions or Predictions/sec

  • Total Queries or Queries/sec

Analysis Services, like all the BI components of SQL Server, loves memory. Monitor how much Analysis Services is using, and for what purposes, by using the MSAS 2005: Memory counters:

  • AggCacheKB

  • Memory Usage KB

To evaluate whether you are running short on memory during processing aggregations, use the MSAS 2005: Proc Aggregations counter:

  • Temp file bytes written/sec

It’s seldom necessary to monitor processing performance as part of the standard monitoring set of counters. SQL Profiler can pick up the important events, like processing starts and stops. Monitor the processing counters (Data Mining Model Processing, Processing, and Proc Aggregations) only during test, or if you need to diagnose a performance or other processing problem.

It’s as easy to use SQL Profiler to set up a trace on Analysis Services as it is on the relational database. Create a new trace to Analysis Services, and choose the desired event classes. As with the relational database engine, you can run multiple traces at the same time. We recommend that you set up a basic trace that runs whenever the database is running. This basic trace should be as lean as possible, so justify every event class that you’re logging. Consider tracing the following events:

  • Security Audit: Audit Login and Audit Logout. These event classes are most important for usage monitoring, although user counts affect performance. In some regulated environments, you will be required to capture login/logout information for all users of the system. You may also be required to collect the text of all queries.

  • Security Audit: Audit Backup/Restore Event and Audit Server Starts and Stops. These two classes should generate few events. The rationale for logging these events should be pretty obvious.

  • Errors and Warnings: Error. It’s always a good idea to log errors.

  • Command Events: Command End. The types of commands collected by this event class are specified by the event subclass column, and include CREATE, ALTER, PROCESS, and so on. The TextData column contains the text of the command. The Command End event includes the command start and stop times.

  • Query Events: Query End. Queries can be MDX to a cube, MDX to a data mining model, or SQL, as specified by the event subclass column. The TextData column contains the text of the query. The Query End event includes the query start and stop times.

  • Progress Reports: Progress Report Begin, End, and Error. Progress reports occur during dimension and partition processing. You may also want to track Progress Report Current, although this will generate somewhat more information.

Monitoring Reporting Services

The story for monitoring Reporting Services begins in a familiar way, with System Monitor. We’ll outline the counters to include in a stripped-down standard monitor. Events are captured for you automatically by Reporting Services, rather than by setting up a trace in Profiler.

As with the other BI components, we recommend that you run System Monitor (PerfMon) all the time with a few counters, storing the results in a SQL Server table. The most important counters to run all the time are listed following.

RS Web Service counters track activity through the web service:

  • Active Sessions counts all browser sessions generated from report subscriptions, whether they are still active or not.

  • Total Requests or Requests/sec counts all requests made to the report server after the service started.

  • Total Reports Executed or Reports Executed/sec counts the number of reports that were executed.

RS Windows Service counters track activity through scheduled activities like snapshots and report subscription and delivery:

  • Active Sessions counts all browser sessions generated from report subscriptions, whether they are still active or not.

  • Report Requests or Requests/sec counts all requests made to the report server after the service started.

  • Total Reports Executed or Reports Executed/sec counts the number of reports that were executed.

If you have implemented Reporting Services on a web farm, these counters track only one server, not the aggregate across the farm. You can easily monitor the multiple servers on a single instance of Systems Monitor, but you’d need to aggregate the web farm servers’ counters yourself. You could get fancy and build a data mart for your Reporting Services logging information, or just use a spreadsheet.

The Reporting Services Execution Log is more interesting than the Systems Monitor counters. These logs are captured in the operational database that Reporting Services uses, called the ReportServer database by default. However, you shouldn’t report directly from the ReportServer execution logs. These tables are part of the Reporting Services operational system, and reporting directly from them is analogous to reporting directly from any other transaction system.

The Report Server catalog is just like any other transaction system. It will lose history when its transaction logs are truncated. It will lose referential integrity when reports are deleted that have been run in the past, or when users are removed who have used the system in the past. Definitions of the tables and columns may change over time, breaking the existing library of usage reports. It is not easy to query directly because it uses codes instead of descriptions for most of the attributes. Most important, even a simple interactive query could place read locks on the running system, dramatically reducing system performance. For all these reasons, it makes sense to build a simple database that tracks the process of generating reports.

SQL Server has made this easy by providing an Integration Services package that will manage this reporting log database for you. Most people will schedule this package to run nightly, although there’s no harm in doing it more or less often. Just make sure that you run it more often than Reporting Services clears execution records. Otherwise execution log entries will be deleted before they’re gathered up by the package.


You can read about how to set up and execute that package in the Books Online topic “Querying and Reporting on Report Execution Log Data.”

SQL Server also includes some sample reports on the reporting execution log data. The sample reports are oriented to the DW/BI team. Supplement these reports with reports designed for your users and operations environment. As we show in Chapter 16, one of these user-oriented reports should include a time series that shows how overall DW/BI system usage is, one hopes, increasing. Another popular report lists a count of reports run by user or department over the last 30 days. This report has been known to stimulate interest in the DW/BI system by the VPs and directors of those departments.

Usage Monitoring

The usage of your DW/BI system has a huge impact on its performance. Our discussion of resource monitoring suggested that you collect information usage. You should always be collecting counts of queries run and rows returned, by type of application. The System Monitor counters we’ve outlined previously will collect that information for you.

You also should collect sample queries. Sample queries will help you tune system performance by building new indexes or aggregations. You may be able to identify training opportunities by seeing what strange queries your users are putting together.

Some organizations collect the text of all queries submitted to the system, or at least all ad hoc queries. This is actually a lot of data, and is probably overkill unless you have a compliance mandate for collecting such information. By collecting all queries, all the time, you’re placing a non-trivial performance burden on the system. And, your logging database will grow to be quite large.

A reasonable alternative is to turn on query text logging only occasionally, for example from 1 p.m. to 3 p.m. daily or even weekly. For queries into the relational database, this is your best bet. It’s simple to create and schedule a profile that captures the appropriate column (TextData) from the SQL:Batch Started event class, and run it for a specific time. The results of this profile would supplement the constant usage monitoring which, presumably, would not include TextData.

The Profiler events that we discussed for Analysis Services will automatically collect query counts by user. You may also choose to collect the MDX and DMX statements that are issued for OLAP and Data Mining queries. Do so by collecting the TextData column of the QueryEvents\Query End event in Profiler. Note, however, that this column collects the query text in a human-readable format. The volume of data collected here is potentially far larger than the query log for usage-based aggregations. As with the relational data warehouse, you may choose to set up Profiler to collect query text for only a few hours a day or week. You should plan to collect query counts at all times.

There’s a second query log that’s used to support the wizard to design usage-based aggregations. This query log is a server-wide setting, which it’s very important to turn on. We describe this second query log later in the chapter, in the section on optimizing aggregation design.

Reporting on Usage

Your BI portal web site should devote a small amount of screen real estate to reporting on system usage. VPs and Directors are often very interested in how much their staff members are using your system. They tend to be competitive people, and simply seeing another department using the DW/BI system has been known to spur a VP to encourage his or her staff to use the system more. So a time series of reporting and ad hoc use by department is a really good report to publish.

The DW/BI team should also know who is doing what with the system, how much, and when. This information is imperative for performance tuning, as well as identifying problem areas and users whose skills may be leveraged more broadly. If you’re monitoring the text of ad hoc queries, you should communicate with business users what, how often and why you’re monitoring their use, what you plan to do with the information, and who has access to that information. This isn’t a big deal; just put a few sentences in the informational document for new users.

Managing Disk Space

One of the most common reasons for ETL job failure is one of the easiest to prevent: running out of disk space.

At the very minimum, set up a System Monitor counter and alert to warn when free space on each disk falls below a certain threshold. The relevant counter is Logical Disk: Free Megabytes. Set up two alerts: one to warn when you’re within a month of running out of disk space, and one to blare stridently when you’re about a week away.

Figures 15.4 and 15.5 show how to set up System Monitor (PerfMon) to trigger an alert when disk space falls below a threshold, in this case 5,000MB. On the General tab of the dialog box (see Figure), specify the free space you’re looking for, and how often to check. Checking daily is usually fine. On the Action tab you can specify whether to log the alert event to the application event log (always a good idea), send a network message, or launch a program. Figure illustrates how to set up the Schedule tab so that the alert will continue running even if the machine reboots or the alert triggers.

Image from book
Figure: Set up a System Monitor alert to warn of low disk space
Image from book
Figure: Set up alert to continue running after reboot or alert triggers

How do you know what level of free space to check for? Focus on the big files that you accumulate:

  • Staging files for fact table extracts: You may keep the fact table extract files online forever, for restartability, recoverability, or regulatory reasons. You may keep 30 days online, and back up the rest. If you keep a set number of extracts on disk, consider the growth in monthly data volumes, especially if your fact table is concerned with a new line of business. Make a conservative guess, and make an entry in your operations plan to re-check this guess periodically.

  • Analysis Services measure group (fact) partitions: Analysis Services cubes can accumulate data, or you can set them up to keep a rolling window of data, like 37 months.

  • Relational database filegroups (including tempdb): You should monitor the logical disks on which your database files are located. SQL Server provides additional tools for managing its file space.

It’s almost universally true that the incremental disk space you’ll be using each month goes to fact data storage, whether in the file system, relational database, or Analysis Services database. Most dimensions are relatively small and static in size, at least compared to fact tables.

When you set up a relational database, you can specify an initial size, an automatic growth factor like 10 percent, and a maximum file size. We recommend that you enable automatic growth as a failsafe, but monitor free space so that you extend database space on your own schedule. This is classic relational database management. There are tons of books available on the subject, and there’s nothing particularly unusual for the data warehouse environment—except the overall size of the system and the (usually) batch nature of inserts and updates.

You could get a lot fancier than the simple mechanism we’ve described here. If your data volumes are accelerating, you should write a little program to look at the sizes of your staging files and Analysis Services databases. Forecast next month’s disk requirements based on the recent trend. You could write this program as a Script Task from within Integration Services, and schedule it using SQL Agent. You could go so far as to automatically allocate more disk space for the RDBMS, staging area, or Analysis Services. But do the basics at least: You should be ashamed if you actually run out of disk space on your production system.

Killing Queries

Any DW/BI system that supports ad hoc queries—and we hope that’s all of them—needs to be able to kill a query. No matter how hard you try to tune, to educate users, to build front ends that provide a safe environment, it’s inevitable that someone will launch a query that’s Really Bad. Maybe it will bring back a billion rows. Maybe it’s malformed. Maybe it’s a good query but expensive enough that you want to run it overnight. The time will come, sooner than you may think, when you need to kill a query.

The system monitoring counters that we described should tell you if there’s a monstrous query going on. If you set up an alert, your DBA may receive a message to investigate and possibly intervene.


You could even set up a System Monitor alert to automatically kill any query that’s using too many system resources, by having the alert launch a program when it’s triggered by a large query. We don’t encourage this, as the DW/BI system is designed for users and ad hoc use. Those big queries are often quite valuable. We might consider setting this up for a few hours during high peak time, say 8 to 10 am, when everyone in the company is running their daily reports.

For the relational database, killing a query is a piece of cake. Under the covers, there are two key stored procedures: sp_who and sp_kill. But the best approach is to use SQL Server Management Studio. Within the Object Explorer for the database, navigate to Management Activity Monitor. Find the offending process and right-click to kill it.

Management Studio, inexplicably, does not include an Activity Monitor for Analysis Services. SQL Server does ship a separate sample application to perform the analogous functions to sp_who and sp_kill.


You can find the Activity Monitor application in the samples folder, at Program Files\Microsoft SQL Server\90\Samples\Analysis Services\Administrator\ActivityViewer. The source code for this sample application is included. The application includes a small help file, but like the relational database Activity Monitor, the Analysis Services ActivityViewer is straightforward to use.


The samples are not installed by default. During installation (or modification of the existing installation), choose the Advanced button and ensure the samples and sample databases are installed.

Service and Availability Management

The DW/BI team is responsible for ensuring that the system, including data, reports, and other applications, is available to end users. The level of availability required, measured as the amount of acceptable down time, depends on the business impact of the system’s unavailability. You should work with the business users to develop a service level agreement (SLA), and build a plan for meeting that specified service level.

When you develop your availability plan, consider the components of the DW/BI system within the context of your entire IT infrastructure. Let’s hope your IT team already has plans and procedures for managing hardware, software, software installation media and CD keys, and user names and passwords—all necessary for rebuilding or restoring a system as quickly as possible. Assess the following issues:

  • Do any parts of the DW/BI system need continuous, 24-hour-a-day query access? Analysis Services? The relational data warehouse? If so, how do you process new data without compromising availability?

  • If continuous query access is not required, how do you ensure that processing fits within the nightly processing window?

  • How do you handle situations that require an entire Analysis Services cube to be reprocessed? As discussed in Chapter 7, this can happen if an attribute declared to be unchanging does in fact change, among other reasons.

  • How do you recover from a system failure during Integration Services processing? How do you restart a process mid-stream?

  • How is the DW/BI system protected against failure of one or more components on the server(s)?

  • How is the DW/BI system and data protected against failure within the enterprise, like source system outages, or serious problems with the Active Directory servers?

Develop a plan for addressing these issues to achieve the necessary availability. Test each element of the plan. A well-trained staff that is prepared to handle any contingency is an essential part of any disaster recovery plan.

Your availability plan must explicitly state how you’ll detect a problem with a service. Maybe you want to wait for the VP of Marketing to call, but you probably want to be a bit more proactive.

Your Windows system administrators should already have procedures in place for identifying service outages. These procedures probably use System Monitor, Windows Events, or both. Use those same techniques to identify whether the components of your DW/BI system are currently available. If your organization doesn’t already use Microsoft Operations Manager (MOM), you should consider purchasing it or a third-party operations manager like Tivoli. Any operations management product will be looking at standard Windows logs like System Monitor and Windows Events, and SQL Server Profiler.

The SLA should be explicit about the availability that’s required for each component of the DW/BI system. For Analysis Services, it’s particularly important to distinguish between the SLA for unavailability and the SLA for query performance. In many cases, Analysis Services incremental processing can occur in the background, with user access to the cubes undisturbed. During this period, however, query performance may degrade significantly. You need to help your business users understand these tradeoffs, so they can help make the case for a different architecture if needed to meet their availability and performance requirements.

The same issues of availability versus query performance are relevant for the relational data warehouse. Many organizations simply close access to the DW/BI system during ETL processing. Others, with higher availability requirements, will perform the bulk of processing in the background or on a remote server. User query performance on the relational data warehouse usually suffers during ETL processing, perhaps intolerably so. A careful consideration of these issues, and good communication with the users, will help you design a cost-effective architecture that meets your requirements.

Performance Tuning the DW/BI System

The system operations plan should include strategies for periodic performance tuning of all system components. The resource monitoring that we described earlier in the chapter provides the base information you’ll need to determine how to improve system performance. If performance is degrading, you need to identify which operations are causing the problem, and whether the primary bottleneck is in memory, disk, or processing power. All the BI components love memory, so it’s a good bet to check memory usage first.

The best way to solve resource contention may be to distribute your DW/BI system across multiple servers. A lot of DW/BI teams will try to build an all-in-one system, with the four major BI components (relational database, Integration Services, Analysis Services, and Reporting Services) on a single server. As we discussed in Chapter 4, it’s often sensible to distribute across multiple servers. There’s no hard and fast rule about which components to group together, although most often we see the relational database and Integration Services on the same box.

Because of the memory requirements of all the components, 64-bit hardware is very appealing. The price of a 64-bit commodity server with 8GB of RAM is astonishingly low. The extra memory can make a huge difference in the performance of your system. It feels intellectually lazy to recommend a hardware upgrade to solve performance problems, but if you’ve done a good job on your system design it’s often the easiest—if not the only—degree of freedom. An exception is in performance tuning for Analysis Services, where it’s often fairly easy to extract significant improvement.

No matter what you’re doing to improve performance, it’s really important to follow good change management and tuning techniques:

  • Work on the test system, and script any changes to production.

  • Document baseline performance.

  • Change one thing at a time.

  • Document changes to performance.

  • Test all the changes together before moving to production.

Performance Tuning Analysis Services

There are three big things you can do to improve the performance of your Analysis Services database:

  • Upgrade your hardware, especially memory, as discussed in the preceding subsection.

  • Create an optimal set of aggregations.

  • Use partitions to improve query and processing performance.

Optimizing Aggregation Design

We’ve been saying for decades that the single best way to improve query performance for a DW/BI system is to build aggregations. This is as true for an OLAP system as it’s always been for the relational data warehouse. Aggregations are so much better and easier to manage with Analysis Services that we seldom create relational aggregations any more.

When you first design your cube and measure groups, you can use the Aggregation Design Wizard to define aggregations. That wizard highlights the inevitable tradeoff in aggregation design: Query performance increases with the number of aggregations, but processing performance decreases. Optimal aggregation design maximizes query performance while minimizing the processing penalty.

The single best way to optimize aggregation design is to base aggregations on the actual usage of the Analysis Services database: to use the Usage Based Optimization (UBO) Wizard. But before you can design usage-based optimizations, you must capture queries in the query log. The query log is completely separate from the usage monitoring that you capture from Profiler. It’s a server-wide setting that’s used only to gather data for the UBO Wizard. It’s not turned on by default.


Set up the query log from within Management Studio by right-clicking on the Analysis Server node in the Object Explorer window and choosing Properties. Store the query log in a SQL Server database that you’ve created to hold Analysis Services logs. The properties you may need to edit are:

  • Log\Query Log\QueryLogConnectionString: Set up the connection to the logging database.

  • Log\Query Log\QueryLogSampling: The default is 10 (every tenth query is captured).

  • Log\Query Log\CreateQueryLogTable: Set to True to automatically create the query log table in SQL Server and start logging.

Launch the UBO Wizard from Management Studio by right-clicking on a measure group’s partition and choosing Usage Based Optimization. The UBO Wizard reads the query log and designs aggregations based on your users’ queries. These are the aggregations that you really want. The generic aggregations that were created at design time by the Aggregation Design Wizard are inferior to the usage-based aggregations.

As a good practice, use the Aggregation Design Wizard to create a small number of aggregations at design time. An aggregation level of 10 percent is a reasonable maximum; many experts design zero aggregations at the outset. Turn on the query log during the system testing phase, especially testing by business users. As one of the last pre-deployment steps, design usage-based aggregations. During the first few months the new Analysis Services database is in production, redesign aggregations every few weeks. When the database is in a steady state, you can redesign aggregations much less frequently, say on a quarterly basis.

Using Partitions

Partitions are almost as important as usage-based aggregations for improving query performance. Using partitions can also significantly improve processing performance.


You should consider using partitions, or increasing the grain of your existing partitioning, if a partition is larger than 5GB or 20 million rows.

There are two costs to partitioning:

  • Operational complexity: We discuss this in the next section.

  • Licensing cost: Partitioning requires SQL Server Enterprise Edition, which is approximately four times as expensive as Standard Edition.

The most obvious partitioning strategy is to partition by date, usually monthly. If your fact table has multiple date roles, like Order Date versus Ship Date, you should consider which role is used most often in users’ queries. This is the best choice for query performance, but you might override this choice if another date role provides huge benefits for maintenance operations.

Partitioning improves query performance because the system keeps track of what slice of data is in each partition, and the query optimizer is selective. So if you partition by month, and a query asks for data from only one month, then only one partition is accessed to answer the query.


Analysis Services 2000 required that you manually set the data slice for each partition. Analysis Services 2005 automatically tracks that information for partitions stored in its own MOLAP structure. Partitions that are kept in the relational database still need their slice set by hand.

Partitioning improves processing performance when you process multiple partitions in parallel. This is usually a huge benefit during full processing of the database. Partitioning also improves processing performance if you need to process only the current partition.


Analysis Services 2005 automatically processes multiple partitions in parallel. Parallel processing in Analysis Services 2000 was harder; most people used the parallel processing utility distributed by Microsoft. That utility is no longer necessary.

The more finely you partition, the more you’ll improve query and processing performance. You’re usually better off by partitioning along multiple dimensions—say Date.Month and Product.Brand—than by partitioning more finely along a single dimension like Date.Day. This makes sense if you think about it: You’re more likely to get query selectivity along two dimensions than along one, and you’re more likely to be able to process partitions in parallel, especially during incremental processing.


If you create a lot of partitions, you’ll find it increasingly awkward to use Management Studio. With a thousand partitions, it may take five minutes to open the database node in order to browse the dimensions and measure groups.

Most implementations will partition evenly along a dimension level like Product.Brand. You don’t have to, though. If you have five brands that combined make up 80 percent of sales, and then 500 brands that make up the rest, you might create 6 partitions: one for each large brand, and one for all the rest.

Managing Partitioning

We hope we’ve convinced you that partitioning your Analysis Services and relational fact tables is a good idea. For Analysis Services, as we just discussed, partitioning can greatly improve query and processing performance. Partitioning can also help you solve some thorny processing problems, particularly with snapshot fact tables.

In the relational data warehouse, partitioned fact tables are easier to manage than non-partitioned tables, particularly for backups. Because you can set partitions to be read-only (more accurately, you set the partition’s filegroup to be read-only), incremental backups can be much smaller and faster. Partitioned fact tables can also make the initial load of historical data occur much more quickly and easily. If you’re willing to invest in a more complex management system, you can use partitioned fact tables to help you with incremental loads into the relational data warehouse.

Managing Relational Partitions

You’d expect to partition only the fact table in the relational data warehouse. There’s no technical reason that you couldn’t partition a dimension table, but it’s hard to envision a scenario where doing so is helpful.

Almost everyone who partitions the fact table does so by date, and only by date. Unlike Analysis Services, where finer grained partitioning is really helpful, in the relational engine you can partition by a single column only. This section, then, assumes you’re partitioning by date, usually by month.

A partitioned table looks and behaves like a single, monolithic table. You can select from, insert into, and update data in the partitioned table using normal SQL. Queries are selective: a query for a single day’s data will touch only the partition that contains that day’s data. Inserts work correctly: Under the covers SQL Server finds the correct partition and inserts the new row into it. Updates, of course, also work correctly.

Under the covers, you can think of a partitioned table as a set of identically structured tables that are hooked together by metadata. The requirement that the structures be identical is serious: same columns, same column names, same indexes, same constraints, same everything. The only difference is that they can be on different file groups. These details are discussed in greater detail in Chapter 4.


Partitioned tables are really a single object. The optimizer evaluates a single set of statistics and table-level metadata. Only the underlying storage is partitioned.

In Chapter 4, we talked about how most systems will fast-load identically structured partition tables during the initial historical load, then combine those tables into a single partitioned table. Under normal operations, most people will simply load today’s data directly into the partitioned table. As we’ve already said, SQL Server takes care of sending that data to the correct partition.

With this simple partitioning plan, all you need to do is make sure you have partitions ready before you need them. If you’re partitioning monthly, the easiest thing to do is to create the next year’s worth of partitions at the end of every year. You should document your operations plan to this effect. It’s always dangerous to rely on people to remember to do something once a year—even if it’s documented—but in this case no tragedy will ensue if you forget. You can always split a partition later. It takes processing resources, and you should back up your table before you begin, but it’s not a big deal.

Fast Loading Daily Partitions

If your incremental fact table loads are huge—10 million or more rows per day—and your load window small, you need to be more creative. It will be really important for you to be able to fast-load even your incremental fact table loads. If you’re keeping less than a year of data online you can partition daily, but you shouldn’t plan to have more than several hundred partitions online.

To fast-load data into a very large partitioned table, follow these steps:

  1. Near the beginning of the Integration Services package that performs the fact table load, clone a table partition. You want an empty table that’s structured identically to the target partition table. Name the empty table as MyFactTable_yyyymmdd. You could implement this cloning as an Execute SQL task in Integration Services, using a variable expression to supply the table name. Alternatively, you could use the SQL Server SMO object model and use the Script task. Chapter 4 describes the necessary characteristics of the empty table, including a check constraint.

  2. The Data Flow task for the incremental load package is the same as it would be for a non-partitioned table, with one exception. The target is your new empty table rather than the partitioned table. (You could create a view when you create the empty table so that the Data Flow task always has the same target name, or you could parameterize the OLE DB Destination in the Data Flow task.)

  3. Load data into your new, empty table, using the fast-load techniques we discussed in Chapter 6.

  4. Create the appropriate indexes and foreign key constraints (if any) on the new table. As with the cloning step, you can use an Execute SQL task or a Script task.

  5. Switch the newly loaded table with today’s data into the partitioned table. Again, you can use an Execute SQL or a Script task. In Chapter 4 we detail what we mean by switching the table into the partition, and show the SQL syntax.

At the end of the month, merge the daily partitions into a monthly partition by bulk inserting the daily data into an empty table, and then switching that new table (now containing a month of data) into the partitioned table. You can implement this as a separate Integration Services package that you’d schedule monthly. Set last month’s partition to be read-only.


Do not merge or split partitions, other than empty partitions. SQL Server will do the right thing, in the sense that it will move the data for you. But it will do it intolerably slowly. Use bulk insert operations into an empty table, and then switch into the partitioned table.

LateArriving Data

Your processing logic has to be a lot more complicated if you’ll receive late-arriving data for your daily partitioned fact table. We’ve worked on very large systems where 70 percent of the data we receive today was from yesterday; 20 percent was from the day before yesterday; and the remaining 10 percent could span back months.

In this scenario, assuming you’re still determined to fast-load the daily partition, you need to put a conditional split into the data flow, dividing the clean data stream into yesterday and all other. Fast-load yesterday’s data into its own daily partition, and use slow loading techniques for aged data.


In this scenario, you should evaluate the costs and benefits of loading the really old data right away. Does data that’s three months old have any value to the DW/BI system? It depends on the business requirements. At the very least, evaluate whether it makes sense to hold onto aged data until the weekend. The data’s already late—is it important that the users see it today?

Accumulating Snapshots

An accumulating snapshot is a kind of fact table that keeps track of a process whose state changes over time, like the lifecycle of an order or manufacturing process. The accumulating snapshot fact tends to have lots of date roles (Order Date, Payment Date, Verification Date, Requested Ship Date, Actual Ship Date) and facts that calculate the amount of time spent between various checkpoints. Each row in the accumulating snapshot tends to receive a lot of updates during its lifecycle.

We hate updates in the DW/BI system. They’re slow. They’re untidy. They mess up your backup strategies. They really mess up cube processing or any aggregation tables you might be maintaining in the relational data warehouse.

You can lessen the pain of managing an accumulating snapshot if you can ascertain when its life is over: when an order has shipped, a package delivered, or a product manufactured. You can partition the fact table both by the date the process began and by IsCompleted. In this case you have two partitions for each month, one to hold facts that are completed, and one to hold those in process. As we discuss in the next section, Analysis Services will benefit hugely from localizing updates to a subset of the partitions.

Managing Analysis Services Partitions

Managing Analysis Services partitions is almost as easy as managing relational partitions. The main difference, apart from syntax, is that you must process each partition by name—there’s no analogous operation for inserting data into the partitioned fact table. The unit of Analysis Services fact processing is the partition.

Partitioning by Month

As with the relational database, most systems will partition their large cubes by month. When you first develop the cube, you’ll probably define the partitions by hand and then fully process the cube. Ongoing, you’ll need to set up new partitions and incrementally process them. The following steps describe the ongoing process:

  1. Near the beginning of the Integration Services package that performs the fact table load, create a new Analysis Services partition. Use the Integration Services task called SQL Server Analysis Services Execute DDL Task, which we’ll call the AS-DDL task.

  2. Get the basic form of the AS-DDL statement by generating a script for an existing partition in Management Studio (right-click on the partition and choose Script Partition as Create). The AS-DDL statement is in an XML format, and is quite easy to read.

  3. You need to make three modifications to the AS-DDL template statement:

    • Modify the ID and Name attributes so they’re tagged with the current month, for example, MyMeasureGroup_yyyymm.

    • Modify the WHERE clause of the source query so it refers to the correct date range. For example, the source query may include a WHERE clause like the following:

      WHERE DateKey >= 20050300 AND DateKey <= 20050332
    • Set the SLICE property of the AS-DDL statement to point to the year and month contained within the Analysis Services Partition:

      <Slice>[Date Dimension].[Calendar Hierarachy].
      [Calendar YearMo].&amp;[200503]</Slice>
  1. Process the new partition.


SQL Server ships an excellent sample that describes how to automate the partition management process. You should modify this sample to fit your own environment. The default location for the sample is C:\Program Files\Microsoft SQL Server\90\Samples\Integration Services\Package Samples\SyncAdvWorksPartitions Sample.

Partitioning by Two Dimensions

Earlier in this chapter we recommended that you partition very large cubes by a second dimension, say by product family. If you follow this strategy, then each month you’ll have multiple partitions, one for each product family.

The script that creates the monthly partition set is obviously more complicated than the single partition case outlined previously. It’s a straightforward extension of that logic, with one caveat: Make sure you look for new product families, and add a new partition as necessary. With the Date dimension, you don’t need any fancy logic to know when to add a new partition. But with other dimensions, it’s perfectly possible for us to receive a new member, like a new product family.

The script that creates the new month’s partition set should query the cube to find all possible values for Product Family, and create a partition for each. You should also account for the possibility that a product family is added mid-month, if that event can occur in your environment.

Partitioning and Proactive Caching

Proactive Caching is an important new feature of Analysis Services 2005, which simplifies cube processing. As we describe in Chapter 17, a partition with Proactive Caching turned on isn’t processed explicitly by a “Process Partition” command. Instead, Analysis Services monitors the relational database and magically adds new data to the cube as it appears in the source database.

Under normal circumstances turn on Proactive Caching only for the current partition. As soon as a partition ages, turn off Proactive Caching, and do a full process of the partition. It’s important to fully process a partition that’s been incrementally processed by Proactive Caching. The frequent incremental updates leave the partition in an inefficient state, similar conceptually to a fragmented index in the relational database.

If you’re adding data into your relational fact table by switching in partitions, as described earlier in this chapter, you should be aware that Proactive Caching will not pick up this event. In other words, let’s say you added a whole month of data to your relational fact table by loading that data into an empty table and then revising the fact table’s partitioning strategy. The corresponding Analysis Services partition will remain empty because this event doesn’t trigger the Proactive Cache.

The workaround is simple: Create the Analysis Services partition after the fact table partition has been created, populated, and added to the fact table. Fully process the new Analysis Services partition as soon as you create it, to grab this initial set of data. Immediately after processing, turn on Proactive Caching for the new partition. As long as new rows are added to the relational partition through a normal Insert statement, rather than partition switching, the Proactive Caching mechanism will see those rows and add them to the Analysis Services partition.

Snapshot Fact Tables

There are two kinds of snapshot fact tables: the Accumulating Snapshot that we discussed earlier in this chapter, and the Periodic Snapshot fact table. Both of these snapshot fact tables are problematic for the Analysis Services database because both require significant fact row updates.

First, let’s discuss the Periodic Snapshot. An example of a Periodic Snapshot fact table is inventory levels. The typical fact table, and associated cube, has inventory levels by warehouse and product for each month end. In the current month, you update inventory levels each day, keeping only current month to date.

Partitions really help you manage the Periodic Snapshot Analysis Services database. On the relational side, you may choose to replace data in the current partition. Or, you may choose to update the existing fact rows. Your decision here is based on what percentage of rows are updated every day. The relational database can accommodate either approach.

In Analysis Services, there is no way to update a fact. You can add new facts, but the only way to process fact updates is to fully reprocess the partition that contains that fact. If you partition your periodic snapshot cube by month, then every day you can fully reprocess the current month’s partition.

Accumulating Snapshot cubes track processes with a relatively short life span. They’re much harder to manage in Analysis Services because they inevitably require a lot of updates. If you must publish your Accumulating Snapshot schema as an Analysis Services cube, your first hope should be that the data is small enough to completely refresh the cube every load cycle. That would certainly be easiest to maintain, but it’s seldom the case.

The solution is to partition both by date and by whether the process has completed. Every month you’ll have two partitions: one for processes that have finished, and one for those that have not.

In your Integration Services package, you need to identify which Analysis Services partitions need to be incrementally updated because they’ve received new rows, and which need to be fully reprocessed because they’ve received updates.

Incrementally process any IsCompleted=Yes partition that’s received a new row. IsCompleted=Yes partitions receive new rows when a transaction that started in that month gets completed today. In other words, let’s say today is 5 March 2005. If you opened an order on 28 February, and you finally shipped it today, then you want to incrementally process the February-2005/IsCompleted Analysis Services partition. As far as that partition is concerned, this is a new row so incremental processing will work just fine.

Fully process any IsCompleted=No partition that’s received a change. You can assume that the current month’s partition always needs to be fully processed. But in the preceding example, you also need to fully process February’s NotCompleted partition because the order from February 28 got updated. If the February 28 order had simply been updated—let’s say it got shipped but hasn’t yet been delivered—you’d still need to fully process the February-2005/NotCompleted partition.

Whew. You can see why people avoid building Accumulating Snapshot cubes.

Within the Integration Services package that manages the relational table, you can set up a loop to identify which cube partitions need to be incrementally and fully processed.

Backup and Recovery

No matter what are the availability requirements on your system, you need a backup and recovery plan. This seems like an intuitively obvious statement, but we’ve seen any number of DW/BI systems that purported to be in production, but which had no backup plan.

It’s as important to have a recovery plan as it is to have a backup plan. And it’s equally important to test these procedures. When the inevitable emergency happens, you want to be ready, practiced, and calm. Your test system is an ideal platform for testing these procedures. If you haven’t fully tested your recovery procedures, you’re lying to yourself and your management that you have a real backup and recovery plan.

In the DW/BI world, you can experience the same kinds of emergencies as transaction systems, from server outages and disk failures to earthquakes and floods. Plan for your daily or monthly load cycle to break down occasionally. Develop your ETL system so that failure is fairly unlikely. But let’s face it: The DW/BI system is at the end of a long train of data flows over which you have no control. Only a foolish manager would neglect to plan for backing out a bad load. The auditing system described in Chapter 6 lays the foundation for identifying the rows that were changed during a specific load process.

SQL Server Databases

The relational databases are usually the most vital sets of information to back up regularly. Ideally, back up the following databases after each load:

  • Relational data warehouse databases

  • Staging databases, if any

  • Staging data in the file system, if any

  • Metadata databases

You also need to set up a regular schedule to back up the logging databases.

Your backup and recovery strategies are intertwined with each database’s recovery model. The Simple recovery model lets you restore only to the point of a backup. The transaction log is not backed up. This works fine for many relational data warehouses, where data flows in nightly, weekly, or monthly. The Simple recovery model is appropriately named; it’s faster and simpler to manage than the Full recovery model. Nonetheless, as your DW/BI system moves closer to real time, the Full recovery model becomes increasingly appropriate.


See the Books Online topics “Overview of the Recovery Models,” “Recovery Models and Supported Restore Operations,” and “Selecting a Recovery Model” for more information.

Most systems use the standard SQL Server backup facilities for relational backup and recovery. The relational data warehouse database is usually quite large, and so it’s often challenging to run a backup at the end of each (nightly) load cycle. There are several alternatives:

  • Store the database on a Storage Area Network (SAN), and use the SAN software to perform the backup. The SAN backup techniques are high performance, and this approach has been a common practice for very large databases with SQL Server 2000.

  • Partition the large fact tables, and set aged partitions to be read-only. Perform occasional full backups, but rely primarily on a strategy of filegroup and partial differential backups. Under the simple recovery model, partial backups back up the primary filegroup and all the read-write filegroups. Read-only partitions are backed up when they’re filled and converted to read-only status. The innovation of read-only partitions greatly improves your ability to quickly back up the changed portions of the relational data warehouse.


See the Books Online topic “Partial and Partial Differential Backups” for more details.

The logging database is written to constantly. Some DW/BI teams think the logging data is vitally important, and implement a very strong backup strategy. Other teams are sanguine about the notion of losing a week’s worth of logging data, and manage the database far more loosely. Obviously, if your logging data contains usage data necessary for regulatory compliance, you need to develop a serious backup and recovery strategy. Use Full recovery mode and a backup strategy appropriate for a transaction database. Books Online, and any number of SQL Server books, are filled with information about backup strategies for transaction databases.

Approaches differ on backup and recovery strategies for the staging databases. Many DW/BI teams think of the data in the staging tables as ephemeral, and back up only the table CREATE scripts. On the other hand, most staging databases contain only data for the most recent loads—for example, the last seven days—so a full database backup is really fast.

You may have built a simple application for business users to manipulate custom hierarchies or other attributes of a dimension. Such an application is a transaction system, however small scale. Typically you want the application to write directly to a different database than the data warehouse, one with Full recovery mode and log backups. Similarly, the metadata database should also be treated more like a transactional database than the large data warehouse database.

The msdb system database may include your Integration Services packages. It will certainly include any SQL Agent job definitions and schedules, and other information used by Management Studio, including information about which databases were backed up. For that reason, the msdb database should always be backed up immediately after any other backup operation. Use Full recovery mode for msdb.

Integration Services

The most important information to back up for Integration Services are the package definitions themselves. Packages can be stored in SQL Server, in the file system, or in a managed mode in the file system called the Package Store.

If the package definitions are stored in SQL Server, they’re located in the msdb system database, which as we’ve already discussed should be backed up religiously.

If the packages are stored in the file system or the Package Store, simply use a file system backup utility like Windows Backup to back up the package definitions, configuration files, and associated information. Of course, package definitions should be under source control, and that source control database backed up too.

As we discussed in Chapters 5 and 6, you may be staging or storing data in the file system. With Integration Services, this is a more common staging area than within the relational database. Use Windows Backup or another copy utility to back up staged data. This is especially vital if you’re relying on re-running staged extracts to bring your data warehouse database up-to-date.

Analysis Services

Throughout this book we’ve encouraged you to think of the Analysis Services database as ephemeral—a database that may need to be fully reprocessed at some point. That’s necessary because Analysis Services doesn’t support the full level of data manageability, notably updates and deletes, as the relational database. The great benefits provided by Analysis Services in query performance, complex security, a calculation engine, and easy user navigation come at a cost. You need a plan for being able to fully reprocess the dimensional database; never throw away the relational data.

You absolutely must back up the definition of the Analysis Services database: the information that enables you to fully process the database. You might think that, because you have the database definition on your development server and checked into source control, you’re safe. You could always re-deploy and re-process the Analysis Services database. That’s largely true, but you’ve probably modified aggregation design and partition strategy on the production database; these changes are not reflected in the version on the development server.

You won’t find a formal command or utility for backing up the database’s definition. The most straightforward approach is to generate a complete CREATE script for the database, and back up that script.

The recommended method for backing up the Analysis Services database is to use the Analysis Services backup and restore facility in Management Studio. For readers familiar with Analysis Services 2000, the new backup facility overcomes the (severe) limitations of the old archive utility. The greatest drawback of Analysis Services 2005 backup is that it works only at the database level. On the plus side, you can launch the Backup and Restore wizards from Management Studio. From within the wizard, you can script the commands for automated operations. Schedule the backup from SQL Agent or launch it from an Integration Services package. The Analysis Services Samples folder includes source code for a sample application that illustrates how to execute backup and restore programmatically.

The Analysis Services Backup facility backs up all metadata, but only data that’s stored in MOLAP format. This includes all data and aggregations for MOLAP partitions, and aggregations only for HOLAP partitions. Data stored in the relational data warehouse should be backed up using relational backup techniques. Plan for your Analysis Services backups to take about as much space as the database itself. Analysis Services databases are stored so efficiently that—unlike relational backups—we see very little additional compression upon backup.

If your Analysis Services database is small, in the tens of gigabytes, the simplest approach is to perform a full backup every load cycle, or whenever you make a metadata change. No matter how efficient the backup utility might be, if your Analysis Services database is multiple terabytes, it’s just not practical to perform daily full backups.


If you have a Storage Area Network (SAN), frequent full backups are more practical, with very limited downtime and minimal pressure on server resources. You can circumvent the backup utility and copy the files directly:

  • Create a mirror set and wait for it to fully synchronize.

  • Stop Analysis Services; break the mirror; restart Analysis Services.

  • Mount the mirrored image as a separate drive and perform a file level backup of the entire data folder (Program Files\Microsoft SQL Server\MSSQL\OLAP\Data).

Our recommended practice is to back up the database whenever the metadata changes. Metadata changes include redesigning aggregations, adding a partition, or changing security groups and permissions. If you just can’t do a full backup every time the metadata changes, you should first question why your metadata is changing so often! Assuming you’re still in this situation, you must capture the database ALTER scripts, always date-time stamped so it’s easy to identify which changes have occurred since the last backup.


Always script any metadata change on the production server. Most changes, including security and aggregation design, should be tested on your test server, and then rolled to production using ALTER scripts. Use a script to add new partitions as well, although this routine operation might not have a corresponding action on the test server.

To restore the Analysis Services database, run the Restore script. Apply any ALTER scripts. Fully process any partition that’s gotten new data since the backup—usually just the current partition or partition set. If you have late arriving data, you need to build a more complicated system for identifying which partitions need to be reprocessed.

Reporting Services

All of your Reporting Services report definitions and schedules are in the Report Server database. This database should use Full recovery mode, and be backed up like any transactional database.


It is as important to document and test your recovery plan as it is to perform backups. During an emergency is not the time to test out your recovery procedures. We could regale you with sad tales of daily backups to corrupt media that were never tested until too late. Despite the fact that this is kindergarten-level system administration, we are past being astonished at finding people who don’t know if their recovery procedures will work.


We’ve said it several times already, but once more: Backup without verification is meaningless and a waste of time. You’re better off not even doing the backup, and not kidding yourself that you’re protected. Good intentions don’t count.

Verification doesn’t mean checking the checkbox in the utility, which verifies the physical media. That’s a good thing to do; it’s just not what we’re talking about. We’re talking about testing the full recovery process, including the BI applications, to make sure everything really works. And it’s not just that the scripts work. You need to confirm, by testing, that your staff know what steps to take to successfully restore the system.

Previous Section
Next Section

 Python   SQL   Java   php   Perl 
 game development   web development   internet   *nix   graphics   hardware 
 telecommunications   C++ 
 Flash   Active Directory   Windows