Architecture of a Microsoft DW/BI System





Architecture of a Microsoft DW/BI System

All DW/BI systems consist of several major components, as pictured in Figure: sources of data, an ETL system, data warehouse databases, and a wide variety of uses. Metadata is the glue that binds together the complete DW/BI system.

Image from book
Figure: Microsoft DW/BI system architecture

As we explained in Chapter 2, the data warehouse databases should be in a dimensional form, consisting of fact tables and their associated dimension tables. Dimensions should be conformed across the enterprise. All business processes that are described by the customer dimension should use the same customer dimension with the same keys.

The primary place to store and manage the dimensional model is in the relational data warehouse database. In Microsoft terms, this is the SQL Server 2005 database engine. You will write an ETL system that populates that database, performing inserts and updates, and perhaps also managing system resources such as disk space and indexes.

The second place to store and manage the dimensional model is in the OLAP data warehouse database. In Microsoft terms, this is the Analysis Services OLAP engine. We recommend that you always build the OLAP database from a clean, conformed relational data warehouse database.

As we describe in Chapter 8, there are many kinds of BI applications, ranging from standard predefined reports to complex analytic applications that use data mining technology to affect business operations. Microsoft offers many technologies here, from Reporting Services for predefined reports, to Analysis Services data mining and the Visual Studio development environment to build custom applications.

The other kind of usage is exploratory or ad hoc. Here, Office Excel continues to be popular, although many organizations struggle with the data anarchy that comes with extensive use of Excel in the enterprise. Many organizations use non-Microsoft tools to deliver more structured, yet still highly flexible, ad hoc query functionality. As we discuss in Chapter 9, the Report Builder component of Reporting Services is designed to provide some ad hoc functionality. Data mining is another kind of exploratory use, delivered by the Analysis Services data mining features.

All these tools use metadata for development and operations, but there’s no specific metadata feature that we can point to. That doesn’t mean the metadata is missing or even that it’s unavailable; it’s just not as easy to get to, nor as integrated as we’d like.

Most readers understand why the relational data warehouse database is important. Let’s talk first about why your architecture should include Analysis Services.

Why Analysis Services?

What functionality is addressed by the OLAP database engine? Why would you want an OLAP engine—an Analysis Services implementation—in addition to the dimensional model stored in the relational database?

All DW/BI systems need a user-oriented layer on top of the dimensional data stored in the relational database. This layer could simply be a set of predefined reports. But for successful ad hoc access by business users, you need a layer that performs the following basic functions:

  • Easy user navigation: User-oriented names for database objects, and transparent join paths between dimensions and facts and between multiple fact tables.

  • Complex calculations: Centralized storage of calculation logic, and execution of calculations.

  • Fast user query performance: Usually accomplished through aggregate navigation and aggregate management.

  • Data security definition and enforcement: Preferably managed on a server rather than on users’ desktops.

For many years people have used relational techniques like views and client-side query tools to deliver this functionality. An OLAP engine like Analysis Services provides a better way. There are two key characteristics of the best OLAP engines that make them superior to the classic approach of using a client query tool atop relational views:

  • Query language: OLAP engines use a different—and better—query language than SQL to express complex calculations.

  • Computational performance: An OLAP engine has been designed as a high-performance server—its ability to resolve the most complex calculations far outstrips any client-based tool (even those client-based tools that run on a shared file server).

Analysis Services meets our definition of an OLAP engine. It provides the following features:

  • User-oriented metadata: The structure of the Analysis Services database explicitly defines dimensions, facts, and hierarchies. These structures have user-oriented names that can be localized into different languages for different users. Query and reporting tools provide an interface to browse the data and generate complex analytics using the familiar slicing, dicing, and drilling metaphors.

  • Complex analytics stored in the database: The Analysis Services database stores information about calculations, from simple information about calculations, such as whether an inventory balance is calculated as an average or period-end, to the definition of complex calculations such as corporate profit or revenue allocation. Such calculations belong within the definition of the database itself, and Analysis Services provides many mechanisms for doing so, including calculated members, calculated cells, scripts, and key performance indicators.

  • Richness of the analytic language: SQL is a set-oriented query language; it is not and will never be an analytic language, even with the ANSI-99 OLAP extensions. Query tools for Analysis Services generate MDX (Multidimensional Expressions) instead of SQL. MDX’s greatest strength is its understanding of the dimensional metadata: facts, dimensions, attributes, hierarchies, parents, siblings, and children.

  • Query performance: Ad hoc query performance is significantly improved over the “same” queries issued directly against the relational data warehouse database. This is especially true when you use the Analysis Services storage for most of the OLAP database.

  • Aggregate management: Aggregate or summary tables are vital for query performance of any relational data warehouse database, but building, populating, and updating those aggregates is painful. Analysis Services is a great aggregate management and navigation system.

Reference 

Chapter 14 of The Data Warehouse Lifecycle Toolkit discusses relational aggregate management in great detail.

The concepts introduced here are discussed more thoroughly in Chapter 7, which describes how to design an Analysis Services database.

Arguments Against Analysis Services

Although Analysis Services has become an extremely popular component of SQL Server 2000, there are still several common objections to using Analysis Services:

  • Scalability: What if your relational database contains several terabytes of data? Analysis Services 2000 scaled to this level, although like its relational cousin it scaled better with simple schemas than complex ones. Analysis Services 2005 has been designed for improved scalability. We wouldn’t hesitate to implement systems with several terabytes of data in Analysis Services. This isn’t to say Analysis Services won’t scale higher than several terabytes, just that we’d be more cautious.

  • Duplication of data: Many users dislike the notion of duplicating all the relational data warehouse data into a second database management system. First, you should recognize that using an Analysis Services database does not require duplicating the data—the data can continue to be stored only in the relational database. Even if you choose an implementation that duplicates the data, the technologies self-manage so that the duplication behaves like an index or cache.

  • Changing the user applications: Your business users are accustomed to using a SQL-based query and reporting tool, which might not work the same way (or at all) against an Analysis Services database. There is significant cost in purchasing new tools and retraining your users. This is the best argument against Analysis Services, especially in companies that have made significant commitments to a SQL-based tool vendor. In many cases, however, the query and reporting tools require periodic expensive upgrades and re-licensing. If a solution based on Analysis Services and Microsoft Office meets most of your users’ needs, and Office is already licensed for basic desktop use, you may minimize the cost of retooling.

Of the three common arguments against using Analysis Services, we find only the third to be broadly compelling. Worries about scalability and data duplication shouldn’t prevent the vast majority of SQL Server implementations from reaping the very real benefits of a DW/BI system that’s built on Analysis Services.

Why a Relational Store?

Perhaps you’re convinced that Analysis Services is a vital part of your DW/BI system architecture. Your next question may be: why do you need to store the dimensional data in the relational database? You aren’t required to do so: Microsoft provides several mechanisms for populating Analysis Services cubes directly from non-dimensional source systems. Why go to the trouble and expense of implementing a relational data warehouse database? Here’s why:

  • Disaster recovery: The tools and knowledge for managing a relational database for easy recovery are better than those for Analysis Services. It’s worth noting that Analysis Services management tools are much improved from SQL Server 2000.

  • Conforming dimensions and facts: In a hypothetical, simple example you could conform data on the way into the Analysis Services database. In the real world, you will have to update and delete some data in the ETL pipeline, and you really want to do this in a relational database.

  • Query performance: Insofar as users’ queries are resolved from the relational database rather than the Analysis Services multidimensional cache, those queries will perform much faster against a dimensional source than against the normalized transaction system sources. This is an order of magnitude more important if the sources are on multiple servers.

  • Database flexibility: If you want to modify an Analysis Services database, you usually need to redeploy and reprocess a large chunk of the database. It’s much easier to “join and go” in the relational world.

  • Comfort: DBAs and power users are very familiar with SQL and relational databases, and will violently resist the elimination of the relational layer.

  • Future flexibility: The notion of eliminating the relational data warehouse database and populating the Analysis Services database directly from transaction systems may sound appealing. But if you choose this architecture, you’re committing to an architecture that’s especially Microsoft-specific.

There are scenarios, particularly around the real-time delivery of analytic data, where the best choice is to skip the relational storage of the dimensional data and populate the Analysis Services database directly from transaction systems. But these are edge cases. Most of us, most of the time, should plan to store and manage the dimensional data in the relational database, and use that store to feed Analysis Services. Think of the Analysis Services layer as metadata for the OLAP engine, which possibly includes a data cache.

Previous Section
Next Section


 Python   SQL   Java   php   Perl 
 game development   web development   internet   *nix   graphics   hardware 
 telecommunications   C++ 
 Flash   Active Directory   Windows