Making the Case For (and Against) Real-Time Data





Making the Case For (and Against) Real-Time Data

Real-time data sounds so cool. Who would not want to have data that’s as fresh and current as possible? And surely we, as professionals, want all enterprise data to be conformed and consistent? What better way than to populate the DW/BI system in real time?

As we discuss in this section, it may sound cool, but it’s not so easy.

What Makes Delivering Real-Time Data Hard?

The hardest things about delivering real-time data are, as usual, related to people rather than to technology. Nonetheless, there are some technical challenges, too. Microsoft offers many interesting features to help you with the technical challenges.

First, the people-based challenges. One of the biggest issues for the DW/BI team is to meet both the demands of those who want real-time data, and those who most decidedly do not. You may think that pushing the latency closer to real time would be a win for everyone. Surely, in the absence of any cost of delivering data, we’d all prefer to have real-time data?

Actually, no. Anyone who’s trying to develop a non-trivial analysis knows that you need to work on a static dataset, where the numbers aren’t changing from moment to moment and query to query. If you try to make these analysts work against a dynamic database, they’ll copy a bunch of data to a personal computer: exactly the kind of behavior you’re probably hoping to stem with your DW/BI project.

A second person-related problem has to do with the DW/BI team itself. If you add operational duties to this team’s charter, you risk having the urgent overwhelm the important. In other words, the real-time operations will co-opt the strategic nature of the DW/BI system. Think about it: This is why strategic groups generally have no operational responsibility.

Now, on to the technical problems.

If you’re going to put real-time data in the DW/BI system, you need to use the same conformed dimensions, and surrogate keys, used elsewhere in the DW/BI system. Otherwise, there’s really no point. You’ll need to process new dimension members, and updates to existing dimension members, in real time. The real-time (and non-real-time) facts that flow in must swap their business keys for the correct surrogate keys.

This is more of a design challenge than it is a computational one. Assuming you receive only those dimension rows that are changing, the ETL for hourly processing should take approximately 1/24 the time of the ETL for daily processing. The challenge is in the design.

Most DW/BI systems that are updated daily or monthly track dimension changes on a daily basis. In other words, you’ll combine potentially many database transactions on a customer account record into a single end-of-day image. And you associate all sales transactions that occurred yesterday with that end-of-day image. This isn’t strictly true, but business users almost always prefer this structure. There are some exceptions, especially with online businesses. Daily changes are easier to manage, especially if some of the changing attributes track history as a Type 2 Slowly Changing Dimension.

But if you’re consuming and processing dimension change information intraday, you really don’t have this option. You need to date and timestamp all Type 2 changes to the dimension member, and you may end up adding several rows for a customer account during the day. This may or may not be what the business users really want.

This Type 2 problem is annoying, but it’s not as bad as the Type 1 problem. An update to a Type 1 dimension attribute requires updating the dimension member’s row, potentially multiple times during the day. The update itself is mildly problematic. We don’t like lots of updates to tables. The worse problem, however, lies with any aggregations—relational aggregate tables, indexed views, or Analysis Services aggregations—that are built on the Type 1 attribute. All of these aggregations need to be adjusted, not just for the data that’s flowing in today, but for all time.

Warning 

Any precomputed aggregations defined on a Type 1 attribute need to be adjusted (or rebuilt) if the attribute changes. This adjustment affects the entire time series of data. It affects even the fact tables that you’re not updating in real time. It affects any fact table that uses that dimension (and has aggregations defined on the Type 1 attribute). The cost of recomputing (or fixing) the aggregations may be bearable for daily processing. It’s almost certainly too expensive for data latency of less than a minute—the system constantly would be recomputing aggregations.

With large data volumes, don’t define any aggregations on a Type 1 attribute that’s updated in real time—neither for the real-time fact table, nor for any other fact table that uses that dimension. This problem is not unique to Microsoft; it’s a fundamental issue with real-time data.

In general, we’ve found that real-time DW/BI systems are harder, more costly, and more time consuming to build. Hence, adding a real-time element to your DW/BI project will greatly increase its risk of failure. We certainly don’t recommend including real-time data in a Phase 1 project.

What Makes Real-Time Data Valuable?

Real-time data is valuable if it will help you make an important decision. What type of decision is based on something that happened seconds, minutes, or even hours ago? An operational decision. A CEO might want a whiz-bang real-time monitor on her desktop, but any CEO who’s running a company based on what happened five minutes ago should be replaced as soon as possible (unless it’s a very small company where the CEO also works the cash register!). Ditto for any VPs and their analytic support staff. Even the VP of Operations focuses on tactics (if not strategy), rather than what’s happening right now. That’s what line managers are for.

Note 

The exceptions are obvious: major disasters or fundamental changes in the business landscape—like an earthquake, a chemical spill, or a competitor’s legal troubles. A CEO is pretty unlikely to learn about such events from a real-time BI dashboard, don’t you think?

It’s worth repeating this point: Real-time data is intended for operational staff and their direct managers. It’s intended for operations. In the vast majority of cases, an operational system—which is, after all, designed to support an operational process—should deliver the real-time data.

We’re not saying this operational staff, and the decisions they make, aren’t important. Any one decision is small, with a small impact on the business. In aggregate, operational decisions are hugely important. Someone should help these guys out. Maybe that someone is the DW/BI team, which has expertise with business analysis and the query and analysis toolset. Maybe not.

What Should You Do?

The business has a problem: It needs better, more flexible access to real-time data. Business users often look at the DW/BI tools, and ask for that same level of functionality and flexibility on operational systems.

Often, the best solution is to improve the operational systems. If your operational system is purchased—as most are—your company could extend the set of reports that are shipped with the product. If those reports can’t be modified, you should think about whether you made a great product selection. But in the meantime, you could certainly replace or extend the packaged reports with a Reporting Services portal. That portal may even be integrated with the DW/BI portal.

The problem is more interesting if the need for real-time data spans several operational data sources. In this case, you must perform significant data transformation and integration steps. For really simple scenarios, where the data is perfectly clean (what alternate universe would that be?), both Analysis Services and Reporting Services can span multiple data sources. But realistically, any integration that’s not trivial will require that the data flow through Integration Services. From there it can populate a relational database, or even flow directly into Analysis Services or Reporting Services.

If you attempt to integrate the real-time data directly into the relational data warehouse database, you run into the dimension change issues that we discussed earlier in this chapter. You have two choices:

  • Integrate the real-time data into the DW/BI system, using and updating the surrogate keys. This approach can disrupt the user experience for that part of your user community that doesn’t use real-time data. It can also be extremely difficult and problematic to manage.

  • Keep the real-time data separate, and access it using only business keys. This approach can lead to multiple versions of the truth.

We discuss these approaches later in this chapter. Although the first approach is more intellectually appealing, we don’t think it’s very practical for most systems. Segregating the real-time data is a more common approach.

We realize we haven’t answered the main question: What should you do? Of course, the answer depends on your requirements and environment. Figure collects some of the approaches we think are most practical and ranks them on key criteria. These criteria are:

  • How easy is the approach to implement?

  • What kind of latency can the approach deliver? Zero latency (++ in the table) is delivered only by direct queries against the transaction system.

  • How well does the approach perform at large scale (data volumes or users)?

  • How well does the approach support ad hoc access?

  • How well does the approach support access to data that’s been transformed and integrated?

  • Does the approach meet our standards for consistent reporting across the enterprise (a single version of the truth)? No approach scores better than a zero on this criterion because of the difficulties inherent with real-time data and analysis in a company that also has an integrated DW/BI system.

Figure: Methods for Delivering Real-Time Data
 

EASE

LATENCY

PERF/SCALE

AD HOC

INTEGRATE

ENTERPRISE

Source directly from the transaction system

Create Reporting Services reports directly on the transaction database. By default, reports are executed on demand and contain the most recent data.

+ +

+ +

– –

– –

– –

– –

Create Reporting Services reports directly on the transaction database. Set up the reports either to cache or to snapshot on a schedule of your choice.

+ +

+

– –

– –

– –

Create an Analysis Services database directly on the transaction system. Use proactive caching settings.

+

+ +

+ +

– –

Use Reporting Services Report Builder directly on the transaction database.

+

+ +

– –

+

– –

– –

Populate a relational data warehouse Database in real time

Use Reporting Services reports on the relational DW.

+

+

– –

+ +

0

Use Reporting Services Report Builder on the relational data warehouse.

+

+

+

+ +

0

Create an Analysis Services database that you update in real time.

– –

+

+ +

+ +

+ +

0

Perform integration, but skip the relational data warehouse Database

Write an Integration Services package that integrates and transforms data. Set up a report that sources from that package.

+

+ +

– –

– –

+

Write an Integration Services package that integrates and transforms data. Use that package to populate Analysis Services objects directly.

– –

+

+

+ +

+

In Figure, we ranked each criterion on a scale of –2 (noted as – –) to +2 (noted as + +). An “ease” score of + + means the method is the easiest approach for delivering real-time data. No method gets positive marks across the criteria.

The remainder of this chapter describes the technical features in the SQL Server 2005 toolset that enable business users to access data in real time—latency of less than one day.

Previous Section
Next Section


 Python   SQL   Java   php   Perl 
 game development   web development   internet   *nix   graphics   hardware 
 telecommunications   C++ 
 Flash   Active Directory   Windows