Dimension Processing





Dimension Processing

Dimension tables provide the structure and context for the fact tables and measurements in the DW/BI system. Dimension tables are much smaller than fact tables, which is good because their processing is complex.

Dimensions are updated over time, either by updating in place (Type 1), or by adding new rows to track history (Type 2). Sometimes you’re lucky enough to be handed separate streams of new rows and changed rows. Often you need to figure out what changes have occurred, and handle them appropriately.

This section presents examples related to the AdventureWorks databases. You’ll be able to follow along simply by reading the text and looking at the figures. But if you want to experiment with the packages from this chapter, you need to go to the book’s web site. We’re populating a modified version of the AdventureWorksDW database that ships with SQL Server 2005: We call our versions MDWT_AdventureWorksDW. The web site (www.MsftDWToolkit.com) has copies of the packages that populate MDWT_AdventureWorksDW.

Dimension Processing Basics

Kimball Method Etl Subsystem 

This section describes techniques relevant to Kimball Method ETL Subsystems:

  • #1 Extract System: Source data adapters, push/pull/dribble job schedulers, filtering and sorting at the source, proprietary data format conversions, and data staging after transfer to ETL environment.

  • #9 Surrogate Key Creation System: Robust mechanism for producing stream of surrogate keys independently for every dimension. Independent of database instance, able to serve distributed clients.

Start with a simple dimension: Promotions. In the Adventure Works Cycles case study, the DimPromotions table is sourced from the [Sales].[SpecialOffer] table, with no transformations necessary. You should be so lucky in the real world.

Even though this is the simplest possible example, we’re still going to develop a data flow diagram in Visio, which is illustrated in Figure. As you can see, even a trivial example has several steps. Ignore the Audit dimension for now, and focus on the basics. We discuss the auditing subsystem later in this chapter.

Image from book
Figure: Logic flow diagram for populating DimPromotions

First, define the extract from the source system. There’s a grand total of 16 promotions in our source system, so simply select all the rows from the source table. In the section on Slowly Changing Dimensions, we describe how to compare them against the Promotion dimension table before deciding whether to insert them or update any changed values.

Hands-on Example
Image from book

This section walks you step by step through the process of creating, modifying, and running the simple Promotions package. After this section, we abandon the step-by-step approach.

Image from book

Extract Promotions Data

To extract the Promotions data from the SQL Server source in AdventureWorks, set up a connection and define the extraction query. Open the DimPromotion_ETL package that you created earlier and add a Data Flow task if you don’t have one already. In the Data Flow task, add an OLE DB Source, and set the connection to your source database (AdventureWorks). Specify that you’ll be extracting from a table, and choose [Sales].[SpecialOffer], as you see in Figure.

Image from book
Figure: Promotion package OLE DB source

Note 

If you source from a table or view, Integration Services extracts all the rows in that object. To extract only the rows changed today, you need to change the Data Access Mode to a SQL command. We show how to do that later in this chapter.

Choose all the columns in the table except the last two: rowguid and ModifiedDate. This step is optional, but it’s a good practice to extract only the data you really need. Remember that data is resident in memory within an Integration Services Data Flow task. Keep the data stream as small as possible. Memory is not a concern for our 16 promotions, but you should develop good habits from the outset.

Change column names to match target columns as soon as possible. You can do so in the OLE DB Source Editor, as illustrated in Figure.

Image from book
Figure: OLE DB Source Editor column configuration for DimPromotions

You may wonder, in Figure, what the column names that end in “_U” are all about. These columns are in Unicode format, and our target tables are single-byte character strings. By creating the extract columns with the “U” suffix, you’re saving the correct column name for later on, when you perform this conversion.

You may never run into this Unicode to single-byte string conversion in the real world. Typically, source systems are defined correctly to support the data that is stored in them. Microsoft uses Unicode by default in all of its demo databases to support localization of the AdventureWorks database. But you will definitely use this trick: There’s always something to convert.

The third section of the OLE DB Source Editor is where you configure error outputs. We’re hard pressed to understand how a relational source could have malformed data. Error outputs make a lot of sense for a flat file source, but they’re seldom useful for a relational source.

Note 

If you look closely back at Figure, you’ll see that we renamed the OLE DB Source transform to “Sales SpecialOffers.” We like to name source transforms to indicate the data source table, in this case the SpecialOffers table in the Sales schema. You should rename all objects and descriptions in your packages, and add text annotations to the package surface as well. As we describe in Chapter 13, package object names and descriptions are a part of your metadata. There aren’t many ways to decorate packages with custom metadata, and you should use every option you have.

Transform Promotions Data

Although we claimed that no transformations were necessary on the promotions data, a quick look at Figure tells you that was an exaggeration. You want to get rid of the null values for the maximum quantity applicable to a promotion. If you leave this as null, business users will have to construct a complicated filter condition: where the quantity sold is between the minimum and maximum quantities, or where it’s greater than the minimum, and the maximum is null. Instead let’s take the simple step of replacing null with a large number. Because this is an integer column, use the maximum value an integer can take: 2,147,483,647.

Tip 

The maximum value for a smallint is 32,767; the largest tinyint is 255. The largest bigint is a really big number: 9,223,372,036,854,775,807.

If you often replace null values, add a MaxInt variable to your template package.

Add a Derived Column transform to the data flow diagram, and hook it up to the SpecialOffers Source. Add two columns here. First, add a new column for the AuditKey, and set it equal to zero for now. We’ll come back and fix this later. Next, add a derived column that replaces MaxQty. Set its value to the following expression:

ISNULL(MaxQty) ? 2147483647 : MaxQty

This expression checks to see if MaxQty is null. If so, it replaces its value with our new maximum value; otherwise it keeps the existing value of MaxQty.

Reference 

The Books Online topic “Data Types (Transact-SQL)” contains information about relational data types.

The Books Online topic “Integration Services Expression Reference” is an important reference for creating expressions.

Figure illustrates the Derived Column transform. In addition to the AuditKey and MaxQty columns, trim all of the string columns. String columns often have extra spaces, usually to the right. These spaces are undesirable, and are really hard to debug because they’re invisible.

Image from book
Figure: Derived Column transform
Tip 

Trim all string columns immediately after the data source step.

You could convert the Unicode strings to single-byte in this same transform. This is probably what we’d do in a real package, trimming and converting in the same step. But in this case add a third transform to the data flow, a Data Conversion transform. Set it up to convert PromotionName_U, PromotionType_U, and PromotionCategory_U to single-byte strings of the appropriate lengths (50, 20, and 20, respectively). Remove the “_U” when you specify the Output Alias.

Tip 

Readers who are working with a localized version of SQL Server 2005, localized to a double-byte language, might find that the data in the demo AdventureWorks database has been translated and actually does use double-byte strings. In this case, the conversion step will throw an error.

We like to test packages incrementally, and although we usually build up a bit more logic before testing, let’s start small. Test your logic at various points by setting up a flow to the Row Count transform. The only thing to set up in the Row Count transform is the package variable to write the row count to. Your template package has a RowCount variable already.

Warning 

Package variable names are case sensitive.

The point of writing to the Row Count transform for debugging purposes is not so much for counting the rows, although that’s useful. The greater value is that it lets you add a data visualizer to the pipeline. You want to examine the rows as they flow out of the Derived Column transform. You need one additional transform in order to check the output of the Derived Column transform. Any transform will do. We usually use the Row Count transform to serve the role of low-cost pipeline endpoint for debugging purposes.

Add the data visualizer by right-clicking the flow arrow between the derived column and row count transforms. Choose Data Visualizers Add. There are several options for types of visualizers, including histograms, scatter plots, and column charts, but we use the grid most often.

Figure illustrates our simple package at this point. Execute the package by highlighting it from the list of packages in the Solution Explorer; right-clicking, and choosing Execute Package. BI Studio automatically saves the package when you execute it.

Image from book
Figure: Ready to run the Promotions Package for the first time
Tip 

Always execute packages by right-clicking in the Solution Explorer. Get into this habit from the outset. If you’ve done any Visual Studio development, you know that pressing the F5 key will start execution. But in an Integration Services project, this can, and usually does, launch a package other than the one you’re looking at.

The BI Studio environment provides a lot of information to the developer while the package is executing. Transforms turn yellow while they’re executing, green when they’ve succeeded, and red if they throw an error. You can see how many rows have flowed through each step. And you can actually see the data in any data visualizers in the package. The data visualizer pauses the package’s execution; you need to detach it (or press the green arrow button) to proceed. When a package finishes executing in BI Studio, it’s left in a debugging state. You need to stop the execution by pressing the blue square box in the debug toolbar, or by selecting Stop Debugging in the Debug menu to edit any package.

Tip 

Maximize the useful design surface by setting the Solution Explorer, Toolbox, Variables, and Properties windows to auto-hide. You can get them back by moving the mouse over the window tab.

If you followed our suggestion to add an OnPostExecute breakpoint to the package template, the package will pause for you to examine package variables. You can see the RowCount along with a list of system variables by selecting the Locals tab that opened up in the lower left corner during execution. When you are finished, simply continue or stop the package’s execution.

Tip 

If you close the Locals window, you can get it back by choosing Debug Windows while the package is in Debug mode.

Load Promotions Data

The next step is to write the result set into the target table. This is going to be another throwaway step because you still need to look for whether these are new or changed rows. But it’s a good development practice to test your package bit by bit. It’s easiest to debug when you’re doing something really simple.

Delete the Row Count transform and add an OLE DB Destination adapter to your data flow diagram. Hook it up to the output from the Derived Column transform, and configure it to point to the DimPromotion table in MDWT_AdventureWorksDW. Click on the Mappings tab, and you’ll see that because you changed the column names in the source transform, the source-to-target mappings come up correctly by default. The surrogate key, PromotionKey, has no mapping. That’s because it’s an identity column that’s populated as the data is inserted into the table. Don’t worry about any of the advanced properties for now.

Tip 

You have to go to the Mappings tab in the destination adapter, even if you know the default mappings are going to work. It’s tempting to set up the connection and then click OK, but that doesn’t set up the adapter correctly. You have to click on the Mappings tab.

Now, execute this package. You’d expect to see 16 rows processed and added to DimPromotion. After the package finishes executing, run a query on DimPromotion; the 16 rows should be in there.

Hands-on Example
Image from book

This version of the Promotion package is available on the web site. It’s called Promo v.1.dtsx.

Image from book

You can execute the new package as often as you like. Each time, you add 16 rows to the target table DimPromotion. Obviously, this isn’t what you want to happen. You need to modify the package to check for whether rows exist and add them only if they are new. You also need to look for whether an existing row has been changed. We discuss these modifications in the following sections.

Tip 

The example in this section uses the OLE DB destination adapter. You could have used the SQL destination instead. In Chapter 5 we discuss the tradeoffs between the two destination adapters. In general, use OLE DB during development. If the performance of the insert is a problem, consider switching to the SQL destination, which is sometimes faster.

Extract Changed Rows

Kimball Method Etl Subsystem 

This section describes techniques relevant to Kimball Method ETL Subsystem #1 Extract System: source data adapters, push/pull/dribble job schedulers, filtering and sorting at the source, proprietary data format conversions, and data staging after transfer to ETL environment.

Whether you’re processing dimensions or facts, you either need to be told which rows might have seen changes, or you need to figure it out in your ETL process. In Chapter 5 we discuss some of the many ways that source systems can be designed to flag changed rows. AdventureWorks, luck for us, uses a ModifiedDate column in each table.

Warning 

Explore the ModifiedDate columns or their equivalents in your organization’s source systems. We have seen many systems where ModifiedDate is managed in application logic. Any bulk updates by the DBA do not necessarily set the ModifiedDate correctly. You really want ModifiedDate to be maintained by a database trigger, as is done in AdventureWorks.

Each of your packages should have two package-level variables to bracket the date range for which the package is to be run. We call these ExtractStartDate and ExtractEndDate. Set the initial values of the variables to a useful range for the initial stages of development and testing. In production, you’ll set the variables’ values dynamically at runtime. Because most or all of your packages will use these variables, you should add them to your package template. We use a convention that the data range for the package’s run starts on ExtractStartDate, for Figure:00:00.000 and runs to but excluding the ExtractEndDate, for Figure:00:00.000. In other words, by putting the end date of 2004-08-01 in the variables value, use a “less than” on the comparison. This guarantees that you’ll get all transactions that occurred on July 31 no matter how close to midnight, and no transactions that occurred on August 1.

Tip 

The variable editor taskpad tries to be helpful when you create a date variable. It puts up a calendar date dropdown and forces you to use it. We find that more annoying than helpful. If you go to the variable’s property pane, you can just type in a date.

You could get clever and put a dynamic default value for the end date, equal to today’s date at 00:00. That way, by default the package will pull all changes up until yesterday midnight. There are two tricks to accomplishing this that are worth discussing:

  • Set the variable’s default value to an expression. Open the property pane for the variable, and set EvaluateAsExpression to True. Now you can enter an expression, like getdate(). The function getdate() includes the time part of today’s date, so you need to strip that out.

  • Strip out the time part of the date by using some cast functions. The expression should be:

    (DT_Date)(DT_I4)(DT_Date)getdate()
Warning 

In the real world, you don’t always set the end date to yesterday. You need to be able to run the package for two days ago, in case you missed a load. Use a package configuration to set the value of the variable at runtime. When setting the date from a configuration, you don’t want the variable to be set to an expression: The expression will override the configuration value. Don’t get so clever that you hurt yourself.

If you have a reliable row modified date from the source system, and some package variables with the date range you want to extract, simply parameterize the query from the source system so that it filters on the row modified date. Figure illustrates a parameterized source system query.

Image from book
Figure: Parameterized source system query

The Parameters button on the right side of the SQL command text box brings up a small window titled Set Query Parameters where you map package variables to the parameters in your query. The user interface is fine here, where there are only two parameters. But if your SQL statement has many parameters, you’ll improve readability by making that SQL statement into a stored procedure.

This query will extract only the rows that changed during the date range of interest. For daily processing, that date range refers to yesterday. In a small dimension like Promotions in the demo database, we’d expect that most days would deliver zero changes to dimension rows. You can continue to execute the rest of your package with zero rows, which shouldn’t take long, or add a check to halt package execution. Later in this chapter we discuss how to design your package so that it halts execution at appropriate points.

If no one tells you which rows have changed, you have to figure out the changes by brute force: by comparing the new rows with the existing data. For dimension tables, the easiest way to develop this logic is to use the Slowly Changing Dimension transform.

Slowly Changing Dimensions

In Chapter 2, we talked about the two main types of slowly changing dimensions:

  • Type 1: Restate history by updating the dimension row when attributes change.

  • Type 2: Track history by propagating a new dimension row when attributes change.

Kimball Method Etl Subsystem 

This section describes techniques relevant to Kimball Method ETL Subsystems:

  • #9 Surrogate Key Creation System: Robust mechanism for producing stream of surrogate keys, independently for every dimension. Independent of database instance, able to serve distributed clients.

  • #10 Slowly Changing Dimension Processor: Transformation logic for handling three types of time variance possible for a dimension attribute: Type 1 (overwrite), Type 2 (create new record), and Type 3 (create new field).

  • #11 Late Arriving Dimension Handler: Insertion and update logic for dimension changes that have been delayed in arriving at the data warehouse.

  • #12 Fixed Hierarchy Dimension Builder: Data validity checking and maintenance system for all forms of many-to-one hierarchies in a dimension.

Standard Handling for Slowly Changing Dimensions

Any dimension that contains a Type 2 attribute should track the date range for which each dimension row is valid. For any dimension with a Type 2 attribute, add three columns: RowStartDate, RowEndDate, and IsRowCurrent. For every dimension member like customer, there should be one and only one current row at any one time. Older rows have their RowStartDate and RowEndDate set appropriately. Figure illustrates the logic for handling updates to a dimension with both Type 1 and Type 2 attributes.

Image from book
Figure: Logic flow for handling dimension updates

We’ve seen companies get so intimidated by this complexity that they decide to manage all dimensions as Type 1, even if that’s not what the users want. The Integration Services Slowly Changing Dimension transform is a great feature. It does most of this work for you.

When you drag the Slowly Changing Dimension (SCD) transform into the Data Flow design palette, it consists of a single rectangle like all the other transforms. When you edit it, it launches a wizard with several pages of questions. And when you finally press Finish, the wizard generates a bunch of transforms and flows for you. The generated transforms and flows do the work that’s outlined in Figure.

The wizard starts by asking you to specify the dimension table you’ll be loading. Next, identify the business key, used to tie together all the instances of a particular entity. In a customer dimension the business key is usually the account number or customer ID. Map the other columns in the input flow to the attributes in the target dimension.

The wizard next has you identify how to manage the changes in each attribute. In addition to the Types 1 and 2 (restate and track history) described previously, Integration Services includes a Fixed attribute, which should never be updated. Set the attribute change type for all the columns in your target table.

On the next screen you’re asked several housekeeping questions. Do you want the processing to fail when you encounter a change to a fixed attribute? Answer No (you rarely want processing to fail outright). Do you want the bad row to go into an error flow? Answer Yes. You’re also asked if a Type 1 change, when encountered, should update all the historical rows for the dimension entity, or just the current row. The textbook definition of a Type 1 attribute indicates you should update all the rows, and this is the recommended setting. This is a characteristic that’s set for the whole table, not for each column, which is too bad.

If you have any Type 2 attributes in your dimension, you’re next asked how to identify the current row. Do you use row start and end dates, or an indicator like IsRowCurrent?

Note 

You can’t have the SCD Wizard maintain both the row start and end date and the row current indicator. It’s one or the other. As we discuss in the next section, you can edit the results of the wizard, and you can generally make it do what you want. We find it easier to generate the SCD Wizard using the row current indicator technique, and then edit the resulting transforms to add the row date handling.

When you reach the wizard’s screen about inferred members, you are almost at the end. Inferred members are an Integration Services innovation, and help you handle the late arrival of a dimension member generated by an early arriving fact. Early arriving facts appear in your incremental processing stream before you have any dimensional data for them. If you expect to have early arriving facts, design your fact table load so that it will generate a dummy dimension row. That way you can load all of your fact data.

Creating that dummy dimension row creates a problem during dimension processing. If the dimension contains any Type 2 attributes, then as soon as you get the real information about the new dimension member, you’ll automatically propagate a new dimension row. What you really wanted to happen was to update all the missing values in the original row with the new information. And that’s exactly what the inferred dimension member support does: If all the attributes are missing it will update them in place even if it’s a dimension with Type 2 attributes.

Perhaps you’re exasperated with the complexity of this wizard, although we’ll point out that it’s a complex problem. Your reward comes when you press the Finish key and see all the transforms that have been created for you, as illustrated in Figure. These objects are exactly what’s been generated by running through the wizard for a hybrid dimension with Type 1 and Type 2 attributes and inferred member support. We rearranged and renamed the transforms to improve readability, and labeled the data flow diagram branches with 1, 2, 3, and 4 to correspond to the branches in our logical flow in Figure.

Image from book
Figure: Results of running the Slowly Changing Dimension Wizard
Hands-on Example
Image from book

Set up the Slowly Changing Dimension Wizard on the Promotions package. Delete the target transform to DimPromotion, and set up the SCD transform in its place. Because the Promotion dimension attributes are all Type 1, the output isn’t as interesting as the customer example illustrated in Figure. Go through the wizard again, setting up one of the attributes as Type 2. You can see how different flows are created.

Image from book

The Slowly Changing Dimension transform will meet many projects’ requirements without any further changes. However, there are circumstances where you need to do something tricky, or circumvent the wizard altogether. The next few sections discuss some advanced topics around handling dimension changes.

Custom Handling for Slowly Changing Dimensions

You will probably want to customize the output from the SCD Wizard. There are even a few cases where it makes sense for you to develop custom handling for dimension changes.

The SCD Wizard will identify the current row for an entity like customer in one of two ways: with a True/False (or Yes/No) indicator, or with a range of dates for which the row is valid. If you choose the date range approach, the SCD transform will look for the single row for each natural key that has a null end date.

We recommend that you use both the row current indicator technique and the valid date range technique. Also, set the end date to a date far in the future, rather than leave it null. Can you still use the SCD Wizard? Yes you can, but you need to modify the generated objects in order to populate your dimension the way you want. Use the row current indicator to identify the active row, and then modify the computed column transform that’s between the Union All and the OLE DB transforms, in such a way that it calculates the row start and end dates the way you want them. This is easy to do, but be warned: If you need to go through the wizard again, perhaps to switch an attribute from Type 1 to Type 2 handling, you’ll lose any customizations you’ve made to the generated objects.

Note 

The objects generated by the SCD Wizard are standard Integration Services transforms. You can edit them to do whatever you like.

The only logic that you can’t modify is the comparison logic that’s hidden in the SCD transform itself. Under the covers, this transform takes a row in the pipeline and compares it to the current row in the dimension table; determines if the pipeline row is new or an update; and sends the row to the correct downstream path or paths. Except by going through the wizard again, you can’t change any of the logic in this transform.

Row Changed Reason

The Type 2 Slowly Changing Dimension logic doesn’t explicitly track which attributes changed when it creates a new row for an entity like customer. All you know is that something changed. Our Type 2 dimension tables include a Row Changed Reason column. This column should indicate which attributes changed, causing the creation of a new dimension row.

The SCD transform does not supply this information. If you use the wizard—which you almost always want to do—you can’t populate this column. If this information is really important to your business users, you must do one of the following:

  • Develop a post-processing step that runs through the dimension rows added today and figures out what changed.

  • Abandon the SCD transform, and manage dimension changes using fundamental Integration Services transforms.

Reference 

For additional detail on tracking the row changed reason, please refer to the following sources:

  • Search kimballgroup.com for the topic Row Changed Reason for a related article.

  • Visit the book’s web site (www.MsftDWToolkit.com) for a few simple examples of the options described.

Performance of the Slowly Changing Dimension Transform

It makes sense that you can’t modify the internal logic of the SCD transform, but this restriction presents a problem for large dimensions. Unfortunately, the comparison logic is not blazingly fast. Use the SCD transform for as many dimensions as you can. It’s well thought out and tested, and using it will simplify your development process. For very large problems, you may have to custom code the dimension change logic.

The best way to find out if the SCD transform works for you is to test it against your data in your environment. But we can provide some guidance. There are two main parameters: the size of the data flow you’re sending through the SCD transform, and the size of the target dimension.

Tip 

Minimize the number of rows going into the SCD transform. The SCD transform can handle rows in the pipeline that are unchanged: neither inserts nor updates. But if you can efficiently filter out those rows in advance, do so.

If the flow and the target dimension are both small—say less than 10,000 rows—use the Slowly Changing Dimension transform. If the input rowset is small and the dimension is large, the SCD transform should still work acceptably quickly.

If the input rowset is large and the dimension is small, it’s probably because this is a first time load of the dimension table. You might want to build a separate package to handle the historical load of a dimension with Type 2 changes, and simply circumvent the SCD Wizard during the historical load. Other circumstances where the input rowset is large yet the dimension is small are also probably better handled by cleaning up the input rowset using custom logic.

Finally, we reach the tough problem: The input rowset and the dimension are both large. The most obvious solution is better hardware, especially 64-bit with large memory. But that may be too expensive, or not speed the processing sufficiently. You may have to build dimension change logic by hand, as people have done for ages past.

To hand-craft the dimension change logic, use a hash or checksum function to speed the comparison process. Add to your dimension table two new housekeeping columns: HashType1 and HashType2. Soon you’ll have more housekeeping columns than real attributes! Into the HashType1 column place a hash of a concatenation of the Type 1 attributes; similarly for HashType2. Hashing algorithms are commonly used for encryption. The algorithms convert a very long string into a much shorter string that is guaranteed to be almost unique. (Computer scientists, forgive me.) This is a common technique for high performance dimension change management. Compute the hashes and store them in the dimension table. Then compute the hashes on the incoming rowset and compare to the stored values. The comparison on a single, relatively short string column is far more efficient than pair-wise comparison on dozens of separate columns.

Note 

By far the easiest way to compute the hashes is to use the Checksum transform available for download from www.sqlis.com.

Alternatively, use one of the many algorithms for computing hashes. Among those available to you from within Integration Services are:

  • System.Security.Cryptography.KeyedHashAlgorithm

  • System.Security.Cryptography.MD5

  • System.Security.Cryptography.SHA1

  • System.Security.Cryptography.SHA256

  • System.Security.Cryptography.SHA384

  • System.Security.Cryptography.SHA512

These algorithms have different performance and uniqueness characteristics. Experiment in your own environment. If you use a System.Security algorithm, you’ll implement Script Transform. Later in this chapter we walk through an example of creating a Script Transform. Script Transforms are easy to develop, extremely powerful, and are not inherently slow (although you could certainly code a slow transform!).

Here is an outline of the logic flow for custom-coded dimension change management:

  1. Filter out unchanged rows as early in the flow as possible.

  2. Identify new rows as cheaply as possible. Some source systems keep track of the date an entity (like customer) was added to the system, let’s hope in addition to the date the row was last modified. If this information is reliable, by all means use it to separate new rows into a separate stream for insert into the target dimension table.

  3. Compute HashType1 and HashType2 on the incoming stream, as described previously.

  4. Perform an Integration Services Lookup to the active dimension rows in the dimension table, matching on the natural key. Keep the columns from the incoming rowset, plus the surrogate key and hashes from the dimension table.

    1. Create an Error Flow on the Lookup transform to handle failures from the lookup.

    2. Lookup failures are new dimension members not already in the dimension table. Flow the error stream into the dimension table.

  5. Write a script component that creates streams for Type 1 and Type 2 changes. The script component will create two output streams.

    1. Compare the incoming HashType1 to the HashType1 that you picked up in the Lookup or Merge Join.

    2. If the incoming and stored hashes differ, output to the Type 1 output flow.

    3. Repeat for the Type 2 comparison.

    4. You must write a script rather than use the Conditional Split transform because some rows will go to both the Type 1 and Type 2 output flows. Conditional Split sends a row to one output or the other, but not both. Writing this script component sounds really ugly, but it’s approximately 20 lines of code.

  6. Recreate the logic to perform the updates and inserts, similar to what is automatically generated by the SCD transform.

We can think of even more complicated strategies to improve dimension processing performance and functionality. If nothing else, we expect this discussion has convinced you what a great feature the SCD transform is. We didn’t discuss Fixed attributes or Deferred Members in this custom example. That logic is left as an exercise for the reader.

Reference 

The book Professional SQL Server 2005 Integration Services, by Knight, Chaffin, Barnes, Warren, and Young (Wrox, 2006), is an important reference if you need to further improve the performance for processing dimension changes.

An example package that implements the custom dimension change handling logic is available on the book’s web site, www.MsftDWToolkit.com.

Recreating Dimension Change History

If your dimension has one or more Type 2 attributes, you should recreate history for the dimension’s initial load. Some source systems keep track of all the historical addresses associated with a customer, for example, and the date ranges for which those addresses were active. Working with the business users, decide at the outset whether you will recreate dimension changes. Once the dimension is built and associated facts are loaded with the appropriate surrogate keys, it is usually too expensive to go back and rebuild.

As usual, there are many possible solutions to this problem. One alternative is to create a package with a loop that executes the incremental dimension load Data Flow task for each day in the historical time period.

Alternatively, construct a data flow that contains the current image for each dimension member and all the historical changes as well. In the case of a customer dimension that triggers a new row when the address changes, the data flow pipeline would contain one row for each combination of customer and address. Include the date the address became effective. Usually multiple attributes trigger Type 2 changes: include as many as you have historical data for.

Use the Sort transform to sort by natural key and by RowStartDate descending.

Next, develop a Script transform that works through the set of rows associated with each dimension entity like customer. Because the data is sorted descending by RowStartDate, the first row that the script encounters for a new customer will become the active row. Use one row’s start date to set the end date for the next row in the sequence.

Sample Script Transform

In the preceding sample, we blithely talked about creating a script transform. This may sound intimidating, but it’s really easy. Here is an example of the script transform code to set the end dates and row current indicators:

Listing 6.1: Script transform to set end dates
Image from book
Public Class ScriptMain
    Inherits UserComponent

    'These variables retain their values from row to row
    Dim CurrentEntity As String = ""  ' Change data type to match BusinessKey
    Dim NextEndDate As Date

    Public Overrides Sub Input0_ProcessInputRow( ByVal Row As Input0Buffer)
        '
        ' Kimball Group, Microsoft Data Warehouse Toolkit
        ' Example script component used to set row end dates and row current
        ' indicators for the historical load of a dimension with type 2
        ' attributes.
        '
        Dim FirstEndDate As Date

        If Row.BusinessKey <> CurrentEntity Then
            CurrentEntity = Row.BusinessKey      'We have a new group
            NextEndDate = Row.StartDate.AddDays(-1)     'Set next end date
            Row.IsRowCurrent = "Y"
            'We set all row end dates to 9999-12-31 in a computed column
            'before we run this script transform. Leave that value for
            'this row, which is the currently active row.
        Else
            'Fix up end date and is current indicator for the inactive rows
            Row.EndDate = NextEndDate
            Row.IsRowCurrent = "N"
        End If

    End Sub

End Class
Image from book

As you can see, the code is simple. This same logic will work for any dimension, except you’ll need to rename the columns like Row.NaturalKey as appropriate. You may want to do something more complex in your script, but this sample provides the basic structure.

It’s more complicated to set up a Script transform than most Integration Services transforms. In the Input Column setup tab, you must choose which columns will flow through the transform. If you don’t choose a column, it’s thrown away. Also note that you can choose whether an input column is ReadOnly or ReadWrite. In the example illustrated here, we made the EndDate column be ReadWrite.

To add columns to your flow, go to the Column Properties setup tab. Add an output, and then add columns to that output. If you’re simply replacing a ReadWrite input attribute you don’t need to do this, but if you want a new column, you do.

Finally, go to the Script tab of the Script Transform Editor, and click the Design Script button. Integration Services launches a VB.NET editor for you, and autogenerates a bunch of code. Compare the sample script here to what’s automatically generated; you can see how few lines of code—eleven!—we wrote in order to solve this little problem.

Integration Services keeps track of the code you’ve written, integrating it into the package. If you were ambitious enough to open an Integration Services package in Notepad or your favorite XML editor, you’d see it in there. The point is that you don’t need to worry about where to store the source code; it’s part of your package.

Reference 

You can learn more about scripts and the script transform in the book The Rational Guide to Scripting SQL Server 2005 Integration Services Beta Preview, (Rational Guides, 2005) by Donald Farmer.

De-Duplication and the Fuzzy Transforms

A great ETL challenge is to de-duplicate information, like customer accounts. A common situation is a retail operation that’s operated by phone and direct mail now opens a web ordering application. It’s easy to have the same customer or family have many account numbers. Similarly, if your business has grown by acquisition you’re likely to have the same customer—and even product—reflected in the source systems of the different business entities. If your business users are interested in analyzing the history of a customer’s purchases, some de-duplication may be necessary. De-duplication is also necessary when you integrate data from multiple source systems even within a single company, or combine multiple individuals’ accounts into a family or household group.

Kimball Method Etl Subsystem 

This section describes techniques relevant to Kimball Method ETL Subsystem #4 Data Cleansing System: Typically a dictionary-driven system for complete parsing of names and addresses of individuals and organizations, possibly also products or locations. “De-duplication” includes identification and removal usually of individuals and organizations, possibly products or locations. Often uses fuzzy logic. “Surviving” using specialized data merge logic that preserves specified fields from certain sources to be the final saved versions. Maintains back references (like natural keys) to all participating original sources.

De-duplication of existing data is usually done as an analytic project, often as part of the development of a customer-oriented DW/BI system. It’s a labor-intensive process. Tools help a lot, but a person needs to be looking at results and making judgments about where correlations exist and what techniques are working. At the end of this process of trial and error, you’ll have a handful of techniques, and an order in which to apply those techniques that work well with your data. You’ll make a one-time pass to de-duplicate existing data and integrate these de-duplication transforms into your incremental loads.

Integration Services is a productive environment in which to conduct a de-duplication project. The Data Flow task visualizers and Data Source View data viewers are hugely valuable for investigating the data. The Fuzzy Lookup and Fuzzy Grouping transforms are particularly valuable for de-duplication. These transforms find one or more matches in an incoming row to a second data set, using one or more columns in each data set. The underlying algorithm is sophisticated. Unlike some similar algorithms, it’s not specific to a language (like English), but instead is based on pattern matching. The Fuzzy transforms return both matches and measures of similarity and confidence.

Microsoft has published a number of excellent examples on using the Fuzzy transforms, so we won’t repeat usage details here. However, we can’t refrain from mentioning one important best practice. The algorithm used by the Fuzzy transforms is efficient, but it’s not magic. Always use a standard lookup first to find perfect matches. Then use a conditional split to divide your data flow diagram and use Fuzzy matching only on the unmatched data.

Reference 

Search on MSDN (www.msdn.com) for “fuzzy lookup” to find articles and examples about the fuzzy transforms and de-duplication.

Previous Section
Next Section


 Python   SQL   Java   php   Perl 
 game development   web development   internet   *nix   graphics   hardware 
 telecommunications   C++ 
 Flash   Active Directory   Windows