Data Mining Examples





Previous Section
Next Section

Data Mining Examples

For many of us, the best way to learn something is by doing it. This is especially true for technical folks, and even more so for data mining. You really need to work through the SQL 2005 Data Mining Tutorial before you run through this example. This will give you a chance to get familiar with the tools and the user interface, which will make these examples easier to understand and follow. If you haven’t already done so, please take the time to go work through the data mining tutorials now.

In this section we start out with a simple example to get a feel for the data mining tool environment and the data mining process. In fact, the first example is so simple that its case set of economic data can be presented on a single page. Then we dig into a more detailed example based on the Adventure Works Cycles data mining tutorial data. Both of these examples will follow the general flow of the data mining process presented in the previous section.

Case Study: Classifying Cities

This first example is a small, simplified problem designed to provide a clear understanding of the process. The dataset has only 48 cases total—not nearly enough to build a robust data mining model. However, the small size allows us to examine the inputs and outputs to see if the resulting model makes sense. The scenario is based on a large non-governmental organization with a mission and operations much like that of the World Bank:

(T)o fight poverty and improve the living standards of people in the developing world. It is a development Bank that provides loans, policy advice, technical assistance and knowledge sharing services to low and middle income countries to reduce poverty. The Bank promotes growth to create jobs and to empower poor people to take advantage of these opportunities. (http://web.worldbank.org/)

Classifying Cities: Business Opportunity

The data miner on the Bank’s DW/BI team held meetings with key directors and managers from around the organization to identify business opportunities that are supported by available data and can be translated into data mining opportunities. During these meetings it became clear that there were not enough resources to properly customize the Bank’s programs. The Bank had historically focused its efforts to provide financial aid at the country level. Several economists felt that targeting policies and programs at the city level would be more effective, allowing for the accommodation of unique regional and local considerations that are not possible at the country level.

A switch to the city level would mean the people who implemented the Bank’s programs would need to deal with potentially thousands of cities rather than the 208 countries they were working with around the world. This switch would significantly expand the complexities of managing the economic programs in an organization that was already resource limited.

The group discussed the possibility of designing programs for groups of cities that have similar characteristics. If there were a relatively small number of city groups, the economic analysts felt it would be possible to design the appropriate programs.

Classifying Cities: Data Understanding

During the meetings, there was discussion about what variables might be useful inputs. The analysts had a gut feel for what the city-groups might be, and some initial guesses about which variables would be most important and how they could be combined. The group came up with a list of likely variables. The data miner combed through the organization’s databases to see if these inputs could be found, and to see what other relevant information was available. This effort turned up 54 variables that the organization tracked, from total population to the number of fixed line and mobile phones per 1,000 people. This wealth of data seemed too good to be true, and it was. Further investigation revealed that much of this data was not tracked at the city level. In fact, there were only ten city-level variables available. The data miner reviewed this list with the business folks and the group determined that of the ten variables, only three were reliably measured. Fortunately, the group also felt these three variables were important measures of a city’s economic situation. The three variables were average hourly wages (local currency), average hours worked per year, and the average price index.

At this point, the data miner wrote up a description of the data mining opportunity, its expected business impact, and an estimate of the time and effort it would take to complete. The project was described in two phases; the goal of the first phase was to develop a model that clusters cities based on the data available. If this model made sense to the business folks, phase two would use that model to assign new cities to the appropriate clusters as they contact the bank for financial support. The data mining opportunity description was reviewed by the economists and all agreed to proceed.

Classifying Cities: Data Preparation

The data miner reviewed the dataset and realized it needed some work. First, the average hourly wages were in local currencies. After some research on exchange rates and discussion with the economists to decide on the appropriate rates and timing, the data miner created a package to load and transform the data. The package extracted the data from the source system and looked up the exchange rate for each city and applied it to the wages data. Because the consumer price data was already indexed relative to Zurich, using Zurich as 100, this package also indexed the wage data in the same fashion. Finally, the package wrote the updated dataset out to a separate table. The resulting dataset to be used for training the cluster model, containing three economic measures for 46 cities around the world, is shown in Figure.

Figure: The City Economic Data Training Set After Data Preparation

OBS.

CITY NAME

AVG WORK HRS

PRICE INDEX

WAGE INDEX

1

Amsterdam

1714

65.6

49

2

Athens

1792

53.8

30.4

3

Bogota

2152

37.9

11.5

4

Bombay

2052

30.3

5.3

5

Brussels

1708

73.8

50.5

6

Buenos Aires

1971

56.1

12.5

7

Cairo

NULL

37.1

NULL

8

Caracas

2041

61

10.9

9

Chicago

1924

73.9

61.9

10

Copenhagen

1717

91.3

62.9

11

Dublin

1759

76

41.4

12

Dusseldorf

1693

78.5

60.2

13

Frankfurt

1650

74.5

60.4

14

Geneva

1880

95.9

90.3

15

Helsinki

1667

113.6

66.6

16

Hong Kong

2375

63.8

27.8

17

Houston

1978

71.9

46.3

18

Jakarta

NULL

43.6

NULL

19

Johannesburg

1945

51.1

24

20

Kuala Lumpur

2167

43.5

9.9

21

Lagos

1786

45.2

2.7

22

Lisbon

1742

56.2

18.8

23

London

1737

84.2

46.2

24

Los Angeles

2068

79.8

65.2

25

Luxembourg

1768

71.1

71.1

26

Madrid

1710

93.8

50

27

Mexico City

1944

49.8

5.7

28

Montreal

1827

72.7

56.3

29

Nairobi

1958

45

5.8

30

New York

1942

83.3

65.8

31

Nicosia

1825

47.9

28.3

32

Oslo

1583

115.5

63.7

33

Panama

2078

49.2

13.8

34

Paris

1744

81.6

45.9

35

Rio de Janeiro

1749

46.3

10.5

36

Sao Paulo

1856

48.9

11.1

37

Seoul

1842

58.3

32.7

38

Singapore

2042

64.4

16.1

39

Stockholm

1805

111.3

39.2

40

Sydney

1668

70.8

52.1

41

Taipei

2145

84.3

34.5

42

Tel Aviv

2015

67.3

27

43

Tokyo

1880

115

68

44

Toronto

1888

70.2

58.2

45

Vienna

1780

78

51.3

46

Zurich

1868

100

100

Note 

This data actually comes from the Economics Research Department of the Union Bank of Switzerland. The original set contains economic data from 48 cities around the globe in 1991. It has been used as example data for statistics classes and can be found on many web sites. Try searching for “cities prices and earnings economic (1991) globe”.

Observe that two of the cities are missing data. The data miner might opt to exclude these cities from the dataset, or include them to see if the model can identify them as outliers. This would help spot any bad data that might come through in the future.

Classifying Cities: Model Development

Because the goal in the first phase was to identify groups of cities that have similar characteristics, and those groupings were not predetermined, the data miner decided to use the Microsoft Clustering algorithm. After creating a new Analysis Services project in BI Studio and adding a data source and data source view that included the City dataset, the data miner right-clicked on the Mining Structures folder and selected New Mining Structure. This opened the Data Mining Wizard to create a new mining structure. The wizard asks if the source is relational or OLAP, and then it asks which data mining technique should be used. The data miner chose Microsoft Clustering and specified the city name as the key and the other columns as input, checking the “Allow drill through” box on the final wizard screen. After the wizard was finished, the data miner clicked the Mining Model Viewer tab, which forced the build/ deploy/process steps. When this whole process completed, BI Studio presented the cluster diagram shown in Figure.

At first glance, this diagram isn’t that helpful. Without knowing anything about the problem, the Microsoft Clustering algorithm is limited to names like Cluster 1 and Cluster 2. At this point, the data miner can explore the model by using the other tabs in the Mining Model Viewer tab, and by using the drillthrough feature to see which cities have been assigned to which nodes. The data miner can also use the Microsoft Mining Content Viewer to examine the underlying rules and distributions. The data in Figure is grouped by cluster and was created by drilling through each cluster and copying the underlying list into a spreadsheet.

Figure: City Economic Data Listed by Cluster

CLUSTER

CITY NAME

AVG WORK HRS

PRICE INDEX

WAGE INDEX

Cluster 1

Bogota

2152

11.5

37.9

 

Bombay

2052

5.3

30.3

 

Buenos Aires

1971

12.5

56.1

 

Caracas

2041

10.9

61.0

 

Kuala Lumpur

2167

9.9

43.5

 

Lagos

1786

2.7

45.2

 

Mexico City

1944

5.7

49.8

 

Nairobi

1958

5.8

45.0

 

Panama

2078

13.8

49.2

 

Rio de Janeiro

1749

10.5

46.3

 

Sao Paulo

1856

11.1

48.9

Cluster 2

Athens

1792

30.4

53.8

 

Hong Kong

2375

27.8

63.8

 

Johannesburg

1945

24.0

51.1

 

Lisbon

1742

18.8

56.2

 

Nicosia

1825

28.3

47.9

 

Singapore

2042

16.1

64.4

Cluster 7

Seoul

1842

32.7

58.3

 

Tel Aviv

2015

27.0

67.3

Cluster 6

Houston

1978

46.3

71.9

 

Los Angeles

2068

65.2

79.8

 

Taipei

2145

34.5

84.3

Cluster 4

Brussels

1708

50.5

73.8

 

Chicago

1924

61.9

73.9

 

Dusseldorf

1693

60.2

78.5

 

Frankfurt

1650

60.4

74.5

 

Luxembourg

1768

71.1

71.1

 

Montreal

1827

56.3

72.7

 

New York

1942

65.8

83.3

 

Sydney

1668

52.1

70.8

 

Toronto

1888

58.2

70.2

 

Vienna

1780

51.3

78.0

Cluster 5

Amsterdam

1714

49.0

65.6

 

Copenhagen

1717

62.9

91.3

 

Dublin

1759

41.4

76.0

 

London

1737

46.2

84.2

 

Madrid

1710

50.0

93.8

 

Paris

1744

45.9

81.6

Cluster 3

Geneva

1880

90.3

95.9

 

Helsinki

1667

66.6

113.6

 

Oslo

1583

63.7

115.5

 

Stockholm

1805

39.2

111.3

 

Tokyo

1880

68.0

115.0

 

Zurich

1868

100.0

100.0

Cluster 8

Cairo

  

37.1

 

Jakarta

   

43.6

Figure has a few interesting items. The clusters seem to make general sense given a basic understanding of the world economic situation in 1991. Many of the poorer cities of South and Central America and Africa ended up in Cluster 1. The two cities that had only price data, Cairo and Jakarta, are on their own in Cluster 8, and that cluster was drawn a distance from the other six in Figure. Cluster 8 seems to be the data anomalies cluster. Beyond this, it’s difficult to see what defines a given cluster from the data table. The underlying rules are too complex to reverse engineer by sight. The Microsoft Mining Content Viewer can help by revealing the equations that determined each cluster, but a graphical representation will usually reveal much more than written equations.

Figure converts the data from Figure into a graphical form. Each city is grouped by cluster along the X axis, with work hours plotted on the left Y axis and wage and price indexes on the right Y axis. The order of the clusters was determined by starting with Cluster 1 in Figure and following the links. This graph allows the viewer to get a better sense for what makes up the clusters.

Image from book
Figure: A graphical view of the city clusters

Even though Figure is a bit busy, it helps clarify the commonalities within each cluster. The cloud-like circles in Figure highlight the Price Index data points. The horizontal line in the graph indicates the average Price Index of about 69.2. Clusters 6, 4, and 3 on the upper part of the right side of the graph have a Price Index above average, while Clusters 1, 2, and 7 on the left side are all below average. This graph also makes it easier to identify the characteristics of the individual clusters. For example, Cluster 4 includes cities whose wages are relatively high, with prices that are at the low end of the high price range, and relatively low work hours. These are the cities that make up the day-to-day workforce of their respective countries—we might call this cluster the “Heartland” cities. Cluster 1, on the other hand, has extremely low wages, higher work hours than most, and relatively lower prices. These are the newly developing cities where labor is cheap and people must work long hours to survive. You might call this cluster the “Hard Knock Life” cities. The Bank would likely call it something a bit more politically correct, like the “Developing Cities” cluster. As we mentioned earlier, good names can help crystallize the defining characteristics of each cluster.

Classifying Cities: Model Validation

Some data mining models are easier to test than others. For example, when you’re building a prediction model, you can create it based on one historical dataset and test it on another. Having historical data means you already know the right answer, so you can feed the data through the model, see what value it predicts, and compare it to what actually happened.

In the City cluster example, validating the model is not so straightforward. You don’t know the classification ahead of time, so you can’t compare the assigned cluster with the right cluster—there is no such thing as “right” in this case. Validation of this model came in two stages. First, the data miner went back to the economic analysts and reviewed the model and the results with them. This was a reasonableness test where the domain experts compared the model with their own understanding of the problem. The second test came when the model was applied to the original business problem. In this case the team monitored the classification of new cities to make sure they continued to stand up to the reasonableness test. Ultimately, the question was “Do the classifications seem right, and do they help simplify how the organization works at the city level?” Is it easier and more effective to work with eight clusters rather than 46 (or 4,600) cities?

Classifying Cities: Implementation

Once the team decided the model would work, or at least that it was worth testing in an operational environment, it was time to move it into production. This can mean a whole range of activities, depending on the nature of the model and the business process to which it’s being applied. In this city classification example, new economic data at the city level can arrive at any time. The team decided to have this new data entered into a table in the database and assign clusters to it in a batch process once a night.

The data miner wrote a simple Integration Services package that reads in unassigned cities, submits them to the data mining model for cluster assignment, and writes the full record out to a table called CityMaster with a batch date identifying when the cluster was assigned. Figure shows what the data flow for this simple package might look like. This flow has a data viewer inserted right after the Data Mining Query task showing the output of the task. Compare the cluster assignments for Manila (Cluster 2), and Milan (Cluster 7), with the other cities in those clusters in Figure. Do these assignments pass the reasonableness test?

Image from book
Figure: An Integration Services package to assign clusters to new cities

Implementation would also integrate this package into the rest of the nightly ETL process. The package should include the standard data and process audit functions described in Chapter 6. Ultimately, the process should be part of the package that manages the City dimension.

This nightly data mining batch process is a common one in many DW/BI systems, using a data mining model to populate a calculated field like a Default Risk score across all loan records or a Credit Rating across all customers. These scores can change depending on customer behaviors, like late payments or deposit balances. Every loan and customer may have to be rescored every night. The same batch process can be used on a one-time basis to address opportunities like identifying customers who are likely to respond to a new product offering.

Classifying Cities: Maintenance and Assessment

As the data changes, the data mining model will likely change as well. In this case, the data mining model should be re-built as data about additional cities is collected, or data about cities that have already been assigned a cluster is updated. The team would then review the resulting clusters to make sure they make sense from a business perspective. Given the nature of the business process and the changing data, this review should probably happen on a regularly scheduled basis—perhaps monthly to start, then quarterly as the model and data stabilize.

This example shows how classification can be used to improve an organization’s business processes. Although this dataset is too small to produce reliable results, it does generate a model that makes sense and shows how that model can be applied to new incoming data.

Case Study: Product Recommendations

The ability to recommend products that might be particularly interesting to a given customer can have a huge impact on how much your customers purchase. This example follows the data mining process from identifying the business requirements through to the implementation of a product recommendation data mining model. It’s based on the Adventure Works Cycles dataset provided with SQL Server 2005. Recall from Chapter 1 that Adventure Works Cycles is a manufacturer, wholesaler, and Internet retailer of bicycles and accessories. While Microsoft has taken pains to build some interesting relationships into the data, this data may not be completely real. That is, the purchasing behaviors might not be drawn from the real world, but may in fact be made up. Data mining can help us find only relationships that are in the data.

The SQL Server data mining tutorial steps you through the process of building a model that predicts whether or not someone will be a bike buyer. While this is interesting information, it doesn’t help you figure out what to display on the web page. Even if you’re pretty certain someone will be a bike buyer, you don’t know which bike to show them. Also, what products should you show all those folks who are not bike buyers? Our goal in this example is to create a data mining model that produces a custom list of specific products that you can show to any given web site visitor based on demographic information you might have. To the extent that the custom product list is more appealing than a random list of products, the visitor is more likely to make a purchase.

In this example, we try to give you enough information to work through the model creation process yourself. It’s not a complete step-by-step tutorial, but if you’ve already worked through the SQL Server data mining tutorial, you should be able to follow along and see how it works for yourself.

Product Recommendations: The Business Phase

Recall that the business phase of the data mining process involves identifying business opportunities, and building an understanding of the available data and its ability to support the data mining process. The Adventure Works Cycles data is more complete than many real-world systems we’ve seen, and it has more than enough customers and purchases to be interesting.

Product Recommendations: Business Opportunities

In many companies, the DW/BI team has to sell the idea of incorporating data mining into the business processes. This is usually either because the business folks don’t understand the potential or because the business will have to change the transaction system, which is a big and scary task. Sometimes, the team gets lucky and data mining starts with a request from the business community. This is how it worked in the Adventure Works Cycles example. One of the folks in marketing who is responsible for e-commerce marketing came to the DW/BI team asking for ways to boost web sales. Web sales accounted for about $9,000,000 in the first half of 2004, or about one-third of Adventure Works Cycles total sales. The Marketing group has created a three-part strategy for growing the online business: Bring in more visitors (Attract), turn more visitors into customers (Convert), and develop long-term relationships with customers (Retain). The marketing person who came to the DW/BI team is responsible for the Convert strategy—that is, for converting visitors to customers.

Tip 

In a case like this, if the Marketing person has a PowerPoint presentation that goes into detail on the marketing strategy, the data miner should review it. We feel sure they have such a presentation.

Because the marketing person is responsible only for conversion, the team will investigate alternatives for increasing the conversion of visitors to customers. They will check to see if the final model also increases the average dollars per sale as a beneficial side effect. After some discussion, the team decided that influencing purchasing behavior (conversion) with relevant recommendations is likely to be the best way to achieve their goals. They translated the idea of recommendations into specifics by deciding to dedicate one section of the left-hand navigation bar on the e-commerce web site to hold a list of six recommended products that will be tailored to the individual visitor. While the marketing person investigated the level of effort required to make this change on the web site, the data miner dug into the availability of relevant data.

Product Recommendations: Data Understanding

After some research, the data miner discovered that when visitors come to the Adventure Works Cycles web site, they are asked to fill out an optional demographics form (“to better serve them”). A few queries revealed that about two-thirds of all visitors actually do fill out the form. Because the form is a required part of the purchase process, the information is also available for all customers. In either case, this demographic information is placed in a database and in cookies in the visitor or customer’s browser. The DW/BI system also has all historical purchasing behavior for each customer at the individual product and line item level. Based on this investigation, the data miner felt that sufficient data was available to create a useful mining model for product recommendations.

The data miner knew from experience that behavior-based models (like purchases or page views) are generally more predictive than demographic-based models. However, there were several opportunities to make recommendations where no product-related behavioral data is available but demographic data is available. As a result, the data miner believed that two data mining models might be appropriate: one to provide recommendations on the home page and any non-product pages, and one to provide recommendations on any product-related pages. The first model would be based on demographics and would be used to predict what a visitor might be interested in given their demographic profile. The second model would be based on product interest as indicated by the product associated with each web page they visit or any products added to their cart.

At this point, the data miner wrote up a Data Mining Opportunity document to capture the goals, decisions, and approach. The overall business goal was to increase conversion rates with an ancillary goal of increasing the average dollars per sale. The strategy was to offer products that have a higher probability of being interesting to any given visitor to the web site. This strategy breaks down into two separate data mining models, one based on demographics and one based on product purchases. This decision was considered a starting point with the understanding that it would likely change during the data mining phase. This example goes through the creation of the demographics-based model. The product-based model is left as an exercise for the reader.

The team also agreed on metrics to measure the impact of the program. They would compare before and after data, looking at the change in the ratio of new customer purchases (conversions) to the total unique visitor count in the same time periods. They would also examine the change in the average shopping cart value at the time of checkout. A third impact measure would be to analyze the web logs to see how often customers viewed and clicked on a recommended link.

Product Recommendations: The Data Mining Phase

With the opportunity document as a guide, the data miner decided to begin with the demographics-based model. This section follows the development of the model from data preparation to model development and validation.

Product Recommendations: Data Preparation

The data miner decided the data source should be the AdventureWorksDW relational database. Because no web browsing data was available yet, sales data would be used to link customer demographics to product preferences—if someone actually bought something, they must have had an interest in it. The advantage of sourcing the data from the data warehouse is that it has already been through a rigorous ETL process where it was cleaned and transformed to meet basic business needs. While this is a good starting point, it’s often not enough for data mining.

The data miner’s first step was to do a little data exploration. This involved running some data profiling reports and creating some queries that examined the contents of the source tables in detail. Because the goal is to relate customer information to product information, there are two levels of granularity to the case sets. The demographic case set is generally made up of one row per observation: in this example, one row per customer. Each row has the customer key and all available demographics and other derived fields that might be useful. The product sales case set is at a lower level of detail, involving customers and the products they bought. Each row in this case set has the customer key and the product model name of the purchased product along with any other information that might be useful. Each customer case set row has a one-to-many relationship with the product sales case set (called a nested case set). You could create a single case set by joining the demographics and sales together up front and creating a denormalized table, but we prefer to rely on the data mining structure to do that for us.

After reviewing the source data in the AdventureWorksDW database, the data miner decided to pull the demographic case data from the DimCustomer table and combine it with other descriptive information from the DimGeography and DimSalesTerritory tables, and to pull the product purchasing case data from the InternetSalesFact table along with some product description fields from DimProduct and related tables. The data exploration also helped the data miner identify several additional transformations that might be useful in creating the data mining model. These are shown in Figure.

Figure: Additional Transformations Used to Create the Example Case Set

TRANSFORMATION

PURPOSE

Convert BirthDate to Age

Reduce the number of distinct values by moving from day to year, and provide a more meaningful value (Age) versus YearOfBirth.

Calculate YearsAsCust

DATEDIFF the DateFirstPurchase from the GetDate() to determine how many years each case has been a customer. This may help as an indicator of customer loyalty.

Create bins for YearlyIncome

Create a discrete variable called IncomeGroup to use as input to algorithms that cannot handle continuous data. Note: This is optional when creating the case set because binning can also be accomplished in the Mining Structure tab, or through the Data Mining Wizard.

As a critical part of data preparation, the data miner recognized the need to split the historical dataset into two subsets: one to train the initial model, and one to test its effectiveness. After a bit of customer count experimentation, the data miner decided to build an Integration Services package to randomly select 18,000 customers from the customer dimension and send 90 percent into the training set and the rest into the test set. The data flow shown in Figure is the part of this package that selects the customers, adds the derived columns, splits the set into test and training datasets, and writes them out to separate tables called DMTestSet and DMTrainSet.

Image from book
Figure: An Integration Services data flow to create test and training datasets
Tip 

This approach works for the Adventure Works Cycles customer dataset because there are only 18,484 customers total. If you have millions of customers, you might look for a more efficient way to extract training and test subsets. One possible approach is to use the last few digits of the customer key (as long as it is truly random). For example, a WHERE clause limiting the last two digits to “42” will return a 1 percent subset.

Another data flow elsewhere in this package uses the two customer sets to extract the product orders for the selected customers and write them out to a table called DMCustPurch. This is the nested product case set. Depending on how rapidly the product list changes, it might make sense to limit the datasets to only those products that have been purchased in the last year and their associated customers.

You can see the tables for the training and test datasets in the Data Source View in Figure. Figure also includes the data model for the nested product case set.

Image from book
Figure: The ProductRecs datasets, presented as a data source view

The nested product case set has one or more rows for each customer. Just the fact that someone with a certain set of demographics bought a certain product is all the information you need. Notice from Figure that the data miner decided to include some additional fields from the orders fact table that will not play a role in making recommendations but may be helpful in troubleshooting the dataset.

Tip 

Integration Services makes it easy to create physical tables snapshots of the exact datasets and relationships at a point in time. You can then use these tables to build and test many different mining models over a period of time without tainting the process with changing data. It’s common to set up a separate database or even a separate server to support the data mining process and keep the production databases clear of this data mining debris.

The SQL Server 2005 Data Mining Tutorial uses views to define the case sets. This helps keep the proliferation of physical tables in the database to a minimum, but it’s not our preferred approach. The views need to be defined carefully; otherwise their contents will change as the data in the underlying database is updated nightly.

At this point, the data miner has enough data in the proper form to move on to the data mining model development process.

Product Recommendations: Model Development

The data miner began the model development process by creating a new Analysis Services project in the BI Studio called ProductRecs. She added a data source that pointed to the AdventureWorksDW database and created a data source view that included the three tables created in the Integration Services package: DMTrainSet, DMTestSet, and DMCustPurch. After adding the relationships between the tables, the project and data source view looked like the screen capture shown in Figure. The keys shown in the Train and Test tables are logical primary keys assigned by the data source view.

Next, the data miner used the Data Mining Wizard to create a new mining structure by right-clicking on the Mining Structures folder and selecting New Mining Structure. The data is coming from an existing relational data warehouse, and the data miner chose the Microsoft Decision Trees data mining technique in order to predict the probability of purchasing a given product based on a set of known attributes (the demographics from the registration process). In the Specify Table Types dialog window of the wizard, the data miner checked DMTrainSet as the Case table and DMCustPurch as the Nested table.

The Specify the Training Data window can be a bit tricky because it’s unclear what columns should be used in what ways. For this model, all of the demographic variables will be used as input only—you’re not trying to predict the Age or Gender of the customer with this model. Also, you need to correctly specify which DMCustPurch columns to include. At a minimum, specify a key for the nested cases.

In this example, ProductModel is the appropriate key, although it’s not enforced in the creation of the table. You also need to specify which column or columns to predict. Again, ProductModel is the obvious choice because it contains the description of the products to recommend. The data miner also included EnglishProductCategoryName as a predicted column because it groups the ProductModels and makes it easier to navigate later on in the Model Viewer. Finally, the data miner did not to include the quantity and amount fields because they are not relevant to this model. Remember, these are the nested purchases for each customer case. With a new visitor, you’ll know their demographics and can use that as input to the model, but you won’t have any purchase information so it makes no sense to include it as available input data. The bottom section of the completed Specify the Training Data window is illustrated in Figure.

Image from book
Figure: The nested table portion of the Specify the Training Data window

The next step in the Data Mining Wizard is meant to specify the content and data types of the columns in the mining structure. The data miner accepted the defaults at this point and went to the final screen, changing the mining structure name to ProductRecs1, and the mining model name to ProductRecs1-DT (for Decision Trees). After hitting the Finish button, the wizard completed the creation of the mining structure and the definition of the Decision Trees data mining model. The data miner is then able to view and verify the model definitions by viewing the Mining Models tab.

The next step is to deploy and process the model. Typically, a data miner works with one model at a time to avoid the overhead of processing all the models in the project (even though there is only one at this point, there will be more).

Tip 

To deploy and process the model, select any column in the ProductRecs1-DT model, right-click, and select Process Model. Select Yes to deploy the project, Run at the bottom of the Process Mining Model window, Close in the Process Progress window (when it’s finished), and finally Close back in the Process Mining Model window.

The Decision Trees algorithm generates a separate tree for each predicted value (each ProductModel), determining which variables historically have had a relationship with the purchase of that particular product. This means it will build 40 trees, one for each distinct value of the predicted variable, ProductModel, found in the DMCustPurch table.

Once the processing is complete, the data miner is finally able to explore the results. Selecting the Model Viewer tab automatically brings up the currently selected mining model in the appropriate viewer—in this case, the Decision Trees sub-tab of the Microsoft Tree Viewer. The tree shown is for the first item alphabetically in the predicted results set: the tree for the All-Purpose Bike Stand, which seems to have a slight bias toward women. Selecting a more interesting product like the Mountain-200 mountain bike brings up a more interesting tree—or at least one with more nodes.

The first split in the initial Mountain-200 tree is on DateFirstPurchase, and then several other fields come into play at each of the sub-branches. Immediately, the data miner recognized a problem. The DateFirstPurchase field was included in the case set inadvertently because it is an attribute of the customer dimension. However, it’s not a good choice for an input field for this model because visitors who have not been converted to customers will not have a DateFirstPurchase by definition. Even worse, after looking at several trees for other bicycle products, it is clear that DateFirstPurchase is also a strong splitter—perhaps because the longer someone has been a customer, the more products they have purchased, and the more likely they are to have purchased a bike. A quick review of all the fields reveals that another field has the same problem: YearsAsCust. This makes sense because the field is a function of DateFirstPurchase, and contains essentially the same information. The data miner decided to remove these fields from the model and reprocess it.

One easy way to do this is to delete the fields from the Mining Structure by right-clicking on the field and selecting Delete. The more cautious way is to remove them from the ProductRecs-DT mining model by changing their type from Input to Ignore in the drop-down menu. This keeps the fields in the Mining Structure, just in case. After changing the type to Ignore on these two fields and reprocessing the model, the decision tree for the Mountain-200 now looks like the one shown in Figure.

Image from book
Figure: The Mountain-200 decision tree

After exploring the model in the Decision Tree tab for a bit, it’s useful to switch over to the Dependency Network tab. This tool provides a graphical way to see which demographic variables are predictive of which products (ProductModels). The dependency network shows the relationships between the input variables and the predicted variables. But the meaning of the initial view of the dependency network for this example, shown in Figure, is not immediately obvious. Each node in the network stands for one of the variables or products in the mining model. Some input variables are predictive of many product models, others only a few. Because we have 15 input variables and 40 individual product models and so many relationships among those variables, we end up with a spider web. In fact, the viewer automatically limits the number of nodes it will display so as not to overwhelm the user.

Image from book
Figure: The default Dependency Network drawing for the ProductRecs1 Decision Trees model

Fortunately, there’s more to the Dependency Network tab than just this view. Zooming in to see the actual names of the variables is a good way to start. Selecting a node highlights the nodes with which it has relationships. The tool uses color and arrow directions to show the nature of those relationships. Finally, the slider on the left of the pane allows the user to limit the number of relationships shown based on the strength of the relationship. The default in this viewer is to show all the relationships. Moving the slider down toward the bottom of the pane removes the weakest relationships in ascending order of strength.

One way to get a better sense of the relationships in the Dependency Network tab is to drag the predictive (input) variables over to one corner of the screen. Figure shows the model from Figure after the data miner dragged the predictive variables to the upper-right corner.

Image from book
Figure: The Dependency Network with predictive variables dragged to the upper-right corner

This is still not very helpful. By zooming in on the upper-right corner, as shown in Figure, you can see that these, in fact, are the input variables. Note that there are only 14 shown in the dependency network. StateProvinceName did not play enough of a role in the model to make it onto the graph. Figure has had another adjustment as well: The slider was moved down to about one-third of the way up from the bottom. This shows that most of the relationships, and the strongest relationships, come from only a few variables. This comes as no surprise to an experienced data miner. Often there are only a few variables that really make a difference—it’s just difficult to figure out ahead of time which ones they’ll be.

Image from book
Figure: The Dependency Network zoomed in on the predictive variables

The input variables with the strongest relationships shown in Figure are English Country Region Name, Yearly Income, and Number Cars Owned. True to the iterative data mining process, this brings up an opportunity. Removing some of the weaker variables will allow the model to explore more relationships among the stronger variables and to generate a more predictive model. For example, Figure shows the decision tree for the Women’s Mountain Shorts product based on the initial model.

Image from book
Figure: The initial decision tree for Women’s Mountain Shorts

There is clearly a variation in preference for these shorts by country. About 14.5 percent of the Canadians bought a pair, but less than one-half of one percent of the Germans bought a pair. Given this information, recommending Women’s Mountain Shorts to a German web site visitor is probably a waste of time.

Figure shows the decision tree for the same product after the model has been narrowed down to five of the strongest input variables shown in Figure: Age, Yearly Income, English Country Region Name, Number of Children, and House Owner Flag.

Image from book
Figure: The expanded decision tree for Women’s Mountain Shorts after reducing the number of input variables

The first split is still based on English Country Region Name, but now there is a second split for three of the country nodes. Canada can be split out by income, showing that Canadian customers making >= $74,000 are more likely to own a pair of Women’s Mountain Shorts (probability 23.46 percent)—much higher than the 15 percent we saw for Canada based on the English Country Region Name split alone in Figure.

The process of building a solid data mining model involves exploring as many iterations of the model as possible. This could mean adding variables, taking them out, combining them, adjusting the parameters of the algorithm itself, or trying one of the other algorithms that is appropriate for the problem. This is one of the strengths of the SQL Server Data Mining workbench—it is relatively easy and quick to make these changes and explore the results.

Moving back to the case study, assume that the data miner worked through several iterations and has identified the final candidate. The next step in the process is to validate the model.

Product Recommendations: Model Validation

As we described in the data mining process section, the lift chart and classification matrix in the Mining Accuracy Chart tab are designed to compare and validate models that predict single, discrete variables. The recommendations model is difficult to validate. Rather than one value per customer, the recommendations data mining model generates a probability for each ProductModel for each customer.

Another problem with validating the model is that the data miner doesn’t really have historical data to test it with. The test data available, and the data used to build the model, is actually purchasing behavior, not responses to recommendations. For many data mining models, the bottom line is you won’t know if it works until you try it.

Meanwhile, the data miner wants to be a bit more comfortable that the model will have a positive impact. One way to see how well the model predicts actual buying behavior is to simulate the lift chart idea in the context of recommendations. At the very least, the data miner could generate a list of the top six recommended products for each customer in the test case set and compare that list to the list of products the person actually bought. Any time a customer has purchased a product on their recommended list, the data miner would count that as a hit. This approach provides a total number of hits for the model, but it doesn’t indicate if that number is a good one. You need more information: You need a baseline indication of what sales would be without the recommendations.

In order to create a baseline number for the recommendations model, the data miner also created a list of six random products for each customer in the test case set. Figure shows the results for these two tests. As it turns out, the random list isn’t a realistic baseline. You wouldn’t really recommend random products; you would at least use some simple data mining in the form of a query and recommend your six top-selling products to everyone—people are more likely to want popular products. Figure includes the results for the top six list as well.

Figure: Recommendations Model Validation Data Points

TEST

NUMBER OF HITS

TOTAL POSSIBLE

HIT RATE

Random Baseline

1,508

10,136

14.9%

Top Six Products

3,733

10,136

36.8%

Recommended List

4,181

10,136

41.2%

The data miner and marketing manager learn from Figure that the model is reasonably effective at predicting what customers bought—it’s not great, but it’s better than listing the top six products, and a lot better than nothing at all. Note that the hit rate in Figure has very little to do with the click-through rate you’d expect to see on the web site. The real number will likely be significantly lower. However, based on these results, the data miner and the marketing manager decided to give the model a try and carefully assess its impact on the original goal of the project, increasing the percentage of visitors who become customers, and increasing the average sale amount.

Product Recommendations: The Operations Phase

The decision to go forward moved the project into the Operations phase of the data mining process. The details of the implementation are well beyond the scope of this book, but the high-level steps would involve making the data mining model available to the web server, and writing the ADOMD.NET calls to submit the visitor’s demographic information and to receive and post the recommendation list. Figure shows an example of the DMX query for the ProductRecs1-DT mining model.

Image from book
Figure: Sample DMX for a data mining query to get product recommendations based on an individual’s demographics

In this case, for a 43-year-old person from France who makes $70,000 per year, has no children, and owns a house, the model recommendations include a Mountain-200 and a Road-750. This is good—you like seeing those high-revenue bikes in the recommendations list.

Assessing the impact of the model would involve several analyses. First, the team would look at the number of unique visitors over time, and the percentage of visitors that actually become customers before and after the introduction of the recommendation list. Increasing this percentage is one of the goals of providing recommendations in the first place. This analysis would look at the average purchase amount before and after as well. It may be that the conversion rate is not significantly affected, but the average purchase amount goes up because current customers are more interested in the recommendations and end up purchasing more.

The second analysis would look at the web browsing data to see how many people click on one of the recommendation links. The analysis would follow this through to see how many people actually made a purchase. In a large enough organization, it might be worth testing the recommendation list against the top six list.

The model would also have to be maintained on a regular basis because it is built on purchasing behaviors. New product offerings and changes in fashion, preference, and price can have a big impact on the purchasing behaviors—you don’t want your recommendations to go stale.

Previous Section
Next Section

 Python   SQL   Java   php   Perl 
 game development   web development   internet   *nix   graphics   hardware 
 telecommunications   C++ 
 Flash   Active Directory   Windows