Mine the Predictive Power of Data

Use the built-in functionality of SQL Server to make your business processes more efficient with data mining.

Technology Toolbox: VB.NET, SQL Server 2000, Analysis Services

Editor's Note: Portions of this article have been adapted from Chapter 5, "Data-Mining Predictions," and Chapter 6, "Applying Data-Mining Predictions," of Sara Rea's book, Building Intelligent .NET Applications: Agents, Data Mining, Rule-Based Systems, and Speech Processing, with permission from Addison-Wesley Professional [2005, ISBN: 0321246268]. © 2005 Pearson Education Inc. Publishing as Addison-Wesley. For more information, go to www.awprofessional.com/title/0321246268. To read Chapter 5 in its entirety, go to www.ftponline.com/books/chapters/default.asp?isbn=0321246268.

Data mining is the process of extracting meaningful information from large quantities of data. It involves uncovering patterns in the data and is often tied to data warehousing because it attempts to make large amounts of data usable.

Data elements fall into distinct categories; these categories enable you to make predictions about other pieces of data. For example, a bank might wish to ascertain the characteristics that typify customers who pay back loans. Although this could be done with database queries, the bank would first have to know which customer attributes to query for. Banks can use data mining to identify what those attributes are and then make predictions about future customer behavior.

One of the more difficult aspects of data mining has always been translating the theory into practice. Practical examples can be hard to come by, especially the code that backs them up. I'll show you how a fictional retailer named Savings Mart could use Microsoft's Analysis Services, included with Microsoft SQL Server 2000, to improve operational efficiencies and reduce costs. I'll explain the techniques required to implement and refine data mining, as well as provide the code you need to test this out.

You'll be able to use the techniques described and code provided to create a mining model that lets you predict the right amount of shipments to each store. The mining model is the first step toward revising the way Savings Mart procedurally handles product orders and shipments, with the end goal being reduced operating costs.

You can implement data mining with Analysis Services using one of two popular mining algorithms: decision trees and clustering. You use these algorithms to find meaningful patterns in a group of data, then make predictions about the data. Decision trees are useful for predicting exact outcomes. For example, you can apply the decision trees algorithm to a training data set to create a tree that allows the user to map a path to a successful outcome. At every node along the tree, the user answers a question (or makes a "decision"), such as "years applicant has been at current job (0-1, 1-5, >5 years)." I'll teach you how to use the decision trees algorithm to build a mining model based on shipment dates.

Savings Mart is a fictitious discount retailer operating in a single American state. It has been in business since 2001 and hopes to open new stores by achieving greater operational efficiencies. Since its inception, Savings Mart has relied on a system of adjusting product inventory thresholds to determine when shipments will be made to stores. Every time someone purchases a product, the company's business software updates the quantity for that product and store. When the quantity dips below the minimum threshold allowed for that product and store, the software automatically generates an order that is delivered three days later.

This process might seem like a good way to ensure that the stores are well stocked, but it results in shipments being made to each store almost every day. This means high overhead costs for each store. Management wants to replace the order/shipment strategy with a system designed to predict shipment dates rather than rely on adjustable thresholds.

Load the SavingsMart Database
You will need to create a database named SavingsMart to execute the sample code with SQL Server. Do this by attaching the database file (SavingsMart.mdf) available for download.

Begin by taking a careful look at the diagram of the SavingsMart database (see Figure 1). The Products table contains a record for every product available in Savings Mart's stores. Each product is associated with a certain vendor and assigned to a product type, such as beverage or medicine.

The database writes to the Purchases table every time a customer makes a single purchase. It records information common to all purchases, such as when the purchase took place, which store it was made in, and which employee rang up the purchase. The table tracks product purchases for a particular store. You determine availability by examining the Quantity field in the table, ProductQty. A purchase can include multiple products and more than one unit of each product.

Once a product's quantity falls below the minimum threshold allowed for the store (determined by the ProductThresholds table), the store places a new order automatically. The quantity for the order is based on the maximum amount found in the ProductThresholds table. Each shipment is the direct result of an order and is typically completed three days after the order is placed.

I generated the database available for download using a sample data-loading application. I used this application to simulate random purchases, as well as to initiate orders and shipments needed to restock purchases.

Cleaning up the database is one of the most important tasks in successful data mining. You often construct the databases you want to mine from multiple data sources. These data sources often involve data that is prone to a variety of errors—errors that can destroy any chance you have of making useful predictions. The phrase "garbage in, garbage out" applies particularly acutely to data mining.

Possible errors include records with impossible values or values that fall outside the expected range. Another source of errors: records that include null values where nulls will negatively impact the decision-making process. You can prevent most of these errors with database restrictions or application code, but this doesn't always happen. I created this article's sample database artificially, so it's reasonably safe to assume these errors do not exist.

You have a choice of several different methods to clean up a database. Often, you can correct values with a few update queries made in Query Analyzer. The hardest part is determining what the correct values should be. More than likely, you'll need outside help from people intimately familiar with the data, such as a store manager.

Create the Views
You need to create a special view to ease the process of building a mining model. The view, vw_Shipments, combines fields from five different tables and is used in the next section to create the mining model. The view utilizes the fn_GetLastShipmentDate function to calculate the number of days between shipments. This function accepts @ShipmentID and @ProductID as input variables. It then opens a scrollable cursor (similar to an ADO resultset) based on this SQL statement:

SELECT shippeddate, s.shipmentid FROM 
   shipments s
   LEFT JOIN shipmentdetails sd ON 
      s.shipmentid = sd.shipmentid 
   WHERE s.storeid IN (SELECT StoreID 
      FROM 
      Shipments
      WHERE shipmentid = @Shipmentid)
   AND sd.productid = @ProductID
   ORDER BY shippeddate

The function loops through the cursor results until it locates the ShipmentID supplied as an input variable. Once located, the ShipmentID moves to the preceding record and returns that shipment date. This shipment date serves as the first variable for the built-in SQL function, DATEDIFF. The resulting variable, DaysSinceLastShipped, is an important column for the Analyze Shipments mining model.

After inspecting and cleaning your database, the next step is to create a mining model. Typically, you use the Mining Model Editor in Analysis Manager to create mining models. You can also use Decision Support Objects (DSO) if you need to automate the mining-model process programmatically.

Analysis Manager lets you create and manage databases used by Analysis Services. A database node in Analysis Manager doesn't represent the physical storage of a large database. Instead, it represents the database that Analysis Services uses to hold the mining-model definitions and the results of processing these mining models. Note that it does reference a physical data source.

Each database in Analysis Manager is associated with one or more data sources. These data sources can be either relational databases or data warehouses. You create data sources by right-clicking on the Data Sources node and selecting New Data Source. From the Data Link Properties dialog, select a data provider along with the connection information. Analysis Services supports SQL Server, Access, and Oracle databases as data sources.

Mining models serve as the blueprint for how you should analyze and process the data. Each model represents a set of cases. The mining-model object stores the information necessary to process the model, such as what queries you need to get the data fields, what data fields are input columns or predictable columns, and what relationship each column has with other columns. Input columns are attributes whose values the model uses to generate results for the predictable columns. In some cases, an attribute can serve as both an input column and a predictable column.

Once you process the model, the data associated with the mining model represents what was learned from the data. Note that the model stores the analyzed results, not the actual data from the training data set, in the Analysis Services database.

Let's walk through the process of creating a mining model using Analysis Manager. You need to install Analysis Services if you haven't done so already. It's available as a separate install with the SQL Server 2000 setup. Make sure that you also install the latest Service Pack release for SQL Server (see Additional Resources).

Develop a Mining Model
Begin by opening Analysis Manager from the Analysis Services menu item. Next, create a new database. Right-click on the server name in the left-hand pane, then select New Database. Next, specify SavingsMart as the Database Name.

Expand the newly added SavingsMart node, right-click on Data Sources, and select New Data Source. From the Data Link Properties dialog box, select Microsoft OLE DB Provider for SQL Server, then Next. Enter the SQL connection information for your SQL Server and test the connection before closing the Data Link Properties dialog.

Next, create the mining model with the mining-model wizard. Begin by right-clicking on Mining Models in the left-hand pane and selecting New Mining Model. Click on Next in the Welcome dialog, then Next on the Select Source dialog because you're using relational data. Select the vw_Shipments view from the Available Tables listbox in the Select Case Tables dialog, and click on Next. It's possible to select multiple tables, but utilizing the view allows access to a calculated field indicating the number of days between shipments.

After you select the case table, accept the default of Microsoft Decision Trees for the data-mining technique. You should also accept ShipmentID as the default Case Key column. Select the "Finish the mining model in the editor" checkbox and click on Next. Finally, name the model "Analyze Shipments" and click on Finish.

Next, you want to fire up the Relational Mining Model Editor. Click on Insert and Column? and then select the column named DaysSinceLastShipped. After you add this, change the usage to Input and Predictable (note that a diamond icon now appears next to the column). Then go to the Advanced Properties and enter DISCRETIZED(CLUSTERS) as the content type.

Choosing "discretized" as the content type lets you group a continuous variable discretely instead. Continuous variables are usually numeric-based values that have an infinite range of possibilities. DISCRETIZED accepts two parameters, such as:

DISCRETIZED(<method>, <#buckets>)

This method contains one of four values: EQUAL_AREAS, which divides into equal buckets; THRESHOLDS, which uses inflection points to estimate bucket boundaries; CLUSTERS, which uses a clustering algorithm to estimate buckets; and AUTOMATIC, which tries all algorithms and uses the first one that suggests buckets.

Select the next input column by clicking on Insert and Column? and then select the column named StoreID. After you add this, change the usage to Input and Predictable. Do the same thing for the VendorName and ProductType columns. Continue by clicking on Insert and Column? and then selecting the column named Quantity. Next, change the usage to Predictable, and enter DISCRETIZED(CLUSTERS) as the content type on the Advanced Properties tab.

Complete the process by clicking on Tools and Process Mining Model?. Next, save the mining model, then click on OK to start a full process of the mining model. Note that this process will take several minutes to run.

Train the Mining Model
The next step is to train the mining model. Do this by processing the results of a mining model using Analysis Manager. Alternatively, you can use a scripting language known as Data Definition Language (DDL) and a connection to the Analysis Server.

DDL is useful in cases when you want to process a mining model programmatically. You can execute DDL through a connection to the Analysis Server. It's also useful for demonstrating how Analysis Manager processes a mining model. Use the CREATE MINING MODEL syntax to create the mining model. This syntax is similar to Transact SQL (T-SQL), and it should look familiar to SQL developers. Writing the CREATE statement for this mining model is easy:

CREATE MINING MODEL [Analyze Shipments]( 
   [Shipment Id] LONG KEY, 
   [Days Since Last Shipped] LONG 
   DISCRETIZED(CLUSTERS) PREDICT, 
   [Store Id] LONG DISCRETE, 
   [Quantity] LONG DISCRETIZED(CLUSTERS) PREDICT_ONLY, 
   [Vendor Name] TEXT DISCRETE, 
   [Product Type] TEXT DISCRETE) 
USING Microsoft_Decision_Trees

This statement creates a new mining model named Analyze Shipments. The model utilizes Shipment ID as the case key and defines Days Since Last Shipped and Quantity as predictable columns. Note that Days Since Last Shipped also functions as an input column. The remaining columns—Store ID, Vendor Name, and Product Type—are input columns only. You define mining-model columns as either input, predictable, or input and predictable.

You train a model by inserting data into the model with the INSERT INTO syntax:

INSERT INTO [Analyze Shipments]
(SKIP,[Days Since Last Shipped], [Store 
   Id], [Quantity],
   [Vendor Name], [Product Type]) 
OPENROWSET('SQLOLEDB.1','Provider=
   SQLOLEDB.1;Integrated
   Security=SSPI;Persist Security 
   Info=False;Initial
   Catalog=SavingsMart;Data 
      Source=(local)', 
   'SELECT DISTINCT 
      "dbo"."vw_Shipments"."ShipmentID" 
   AS "Shipment Id",
      "dbo"."vw_Shipments".
      "DaysSinceLastShipped"
   AS "Days Since Last Shipped", 
      "dbo"."vw_Shipments"."StoreID"
   AS "Store Id", 
      "dbo"."vw_Shipments"."Quantity" 
   AS "Quantity",
      "dbo"."vw_Shipments"."VendorName" 
   AS "Vendor Name",
      "dbo"."vw_Shipments".
      "ProductType" 
   AS "Product Type" 
      FROM "dbo"."vw_Shipments"')

The mining model doesn't store the actual data, but it stores the prediction results instead after it processes the mining algorithm. This statement uses the OPENROWSET keyword to specify the location of the physical data source.

Select the Content tab to examine the results (see Figure 2). In this example, you can see that VendorName is the most significant factor affecting DaysSinceLastShipped. You know this because it is the first split on the tree. For nodes that have additional branches, two lines follow the node. You can view the additional branches by double-clicking on that node.

Note that the darkest-shaded node in the figure is the one where the Vendor Name is Store Brand. The mining model shades nodes with a higher data density in a darker color. This result is not surprising because more than 25 percent of the 500 products available are represented by the Store Brand. You can confirm this in Query Analyzer with this query:

SELECT v.VendorName,
   (COUNT(ProductID)/500.0) 
   AS 'Percent'
FROM Products p
LEFT JOIN Vendors v ON p.VendorID = 
   v.VendorID
GROUP BY v.VendorName
ORDER BY 'Percent' DESC

Double-clicking on the Store Brand node brings up the detail pane. Examine the attributes for the remaining nodes, and you'll see that you can make predictions for all the stores. Store 1 includes an additional branch that distinguishes between a Snack Foods product type vs. all other product types.

Change the Shipment Strategy
If you use the Prediction Tree dropdown box to select the quantity column, you see that the main factor affecting quantity is the days since last shipped. This is possible because you defined the column DaysSinceLastShipped as an input and a predictable column.

The next factor affecting quantity is the Store ID. In some cases where the vendor name is NOT PNuts, Product Type is an additional factor. For example, analyzing the data further reveals that when the days since last shipped is less than or equal to 43, and the Store ID is not 1, there is a 98 percent probability that the quantity should be 200. You can examine the predicted values for each node path by selecting different nodes within the Content Detail pane.

Assume that the managers of Savings Mart, now aware of the prediction results, would like you to redesign the way product orders and shipments are generated automatically. The primary goal is to increase profits by reducing operational overhead.

Previously, orders were generated based on minimum and maximum threshold quantities for each store and product. As purchases were made, the system would check to see whether the available quantity fell beneath the minimum threshold amount for that product. When this occurred, the system would generate an order for that store automatically with a quantity based on the maximum threshold amount.

At the end of one year, management realized that 295 shipments were made for Store 1 alone. Thus, for every day of the year, there was an 81 percent probability (295/365 = .8082191) that a shipment would be made. Each shipment involves substantial operating costs, so management knew it needed a better system.

When you examine the Content through the Relational Mining Model Editor, you can see that the most significant factor affecting days since last shipped is the vendor name. For every vendor, the model makes predictions of the time to ship products and what quantity the store will need. The next step is to get the information into a usable format so you can estimate new shipping schedules. Do this by querying the mining-model results for specific vendors and stores.

You need to be able to communicate with Analysis Services programmatically before you can redesign the way the store generates orders and shipments. Microsoft introduced the ADOMD.NET SDK to enable developers to communicate consistently with multidimensional sources of data such as Analysis Services. The SDK is a standard .NET data provider that utilizes interfaces from the System.Data and System.Data.Common namespaces. It offers an object model similar to the one used by ADO.NET. For example, it uses a Connection object to connect with the data source, as well as a Command object for issuing commands and other objects to store the results of data queries.

It communicates with standard data sources over the Web using the XML for Analysis specification 1.1. It also lets you query multidimensional data, returning it in an object similar to the ADO.NET DataReader. You make your connections using ADOMDConnection objects, and you pass query parameters using ADOMDCommand objects. ADOMD.NET also offers an object known as the CellSet. This object behaves similarly to the ADO.NET DataSet object, and it gives you three choices for retrieving data: the multidimensional CellSet object, the read-only and forward-only DataReader object, and the XMLReader object. XMLReader, which returns raw XML, is the least resource-intensive.

Install Tools to Analyze Data
You need to install both the XML for Analysis specification 1.1 and the ADOMD.NET SDK to reproduce results returned that are described in this article's sample application. Note that you must install Microsoft XML Parser (MSXML) 4.0 or later first (assuming you don't have it installed already). Next, initiate installation for the XML for Analysis specification 1.1 by executing the file named XMLASDK.exe. Finally, install the ADOMD.NET SDK 8.0. All three files are available from Microsoft's Web site.

When installing the XML for Analysis SDK, make sure you check the Enable HTTP Unsecure checkbox. By default, the XML for Analysis Service Provider will attempt to use a secure HTTPS connection with Secure Sockets Layer (SSL).

You also need to apply database changes to execute the sample code for this article. Do this by opening Query Analyzer and executing the SQL script named DBChanges.sql. This script adds four new stored procedures and three new tables to the SavingsMart database.

Once you train and test a mining model and you believe it will produce good results, you need to set up a process that allows you to refresh the mining model periodically. Refreshing a mining model involves retraining the mining model using new and/or updated data without re-creating it. You can do this manually using Analysis Manager or programmatically using DSO.

The LoadSampleData project available with the online code lets you process the mining model programmatically. If you download the files to your local machine, open the solution file and expand the References node in Solution Explorer. You will see a reference for ADODB and DSO. The DSO reference is a COM-based reference to the Microsoft Decision Support Object library. You'll see a third reference to the ADOMD library namespace, Microsoft.AnalysisServices.AdomdClient. You see this last reference only if you installed the ADOMD.NET SDK successfully.

You don't store the data with the mining model, so you need to reprocess the model periodically. Processing a mining model is as simple as right-clicking on the model name from Analysis Manager and selecting Process. You are given two options: Full Process or Refresh Data. Refreshing the data involves retraining the model without re-creating it completely. This approach is faster, and you should choose it if you're making no changes to the model's structure.

The ProcessMiningModel procedure accomplishes programmatically the same thing as clicking on Process? from the Analysis Manager interface. It reexamines the data using the mining model you specify. It first opens a connection to the Analysis Server and sets a reference to the Analyze Shipments mining model (named oModel). It then initiates a refresh of the mining model:

Try
   oModel.LockObject( _
      OlapLockTypes.olapLockProcess, _
      "Training the Shipments mining model")
   oModel.Process(
      ProcessTypes.processRefreshData)
   oModel.UnlockObject()
Catch ex As Exception
   MessageBox.Show(
      "Unable to refresh the mining model")
   Exit Sub
End Try

This code places a lock on the object before processing occurs to prevent conflicts, releasing the lock after it refreshes the data. At this stage, it's refreshing the predictions themselves. This is because Analysis Services doesn't store the raw data used as input, but instead stores the predictions made as a result of applying the data-mining algorithm.

Query for Results
The LoadSampleData project includes a procedure named PopulateShipmentSchedulesTable. This is where you query the mining model for results. To understand how prediction queries work, let's walk through the syntax for building one. It's similar to building a T-SQL statement you want to execute against SQL Server:

SELECT Predict([Days Since Last 
   Shipped]),Predict(Quantity)
FROM [Analyze Shipments]
PREDICTION JOIN (SELECT 'Ables' as 
   VendorName, 5 as StoreID) as t
ON [Analyze Shipments].[Vendor Name] = 
   t.VendorName
AND [Analyze Shipments].[Store ID] = 
   t.StoreID

In this query, you select the two predicted outcome variables from the Analyze Shipments mining model. You use the PREDICTION JOIN keyword to specify that you want to return values for a specific vendor and store. This is called a prediction query because you use it to retrieve results from an already trained model. Analysis Services also supports a content query, which you use to browse the contents of a mining model. A content query is useful if you want to identify trends programmatically in the processed data.

The prediction query is useful for extracting the top prediction result for each vendor and store. You know to restrict the query to these values because they are identified as the top factors in predicting days between shipments. Browsing the contents of the Analyze Shipments mining model from the data-mining Model Browser in Analysis Manager lets you drill down into a specific node path. The prediction query built in the PopulateShipmentSchedulesTable routine returns the same results as using Analysis Manager.

If you examine the Ables vendor and Store 5, you see the model's highest probability predicts 107 days between shipments. You get the same result if you execute the prediction query discussed previously.

You can continue examining the results using the Content tab of the Relational Mining Model Editor to see the remaining predicted shipment schedules. Alternatively, you can use the LoadSampleData application to query the mining model, and write the data to the ShipmentSchedules table (see Table 1).

The mining algorithms used by Analysis Services do more than output averages of data values. They also apply complex mathematical methods in an iterative fashion, enabling you to cluster data into the most predictable groups.

A key facet of using data mining is measuring the success of your analysis. You can determine whether the new shipment method is a success by generating purchases for an entire year and examining how many orders the method generates. Do this by using the GeneratePurchases routine from the LoadSampleData application.

The new shipment method also simulates random purchase generation for each store. The difference between this version and the one used to create the initial database file is that you use the ShipmentSchedules table to determine when to order products. This approach also reduces the time frame to six months, shortening the processing time.

Track Product Availability
You use the InsertPurchaseDetailNew stored procedure—created when you executed the script from the DBChanges.sql file—to generate purchase records, as well as to verify whether the product quantity requested is available. If the product quantity isn't available, the stored procedure writes to the ProductAvailabilityTracking table. Management can use this table to measure the success of the new shipment method. Each record indicates an attempt by a customer to purchase a product that isn't available.

It's difficult to take advantage of such a table in the real world. After all, most customers won't announce or even know that they're trying to purchase a product the store doesn't have. This means the store likely uses some other method of tracking product availability, whether it's store managers routinely checking the store aisles or an automated method of querying the database.

The key thing to remember is that you should continually measure and evaluate the application of predictions by data-mining algorithms. It's not good enough to use Analysis Services to make predictions and then assume that everything works as expected.

Other factors that can affect the predictions of a mining model depend on the number of splits that occur within the decision tree. You can control some of these factors by using parameters assigned to the model. For example, you might use the COMPLEXITY_PENALTY and MINIMUM_LEAF_CASES parameters from the decision trees mining model to keep this in check (see Table 2).

Successful data mining involves testing the predictions against additional data sets. It can also involve evaluating the success of predictions through various means. Closed-loop processing is an iterative process of reevaluation in which the mining model is refreshed with new data, enabling new predictions to be made. You write the new predictions back to the database in an effort to attain better results. The LoadSampleData project includes a button named Simulate Closed Loop Process.

In the sample app, the closed-loop process generates purchases for 14 days and then initiates a refresh of the Analyze Shipments mining model. You refresh the Analyze Shipment model by calling the ProcessMiningModel routine, which can result in the prediction of new values for the ShipmentSchedules table.

Closed-loop processing is the best method for dealing with real-world situations. After all, many factors go into customer buying habits. What is true at the beginning of the year might not hold true at the end of the year. Closed-loop processing allows you to reevaluate the data continually to ensure that predictions have not changed.

Data mining is a technique that has been around for several years, and the immediate future of predictive data mining should prove interesting. Unfortunately, many of the original tools and techniques for mining data were complex and difficult to grasp for beginners. Many software makers, such as Microsoft, have responded by creating easier-to-use data-mining tools.

The next release of SQL Server is slated to ship with Analysis Services 2005, which introduces a new interface and five new data-mining algorithms. These features will let you take better advantage of all that data mining has to offer. Even if your company is not ready to move to SQL Server 2005, you can utilize Analysis Services 2000 to create some useful applications. You can then migrate the models you create to Analysis Services 2005 when that becomes an option for you.

The introduction of data-mining algorithms with SQL Server represents an important step toward making data mining accessible and useful to more companies. The built-in tools allow users to create mining models visually and then train those models with historical data from relational databases.

The potential applications for data mining are many and can span any industry that collects data. Potential questions that could be answered with data mining include: Is this e-mail considered spam? Is this customer a good credit risk? What other products might this customer buy?

\