Data mining and SQL Server 2005

Article

Introduction.

Data mining is a key member in the Business Intelligence (BI) product family in SQL Server 2005¹. Data mining is about analyzing data and finding hidden patterns using automatic or semiautomatic means, which can be explored for valuable information. It is about learning the characteristics of data set, which are not possible to discover by simple seeing.

There are several attempts to define the learning task applied to software systems such as: "Learning is any process that enables a system to achieve a better performance when working on the same task" or "Learning consists of constructing or modifying representations of past experience".

Large volumes of data which comes from information systems have been accumulated and stored in databases. Organizations have become data-rich and knowledge-poor. The information found in the patterns can be used for reporting, and, most importantly, for prediction.

Working with Analysis Service in SQL Server 2005 as a Data Mining tool.

Data mining approach in Analysis Service is rather simple, all you need to do is to select the right data mining algorithm and specify the input columns and the predictable columns (which are the targets for the analysis).

Data mining can be used to solve a several problems such as:

Classification: Classification refers to assigning cases into categories based on a predictable attribute.
Clustering: It is used to identify natural groupings (self-similarity groups) of cases based on a set of attributes.
Association: The common usage of association is to identify common sets of items (frequent itemsets) and rules for the purpose of cross-selling.
Regression: The regression task is similar to classification. The main difference is that the predictable attribute is a continuous number. Linear regression and logistic regression are the most popular regression methods. Other regression techniques include regression trees and neural networks.
Sequence Analysis: It's used to find patterns in a discrete series. Sequence and association data are similar in the sense that each individual case contains a set of items or states. The difference between sequence and association models is that sequence models analyze the state transitions.
Deviation Analysis: It's for finding those rare cases that behave very differently from others. It is also called outlier detection.

There are some fields where data mining application can be found such as: analysis of finance data (prediction of credit risks, discovery of fraud transactions and prediction of stock prices), analysis of genetic databases (discovery of correlations between gene sequences and protein properties, prediction of the biological effect of chemical substances), analysis of text archives (classification, retrieval of news articles, scientific publications, co-citation analysis, discovery of risk factors for diseases from medical archives), search engines (information extraction, classification of documents, disambiguation of words, relevance feedback, structuring and visualization of documents, user modeling (learning user profiles, personalized classification of documents, analysis of search behavior).

The data mining project lifecycle.

A common data mining project has lifecycle process; practitioners use CRISP DM (Cross Industry Process for Data Mining) methodology for planning, executing and controlling the whole lifecycle of former project.

CRISP DM² comprises the following phases:

Business understanding. (Goals: Problem goals and specification. Problem scope. Produce Project Plan. Assess the situation. Key persons). This step includes analyzing requirements, defining the scope of the problem, defining the metrics by which the model will be evaluated, create project plan and the effort distribution, and defining data mining goals from business objectives.
Data understanding. (Goals: Collect data. Describe data. Explore data. Verify data quality.) This step is to collect initial data identifying relevant attributes, describing attributes (relevance, statistical characteristics) as well as to explore the data using techniques such as calculating the minimum and maximum values, calculating mean and standard deviations and looking for the distribution of the data and finally discovering errors and missing values in data.
Data preparation. (Goal: Select data. Clean data. Construct data. Integrate data. Format data.) This step is to consolidate, transform and clean (correct false values, insert suitable or estimate values) the data. So, data sources which are scattered across the company and stored in several formats must be consolidated. Data representation must be standardized, for instance, if one department represents the customer information in a different schema than other, they must agree in a common vocabulary for the integration of data. It also is important to find inconsistencies (outlier entities outside the whole group) such as the age of person equal to 200 years old, and other missing information (null values). In this phase the data is generated, described and reformatted if necessary.
Modeling. (Goal: Select modeling techniques. Generate test design. Build model. Assess model.) This is step is to build the mining model. You must separate the prepared data into two distinct dataset, training and testing dataset. You use the training dataset to build the model, and the testing dataset to test the accuracy of the former model. A model typically contains the following structure: input columns, an identifying or key column, predictable columns and the learning algorithm with its associated parameters for tuning, it depends on the experience with specific techniques and tools. Patterns are discovered by passing the original training data through the chosen algorithm. It is remarkable to say that SQL Server 2005 contains several algorithms for each type of model to build, so you must assess the best algorithm according to the business objectives and know knowledge. A good practice is to create several models and rank the former models with respect to success critical. SQL Server 2005 allows doing this task very easy.
Evaluation. (Goal: Evaluate results. Review process. Determine next steps). This step is to evaluate the results and identify failures. Before deploying the model in the production environment, you must first test the model to measure how it performs. Also, you may have created several models and will have to decide which model performs the best according to the business success, so you may have to return to the previous step in the process, either by redefining the problem or by reinvestigating the data in the original dataset. You may test how well the model predicts unseen dataset.
Deployment and maintenance. (Goal: Plan deployment. Plan for monitoring and maintenance. Create the final report). The last step is to deploy the developed model in the production environment and to use the model to predict and make business decisions. You may use the Integration Services provided by SQL Server 2005 to create a mining model package.

Other non-functional aspects to follow in a Data Mining project is check the data privacy and security.

Data mining basic concepts.

There are a lot of concepts in the data mining world.

Case: a basic entity of information, that is, an information holder. It is characterized by attributes, such as Age and Fullname and its underlying state. Attributes are also called variables in statistics terms. An attribute may have values called state. Attributes may be of four different model structure types: key, input, predictable and both input and predictable and two different content types: continuous and discrete. Some may contain only input variables such as the clustering algorithm.

A case may correspond to a row in a table, but it is possible to find a case represented by at least one table row, this type of case is called nested case. Case key is the attributes which uniquely identifies the case. The primary key of the table is used often as case key. Sometimes it is composed by several primary keys, in the case of nested case.

The data mining model is characterized by key columns, input columns, and predictable columns as well as the data mining algorithm on which the model is trained. Training a data model means finding patterns in the training dataset by using the specified algorithm and the underlying parameters for tuning the algorithm. The model also stores the discovered patterns. Model training is a process for discovering knowledge about the training dataset. Model prediction is used for applying trained discovered patterns to a new dataset for predicting potential values.

The data mining algorithm³ is the mechanism that creates mining models. To create a model, an algorithm first analyzes a set of data, looking for specific patterns. The algorithm then uses the results of this analysis to define the parameters of the mining model. It is associated by the desired mining model, the mining settings; the application input specification which output a mining model result which best fits the example data.

The data mining algorithm represents the learned hypothesis using different structures such as:

A set of rules.
A decision tree.
A mathematical model.
A set of clusters.

DMX stands for Data Mining eXtensions and it is a language for defining data mining query adopting several relational concepts.

Getting started with an Data Mining project in SQL Server 2005.

Let's create a simple illustrative data mining project. First of all, I like to say that there are two modes for working in a Data Mining Project.

Immediate mode.
Working in Immediate mode is generally a more natural experience for data mining users. When you work in Immediate mode, you are connected directly and continually to an Analysis Services server. When you open an object, such as a mining structure, you are opening the real object from the server as you would expect. When you change the object and save it, the object is immediately changed on the server.

To get started in immediate mode, then do the following steps,

Launch BI Dev Studio. All Programs --> Microsoft SQL Server 2005 --> SQL Server Business Intelligence Development Studio.
In the File-->Open-->Analysis Service Database
In the Connect to Database dialog, enter the name of the server and database, and then create the underlying solution.

Offline mode.
When working in Offline mode, your project contains files that are stored on your client machine. As you make modifications to objects in this environment, the changes are stored in XML format on your hard drive. The models and other object are not created on the server until you decide to deploy them to the destination server. This provides you as the data mining developer/analyst with the ability to design and test your models on a test server before deploying them to a live server.

To get started in offline mode, then do the following steps as shown in Figure 1.

Launch BI Dev Studio. All Programs --> Microsoft SQL Server 2005 --> SQL Server Business Intelligence Development Studio
In the menu File--> New Project --> Analysis Services Project

Figure 1.

Creating data mining objects.
Let's define the data source to be analyzed as well as the underlying structure and the data mining model.

First of all, let's define the data source object which comprises the connection string to the data management system and additional information about the connection.

Right-click the Data Sources folder in the Solution Explorer, and select New Data Source to launch the Data Source Wizard. In the wizard, we may choose different data source provider, in our example we choose Native OLE/DB, SQL Native Client and then enter the name of the server, the authentication mode (Windows Authentication) and select the example database AdventureWorks provided by Microsoft SQL Server 2005, the next page of wizard is for impersonation information for accessing the database, we choose "Use the service account" option, and finally select a descriptive name for this data source.

Creating a data source view object.
This is the actual view of your data for selecting, organizing, and exploring the data source. In this step, we choose the tables that contain the cases.

Right-click the Data Source Views folder in the Solution Explorer and select New Data Source View to launch the Data Source View Wizard. In the first step, we choose the data source created before. On the Select Tables and Views wizard page click the >> button to move all tables from the Available Objects list to the Included objects list, select the Sales.Customer table to be analyzed, and finally we set a descriptive name for this object such as Sales_Customer Views. You may right-click on the view object, and select explore data for getting knowledge about this data.

You may create a customized view of the data source. Right-click the data source view, and select "New Named Query", and create query for selecting important attributes to be analyzed as shown in Figure 2.

Figure 2.

Using the Data Mining Wizard.
Let's create the major objects in the data mining project: data mining structure and data mining models.

A data mining structure defines the domain of a mining problem, whereas the mining model is the application of a mining algorithm to the structure. A mining structure is composed by a list of structure columns that have data and data types, and a binding to the associated data source as well as additional information about how the data is modeled. The data mining model is the realization of a data mining structure as well as an algorithm with the underlying parameters. For each column of the structure, it is assigned how it will be used in the model and the underlying algorithm specific mapping. The use of the algorithm depends on the business problems and capabilities of your business intelligence server.

You can use Data Mining Wizard to create in one step the data mining structure and the data mining model objects. Right-click the Mining Structure Folder in the Solution Explorer and then choose New Mining Structure. On the first page, you choose whether you are creating a model from a relational or multidimensional source (an OLAP cube). Some mining algorithm doesn't support the creation of model from a cube. In our case, select "From existing relational database or data warehouse", as shown in Figure 3.

Figure 3.

The next page asks you which algorithm to use to create your initial mining model. Choosing which algorithm you are going to use is dependant on the business problem you are trying to solve. In our case, we're going to select Microsoft Clustering algorithm for grouping cases of self-similarity, as shown in Figure 4.

Microsoft Clustering algorithm uses K-means algorithm. The main idea is to initialize randomly k points as the center of the k clusters, and then it is assigned each case to the nearest central point according to a measure function, and so forming new clusters. Later, a mean point for each cluster is calculated, and this is the new central point for each cluster. And then it is again assigned each case to nearest central point, and so on, until no new clusters are created. That is, the algorithm has converged. It is remarkable to say, that the convergence of the algorithm depends strongly of the randomly initialized central points.

Figure 4.

Next pages allow us to select the case table from some defined views, as well as whether the table is nested or not. In that case, the created named query Sales_Customer_Territory_CustomerType, as shown in Figure 5.

Figure 5.

Now let's specify how the attributes are interpreted by the mining algorithm whether they are key, input or predictable. You may click on the suggest button to perform a quick entropy-based analysis to indicate which columns are likely to provide information towards the predictable column.

In our case, as shown in Figure 6, CustomerID attribute is the key for identifying each case, the CustomerType and TerritoryID attributes are the inputs for providing information to the data mining algorithm. In the case of the cluster algorithm, there is no need for a predictable attribute.

Figure 6.

Next page allows us to select for each attribute the respective data and content types. It is very important to select the appropriate content type (discrete or continuous) for the performance and accuracy of the model. Clicking the Detect button causes to sample and analyze the source data for choosing the appropriate content type. In our case, we have described the CustomerID and TerritoryID attributes´content type as discrete for helping the data mining algorithm, as show Figure 7.

Figure 7.

And finally the wizard ends after setting a descriptive name for the data mining structure and data mining model container.

Processing.
Processing is the action to train models on the server. Data mining has the ability to process all the models in a structure in parallel on a single data read. Before processing a newly created structure or model, you must deploy the objects in the server. In the Build menu, select the option Deploy solution.

Interpreting the created model.
Each data mining algorithm is associated with a content viewer. The viewer is specially designed to display patterns discovered with the underlying algorithm.

The cluster viewer.
For discovering information of the cluster, we must navigate for the each tab of the cluster, as shown in Figure 8.

The first tab shows the found cluster and its relationships, that is, how close a given cluster is from any other interested cluster. In this example, we can see that the model found seven clusters and the relationship between them. Each cluster has different color intensity showing the number of cases that belongs to the former cluster.

The Cluster Profiles view, accessed through the second tab, contains a column for each cluster in your model and a row for each attribute. This setup makes it easy to see interesting differences across the cluster space. Using this view, you can choose an attribute of interest and visibly scan horizontally to see its distribution across all clusters. When an item catches your interest, you can look at neighboring cells or other cells of the same cluster to learn more about what that cluster means.

The Cluster Characteristics view allows comparing cluster one by one. Through this process you will refine your view of your chosen cluster.

The last tab is Cluster Discrimination. It comprises variables, values of the variables and the favors or complements from a selected cluster to each cluster or others, indicating which cluster the attribute favors.

Figure 8.

After you have understood the created model, you may rename the cluster for understanding or future use of the model. It is very easy to rename a cluster, just select the cluster to be renamed, right-click and select Rename Cluster action.

Conclusion.

My real intention in this article is to introduce the Data Mining concepts and principles and how to be applied in common business problem, as well as, how we can develop a Data Mining project using good tools provided by Microsoft such as Microsoft Visual Studio 2005 and Microsoft Analysis Service shipped with Microsoft SQL Server 2005. Nowadays, it is very important to embed this kind of components in Information System for decision making.

Web Links.

²Referenced at http://www.crisp-dm.org/, April 2006

References.

¹ZhaoHui Tan, Jamie MacLennan, Data Mining with SQL Server 2005, Wiley Publishing Inc, 2005
³Tom M. Mitchell, Machine Learning, McGraw-Hill, 1997