Data Cleansing Tools In Azure Machine Learning

Introduction

 
Data cleansing is an important part of the Data Science Process which will help in having higher and better accuracy on predictive models. In this article, the process and techniques of doing so shall be discussed using Azure Machine Learning.
 
A video recording describing this process is recorded during the C# Corner Chennai webinar. That can be viewed here.
 
In this example, the credit scoring dataset with the following columns shall be used.
 
LOAN_ID SAVINGS_AC AGE
YEARLY_SALARY EMPLOYMENT_SINCE OTH_INSTALLMENT_PLANS
DURATION_MTH INSTALLMENT_RATE_PERC_INCOME HOUSING
CREDIT_HIST STATUS_SEX EXISTING_CREDITS
PURPOSE OTH_DEBTORS_GUARANTORS JOB
AMT RESIDENCE_SINCE NUM_GUARANTOR
TELEPHONE FOREIGN_WORKER  
 
The DataSet also has a column status label, that is, the column that we want to predict.
 
The first step is, of course, to explore the data in Azure ML Studio.
 
By using the visualize feature of the dataset, we can go through each column of the data set and view properties of each column, such as Mean, Unique Values, and Missing Values.
 
Azure Machine Learning
 
Azure Machine Learning
 
To have a more global picture, the summarized data module can be used. Add the module and connect it to the dataset that needs to be visualized.
 
Azure Machine Learning
 
Here, at one glance, all the details about all the columns can be obtained. It is used to create a set of standard statistical measures that describe each column in the input table. The module does not return the original dataset. Instead, it generates a row for each column, beginning with the column name and followed by relevant statistics for that column, based on its data type.
 
Azure Machine Learning
 
Before reading further, I suggest you download the DataSet and do some exploration.
 
Observations
 
After exploring the DataSet, here are my observations.
 
Duplicates Loan ID
As shown in the screenshot below, the count of unique values is less than the number of values for the column LOAN_ID. This is an indication of duplicates!
 
Azure Machine Learning
 
Null Values
As we continue to explore the output of the summarized data module, another anomaly which is easy to detect is the null values. The column Missing Value count indicates this and it can be identified just in one glance by using the summarize data module.
 
Azure Machine Learning
 
Explore the Data Types
The next step is to scrutinize each column and check if the actual data type matches the expected one. While doing so, we notice that the column INSTALLMENT_PERC_INCOME is of type string while a percentage should normally be of type numeric.
 
Azure Machine Learning
 
If we dig a little bit further and explore the values in the histogram, we’ll actually see that some of the values have “%” in it, which makes the program interpret all the values in the column as a string.
 
Azure Machine Learning
 
Outliers
An outlier is defined as an observation that lies an abnormal distance from other values in a random sample from a population. That is, the values that are too low or too high are compared to the mass volume of the data.
 
One way to quickly identify Outliers visually is to create scatter plots. In this example, I used the create scatterplot custom module to plot AGE against AMT. This could also be achieved using R code, but the custom module is easier and faster in this case.
 
 
Azure Machine Learning
 
Azure Machine Learning
 
As we can see in the scatter plot, there’s some amount which is quite high and far from the “general population”. These are potential outliers!
 
Is the data balanced?
The last item to check is to verify if our data set has the right amount/proportion of data to learn from. In this scenario, we are predicting the outcome of a loan (good or bad).
 
So, is our model learning from more good or from more bad loans?
 
To get this answer, we just have to visualize the histogram of the column STATUS.
 
Azure Machine Learning
 
From this histogram, we notice that the model is learning from more “good” loans. Is this correct? From a cost perspective, it’s more costly to predict a bad loan as good than vice versa! So, it makes more sense to learn from the “bad” loans. We’ll see how to fix this below.
 
Initial Experiment
 
The initial experiment as described in the previous article has an accuracy of 0.743.
 
Azure Machine Learning
 
Data Cleansing
 
We shall now use Azure ML to address the issues above and we’ll see how this can contribute to improving the performance of the machine learning model.
 
Duplicates Loan ID
It is very easy to fix this one, just bring the remove duplicate module on the canvas and select the column that has the duplicates.
 
Azure Machine Learning
 
If we examine the results again, you will see that the duplicates will now be removed. Very easy!
 
Azure Machine Learning
 
Treating Outliers
The easiest way to treat the outliers in Azure ML is to use the Clip Values module. It can identify and optionally replace data values that are above or below a specified threshold. This is useful when you want to remove outliers or replace them with a mean, or threshold value.
 
Azure Machine Learning
 
There are 3 methods that we can use to identify the outliers,
  1. ClipPeaks – If you clip values by peaks, you specify only an upper boundary. Values greater than that boundary value are replaced or removed.
  2. ClipSubpeaks – If you clip only by sub-peaks, you specify only a lower boundary. Values that are less than that boundary value are replaced or removed.
  3. ClipPeaksAndSubpeaks – If you clip values by peaks and sub-peaks, you can specify both the upper and lower boundaries. Values that are outside that range are replaced or removed. Values that match the boundary values are not changed.
Once the Outliers are identified, the following can be used to replace the outliers using the Clip Values module:
  1. Mean – Replace clipped values with the mean of the column values. The mean is computed before values are clipped.
  2. Median – Replace clipped values with the median of the column values before clipping.
  3. Missing – Replace clipped values with the missing (empty) value.
  4. Threshold – Replace clipped values with the specified threshold value.
In this example, we are replacing the outliers with a threshold value. However, you may want to experiment will all the possibilities to identify the best solution to your problem.
 
After running the experiment and creating the scatter plot again (using the clipped amount), the outliers have been removed and the plot looks as follows.
 
Azure Machine Learning
 
Treating the null values
To treat null values, the Clean Missing Data module can be used. It can be used to replace missing values with a placeholder, mean, or other value.
You can also completely remove rows and columns that have missing values.
 
Azure Machine Learning
 
In this example, we are replacing the missing values for columns AGE and DURATION_MTH with a mean value. After running the experiment, you’ll see that there will be no missing value count and the missing data will be replaced by the mean of the column.
 
Azure Machine Learning
 
Custom data manipulations
The AzureML studio also allows you to write your own custom codes using SQL, R, and Python. In this example, we’ll use SQL to fix the data type issues by removing the “%” INSTALLMENT_RATE_PERC_INCOME.
 
To proceed, first, add the Apply SQL Transformation on the canvas.
 
Azure Machine Learning
 
Then, add the following SQL script to remove the “%” and convert the column to an int data type.
 
Azure Machine Learning
 
Below is the result if we compare the column INSTALLMENT_RATE_PERC_INCOME to the new column reworked_INSTALLMENT_RATE_PERC_INCOME.
 
Azure Machine Learning
 
Using SMOTE to create a more “balanced” data set.
As discussed above, we want our model to learn from more negative results than positive ones.
 
You can use the SMOTE module to apply the Synthetic Minority Oversampling Technique to an input dataset.
 
This is a statistical technique for increasing the number of cases in your data set in a balanced way. You use SMOTE in data sets that are imbalanced. Typically this means that the class you want to analyze is under-represented.  The module returns a data set that contains the original samples, plus an additional number of synthetic minority samples, depending on the percentage you specify.
 
Azure Machine Learning
 
If we run the experiment and view the output now, you will see that the status label “2” has more rows.
 
Azure Machine Learning
 
 
Using R to create a more “balanced” data set.
 
The same objective as SMOTE can be achieved using pure R code. Just add the Execute R module and add the following code to increase the number of rows having label “2” by 2 times.
 
Azure Machine Learning
 
Selecting the columns.
In the process of cleaning the data, we created several new columns. Therefore, as the last step of the cleaning process, we need to discard the columns having the “bad data” and keep only the newly created columns. To do so, use the select column module as follows.
 
Azure Machine Learning
 
Evaluating the results
 
Finally, proceed by using the same algorithms and parameters as described in my previous article and view the results of the evaluation module again.
 
Azure Machine Learning
 
References
  1. Engineering Statistics Handbook- What are outliers?
  2. Clip Values
  3. SMOTE
  4. Clean Missing Data


Similar Articles