DP-100 Microsoft Data Science – Data Processing – Solving Data Processing Challenges
- Section Introduction
Hello and welcome. So far we have seen various algorithms and ways of data manipulations. However, we many times face the issues related to outliers in the data, that is data points which do not conform to the pattern in the data. Then some variables could be at different scale such as experience would be on a scale of one to 30 while the salary could be in the range of hundreds of thousands. We may also be faced with too many variables or course of dimensionality of data where we will have hundreds or sometimes even thousands of columns in a dataset. We have also dealt with the missing values in the past.
However, we might want to solve these issues more scientifically for some large data set with huge number of missing values. Also, in case of certain classification problems, we may be dealing with an unbalanced data set such as the fraud data. So how do we deal with that? We are also faced with data processing from multiple data sets. We may not always get the data in one sheet or dataset, we may have to merge the data sets on column values. Lastly, but this could be the first step in many cases is how do we summarize and view the data with various measures of columns and data in one step?
Well, Azuramal provides various modules to deal with these problems. In this section we are going to COVID outliers in the data using Clip values module variables at different scale by normalized data dimensionality reduction using Principal Component Analysis or PCA replacing missing values in a large data set using mice which is much more scientific than simply replacing by male mode or median values. Smoke is used for unbalanced dataset such as fraud data and this is a very very interesting concept. We will also learn about merging datasets on column values using join data and we will use the summarized data module for summarizing and viewing the data with various measures and estimates. So let’s get started and I will see you in the first lecture of this section. Thank you for joining me in this one.
- How to Summarize Data?
Hello and welcome to the Azuramal course. In this series of lectures we are going to COVID some of the most interesting data processing modules. Most of them help us solve various data issues that we may face during real world examples. In this lecture we will cover summarized data module that helps us get the summary of all the columns. It generates a basic descriptive statistics for the columns in a dataset. It also provides information about the missing values in all columns. We can also get account of categorical values for a column as well as various statistical measures of data such as mean, mode, median, quartile and so on. So, before we deep dive and run this module, let us try to understand some of the terms that we have not covered in the past.
The first such term is the mean deviation. We have seen what is mean and have also calculated the mean for this column over here. You can pause the video and calculate it if you want to do that. However, this is a very basic concept and I suggest you follow me in this video. Once we have calculated the mean for a column, we find out the distance of every observation from the mean. The average of this column is nothing but the mean deviation. It is also known as the mean absolute deviation. It is used as a measure of dispersion where the number of values are small. Otherwise we should use standard deviation. All right, so standard deviation is the next term in the list. As we have seen in the previous slide, we have calculated the distance from mean for every observation.
Next, what we do is we square these distances and divide it by number of observations minus one. The resultant value is nothing but the variance or s squared of the sample. The statistical variance gives a measure of how the data distributes itself around the mean or expected value. Unlike range that only looks at the extremes, the variance looks at all the data points and then determines their distribution. All right. The standard deviation is also useful when comparing the spread of two separate data sets that have approximately the same mean. The data set with smaller standard deviation has a narrower spread of measurement around the mean and therefore usually has comparatively fewer high or low values. An item selected at random from a data set whose standard deviation is low has a better chance of being close to the mean than an item from a data set whose standard deviation is higher. All right, the next term that we are going to COVID is quartile.
We have this same data of salary. As we know. This row here is the median value. If we decide to further divide the first half into two equal sets, this observation here would be called as the first quartile. And if we divide the second half into two, this observation will be called as third quartile. Median is nothing but the second quartile. These observations are called as quartiles as they divide the data sets into quarters and they represent the boundaries of these quarters. Another interesting term to note is the IQR, or interquartile range. IQR is the difference between the Q three and Q one. It also helps us in determining the spread of data.
Lower the values of IQR, it would mean narrower is the spread and vice versa. All right, the last term we are going to COVID in this lecture is the skewness. Skewness is a measure of a symmetry of the probability distribution of a real valued random variable about its mean. I suggest you go through the earlier lecture of basic terms to understand what is probability. Coming back to the school net, a column data is supposed to have a positive skew. If its probability distribution is tilted towards the left side and has a long tail on the right, it’s opposite for the negatively skewed data set. Data with zero skew will be a balanced dataset. That concludes the lecture on additional terms related to summarize data module. In the next lecture, let’s implement it, and then some of these terms will be much more clear and we will view the summarized data module in action. So see you in the next lecture. Until then, have a great time.
- Summarize Data – Experiment
Hello and welcome. In the previous lecture we covered various terms used in summarized data module such as main deviation variants, standard deviation, quartile and so on. In this lecture, let’s implement and view the output of summarized data module. So let’s go to the Azure ML studio and implement it on adult census data. All right, here we are. And let’s now search for adult census data and drag and drop it here. The next thing is to get the summarized data module. Let me search for it and drag and drop it here. All right, let’s make the right connections and we are ready to run it. Great, it has run successfully. And let’s visualize the output.
As you can see, the feature column has names of all the features from the input data set. And then it goes on providing various summary statistics for each of those variables such as how many values are present, number of unique values, missing value count, if any. And as you can see, work class, occupation and native country have missing values. So for large data set, you don’t have to go and look one column at a time.
Instead, the summarized data can help you get all the columns with missing values. Similarly, it provides various other statistical measures such as mean max, mean, median, quartiles, variance, standard deviation, and so on. The last few values are nothing but the percentiles for that particular column. So p 0. 5 means the column values at 0. 5 percentile, which means 0. 5% values are below 24185 for this column f and l wgt and 95% of the values are below 379682, and 99% of the values are below 51.
Double zero, 72. This can be very useful for determining the outliers as well as fixing the upper boundaries for many of these columns. I hope you have understood how to implement the summarized data for under understanding the various statistics related to the columns. In the next lecture, let’s see how we can eliminate the outliers from the data. Thank you so much for joining me in this one and have a great time.
- Outliers Treatment – Clip Values
Hello and welcome to the data processing using Azure ML. Some practical perspective. Okay, so today we are going to COVID the Clip Values module of Azure Machine Learning. But before we jump to the Azure ML studio and start working on the Clip Values module, let’s first try to understand why it is important to clip the values of what we call outliers. And let’s also try to understand what are the outliers. In this lecture, we are going to COVID what is an outlier, occurrences and causes of outliers, types of outliers, the impact that outliers have, and how to detect an outlier. All right, the first question what is an outlier? In simple terms, outliers are those observations which are way distant from the other observations. And because of that, it does impact the predictions or estimates of the sample. Let’s try to understand these statements using the example of a salary data set. In this data set, as you can see here, these two observations are way distant than the other observations. And if we take the simple mean of the column salary, it will be 8967.
However, if we omit these two observations, you will see that the mean has reduced significantly to six to 60. Now, it is very well aligned to the sample set. So what causes an outlier and why it occurs? Well, there can be multiple reasons. It could be because of a human error, such as data entry error while capturing the observation, or it can also occur if the measuring equipment malfunctions. Or there could be an error during data transmission or transcription.
It can also happen due to various other reasons, such as system behavior, fraudulent behavior, natural outliers, and also due to sampling errors. All right, so let’s try to understand what are the different types of outliers. One of them is univariate, which can be found by looking at the distribution of a single continuous variable such as this one. And as you can see, these two observations are outliers compared to rest of the data set. For multivariate outliers, you need to look at the data in the multi dimensional space such as this one, where the certain observations of the salary are not common when compared to the experience level, and hence they can be categorized as outliers.
As you can see, this salary level is not an outlier in a univariate space because it is not significantly different, but because it occurs when the experience level is very low, it becomes an outlier. Had it been over here, it would have been a normal observation. The same goes for these two observations as well. All right, I hope that explains what is an outlier.
Let’s also see the impact of it using an example of simple regression. A regression line plotted for this set of data which has got these two observations as outliers would appear something like this. And as you can see, it is trying to fit these two observations and these two observations have a significant impact that it deviates the line a little bit from the rest of the data. However, a regression line without these outliers would fit the data much better compared to the previous one. I hope it explains the impact of outliers and what they are.
So how do we detect the outliers so that we can clip their values? Well, there are multiple techniques but nothing works as good as the visualization. You can view them using a box plot, an histogram or a scatter plot which are readily available in Azuramal when we visualize the data, these observations here would very easily be identified as the outliers.
You can also use the percentile method for detecting the outliers or treating the outliers. For example, in this graph you will see that majority of the observations fall within the ten and 90% range and rest is almost insignificant. We can use such percentile values to detect and trade the outliers. That brings us to the end of this lecture on outliers. So far in this lecture, we have learned about what is an outlier, the occurrences and causes of an outlier, types of outliers, impact of outliers, how to detect an outlier. And in the next lecture, let’s use the Clip values module to treat the outliers from a sample data center. Thank you so much for joining me in this class and I will see you in the next one. Until then, have a great time.
- Outliers Treatment – Clip Values Experiment
Hello and welcome. In the previous lecture we covered what is an outlier, occurrences and causes of an outlier as well as the types and impact of an outlier. And we also saw how we can detect an outlier. Let’s now go to the Azure ML studio and understand the module clip values which helps us remove the outliers from the data set. All right, let’s search for the clip values and let me drag and drop it here. Let’s also get the employee data set onto the canvas and make the right connections. Okay, and let’s try to visualize the employee data set and look at the scatter plot for years of experience and the monthly income. We select the column Years of experience and in the visualization we compare it with the monthly income. As you can see, these four data points are very distant from the rest of the data and while building a model, such data points can cause deviation from the pattern that we may build. In such cases you may want to consider these as outliers and fix a value or it is possible that beyond a point you may want to treat them in one category alone.
This is where the clip values module helps us in clipping the values of the outliers. Let me first close this window and for clip values we are going to keep all the default options for now. I will explain the percentile option after we are done with the first one which is constant. Okay. And let’s say we are working on a loan application data set and it does not matter to the bank if the income is beyond $6,000 per month as they will be treated equally. So we want to limit the values of monthly income to $6,000. So we input 6000 as the constant and let’s now click on select columns and select Monthly income as the column where we want to clip the values and click OK, great.
Let’s now run this particular module. All right, it has run successfully. Let’s go and visualize the output. Let’s select the column monthly income and as you can see, the maximum value has now changed to 6000, which is what we wanted. Let’s now compare the years of experience with monthly income. Our scatter plot now looks much better without significant deviation. Let me close this and let’s do one more experiment before we complete this lecture. Let me delete the employee data set from the canvas and get the adult census income data. Great. We ran this module with default options and let’s now see what are the other options available to us. The set of thresholds allows us to choose whether we want to clip only peaks or only sub peaks. Sub peaks are nothing but the lower thresholds. Remember, there could be outliers on both the sides of higher or lower levels of the values of a variable.
So let me select the clip peaks and sub peaks option. The next parameter is the threshold of either a constant value or the percentile value. Let’s choose the percentile for this one. The next parameter is what should be the upper threshold value and then the lower one. So we choose the 99th percentile, which is all the values above. The 99th percentile will be replaced by the 99th percentile value. And same goes for one percentile, which is for the lower threshold. I know it could be a bit confusing to understand it in the beginning, so let’s run it and then many things will be clear. The other parameters of substitute value is for specifying which values should be substituted for the outliers. So we choose the threshold. I’m going to now launch column selector, select the Fnlwgt as the column and click OK.
I’m also going to bring in the summarized data module so that we can see what are the various percentile values and compare it with our results. Let me make the right connection and let’s run the experiment. All right, it has run successfully and let’s first visualize the summarize data crate. For the Fnlwgt, the values of first percentile is 2718 5. 8 and for 99 percentile it is 5100 72. And because we set these as the threshold limits, the minimum and maximum values in the resultant data set should be 27185 and 5100 72.
So let’s go and visualize the clip values. To confirm it. Let me select the column that we have clipped the values for and as you can see, the minimum and maximum values are as expected. You should try it with various other combinations of these options and let me know the results. All right, I hope you have now understood how to do the outlier treatment with the clip values module. That brings us to the end of this lecture. In this lecture we covered the clip values module along with various options it has for outlier treatment. Hope you enjoyed learning it with me. Thank you so much for joining me in this lecture and have a great time ahead.
- Clean Missing Data with MICE
Hello, and welcome to the Azure Machine learning course. So far we have dealt with how to summarize the data, what are outliers and how to clip the outlier values. In this lecture we are going to COVID one of the very important topics for replacing missing values using mice. So far we we have been replacing missing values using mean mode, median or some custom values. But there is a problem with that approach as it may introduce errors or bias within the data. This kind of approach is also known as single imputation method as it makes only one pass for replacing the data.
Whereas mice, which is also known as multivariate imputation using chained equation or Multiple imputation by chain equation makes Multiple passes over the data using MYS. Every variable with the missing values is modeled conditionally using other variables or features within the data set. It makes a fundamental assumption that the value is missing at random, which means we can predict the values based on other available data. Hence, it uses either regression for predicting the continuous missing values and classification for predicting or replacing the categorical missing values.
I know you may or may not have understood this entirely, so let’s try to understand that using a simple example. Let’s say we have this original data set with two columns of age and salary. However, for some reason, such as data transmission or data entry, or any such reason, some of the values from this data set are missing when it is provided to you. All right? So let’s see how we can deal with this missing data using mice instead of replacing them with mean.
For mice replacement, we need to follow certain steps which are nothing, but step one is to calculate the mean based on the available values. Then step two is replace all missing values with mean and step three will be we choose a dependent column and restore it to the original state. During step four we apply an appropriate transformation and create the prediction model. And finally, in step five we predict the missing values and repeat step three to five. So let’s execute them one by one on the sample data set we have as simple explanation over here might not be of much use. All right, so step one calculate the mean based on the available values.
So for the available values of these two columns, the mean is 38. 1 for age and 7080 for salary. So next we replace all missing values with the mean. As you can see, for these four observations in the state three, we make salary as a dependent column and restore it to the original state, which is with the missing values, while keeping the independent or age column with the missing values replaced by mean. So what we have done is we have replaced the missing values in the age column by mean. Okay? And in the step four we basically regress the data set with salary as the dependent variable and build a prediction model using linear regression, which provides us with this particular equation of salary.
Now, using this equation, when we impute the values of age as 29 and 42, it gives us the new values of salary as 5809. 33 for 29 years of age and 75 to 6. 24 for age as 42. And as you can see, these values are very close to what we could have in the original data set. A replacement by mean would have created a not so balanced data set here. All right. Similarly, when we repeat these steps for age with new values of salary and then assume age as the dependent variable, the new regression equation would look something like this we then calculate the age values for these two missing rows by imputing the corresponding salary values.
And by doing that, we get 45. 48 for the salary of 7850. 38 years for salary of 8500, which is very close to what it should have been in mice. We repeat these steps further for a definite number of iterations so that these values can converge with just two iterations. We came so close to the actual values, and as you can see, mice replacement provides a very scientific way of replacing the missing values at the same time because it requires a good understanding of regression. I thought of covering them over here instead of in the earlier data processing section. Let’s now go to the Azure studio and perform an experiment using Mice on some missing data. I’ll see you soon in the next lecture in the Azure ML studio. Thank you so much for joining me here. Have a great time.
- Clean Missing Data with MICE – Experiment
Hello and welcome. In the previous lecture we covered what is multivariate imputation using chained equation or multiple imputation by chained equation, which is also known as Mice. We also learned various steps that are executed during Mice. Using an example in this lecture. Let’s implement the mice and see how it imputes the values using the Clean Missing data and mice as the replacement option. All right, so here we are in the Azure ML studio. I have uploaded this data set for MYS. You can download it from the course materials section for this lecture and let’s drag and drop it here. Let’s summarize the data so we know how many and where we have the missing values for further validation at the later stage.
So I bring in the summarized data module and connect it and let’s run it. Alright, it has run successfully and it’s time to visualize it. As you can see gender, Married dependence, self Employed, loan amount, loan amount, term and Credit history of missing values. Let me close this and let’s now clean the missing data and make the right connections. We have seen all these parameters in the past and I’m going to keep them as it is for this experiment. Let’s launch the column selector and select these columns which have got missing values. Select the gender, Married dependence, self Employed, loan amount, loan amount, term and Credit history and click OK, let’s now come to the cleaning mode where we specify replace using Mice.
I’m going to check this checkbox as it will add an indicator column at the end of the data set. We will see that when we see the result of it. We know from the previous lecture that Mice runs multiple iterations for imputing values. Let’s keep it as default for now. I have deliberately made a mistake here and we will check that after running the Clean missing data module. So let’s run it. All right, and let’s now visualize the output. So now we should have some values in the missing rows. Let me check for the gender first. Oops, it still has some missing values. Any guesses why it must have happened? The hint is the feature type. Pause the video and try to come up with the reason.
Well, Mice algorithms expect numeric or categorical variables. It cannot process string variables and hence gender, married, et cetera still have missing values, whereas loan amount and other numeric features do not have any missing values. Now, because Mice has replaced all of those with certain method. If we go all the way to the right, you will see these indicator columns which have got true or false values. These have been added as we checked the checkbox for the same. All right, now let’s try to further visualize the values. Well, I have completed few steps to save some time. I have added the module edit metadata and have changed the data type of the column. Loan amount is missing from Boolean to string and has also run it the next I have split the data with splitting mode as regular expression and loan amount is missing.
Starts with Tr or true. This would basically extract all the rows where this indicator is true. I have not gone into the details of these modules as we covered them in the past. You may not even have to do this while running an experiment. You can simply take the output of clean missing data module and feed it to the workflow of your model. All right, let’s now visualize the output of split data module. Well, there are 22 rows which had missing values. If you recall, the output of summarized data module and column loan amount has the values which are replaced using mice and as you can see it has multiple different values from 112 to 183. If we were to use mean, median or mode, we would have had only one single value over here. I hope that explains how the mice works and its real life application.
Now what I would want you to do is change the string variables such as gender et cetera into categorical and clean it using mice. Later on you can run the two parallel models with same algorithm and check the output to see an effect mice can have. That concludes the lecture on how to use mice. Thank you so much for joining me in this one and I will see you in the next class. Until then, enjoy your time.