DP-100 Microsoft Data Science – Regression Analysis
- Linear Regression model using OLS
Hello and welcome to the regression analysis using Azuramal studio. So far we have seen the simple and multivariate linear regression. We also understood some common metrics to measure the effectiveness of the linear regression, such as ordinary list square as the solution method, mean absolute error, root mean square error and relative absolute error. In this lecture we are going to build a linear regression model using the ordinary list square method. So before we execute and solve the problem, let’s try to understand the problem at hand. We need to build a model to predict the price of the vehicle based on the available historic data. So let’s go to the Azure ML studio and understand the problem in detail. All right, let’s search for the automobile price data. And there it is. And let’s drag and drop it here and let’s visualize the data set. All right, this data set has got 205 rows and 26 columns. That’s obvious. And let me explain some of the columns which are there in this particular data set.
This data set consists of three types of data. First is the specification of an auto in terms of various characteristics such as its make, model, fuel type, length, height, engine type and so on. Second is the assigned insurance risk rating and this rating corresponds to the degree to which the vehicle is more risky than its price indicates. Now, what do we mean by that? Usually cars are initially assigned a risk factor symbol associated with its price. Then if it is more risky or less, then this symbol is adjusted by moving it up or down the scale actuaries call this process symbolling and a value of plus three indicates that the auto is risky, and minus three that it is probably pretty safe.
All right then the third type of data we have is the relative average loss payment per insured vehicle per year. Now this value is normalized for all the vehicles within a particular size classification such as two doors, small station wagon, sports and so on, and it represents the average loss per car per year. If you analyze the columns further, you will realize that symbolic has six unique values and is actually a category of symbols and hence should be treated as categorical variable. Also, columns, normalized losses, number of doors, bore, stroke, horsepower, peak Rpm as well as the price have missing values. So we would need to edit the metadata for symbolic as well as string variables, and we also need to clean the missing data.
So let me close this and let’s first clean the missing data. So we search for it and drag and drop it here. Let’s connect it to the automobile price data set. As you know, we cannot clean two types of columns unless we want to replace them or substitute with the same value. So during the first cleaning operation, let’s clean only the numeric columns and replace the values with the mean. So let’s launch the column selector, select all numeric columns and click OK. And let’s replace them with mean. All right. And we are ready to run it. Well, it looks awesome and it has run successfully. And let’s attach another clean missing data to it for string variables.
So let me just copy and paste this particular module. Connect the output of first, clean missing data to the input of second and let’s launch the column selector. All right, select all the string variables and click OK. All right, now we are ready to run it. So I right click here and run selected. Wow, it has run successfully. You can visualize the output to check if there are any missing values. For now, I’m going to go ahead and edit the metadata of the data set. So let’s search for it and drag and drop it here. As you must have guessed, we want to change the metadata of the symbolic column and all the string variable. So let’s launch the select column.
All right, let’s select all the string columns and also select the column symbolic and click OK. All right, let’s now run this module. Awesome. It has run successfully. And now we need to split the data for training and testing purposes. So I’m going to bring in the split data module and drag and drop it here. Let’s make the right connections. And next we are going to apply the linear regression on the training data. So let’s get the linear regression. And as you know, we need the training model, score model and evaluate model as well. So we are going to drag and drop them onto the canvas one by one. And as you know, most of the models would require a very standard flow of splitting the data into train and test, then bringing an untrained model, training them, scoring them and evaluating them. So you can always keep them handy. Or at the start of the experiment itself, you can bring all of these modules onto the canvas one by one.
All right, for our model over here, let’s connect the linear regression to the train model and training data set to the train model over here onto the second node, our train model to the score and test data set, also to the score model on the second node. And finally, the score model output goes to the evaluate model. All right, you can pause the video so that you can do it along with me. All connections seem okay. And before we run it, let’s have a look at various parameters the linear regression module requires. Okay, the first one is the solution method. It can accept one of the two parameters that is ordinary list square and online gradient descent. We know what is an OLS and we will cover the gradient descent in the next few lectures.
So for this experiment, we will run it with OLS. So let’s select it. We are already familiar with l two regularization weight. And let’s keep it to default for now. And let’s change the random seed to one, two, three. And before running it, we need to provide the label column to the launch column selector and train the model because we need to find out the price of the vehicle. All right, so let’s launch the column selector and the column that we are going to predict is the price, which is our dependent variable as well. And we click OK, everything seems fine.
And let’s run the experiment. Great. It has run successfully. And let’s visualize the output. Wow. It has got all these values of various metrics that we saw. And we can say that our model is good based on the coefficient of determination. While these errors appear huge, you must remember that we are working on a small set of data. We have only 205 observations and 26 columns. Nevertheless, coefficient of determination is very good. Well, you may ask what is coefficient of determination and how we can interpret it. So let’s reserve that for the next lecture. In the next lecture, we will talk about coefficient of determination or R squared, which explains how good the model has performed. So see you in the next lecture. And thank you so much for joining me in this one. You.
- Linear Regression – R Squared
Hello and welcome. So far we have seen what is linear regression and its types. We also covered OLS, which is one of the methods to perform regression analysis. In the previous lecture we also created a regression model using OLS. But the big question is how do we know if if our model is a good model or the regression explains the variation in the predicted variable with a fair degree. That is where the R squared or the coefficient of determination can help. It explains how much of the variation in y or dependent variable. Or you can say what percentage of variation in y is explained by the variation in x, which is our independent variable.
Let’s try to understand the concept and the formula of R squared. Let’s go back to our previous example where we performed the regression analysis for number of hours of study and the marks obtained by a student. This was the equation and we also calculated the values of b zero and b one using this equation, let’s try to predict the marks scored for number of hours of study. So what I’m going to do is substitute the values of x from the column of hours studied and compute y, which is nothing but marks obtained. As you can see with the first record of the y will be 41. 8 plus 4. 55 multiplied by zero and hence the predicted marks will be 41. 8.
So we compute it for all the records and this is how the table would look like. All right, the next thing we need to do is calculate the difference of actual marks scored and the mean and do a square of it. Similarly, we will also calculate the difference of the predicted marks and the mean. We will square that as well. So now our table would look something like this. Once we have computed all the values as discussed, you can pause the video and maybe create this table and you can check it out yourself. All right, a sum of these columns gives us the two key terms of SST and SSR. SSR is nothing but the sum of squares due to regression, which is the sum of square of the difference between predicted value and the actual mean.
And SST is the total sum of squares, that is the sum of squares of difference between the actual value and actual mean. And once we have this, the R squared or coefficient of determination can be calculated as SSR divided by SST. And a higher value of R squared means that the variation in y is well explained by the variation in x. I hope that clears the concept of R squared or coefficient of determination. I suggest you go back to the experiment and evaluate the results using Evaluate model. Visualize the results for R squared so that you get a better understanding of it. That brings us to the end of this lecture on R squared. Thank you so much for joining me in this class. I will see you in the next one. Until then, have a great time.
- Gradient Descent
Hello and welcome. I’m so excited to COVID today’s topic. So far we have covered some basics. We also now know what is OLS as well as the coefficient of determination or R squared. In this lecture, I am going to teach a very important and powerful concept of gradient descent. Gradient descent is used while building many machine learning algorithms and it is the cornerstone of artificial neural network as well. But before we touch upon the gradient descent, let’s try to understand some of the core concepts that will lead us to the gradient descent.
All right. One such concept is the hypothesis. The literal meaning of hypothesis is it is a proposed explanation made on the basis of limited evidence as a starting point for further investigation. So, for the linear regression, the hypothesis function can be returned as h of x is equal to b zero plus b one x, where x is the input to this function and b zero and B one are the parameters. And our goal here is to find out the values of b zero and b one such that y becomes equal to h of x for the given observations. Let’s see what I mean by that. All right, we have this example from the previous lectures. This is a sample data set of number of hours of study versus the marks scored by a set of students. When we plot this data in the two dimensional space, it looks something like this. Now, we can draw a line like this, or like this, or something like this. And how do we know which one is the best representation of the data? That is where the cost function helps us in determining the best parameters. We consider the cost function for this linear function as one by two n multiplied by the sum of square of the errors between actual and predicted value n.
Here is the number of observations. All right, let’s try to apply the same on our test data set. Let’s try this hypothesis with multiple values of b zero and b one. Firstly, let’s assume that the value of b zero is zero and b one is equal to one. In that case, our predicted marks for the first observation would be zero plus one multiplied by zero, which is nothing but zero. Similarly, the second predicted value will be zero plus one multiplied by two and hence two. When we fill in this entire table, it looks something like this. Basically, predicted marks takes the same value as the RSOF study in this particular hypothesis. So is that the correct one? Of course not, because we know what are the actual values.
Well, we need to calculate similar values for as many hypothetical values as possible. But before we do that, let’s also calculate our cost for this set of parameters. So we calculate the square of difference of actual and predicted values of y. We sum it all up and then multiply it by one by two n and n in this case is 13. You can pause the video and do this in Excel if you want to try it out yourself. You also have the results here as one nine four 4. 538. So when b zero is zero and b one is one, our cost as per this formula is equal to one nine four 4. 538. We then do the same for various other values of b one, keeping b zero as zero. That gives us all the possible values of costs for this parameter range here, b zero is zero. So we can plot it on a two dimensional plot and as you can see, it appears like a parabola. So the cost is very high in the beginning, but it gradually reduces and finally it reaches this bottom and starts increasing again. So the values associated with the parameters determine how your hypothesis behaves. And remember, our goal here is to find out a set of parameters that gives us the least possible cost for this hypothesis, so that the predicted y is as close to the actual y.
So far, we derived the cost using only one value of b one. The graph of b zero, b one and costs would appear something like this when viewed in a threedimensional space. This has been created using the website Academo. org, which provides some beautiful 3D graphs based on the equation that you provided to it. All right, so how do we teach our algorithm to descend to the bottom of this structure? Remember, you and me can see the path from anywhere to the bottom in this diagram. But our algorithm is blind, it cannot see. So let’s try to behave like a blind algorithm and find out if we can reach the bottom of a physical structure. All right, let’s see that with an example and assume that you need to reach this beautiful house which is situated at the bottom of the hill. You are either here or maybe here. You have been asked to reach to the house, but you have been blindfolded and cannot see a thing. Still, you would be able to start going in the direction of the house as you know, it is at the bottom and if you follow the descent, you have a chance of reaching there.
So you may take this path and reach the house. If you are here, or maybe this one if you are here, let’s say, then you may take this path and reach to the house. However, because you are blindfolded, there are high chances that you may cross the house and start going away from it. But very soon you’ll realize that you are going up the hill, whereas the house is at the bottom. So what do you do? You turn around and start going down the descent. You may continue doing that and take smaller steps so that you reach the house which is at the bottom of the hill. The concept of gradient descent is exactly the same when applied to the algorithms. It takes these steps in the descending order to reach the bottom. So how does it do it? In pure mathematical term. Let’s say we are at this particular point and we draw a tangent to this curve at this particular point and identify the slope of it and then take one step in the direction of the slope while also assigning the new value of b one, calculated as previous b one minus the alpha, which is the size of the step multiplied by the cost function.
All right? And by doing this repeatedly for the entire training set, our algorithm can get the set of parameter values, which gives us the lowest cost. One thing to note for the learning rate, well, it acts as a step, and hence, if the step size is high, there are chances that you may not reach the actual bottom or you will always overstep it. Similarly, if the step size is too small, it may take a very, very long time to reach the bottom. I hope you have paid the necessary attention to this concept, as it forms the basis of many algorithms and how they reach the high accuracy. That brings us to the end of this lecture on gradient descent. This method we discussed is also known as Batch gradient descent. In the next lecture, we will cover the stochastic gradient descent. Thank you so much for joining me in this one and have a great time.
- Linear Regression: Online Gradient Descent
Hello and welcome. In the previous lecture we learnt about the gradient descent and how it minimizes the cost function. We also saw how the learning rate or the steps taken in the direction of minimum cost helps us in getting the higher accuracy. However, if we have a data set where where the cost function appears like this, or if we have a large data set, then the gradient descent will either give a wrong minimum cost or it can become computationally expensive. In gradient descent, before it takes one step, it has to go through the entire training dataset.
This is also called as an epoch. Hence it takes much longer for it to reach the bottom and it does it for all examples, all features, and at the learning rate we specify. That is where the variant of gradient descent called stochastic or online gradient descent, can help us achieve the results much faster. In stochastic gradient descent, we randomly shuffle the data set and repeat the steps for every example, while ensuring that we update the coefficient at every step.
This ensures that our algorithm reaches the bottom much faster. Compared to a batch gradient descent, the visualization of it might show a much more zigzag pattern than a batch gradient descent. This was a very short explanation and I hope it served the purpose. In the next lecture, let’s implement the linear regression using online ingredient descent. Thank you so much for joining me in this one and have a great time ahead.
- LR – Experiment Online Gradient
Hello and welcome. Today we are going to implement the online gradient descent method for linear regression. As you can see, I have already copied our previous example of linear regression and keeping all the other steps same. I am going to change the linear regression solution method to online gradient descent. Well, it requires a different set of parameters and let’s try to understand them. Learning rate is the rate at which we want our algorithm to optimize the gradient descent.
We also need to specify the number of training epochs. And epoch is a step that our algorithm takes in the direction of the minimum cost. We know what is lt regularization and let all the other parameters be to their default values. And while we are almost ready to run, I would always advise to use Tune hyper parameter for all such algorithms because different combinations of these parameters can give you completely different results.
So let’s bring in the Tune model hyper parameter. Let me search for it and drag and drop it here. All right. And because we are going to use Tune model hyper parameter, we do not need the train model module. I’m going to delete it and bring Tune model over here and let’s provide the right connections. The linear regression goes here and the training data set come to this particular node. And let’s also change the create trainer mode of our linear regression module to parameter range. Okay? And let’s keep the range as it is. We are not going to provide any different values.
It has got some sufficient number of possible combinations. And for the Tune model hyper parameter, let’s choose the entire grid. Okay? And next that we need to do is launch the column selector. Okay? And we are going to select our predicted column, which is the price and click OK. Next we need to provide the metric for measuring the performance. And because this is regression, we choose the coefficient of determination from this drop down here.
For measuring the performance, we want the possible combination of these parameters where the coefficient of determination is highest. All right. And now we are ready to run our experiment. And believe me, if you are doing this with me, it’s going to take a long, long time as we are running the entire grid for these parameters. So it is going to run those many combinations. So you may want to take a break a few minutes and come back while your experiment is running.
However, you may also want to pause the video before you go. Well, it has run successfully and let’s visualize the output. That’s a great result. And we have a high coefficient of determination. That concludes the lecture on how to build linear your regression using online gradient descent method. Thank you so much for joining me in this class and enjoy your time.
- Decision Tree – What is Regression Tree?
Hello and welcome. In this lecture we are going to learn about the regression using decision trees to recap what we learned during the decision tree for classification. Well, root node is the first node that represents the entire sample data set. The process of forming various decision nodes nodes is called as splitting and when sub node splits into further sub nodes, then it is called as decision node. When we cannot or do not want to split a decision node further, it is called as leaf or terminal node. And a subsection of this entire tree that has one or more decision nodes and two or more leaves is called as the branch or subtree.
We have used the decision trees for categorical variable during classification and we also built a decision tree using the Loan approval prediction. As we know, a categorical variable can only take two or more but a certain number of values. But during regression we are dealing with a continuous variable which can take any set of values. So what do we do if we want to use decision trees for regression? Regression analysis is applied on continuous variables and we definitely don’t want to split them on all the possible values of the predicted variable. So we introduce something called as threshold. Let’s try to understand that using an example. In this example, we have two variables x one and x two plotted on this graph. And let’s say our threshold for x one is at 30.
So we split the value of x one such that we can ask a true or false question so that the resultant subset can be grouped into two. So when we ask a question is x one less than 30, we create this line at x one equals to 30. In essence, we create two groups of observations where x one is less than 30 and the second group of observations where x one is greater than or equal to 30. Then we ask another question whether the x two is less than 20. And that gives us this line at x two equals to 20. Now, if the answer to this question is yes, we get the region r one where all the data points will have x one less than 30 and x two less than 20.
Okay? However, if the answer to this question is negative, that is, the observations are less than 30 but greater than or equal to 20 for x two, then we get this region r two. All right, let’s say we want to stop at this stage on this side of the decision tree and let’s go back to the answer where x one is not less than 30, that is, it is greater than or equal to 30. So we are talking about this region over here, all right? And if we further divide this based on whether x two is less than 40 or not, we get this line at x two equal to 40 and if the answer is no, that is, the data points have x two greater than or equal to 40, we get the region r three. So all the other data points will be in this region below this line. It is fun and let me divide it further. So I’m going to ask another question that what if x two is less than 40? So let’s check what happens if x one at this particular point is less than 60. All right, well, that gives this line, which further divides this space into two regions.
And if the answer to the above question is yes, then we get this region r four and the negative answer will give us the region r five and we can keep on doing it for various thresholds of various variables. So basically, we partition the data into multiple regions and hence a linear regression using decision tree for this sample data would look something like this. But the big question is then how does such a regression tree predict outputs?
Well, every region produces an output which could be the mean of that region, or even a simple linear regression for all the data points within that particular region which can be used to predict the output. I hope that explains the intuition of regression trees. And let’s go to the next lecture where we can learn and build the regression model using boosted decision tree. Thank you so much for joining me in this lecture and I will see you soon in the next one. Have a great time.
- Decision Tree – What is Boosted Decision Tree Regression?
Hello and welcome. In this short lecture, we will cover what is regression using Boosted Decision Tree. We are already familiar with most of the concepts for decision trees as well as boosting. In the previous lecture, we covered what is regression tree and how decision trees are applied for regression. We also saw how the output is predicted using regression trees. During classification. We have seen what is boosting and how the decision tree is formed using boosting.
We know that boosting trains the decision tree in a sequence, and it learns from the previous tree by focusing on the incorrect observations. It further builds a new model with higher weight for incorrect observations from the previous sequence. All right. The Boosted Decision Trees using Azur ML is based on the same principles of ensemble learning. It uses an efficient implementation of the Mart or multiple additive regression trees. Gradient boosting algorithm.
Using gradient boosting, it builds each regression tree in a statewide fashion. It also uses a predefined loss function to measure the error in each step and correct the error in the next step. In the next lecture, let’s build the regression model using the Boosted Decision Tree. Thank you for joining me in this short lecture, and I’ll see you in the next one.
- Decision Tree – Experiment Boosted Decision Tree
Hello and welcome in this lecture we are going to build our regression model using Boosted Decision tree. As you can see, I have already copied the previous experiment of linear regression and just to recap of what we have done. Basically we have this automobile price data that has various observations for the vehicle details as well as actuarial information such as symbolic and loss information. We have applied clean missing data first for all the numeric values and second for cleaning the missing values from string variables. We then edited the metadata and converted all the string variables into categorical as they have categorical values. We also converted the symbol in column into a categorical variable as it had a finite range of minus three two plus three. We finally split the data into 70 30 ratio and also ran the linear regression model using OLS as the solution method.
All right, we are now going to run the boosted Decision tree alongside this model. We will then evaluate and compare the two models and check which one predicts a more accurate price. All right, so let’s get the Boosted Decision Tree regression module. We have always been searching for it and then dragging and dropping it. However, you can find these models under various sections over here. So all these untrained models can be found under the machine learning section and let me expand it. And under the initialized model you can see there are four groups such as animal detection, classification, clustering and regression. Because we are doing the regression, let me expand it further. And there we have our boosted decision tree regression module.
Let me drag and drop it from here. And as I explained earlier, it is always advisable to use Tune model hyper parameter when we have to deal with so many parameters. Also it saves lot of time for trying out various combinations of these parameters. So before we bring in the Tune model hyper parameter, let’s change the trainer mode from single parameter to parameter range. It already has enough combinations of these parameters so we simply provide the random number seed as one two three and our model is set. All right, next we search for Tune model hyper parameters and connect the untrained boosted Decision tree to the Tune model module and we provide the training data set.
And let’s change the random seed to one, two three as well as provide the label column. So let’s launch the column selector. Alright and let’s select price and click OK, as you know, the next step is to score the trend model and finally evaluate it. So let’s get it here and let me provide the right connections. Okay. And let’s also evaluate the model and compare it with the results of linear regression using OLS. Alright, everything seems to be okay and we are ready to run it. Great, it has run successfully. And let’s visualize the score model first. As you can see, the predicted values are now much closer to the actual price. All right, so it appears that the boosted decision tree has performed really well. But before we come to any conclusion, let’s evaluate the results alongside the linear regression. So let me close this and let’s evaluate the overall results. That’s fantastic.
As you can see, all the errors for boosted decision tree are less than the errors for linear regression. Also, the coefficient of determination is much higher compared to the zero point 85 six of linear regression. So we conclude that the decision boosted tree has performed much better than the linear regression using OLS. Creating and comparing different algorithms is really that simple. In Azuramal, if we need to compare few more models such as linear regression using gradient descent or any other such model, we can simply create another branch and compare them using evaluate model. I hope you have followed me and created the boosted decision tree regression model with an awesome performance. That concludes the lecture on booster decision tree regression. Thank you so much for joining me in this one and have a great time ahead.