DP-100 Microsoft Data Science – Clustering
- What is Cluster Analysis?
Hello and welcome. In this lecture we will cover the basics of clustering. Clustering is a very subjective technique. So what is clustering? Well, it is the task of grouping a set of objects in such a way that objects in the same group called a clustering or are more similar in some sense or another to each other than to those in other groups or clusters. It’s an unsupervised learning method and it is also used for discovering distinct groups in the given set of observation.
For example, customers who make long distance calls and don’t have a job, who they could be well, they could very well be the students from overseas making long distance calls back home. Marketers use this kind of knowledge to develop targeted marketing programs and strategies. Marketing guys can segment the customer portfolio based on demographics, transaction behavior or other preferential data. It is typically used to do initial profiling and after having a good understanding using such profiling, we can use objective techniques to build specific strategies for each segment. Some of the examples of clustering are recommendation engine where the user preferences and or the features of users and the items are grouped together. Another example is market segmentation such as high spenders customers with high propensity to buy a particular product and so on.
Social network analysis is also a good example of clustering where we create network structures of people, things and so on. It can be used to even identify friendship network as well as fraudsters network. Clustering is also used in the field of medical science to cluster the group of symptoms to form a particular type of pattern. In image segmentation, the pixels of an image are correctly grouped or classified using clustering.
Clustering can also be used to detect the Anomalies. For example, during cluster analysis, if we observe that a particular cluster or certain number of clusters have a fraction of the overall number of observations, it may suggest an anomaly within the data. All right, let’s now go ahead and see how the clusters are formed and what are the various methods used for forming the clusters. Azuramal supports ken’s clustering method and let’s try and understand one of the important concepts of Euclidean distance used in ken’s clustering.
Let’s say we have these two points plotted in the two dimensional space. Point p one with coordinates x one and y one and second point is p two with coordinates x two and y two. Then the distance d between these two points in the n dimensional space is called as the Euclidean distance. Let me explain briefly how this distance is calculated. If these are the coordinates of p one and p two, we can extend these lines like that and the distance for this line will be the difference between x two and x one.
Similarly, the length of this portion of the line will be y two minus y one. I hope you are with me and are getting what we are doing here. As you can see, we now have formed a right angle triangle over here and we can now calculate the distance D using Pythagore’s theorem, which is nothing but the square root of sum of squares of both the sides.
The same principle is used for calculating the distance between data points and forming a cluster. The k in K means clusters, is the number of clusters we want. Let’s see that with an example. Let’s say we have these data points spread in this fashion and we want to form two clusters for these data points. So what we do is we group these data points randomly into two clusters in the beginning. In the next step, we calculate the mean or centroid point of these observations.
That gives us these two imaginary points at these locations. Then we calculate the distance of each of these points from both the centroids. That will tell us if any of the data points are closer to a particular centroid than the centroid of the parent group or cluster. All right. And as you can see here, these three data points are much closer to the blue centroid than the original orange centroid, the centroid of the cluster where they initially belong.
So we then assign them to the blue cluster and recalculate the mean or centroid location. Okay, the same steps are repeated until the centroid stops moving. I hope you’re with me. And this animation from the Wikipedia page created by Kyer provides a very good intuition to how the clusters are formed in various iterations. And note how the blue, yellow and red centroids are moving with every iteration. And after iteration 14, it almost stops moving. You can either visit this page or rewind the video for repetition of this entire process.
All right, so what is a good cluster analysis and when can we say we have done a good job of cluster analysis? Well, these are the two thumb rules which can guide us in determining the same. First one is observations in the same group or clusters should share similar characteristics such as their spending pattern, their interaction with the organization or their behavior.
This can only be understood after further analysis of the clusters. The second point is that all the clusters should have proportionate number of observations. That does not mean they should have equal number of data points. But one cluster with 90% data points and others with one or 2% is definitely a warning sign that something is wrong. Unless our aim is to identify such anomalies, a good ratio is anywhere between a minimum of 5% to the maximum of 35% to 40%.
It is not a cast in stone type of rule, but a good indicative one. So now that we have understood what is cluster analysis and how we form the clusters, how do we start that process? How do we do the cluster initialization? Well, there are various methods of how do we form the initial set of clusters and how we assign the data points to those clusters. Let’s see some of the methods that Azuramil supports. First one is random. That means the data points are placed at random into the clusters. Next is the first N or also known as 4G method, in which some initial number of data points are chosen from the data set and used as the initial means.
K means plus is the default method and is an improvement over finding the initial means. In this method, the first cluster center is chosen uniformly at random from the data points and after which each subsequent cluster center is chosen from the remaining data points with probability proportional to its squared distance from the points closest existing cluster center. Okay, you can rewind the video if you feel that you need to understand it a bit better.
Another one is kmeans plus fast, which is nothing but a variant that has been created to overcome the time efficiency of earlier algorithms. It converges much faster to the optimal solution. In the evenly initialization centroids are located equidistant from each other in the Dimensional space of the data points. Finally, the use level column initialization uses the values in the label column to come up with the centroid selection. Okay, that concludes the lecture on clustering or cluster analysis. And in the next lecture let’s perform the cluster analysis. Thank you so much for joining me in this lecture and have a great time.
- Cluster Analysis Experiment 1
Hello and welcome. In the previous lecture we saw what is clustering or cluster analysis. We briefly went through the examples of clustering and we also understood how the clusters are formed using Euclidean distance and centroid formation. We finally went through various cluster initialization methods. For this we are going to use the call center data that you can download from the course material. It’s a CSV file with header. I have already uploaded it to my workspace. I suggest you pause the video, download it from the course material and then upload it to your azuramal workspace so that we can do it together. All right, so here is the call center data and let’s visualize it. It has got only two columns and 111 rows. It’s a summary of calls made by the call center employees. It’s the daily average of one month of data. We are going to cluster and group them together and then analyze the trend of identifying a strategy to deal with any problems that our cluster analysis will reveal. All right, so let me get the K means clustering module and there it is.
And let’s drag and drop it here. And before we start working on it, let’s go through the parameters it requires. We know what is the trainer mode and we are going to go with the single trainer mode. Number of centroids I’m going to provide is four and initialization method will be kmeans plus plus. We have gone through each of those in the last lecture. Random number seed as one to three. I’m going to specify, and the metric is the type of function we will use for measuring the distance between cluster vectors or data points and the randomly chosen centroids we have seen. Euclidean distance cosine is also another such method, but it only takes into account the angle among the cluster vectors and not the length of the vectors or the data points. The number of iterations here is number of cycles we want our algorithm to run for. Let it be default and you can adjust higher or lower depending upon the choice of accuracy over training time.
And we are going to ignore the label columns. This option tells the algorithm that if we have a label column how it needs to be treated. Let’s go through the options one by one. It has got three options. For the ignore label column option, the values in the label column are ignored and are not used in building the model. In case of field missing values option, the label column values are used as features to help build the clusters and if any rows are missing a particular label, the value is imputed by using other features. All right. And finally for the third option of overwrite from closest to center, the label column values are replaced with predicted label values using the label of the point that is closest to the current centroid. I hope that’s clear. And we don’t have any label column so we are going to ignore it for this particular experiment and we are ready to train this model using our historic data.
All right. So we are going to use training model but in the case of clustering we have a separate train cluster model so let’s get it here and the only parameter it requires is the set of columns. So let’s launch the column selector and select these two columns and click OK and we are ready to run it. Now it has run successfully and let’s visualize the output. Well the output here is a twodimensional derived components or principal components. It does not provide you a really good insight so for this example let’s visualize the output in a different way. Let me close this and let’s use the select columns module and connect it to the output of train clustering model. Let’s launch the column selector okay and select all the columns, click OK and let’s run it. I will explain various columns that we have when we’ll visualize the output. All right it has run successfully and let’s visualize the output now the first two columns are the original columns from the data set we provided. The column assignment is the cluster number that has been assigned to every observation and as you can see these observations have been grouped into clusters of zero, one, two and three. And when we look at the histogram of the same you will observe that there are 39 elements in cluster 235, in cluster 00:25 in cluster three and twelve in cluster one. The column distances to cluster center measures the distance from the current data point to the centroid for that particular cluster.
Okay. There is also a separate column in output for each cluster in the trend model. The values for cluster distance are based on the distance metric you selected in the option, and that is metric for measuring the cluster result. All right and as you can see this particular observation is the closest to the centroid of cluster zero compared to its distance from other three clusters and hence it has been assigned to cluster zero. You will also observe the same thing when you analyze the observations which have been grouped into various other clusters, such as this one, which is nearest to cluster one, and this particular one is closest to cluster three, as you can see. All right, so we have successfully assigned them to different clusters, but our goal here is to identify the strategy or methods to improve the performance by analyzing the clusters of employees and their performance. And the metrics for that performance is nothing but number of calls made in a day.
All right so let’s close this and now you must understand that Azuraml is built predominantly for machine learning and not a great tool for visualization of data. You may want to use data visualization tools such as tableau or power bi or something similar you can also do that using Excel for some limited functionalities. For now, let me split the data into four data sets using four split modules. So here comes the first one. Where I select assignments is less than one, which means zero. Let me copy and paste this for the second one. And we give relative expression here as less than two, which will select all. Once as we have already extracted the zeros, I do the same for selecting three and four, providing the connections appropriately. All right, and now we are ready to run these modules. Great. It has run successfully and let’s try to analyze the result.
Let’s visualize the output of the first split. It has got all the observations from cluster zero. And as you can see, the mean or average of months of experience here is close to 30 months or 2. 5 years. And the average number of calls attended by this group is close to 40. So let’s go back and check another cluster and let’s visualize this particular one. As you can see, the average years of experience is slightly above 90 months or close to eight years. But the performance in terms of number of average calls by this group is also close to 40. Similarly, you can visualize the other two groups, but basically what it tells us can be explained using a different plot.
So this cluster has less experience and low performance, but this particular cluster has got a very good performance, despite having the similar experience level. Same is true for these two groups. This group here with a very high experience and still handling only a small amount of calls, is definitely a reason for worry. However, let me ask you a question. Is there enough information available? Can we draw some conclusions here? Well, this is only considering two dimensions. You may want to analyze further by adding other set of parameters, such as their age, when did they last attend the training, who was the trainer, what type of equipment they use, the department they belong to, or type of calls they make. You may want to include those features into the training set and then form the clusters, as that could possibly give you a different result and a better cluster.
Of course, it will not be easy to visualize it with so many dimensions of the data, but when you will analyze the clusters formed using multiple dimensions, you can definitely arrive at a solution. It could mean providing additional training, changing the equipment, analyzing the training program provided by certain trainers, and so on. I hope that explains how the clusters can be formed and how we can do the cluster analysis. That concludes this lecture on cluster analysis. In the next lecture, let’s understand how to train and score the clustering model for new observations. Thank you so much for joining in this one and I will see you in the next lecture.
- Cluster Analysis Experiment 2 – Score and Evaluate
Hello and welcome. In the previous lecture, we saw how we can use clustering for grouping data based on its attributes and how we can derive the strategy or approach to tackle some of the problems. Let’s now understand in detail how we can use clustering for new what. If we want to use the model transformation we built for new data that is arriving into the system. That is, we want to know in which cluster a new observation would belong to and deal with it accordingly. For example, if we have a new customer, grouping him into a known cluster can help us understand him or her better, and the organization can then take a specific approach for the customer’s delight than a routine and rudimentary one.
Let’s see that. Using our call center data, we have this experiment that we ran earlier and I am going to save this as call center new data. All right, let me clean up the split data modules over here and keep only the required modules. As we know we would need training and test set of data if we want to evaluate its performance. So we use a split module here, connect the data set and the output or train data set to the Train clustering model. Let’s also change the ratio to 70 30. Keep the randomized split checked, random seed as one, two, three.
And because we are not dealing with the classification type of problem where we had a label column, we are going to keep stratified split as false. Okay, all the connections are in place and let’s now run the Train clustering model. Great. It has run successfully and let’s now score our trained clustering model. However, in case of clustering, we have a different scoring model. Instead of score model module that we have used so far, we will use assign data to clusters for scoring the trained cluster model.
The module returns a data set that contains the probable assignments for each new data point. All right, so let’s drag and drop it here. Connections we are going to make are similar. That is, the Train model goes here and the test data from the split node connects over here. When this box is checked, it means the result will include the existing data as well as the resulting assignments. All right, so let’s run it and check our output. It has run successfully and if we visualize it, it’s going to provide us the PCA based visualization, which is difficult to understand at times.
So let me close this and let’s use the select column module and check the output. Okay, so let’s make the connections here and launch the column selector. All right, as you can see, it has added five columns, one for assignments and four for cluster distance. Let’s select the first three columns, click OK and run this module. All right, it has run successfully and let’s now evaluate the model. So we bring in the same old friend that is Evaluate Model, which can evaluate almost any type of model. Let’s make the right connections and run it well.
Well, it has run successfully. And let’s view the output of this module. Wow. Now this is another set of metrics for clustering. The statistics returned for a clustering model describes how many data points were assigned to each cluster, then the amount of separation between clusters and how tightly the data points are bunched within each cluster. Okay. The combined Evaluation Score lists the average scores for the clusters created in this particular model. This particular score helps us in comparing different types of models. In case we had used another K means cluster model with different number of clusters and then compared it with this one.
Okay. The Average Distance column represents how close all the points within the clusters are. The averages here are very close to the mean of Train score model. You may want to select columns and visualize the output from trained model to know the averages from the Train model. The scores in the column Average Distance to Other Center represents how close on an average each point in the cluster is to the centroids of all the other clusters.
The Number of Points column shows how many data points were assigned to each cluster, along with the total overall number of data points in any cluster. The maximal Distance to cluster center represents the sum of the distances between each point and the central of that point’s cluster. Okay. It basically explains how widely spread our clusters are for this example. It’s not a cause of worry.
All right, we have now reached the end of this lecture on scoring and evaluating the cluster model. In this lecture, we covered how to score the clustering model using Assign data to cluster and also how to interpret the Evaluate metrics for clustering. Thank you so much for joining me and have a great day.