Google Professional Data Engineer – TensorFlow and Machine Learning part 1
- Introducing Machine Learning
Here’s a question I’d like us all to think about as we go through the contents of this video. How would you recognize a machine learning based system if you saw one? So let’s say you were shown a piece of software or a system. How would you know whether it makes use of machine learning or not? What are the defining characteristics of a machine learning based system? Hello and welcome to this module on understanding the foundations of TensorFlow. In this module, we are going to segue just a little bit away from pure Google cloud platform related technologies, because machine learning is an incredibly important set of algorithms for the future.
TensorFlow, which runs on the Google cloud platform, is heavily tested in the certification test. And to a large extent, it is also a differentiator of the Google cloud platform versus competing cloud offerings such as those from AWS or Asia. Now, that’s not to say that Amazon or Microsoft don’t have their own formidable machine learning offerings. They do. But rather it is because TensorFlow is fast becoming a tool of choice for certain machine learning applications, particularly those involving deep learning on neural networks. So one can plausibly argue that TensorFlow is something which the GCP offers natively, that is highly sought after on every platform.
In any case, with that preamble out of the way, let’s move into an overview of the territory that we are going to cover. TensorFlow is a language or a library for numerical computations, and by far its most common and important use is in machine learning, specifically in building deep learning and neural networks. We shall see why TensorFlow is slowly or quickly becoming the default library of choice, particularly for deep learning networks. And we’ll do all of this after a quick primer on what machine learning really is and how it differs from alternative approaches. Machine learning is gaining in popularity for many of the same reasons that cloud computing is.
It lies at the convergence of two trends. The first is that data sets are getting bigger and bigger, and the second is that the world is getting smaller and smaller, meaning that interconnections between data items are becoming more pronounced and more important. And that is what machine learning really specializes at. The whole point of machine learning is given a large amount of data infer relationships between those data items. So the typical use cases of machine learning involve working with a large data set, searching for patterns in that data set, and finally making intelligent decisions based on what’s in there.
The whole point, the defining characteristic of machine learning algorithms is that they change their behavior based on the data that they are operating on. This property, in and of itself does not make a machine learning algorithm any more complex or more sophisticated than a traditional algorithm which is based on static rules. More on that in a moment. So this is an important definition. Keep this in mind really internalize this fact. Machine learning algorithms are those which alter their behavior based on the data that they are working on. In a sense, they are learning from the data.
Let’s consider a simple use case for machine learning. Let’s say that you have a large number of emails and you would like to classify these as either spam or not spam. That’s known as ham. Now this certainly is a problem for which you could use machine learning. You could also choose to solve this problem without machine learning. In fact, for the longest time, email classification was carried out on the basis of static rules. If an email originated from a particular IP, it would be treated as spam. If an email contained a certain set of words, suspicious words, it was treated as spam, and so on. Of course, the problem with this static rule based approach was that spam providers were very fast.
They were quick on the uptake, they’d keep changing their behaviors, they would start sending emails from new IP addresses, or they would introduce typos into those sensitive words. And that’s not all. Very often the same IP addresses were those of mail service providers, which also sent out legitimate emails. And for all of these reasons, the static rule based approaches tended to give substandard results. The most efficient way to keep ahead of the spammers in this arms race between spam and legitimate email was to switch to machine learning. Machine learning applications would then search for patterns in spam emails.
As those patterns changed, the way in which those machine learning algorithms carried out, the classification would change also. And in this way, machine learning algorithms have become the standard for spam classification. And the outcome of this classification process is a decision, of course, should that email go to your trash can or to your inbox? This is important. Machine learning algorithms take in a large amount of data which could be emails on a server. They process the data in some way, for instance, looking for patterns associated with spam emails, and then they typically make a decision of some sort, for instance, whether that email ought to go into your trash can or into your inbox.
Machine learning algorithms have gotten more sophisticated and so have the problems to which machine learning can be applied. You could equally treat an image as a large data set, a data set of pixels. The machine learning algorithm might then identify edges, colors and shapes that might give it enough information to classify the photograph as a being of that of a little bird. Yes or no? We should mention here that in some of these complex learning algorithms, machines are still very far behind human beings. There was this study, a famous study some years ago, which showed that only the absolute cutting edge technology in machine learning was able to identify a video which contained a bear.
On the other hand, this is something that a three to five year old human child was able to do with virtually complete accuracy. And it is precisely in these complex machine learning applications that TensorFlow really shines in many ways. The starter hello world problem in TensorFlow is that of image classification. We’ll get there. But before that, let’s just continue with our survey of machine learning for a little bit. Machine learning has tended to be used for a specific set of problems. The first of these is classification. The second of these is regression. The third of these is clustering grouping together similar points in a high dimensional hyperspace.
And the last of these is rule extraction. Given a large set of data find if else relationships, we are going to implement three out of four of these categories using TensorFlow. Let’s keep going for now and understand some of the conceptual underpinnings. Let’s start with classification. The email classification problem spam or Ham was one example. But let’s now try and understand what makes a machine learning approach different from a rule based one. Let’s come back to the question we posed at the start of this video. The defining characteristic of a machine learning based system is that its internal operations, its internal working, change in some way based on the data that’s been fed in.
Take, for instance, a linear regression algorithm here. There is always going to be a straight line that’s going to be fit to the data points you feed in, but the equation of that line is going to change based on the data points you pass in. By contrast, let’s say you had a system which had been prepared by an expert that would not change its inner working based on the data fed in. The equation of the line would be independent of the data. This is the telltale sign. This is the defining characteristic of a machine learningbased system in contrast to a rulebased approach.
- Representation Learning
TensorFlow is becoming one of the most popular tools in machine learning these days, especially and in particular because it helps apply techniques such as neural networks to data with features such as images or video. The question that I’d like us to think about during the course of this video is why are features such as images or video much harder to work with than x or numbers? Let’s revisit this question at the end of this video. Let’s say that the classification problem that we are seeking to solve is to classify animals. Now, for instance, let’s say that we need to classify whales. We want some kind of procedure which tells us whether whales are mammals.
Mammals are defined technically as members of a certain order of the animal kingdom or whether they are fish. Now, one could make a plausible argument for whales being either fish or mammals. On the one hand, technically they do belong to that infra order SATA. So they are whales, but they also look like fish, swim like fish and move with fish. One approach to classifying animals would be a non machine learning based approach. You should know that non machine learning based approaches are often called rule based or expert based. And we shall see why in a moment. In such an approach, we would start with an animal that surveilled and then we would pass it into some kind of classification system.
This would be based on a bunch of rules. The obvious question then is where do those rules come from? And the equally obvious answer is they come from the minds, from the combined judgment of human experts. This approach has worked well for many use cases for decades, maybe centuries. The basic idea here is that all of that domain knowledge in air quotes is embedded in the brains of these human experts. They just know that a veil does belong to the infra order set Asia. Let’s now try and understand how a machine learning based classifier would go about the same problem. The human experts here would have a far reduced role. This process would begin with a set of attributes being passed into the system.
Those attributes could be stuff like this is an animal which breeds like a mammal. It gives birth like a mammal. This time our classifier is a machine learning based one. So it does not have a panel of human experts to consult, but it does have a corpus of data. And this is why having a large amount of data to train your machine learning algorithm on is so important. The classifier will refer to all of the examples in that corpus. And if you’ve set it up correctly, it will conclude from comparing your set of attributes to the attributes in the corpus that a veil is a mammal. In a binary classifier like this one, there is still a large onus on you on the person implementing the system.
That’s because if you direct your machine learning based classifier to focus on the wrong features, it’s going to produce the wrong output. In this example, you, as the person architecting the system, still needed to know enough about whales to tell your system to focus on how it breeds and how it gives birth. Let’s say that you were not so smart. Let’s say that you gave your system the wrong cues. You asked your system to focus on how this animal moves and what it looks like. It is entirely possible that this algorithm will suggest that this veil is a fish I e. It would arrive at the wrong conclusion. The set of attributes which we are going to tell our machine learning algorithm to focus on are called the features.
In the image classification algorithm that we will implement in TensorFlow, the feature or the feature vector is the image itself. It’s in fact, a matrix of pixels. Selecting the right features is a very important part of setting up a machine learning system correctly. The other really important part is training the machine learning system with a corpus, with a large set of examples. The corpus must be large and representative, and what’s more, it also should consist of the correct features. This is where your role as an expert does come into play. Even in a traditional machine learning based system. For instance, you as the expert, needed to tell your classifier to focus on how the animals breathe and how they gave birth, rather than how they moved or in what environment they lived.
As we shall see in just a moment. An important advantage of deep learning systems like TensorFlow is that they take even this responsibility partially off of your shoulders. Deep learning systems will, on their own, go ahead and pick the correct features, the correct attributes of our data to focus on. Now, before we get there, let’s clearly understand the difference between ML based and rule based classifiers or algorithms in general. ML based algorithms are dynamic because they change their working based on the data that they operate on or that they are trained on. Rule based algorithms are static because by definition, they are based on a set of rules, much like the constitution of a country.
That set of rules has been drawn up by a panel of experts.Those experts have had years, maybe decades, of training with a large amount of underlying data, and that’s why they bring up those rules. The consequence of this is that experts are somewhat optional in a traditional ML based system. They are most definitely required to build a good rule based system. The experts are optional, but still only somewhat optional in a traditional ML system because they still have a role to play in selecting which attributes, which features the ML classifier ought to focus on. The experts have presumably spent a lot of time examining a large number of data points.
The machine learning algorithm has not had all of those years, maybe decades of experience. So what it does need is a corpus, a large set of data points which it can use to train itself. The size and quality of the corpus is extremely important in determining how good the machine learning algorithm’s performance is. Just as a human expert needs a certain length of time to learn the craft and acquire judgment, a machine learning based algorithm also explicitly requires a training step. This is almost always the case. By the way, there is an asterisk here because not all machine learning algorithms require a training step, but virtually all of them do.
And conceptually it is an important part of the working of an ML algorithm. As we shall see, this training step is extremely important when we are implementing our models in TensorFlow. This is where the model algorithm which we have previously defined, changes itself or tweaks itself on the basis of the corpus of training data that’s passed in. We’ve now clearly understood the difference between rule based and machine learning based systems. These apply to traditional machine learning based systems which are a bit different from deep learning systems, which we are going to talk about in just a moment. Even so, there are a few terms which it’s important for us to get really straight, so let’s cover those.
The set of attributes that we told our machine learning classifier to focus on. This set of attributes is called the feature vector. The term vector usually refers to the fact that there are multiple attributes in a complex system. This feature vector can get very complicated indeed. Now, the output of a classifier is a label. For instance. Here the label told us whether the instance passed in was a mammal or a fish. Labels and feature vectors are important terms and classification. In most classification algorithms, the labels are fixed. These are drawn from some categorical variable and the values of the label will not change, while the values of the feature vector are chosen by you.
In our example, you might choose a different feature vector. You might tell the system to focus on how the animal moves, but the label would still need to be a fish or a mammal. That’s the set of labels. Fish and mammal. That’s the set of labels which our classifier can choose from. The idea of output labels is somewhat specific to classification problems. When we discuss regression, we shall see that the output there is not a label, rather it’s a number. But the idea of a feature vector is incredibly important and it’s quite general. It applies to all machine learning algorithms. The attributes that a machine learning algorithm focuses on are called features.
Typically, we will pass in a data list in which each data point is represented as a vector of such features and that forms the input into the machine learning algorithm known as the feature vector. Next, we’ll talk about what makes deep learning systems like those built in TensorFlow so special in traditional ML based systems. It is still experts who have to decide what goes into the feature vector. They are the ones who decide what features the algorithm should pay attention to. There is another class of machine learning systems known as representation learning systems, which figure out by themselves what the important features are.
Let’s talk next about deep learning and understand exactly how this process plays out. Let’s return to the question we posed at the start of this video. Data formats such as images or videos are much harder to work with than x or numbers. And the reason for this is images or videos already represent complex representations or encodings of the underlying data to extract the correct feature vectors to focus in on those features, those parts of the images of the videos which actually matter requires much more expertise than working with x or numbers.
Consider, for instance, that virtually anyone has a good way to summarize a set of numbers, maybe using an average or a standard deviation. On the other hand, if I were to ask you to summarize numerically the contents of a photograph or a video, you’d probably struggle. This is why data formats such as images and videos require tools such as neural networks. Neural networks can figure out for themselves what really matters what the important features are with text or numbers. It’s much easier for human experts to zero in on the feature vectors that really work.