Google Professional Data Engineer – Appendix: Hadoop Ecosystem part 1
- Introducing the Hadoop Ecosystem
Hello and welcome to this module on the Hadoop ecosystem. We are going to spend a fair bit of time discussing Hadoop and some of the important components in that ecosystem. And there are two reasons why this is a good use of time. The first reason has to do with the genesis of the Hadoop ecosystem. Recall that Hadoop and Mapreduce and HDFS were actually born out of Google technologies. And so there is a close mapping between the Hadoop ecosystem and the corresponding tools on the Google Cloud platform. That’s one reason why it makes sense to spend a fair bit of time talking about Hadoop. The other reason has to do with the common use cases of the Gcp.
A typical use case is that an organization is already making use of distributed computing via the Hadoop ecosystem. It has hadoop Clusters, on which it is running Mapreduce jobs, it’s running Spark data transformations and all of that good stuff. And it now would like to shift from an on premise or a colocated cluster model to the Google Cloud. And because this is such a common use case, the Gcp actually has tools or services which are optimized for this, notably Dataproc, but also data flow and some of the others. It only really will make sense to understand how these tools work if one is already pretty familiar with Hadoop, Spark, Pig Hive, and HP’s. So let’s go through and understand these technologies one by one.
- Hadoop
Let’s start with a very quick refresher on Hadoop and why Hadoop is at the center of the Hadoop ecosystem. We had already discussed this at the start of the course when we were talking about the pros and cons of cloud computing. We had discussed that there are a couple of important trends that are playing out in the world of technology these days. The first is that data sets are getting bigger and bigger. And the second is that the interconnections between data are getting more and more important. It’s pretty unusual for a serious use case these days to fit into the memory of a single machine.
While that used to be the norm as recently as maybe a decade ago, as soon as data gets too big and too complicated to process on a single ordinary machine, we are faced with a significant jump in complexity. There are two typical ways of dealing with this. One of these is to have one really extraordinary machine, a supercomputer almost. The other is to assemble a federation or a cluster of generic machines, none of which are individually all that special, but which delegate and parallelize and share the load. This choice lends itself really well to sporting metaphors. There are also terms for these two models in the world of computing. These refer to monolithic and distributed computing, respectively. For now at least, the distributed computing model seems to be way more popular.
It seems to have decisively won. Distributed computing seems to be winning and that’s because it relies on a lot of cheap hardware which makes it accessible to everyone and which makes it modular to scale. Monolithic computing, as its name would suggest, involves big leaps in cost and complexity each time you add a super computer to your setup. Now of course, this cheap hardware is quite likely to fail from time to time. And that’s why replication and fault tolerance become extremely important. As we’ve already discussed, even a low probability of failure on a single machine leads to inevitable failures every month or every year once one has thousands or millions of machines.
So replication and fault tolerance need to be built into the system from ground up. And clearly the defining characteristic of a distributed system is, well, that it’s distributed. We’ve got to have a bunch of CPUs which communicates with each other, which know how to split up jobs and assemble the results of those different jobs. This distributed computing really is at the heart of this model. And Hadoop has been designed specifically keeping each of these requirements or attractions in mind. Data is stored using a lot of cheap hardware on a distributed file system. That’s HDFS, that generic hardware needs to be carefully resource allocated.
It requires replication, fault tolerance and all of that stuff which happens via a resource allocator called Yarn. And then really the fundamental and revolutionary breakthrough that Hadoop provided came via Mapreduce. Mapreduce is an abstraction which allows almost any programmer to express computing operations in terms of map and reduce jobs. This is a clean and simple abstraction which makes for distributed computing. Mapreduce was devised at Google, and really, this is the single breakthrough which made distributed computing mainstream. And really, when we put all of these elements together, we get a cluster of Hadoop machines.
These clusters consist of individual machines, thousands or maybe even millions of them, which are called nodes. And together, these clusters run in giant facilities, which are called server farms. To continue with our sporting analogy, a server farm needs some central coordinating bits of software, some really smart functionality, to make sure that all of these nodes, all of these servers, are coordinated, replicated, and kept in sync. For instance, there is a need to partition data across all of these thousands of millions of machines to coordinate the computations, the map, and the reduce operations across them to handle fault tolerance and recovery in case some nodes go down to allocate capacity to different processes.
All of these steps must be carried out by that coordinating set of software technologies. This is a problem that Google had to grapple with really early on because of the size and the scale and complexity of the search. And indeed, the genesis of Hadoop lies in this same problem. First, Google needed to store millions of records and those needed to be stored on millions of multiple different machines. Next, they needed to have fast, efficient ways of indexing all of that data and running processes to crunch the data so that it could be easily accessed when users submitted search queries. And in response to this was born the Google Distributed suite, which consisted at the time of the Google File system and of Mapreduce.
The Google File system, which was a precursor to HDFS, solved distributed storage. Mapreduce was the revolutionary paradigm to solve distributed computing. Apache then went ahead and developed open source versions of these technologies, and these became HDFS and Mapreduce. And HDFS and Mapreduce together comprise hadoop. As the size of Hadoop clusters grew and the needs around resource allocation became more and more important, it became clear that additional separate functionality was going to be required to handle that resource allocation. And so, in 2013, when Apache released Hadoop 20, they added a third component by splitting Mapreduce into two.
That third component was yarn. Yarn is an acronym, by the way, for yet another resource negotiator. And these are the three components of Hadoop these days HDFS, Mapreduce and Yarn. The user interaction model works as follows A user defines map and reduce tasks using the mapreduce abstraction and submits them to Hadoop for computation. This now is the Mapreduce job, and it’s submitted using the rather complex Mapreduce API. The next step is for that job to be distributed onto the Hadoop cluster. A job is triggered and Yarn takes over. Yarn will then figure out where and how to run that job and store that result in HDFS. This really is the model of how Hadoop works.
An entire ecosystem of tools have sprung up around this core bit of software. We have already referred to these tools really quickly at the start of this course, but now we are going to study each one of them in just a little bit of detail. This is a short, superficial double click on each of the tools in the Hadoop ecosystem, which are important from the point of view of someone who is going to learn or use the Google Cloud platform. Hive, along with Spark, is maybe the single most widely used one of these tools. It provides a SQL interface to Hadoop and it maps directly to Google bigquery cloud. Bigquery on the Gcp is semantically, is logically, almost an exact equivalent of Hive.
Now, it turns out that in a lot of the implementation details, bigquery is quite different. So we are going to have to study both Hive and Bigquery in this course. Now, it turns out that another one of these tools, Hbase, we’ve already studied in some detail. Hbase is somewhere between a big data technology and a storage system or a storage technology. As we’ve seen, Hbase maps almost identically to Google’s Big table. For that reason, we will not discuss Hbase or Bigtable again. We are done with this particular topic. Hive, along with Hbase and Pig, are available on Google’s managed Hadoop service. That’s Dataproc Pig is worth talking about in some detail.
It is a data manipulation language which helps to transform unstructured data and get it into HDFS and into a format where Hive or one of the other technologies can take over. Pig is preinstalled on Dataproc machines, but Pig, just like Spark, is actually subsumed in a different Gcp technology called Data Flow. Spark is a distributed computing engine that’s used along with Hadoop, and it acts as an interactive shell with a replacement read, evaluate, print and loop. Pig and Spark are both common for use cases that involve data transformations, and these data transformations on the Google Cloud platform are typically achieved using data flow.
We are going to discuss Pig, Spark, Hive in some detail, and we’ve already discussed Hbase. There are also a couple of other tools which we will not discuss in a great deal of detail. These are Uzi, which is a workflow scheduling tool on Hadoop, and Kafka, which is used for streaming applications where the data sets are unbounded. We will actually discuss Apache flink rather than Kafka. And also we will make use of Google’s Cloud Pub Sub, which uses a publisher subscriber model to set up a nice enterprise messaging service. So at a glance, that’s what the Hadoop ecosystem consists of. Let’s go ahead and double click on some of the most important components of these, starting first with Hadoop itself.
- HDFS
Let’s now turn our attention to understanding HDFS or the Hadoop Distributed File system. Whether or not you are familiar with Hadoop, there is a question which I would like you to try and answer. And you don’t need to get the answer right now. Do keep this question at the back of your mind. And by the end of this video, let’s discuss the answer. The question is, when we use Hadoop on Gcp, we don’t actually actually make use of HDFS. We use Google Cloud storage instead. Can you guess why this is the case? Well, if you don’t know the answer already, do keep it at the back of your mind as we go through this material on HDFS.
HDFS is an integral part of Hadoop, as its name would suggest. It is built to run on commodity hardware. It doesn’t require anything special, and as a result of this, it is highly fault tolerant. And really it has to be, because if you’re using millions of machines of commodity hardware, hardware failures are going to be the norm. Do keep in mind that Hadoop and HDFS both are optimized for batch processing. They are not set up for real time or interactive use. So data access on HDFS has very high throughput, but it’s not optimized for low latency. This is why hive is not suitable for very low latency applications. The flip side of this, of course, is that HDFS supports extremely large data sets.
Much larger than you would imagine, say in a relational database management system. The main challenge that HDFS is set up to solve is how is file storage across multiple disks to be managed? Each disk is going to reside here on a different machine in the cluster. This is similar to the cloud storage that we had discussed, except that here the cluster of machines are all Hadoop machines. Each of these machines is known as a node in that Hadoop cluster. And each of these machines is going to be quite generic, with one exception. We are going to need one special node which acts as a master and which is in a sense a coordinator of all of the other activities on the other nodes.
That special master node is called the name node. And this name node contains all of the metadata required for the other worker nodes, or the data nodes as they are called to be able to do their thing. In that sense, the name node is a metadata repository. So if all of the data in HDFS is thought of as a book, then the name node can be thought of as the table of contents. The name node knows where to find stuff, but it’s the data nodes which will actually store the raw data that is, the actual text in each page of a book or the actual blocks of a file in a distributed file system. Let’s understand now the role of the name node and the data nodes a little better.
The name node has a bunch of metadata because it’s got to manage the overall file system and store within it the directory structure, the metadata of each of the files, which after all, are going to have file headers and so on. Data nodes, on the other hand, are going to divvy up and physically store the data on different machines in the cluster. Now, given a large file and files in a distributed file system can be ginormous, say, just a large text file. How is this going to be stored in HDFS? The answer is in the form of different blocks. HDFS is going to divvy up that file into blocks. Let’s call those block one through block eight, and each of those blocks is going to have the same size.
This is an important characteristic because it allows different length files to be treated basically in the same way. And storage is to some extent simplified. This block is now a unit for replication and fault tolerance. HDFS uses blocks of size 128 megabytes. Any choice of block size is going to contain within it a trade off. The larger the block size, the less the parallelism. Imagine, for instance, that every file consists of just one single block. Clearly no parallelism is possible at all. But on the other hand, larger block sizes also mean less metadata because there is going to be metadata for each block. So by going with a smaller block size, you are also going to increase the amount of overhead in terms of metadata space and also looking up and working with the metadata.
In any case, this block size of 128 MB is what Hadoop and HDFS have gone with. This size helps to minimize the seed time for each block. Okay, let’s go back to a file example. Given that we have a bunch of blocks, each of these blocks will now be stored in different data nodes. So let’s say that we have data nodes one through four. We will store block one and two on data node one and so on. Each node in this way contains a partition. It contains a subset of the data in this illustration on screen. Now, those partitions are disjoint, but as we shall see in just a moment, for fault tolerance, we actually require multiple data nodes to store copies of the same block.
In any case, this gives rise to an important question how do we know, how does Hadoop know where the different splits of a particular file are located? And the answer is this information resides on the name node. And this is why the name node is so critically important. Without the name node, we have no idea where the different block locations are. So the first step in reading a file in HDFS is always going to be to use the metadata in the name node to find the actual physical location of the data. And then the second step will involve actually reading the blocks from those respective locations which are going to be on different clusters.
As we shall see, the name node is so important that in high availability configurations for Hadoop on the Google Cloud platform, there actually is a provision to have multiple name nodes so that they are kept in sync with each other using a piece of software known as Zookeeper. And then failover happens if any one name node goes down. But in any case, the important bit to remember is that the name node acts as an extra level of indirection. When you want data, you request the name node and the name node then looks up its metadata, it figures out on which data node that data is located and it then goes ahead and reads it and returns it to you or to the requesting application.
This gives rise to a couple of obvious challenges. A distributed storage system is going to need to have to deal with failure management of the data nodes and even more critically, failure management of the name node. Failure management of the data node is a relatively familiar problem here we need to deal with eventualities like one of the blocks of data getting corrupted or what if the data node hosting the blocks just crashes? It turns out that these can be solved relatively easily by using a replication. We can define a replication factor and that means that a given block of data will then be stored in more than one data node. So here, for instance, block one and block two will be stored on data node one as well as in data node two.
When we replicate blocks, we are going to have to be careful to store those replicas in different locations. And we also need the name node to keep track of all of those different replica locations else there’s not going to be any way for the name node to find the backup or the failover replicas. While choosing replica locations, there are a couple of considerations. We want to maximize redundancy, but we also want to minimize the network bandwidth because remember that we are now doing all of this on a distributed file system. If maximizing redundancy were our only consideration, we probably would have kept different replicas of the same data on racks which are extremely far away from each other.
But this would make it really difficult, it would make it really cumbersome to access all of this information across different racks, across servers which are very far away from each other. And so the default replication strategy in Hadoop tries to balance both these needs. What it does is to choose the first location entirely at random. So the first location where some block of data will be stored could reside on any rack anywhere. The second location will then be chosen on a different rack if possible. That also will be chosen somewhat at random. But the third replica is going to be on the same rack as the second, albeit on a different node. And as a result of this INTERRACK, traffic will be relatively low and write performance will remain tolerable even while achieving a good deal of redundancy.
Read operations will now be sent to the rack which is closest to the client making the read request. Write requests are always going to be relatively expensive because the first data node which gets the right will be closest to the writing application. But then those changes, those rights are going to be forwarded onto all of the replica locations until all of them are in sync. And so clearly forwarding of these rights is going to consume a large amount of bandwidth and right bandwidth and the cost of right operations in a distributed file system is always going to be pretty high. Let’s now talk about the question which we had posed at the start of this video. When we run hadoop on the Google Cloud platform. We don’t use HDFS.
Rather we use Google Cloud Storage. The reason for this is that HDFS and Hadoop, by definition, has a server, it has a name node and it has metadata and a whole bunch of lookups, the name node and all of the server configuration requires some compute virtual machine instances to always be running on a platform like the Gcp you are billed by how long you have VM instances running your jobs. And so if you were to use HDFS, you would end up paying a lot of money and it would also be impossible to fire up your cluster each time you needed to access data that would just completely kill the performance of your managed Hadoop cluster. And that’s why on the Google Cloud platform, the underlying storage beneath managed Hadoop is not HDFS, rather it’s the Google Cloud storage.