Google Professional Data Engineer – Appendix: Hadoop Ecosystem part 5
- Spark
We’ve now finished discussing hadoop hive and pig in the Google Cloud platform world. Hadoop, Hive, and Pig are all used via Dataproc, which is a managed Hadoop service. It turns out that Pig, Hive, and Spark are services which are available by default on every instance of a Dataproc cluster. So let’s close the loop by now discussing Spark, which is an incredible popular technology these days. As usual, I have a question that I’d like you to think about as we discuss Spark. And that question is hive is to SQL what? Spark is to what? This is a simple analogy question. If you’ve taken the GRE The Graduate Record Examination, at least when I took it a long time ago, there used to be a lot of questions like these hive has a certain relationship to SQL.
Hive is an underlying technology which is usually accessed by making use of the Hive Ql language, which is superficially similar to SQL. Of course, under the hood, there is very little in common between Hive Ql and SQL in exactly the same manner in a Hadoop mapreduce world. Spark is a technology which is accessed in a particular way. That particular way of accessing Spark approximates some other more familiar technologies. What are those? Having posed this question, let’s now plunge into an introduction of Spark. Spark is really an engine for data processing and analysis. Spark is a general purpose engine. It goes well beyond just the Hadoop ecosystem. We shall talk more about that in a bit. You can use it for exploring data.
You can use it for cleaning and prepping data, just as you would use Pig. You can use it for applying machine learning algorithms for building streaming applications and data applications. ML Lib for machine learning applications, spark Streaming for streaming applications. Spark has a whole bunch of libraries. These are abstractions over cool underlying technologies. And the combined effect of all of these is to make Spark pretty versatile. It really is a general purpose technology. It’s also incredibly interactive. That’s because Spark can be accessed using a rectal environment that is a read, evaluate, print and loop environment. It’s trivially easy to write commands at a prompt and get the results immediately if you’ve used languages such as Python or JavaScript or Scala.
The repel environments are really popular because they give fast feedback. And to add to that, Spark provides a repel environment in a distributed programming world. In other words, it’s possible to integrate with Data, which is in HDFS. It’s possible to run these simple repel commands on large files with petabytes of data, and all of the underlying distributed nature of the computing is abstracted away for us. So that innocuous looking command which you type at your repel prompt is actually processing data over an entire cluster of machines. This can integrate with Hadoop, although it’s not necessarily the case that Spark only runs with Hadoop. And it’s also possible for this program to read data from HDFS.
And so the real popularity of Spark can be attributed to the manner in which it frees programmers from the tyranny, in a sense of the mapreduce programming framework in Java. If you remember how mapreduce jobs have to be coded up, they involve a lot of boilerplate. We need a mapper class that has to be enclosed inside a generic class with a bunch of type parameters. We also need a reducer class. The type parameters between the reducer and the mapper need to agree with each other. And all of this needs to be wired up along with a driver class. That driver needs to make sure that it sets everything up right. It needs to configure, for instance, a job object and do all of the coordination.
This is complex stuff. Even if you are an experienced Java programmer, it is not easy to write mapreduce code. And even if you do manage to get all of this code working just fine, this is still fundamentally batch processing. So you’re going to have to wait for a long time to get the results. And all of those results are going to be HDFS files. The coolest bit of functionality about Spark is that it puts a ripple and interactive ripple environment on top of all of this mapreduce crud. And in this sense, Spark is analogous to Hive. Just as Hive is a complex distributed technology under the hood which makes use of map reduce, but it can be accessed using SQL, which is accessible to most business analysts and data professionals in exactly the same manner.
Spark is also a complex technology which runs mapreduce under the hood, but can be accessed using interactive repel environments from Scala or Python. It can also be accessed from Java, of course, although unlike Scala or Python, java doesn’t really have its own interactive repel. The primary means through which this magic is accomplished by Spark is through an abstraction known as an Rdd. RDDs or Resilient. Distributed data sets are almost as magical as their names would suggest. They are variables, really, but the magical bit about them is that they are both resilient I e resilient to crashes and in memory, so they are in memory collections of objects which somehow managed to be distributed and resilient to crashes.
With RDDs you can interact and play with extremely large data sets exactly as you would with a little container containing maybe ten or 20 rows of data. RDDs abstract away all of the complexities, all of the underlying map reduced stuff from the Spark programmer. Let’s now understand precisely the relationship between Spark and Hadoop. At the heart of Spark is something known as Spark Core. This includes basic functionality such as RDDs, which get most of the cool stuff done. This is a computing engine. This computing engine needs to somehow work with the underlying data. That’s how the Rdd abstractions are going to work. I should add that in addition to the Spark core. There are various very popular Spark libraries such as ML, Lib and Spark Streaming, which I alluded to a moment ago.
The Spark Core is written in Scala, by the way, and there is a reason why the choice of language is so important. Scala runs on Java virtual machines. It uses the Java runtime environment, but it has the concept of closures. These allow for RTDS to really do their magic. In any case, the underlying technology behind Rtts is out of scope here. For now, just regard Sparco as a computing engine. This computing engine needs a couple of additional components in order to actually work with the data. One of these is a storage system. This is an abstraction between Sparcore and the actual underlying data. The other is a cluster manager, which Spark can use to run tasks and deviate computations across a parallel or distributed system.
Now, when you look at it this way, it starts to make a lot of sense. Sparcore is a computing engine which runs on top of a storage system and a cluster manager. Both of these are plug and play components. Now, we already know of a great storage system that’s HDFS, and we also know of a pretty nifty cluster manager, and that is Yarn. So why don’t we just plug in HDFS and Yarn and magic results? We basically have Spark running on top of or interacting with or interoperating with Hadoop. And if you are a user of Spark on top of Hadoop, the beauty of this is that you can now use Spark for all of your computation. You continue to use HDFS and yarn for storage and for cluster management. But now you have the benefits of interactivity.
You can make use of Spark core rather than Mapreduce in order to interface with the data in your HDFS system. Coming back to the question that we had posed at the start of this video. Spark is to both Scala and to Python. What Hive is to SQL. Spark is a technology which provides an interactive and familiar way to access data. That data lies in HDFS. In the case of Hive, it allows business analysts who are used to writing SQL queries to translate that knowledge and write Hive Ql queries instead. In the Spark world, it allows programmers who have used repel environments in Scala or Python to access data in HDFS without being forced to write all of the boilerplate crud and all of the batch processing without interactivity that Map reduce entails.
- More Spark
We’ve mentioned that RDDs are the core programming abstraction in Spark and we’ve introduced them as being resilient in memory variables. The question that we are going to seek to answer here is how can RDDs be in memory if they encapsulate data on the scale of HDFS? Because after all, files in HDFS can be petabytes in size. There is no no possible way that all of this data could fit into memory. So how do RDDs pull it off? How do they manage to be both resilient and in memory? Do ponder over this question while we discuss Spark and how RDDs work their magic. We’ve spent a fair bit of time talking about how Spark is a programmer’s delight, how it offers an interactive way to deal with Hadoop and HDFS data. Let’s spend just a little more time making that real.
Let’s talk about how we’d actually install Spark and run a simple command in it. As always, our emphasis here is not on getting the nitty gritty right, rather that will happen in the labs. This is just to give you a sense of how easy it is to install Spark and get going with parallel programming from an interactive environment. There are some really simple prerequisites. In order to install Spark, you can use Java seven or above, you can use Scala or you can use I Python. That’s the Anaconda IDE, or Jupyter as it’s now called, which allows one to work with Python notebooks. We’ll have more to say about these when we talk about Data Lab later on. The steps in installing Spark are pretty simple. You download the Spark binaries, update a few environment variables, and then you configure I Python’s notebook to work with Spark.
Once again, we are not going to dwell on the nittygritty of these environment variables, but as you can see, they are pretty simple. You just need to tell your machine where Spark has been extracted and you also need to add those to the path. Configuring I Python also requires just a couple of additional environment variables to be set. And once all of that is in place, really, we are good to go. You can now open up the interactive environment, the repel for Spark, just by using a few simple commands. Pi Spark, that is the name of the command which encapsulates all of the magic. So you type Py Spark at a terminal window and a repel opens up. In this repel you have something known as the Spark context.
So this looks and feels exactly like a python shell, with the important difference that the Spark context is now available for your use. Python functions, data structures, containers, all of that will work exactly as before. We can import and use any other Python source code, just as we usually would. You should know that there are a couple of different modes. There is a distributed mode and a non distributed mode. By default, Py Spark will launch in local and non distributed mode. But really the big difference comes down to the Spark context. When this Py Spark shell is launched, it’s going to initialize that Spark context object. And this represents a connection to the Spark cluster.
And that technically is what the Hadoop cluster now is. Because we are going to interface with HDFS through Spark rather than through Mapreduce, one can now load data into a variable, except that that variable is going to be an Rdd. So it seems like we are loading data into memory from a source. That could be like a CSV file or anything else. But the key difference is that that data has been loaded into an Rdd or a resilient distributed data sets. And this is where all of the magic is. Resilient. Distributed data sets are the core programming abstraction in Spark and they possess a bunch of incredibly powerful properties. The properties that really matter to us in this context are the fact that the underlying data is organized into partitions, the fact that RDDs are read only, so you cannot change an Rdd.
It’s immutable. And lastly, and maybe most magically, the fact that each Rdd knows exactly where it came from, it is aware of its lineage. Let’s start with the first of these properties. It should come as no surprise to us that the underlying data beneath an Rdd is partitioned. RDDs are in memory abstractions, but under the hold. Of course, this data sits on a distributed file system, like a Hadoop Clusters file system. This fact is completely abstracted away from us by Spark Core. Remember that Spark core is going to interact with the storage manager as well as the cluster manager, and they collectively will take care of all of this. But it’s important again to remember that when you load a CSV file, for instance, into an Rdd, it’s entirely possible that the data from that Csv file resides on machines, maybe thousands of machines, which are very far away from each other.
Each of them is on a different node, just like it would be in a Hadoop cluster. And all of this abstraction is provided to you by Spark. And any operations which you specify on this Rdd are going to need to be performed in parallel across this distributed file system, across all of these nodes. Once again, that’s going to be handled for you by the Spark Core. So, just like Mapreduce is the programming paradigm, in Mapreduce, spark Core is the programming manager which makes sure that all operations are parallelised. And that parallelism is abstracted from you. In order to work with distributed data which has been partitioned in this way, RDDs need to be viewed only.This is an important point. RDDs are immutable.
Immutable data structures are important because they preclude one from writing code with side effects. There are only two types of operations possible on RDT’s transformations and actions. Transformations basically operate on an Rdd and give you another, different Rdd. And actions will in a sense give you a result based on the values in that Rdd. We will discuss the idea of lineage in a moment. But the basic principle of lineage is that every Rdd remembers all of the transformations which were carried out on the original data before that Rdd came into existence. That’s what it means for an Rdd to know its lineage. Let’s say that we load in a dataset from a CSV file, a gigantic Csv file with petabytes of data into an Rdd. Clearly something’s got to give.
How does one load petabytes of memory into memory? Because we’ve discussed that RDDs are in memory abstractions and of course this has to do with the partitioning. Spark will smartly figure out what actually needs to be loaded into memory. In any case, now that we have this RDD, a user would define a chain of transformations on it. This is going to result in the creation of new RDDs. So step one is to load the data. Step two is to maybe pick only the third column from this data. And step three is to sort all of the values in this column. Spark is going to wait until a result is actually requested before it does any of this stuff. And that’s because Spark has a lazy evaluation of results which again has to do with lineage.
In any case, that was an example of a transformation. Let’s now talk about an action. The difference between a transformation and an action hopefully is clear. A transformation gives you back yet another RTD. An action does not. Rather, an action is a cue for Spark that a result is being demanded. So for instance, that action might be for only the first ten rows. It might be for a count or maybe a sum. All of these are in essence actions which have to be performed over an Rdd. And actions are the trigger for Spark to actually go ahead and start processing data in an Rdd. At this point, the entire chain of transformations that had been defined is going to be executed. Now, this is leading into the idea of lineage.
Lazy evaluation involves only calculating results or only calculating the outputs of transformations when an action has been performed that might be asynchronous it might be quite a bit after those transformations were specified in the first place. So the action is being performed on an entirely different Rdd than the one which you began with. So for a lazy evaluation to work, every Rdd is going to need to know the exact chain of transformations which caused its coming into being. And this is nothing but the lineage. When an RDT is created, it holds metadata. That metadata tells it what transformations occurred from the original data source before it came into being. And it’s important to note that when an Rdd has first been created it holds only metadata.
It actually doesn’t hold any underlying results at all. Now, an efficient way of encapsulating all of this metadata requires just two pieces of information a transformation and its parent Rdd. Given these two pieces of information, it is possible to recursively follow the same metadata in the parent RDT and thereby figure out exactly what transformation sequence led to this Rdd being created. So given an Rdd, let’s call it Rdd One, which is transformed using transformation one into another Rdd, that’s Rdd Two. All that Rdd Two needs to know is the transformation that’s transformation One and the Rdd One that’s its parent. By recursively walking back through this chain, it can trace its lineage all the way back to the source data. At some point, there would be some record of the initial data load from a giant Csv file.
And the only time when all of these transformations actually need to be executed is when an action is requested on the child Rdd. So, for instance, if you perform an action or you request some results from Rdd Two, that is going to cause the evaluation of all of the transformations that occur in its lineage. So the implication of this is that whenever you request an action on an Rdd, you are materializing all of its parent RDDs. If you work with TensorFlow, you would see that this is a very similar paradigm to the TensorFlow directed, basically graph. There you execute a node and that causes only that subset of the Dag which leads into our node to be evaluated. This lazy evaluation based on linear has two important implications.
The first of these is resilience. There is an inbuilt fault tolerance mechanism here because if something goes wrong, really all that needs to be carried out is a reconstruction from source of all of the transformations. This resilience is where the R in the Rdd comes from. And the second important implication of this is that it makes lazy evaluation possible. As we’ve discussed previously, in the absence of lineage information, it would have been impossible for an Rd to be materialized only when necessary. So lineage and lazy evaluation really are how RDDs make their magic happen, no matter how big the underlying dataset. Spark is able to cleverly abstract that away and only calculate results when absolutely necessary and load those into memory.