Amazon AWS Certified Data Analytics Specialty – Domain 3: Processing Part 4
- EMR Promises; Intro to Hadoop
Let’s talk about some of the guarantees EMR makes to you. What is that? EMR charges you by the hour in addition to EC two charges. So you are paying for every hour that your cluster is running, whether you’re using it or not. So it is not a serverless kind of thing. You are provisioning a fixed amount of servers and you are paying for those servers for as long as they’re running. Like we talked about, you can use transient clusters to actually just spin things up, do something and spin them down. But if you need to store your data on the cluster for a long period of time and perform ongoing analysis on it, you probably need that cluster to be running twenty four seven.
And that is going to add up over time. So it ain’t cheap, guys. EMR can provision new nodes automatically if a core node fails, so you don’t have to worry about that at least. And you also have the ability to add and remove task nodes on the fly like we talked about. So if you do have a big surge in computing needs, you can provision some spot instances and add them as task nodes to your EMR cluster and automatically have that additional computing power available to your cluster. You can then remove it when you’re done with it. You can also resize or running cluster as core nodes. Adding or removing them on the fly is generally a bad idea, but it does allow you to resize them. So we’ve talked a lot about Hadoop while we’re talking about EMR.
It’s basically a Hadoop implementation, so maybe you’re new to the world of Hadoop. What’s hadoop all about anyway? Well, let’s talk about the Hadoop architecture here and what it all means. So the Hadoop architecture comprises of the following modules. We refer to the core of Hadoop as Hadoop Common, and that consists of these three systems here MapReduce, Yarn, and HDFS. These contain the libraries and utilities required for other Hadoop modules and provides the underlying file system and operating system level abstractions that we need. It also contains all the jar, files and scripts that we need to start Hadoop itself. Let’s start at the bottom with HDFS. This is a distributed scalable file system for Hadoop. It distributes the data it stores across the instances in the cluster, and it stores multiple copies of that data on different instances to ensure that no data is lost if an individual instance fails. Now, HDFS, as we said before, is ephemeral storage.
That data will be lost if you terminate your EMR cluster. It is stored on the cluster itself. This can be useful from a performance standpoint because it allows Hadoop to try to run the analysis of your code on the same node where that data is stored. But it does come with that caveat that if you shut down your cluster, that data goes byebye. However, you might want to use it for caching intermediate results using MapReduce processing, or for workloads that have significant random I o where that performance is important. Sitting on top of the Hadoop distributed file system is Yarn, and Yarn stands for yet another resource negotiator. It’s a component introduced in Hadoop 20 to centrally manage cluster resources for multiple data processing frameworks. So it’s sort of the thing that manages what gets run where guys on top of Yarn is Hadoop MapReduce, and that’s a software framework for easily writing applications that process vast amounts of data in parallel on large clusters of commodity hardware in a reliable, fault tolerant manner.
A MapReduce program will consist of map functions that just map data to sets of key value pairs called intermediate results, and reduce functions that combine the intermediate results, applies additional algorithms, and produces the final output from your system. And all this can be parallelized very nicely across your entire cluster. You don’t really see MapReduce used itself very much these days because there are other newer systems that outperform it. However, these three components are the core of Hadoop itself.
- Intro to Apache Spark
So the thing that has mainly taken the place of MapReduce is Apache Spark, and it’s an open source distributed processing system commonly used for big data workloads whether or not you’re in the AWS world. Even outside of AWS, Apache Spark is a very commonly used tool for processing massive data sets. Its secret is using in memory Caching, so it really does a lot of work in memory instead of on disk. And it uses directed a cyclic graphs to optimize its query execution for very fast analytic queries against data of any size. Spark provides development APIs in Java, Scala, Python and R. So you do need to write code to use Spark. It’s not something where you can just drag and drop components around. You have to actually be a software developer of some sort to use Apache Spark. However, it does come with a lot of code libraries built in, so you can reuse that code across multiple workloads.
And some of the libraries it provides, you will do batch processing, interactive queries, real time analytics, machine learning, and graph processing as well. Some common use cases of Apache Spark include stream processing through Spark streaming. This allows you to process data collected from Amazon Kinesis for example, but also things outside of the AWS ecosystem such as Apache Kafka or any other data stream that Spark streaming can integrate with on Amazon EMR. It can also do streaming analytics, and it’s performed in a fault tolerant way. And you can write the results of those analytics to HDFS or in the case of AWS, to S Three. You can also use Spark for machine learning at massive scale. It includes a library called ML Lib, which is a library of machine learning algorithms that work on data at massive scale. And you can do interactive SQL using Spark SQL that’s used for low latency interactive queries using either SQL or Hive QL.
However, Spark is not meant for OLTP or batch processing. Spark jobs generally take some time to complete because it has to distribute all that work across the cluster and collect the results back to it. So it’s not really meant for real time usage, it’s really more for analytical applications or doing larger tasks on a regular schedule. So how does Spark work? Well, Spark applications are run as independent sets of processes spread across an entire cluster. So the driver of the whole thing is called your Spark Context Object within your main program.
And that program is known as the driver program or the driver script. This is the actual code of your Spark program that tells the Spark cluster what you want to do with your data. The Spark context will connect to different cluster managers that will take care of allocating all the resources that your driver script needs across different applications. In the case of EMR, it’s going to be using Apache Yarn because that’s a component of Hadoop that’s already installed on the cluster. However, you can also use Apache Spark outside of a Hadoop cluster as well. It has its own cluster manager that you can deploy, so you can have a cluster that just runs Spark and nothing else that can also work.
So once the cluster manager has decided how to distribute that work, spark will acquire executors on nodes within the cluster. Executors are processes that run computations and store the data for your applications. The application code is then sent to each executor, and in the final step, the Spark context sends the tasks to the executors to run. Spark is architected in a similar way to Hadoop itself. You have sort of this core layer at the bottom and a bunch of stuff built on top of it.
So at the bottom we have Spark Core, and it acts as the foundation for the platform. It’s responsible for memory management, fault recovery, scheduling, distributing and monitoring jobs, and interacting with storage systems. And as we said, it supports APIs for Javascal, Python, and R at the lowest level. Here, it’s dealing with something called a Resilient Distributed Data set or an RDD, and that represents a logical collection of data partitioned across the different compute nodes. We tend, however, to deal with data these days in Spark at a higher level than the RDDs, and that’s where SparkSQL comes in. So, SparkSQL is a distributed query engine that provides low latency interactive queries up to 100 times faster than MapReduce.
It includes a cost based optimizer column or storage and code generation for fast queries. And it supports various data sources coming into it, such as JDBC ODBC, JSON HDFS Hive, ORC and park. So you can import data into Spark from pretty much anything. And it also supports Querying Hive tables using Hive QL. Spark SQL is especially important because it contains a construct known as a data set that basically lets you view the data that you have on Spark as a giant database, if you will. And by using straight up SQL to interact with your data, it makes the development of your Spark driver scripts a lot more simple. And it allows Spark itself to perform optimizations that it couldn’t normally do. Generally, when you’re writing modern Spark code, it’s using the data sets that are exposed through Spark SQL. Spark streaming is also built on top of Sparkcore, and it also integrates with Spark SQL to use data sets as well. Spark Streaming is a real time solution that leverages Spark Core’s fast scheduling capability to do streaming analytics.
It ingests data in minibatches, and it enables analytics on that data with the same application code you would write for batch analytics. It improves your developer productivity because the same code can be used for both batch processing and real time streaming applications. It supports data from a variety of streaming sources as well, including Kafka, Flume, HDFS, and Zero MQ. And in the case of AWS, as we see, it can also integrate with kinesis. ML Lib is also built on top of Spark core, and that’s a library of algorithms to do machine learning on data at large scale. These algorithms include things like the ability to do classification regression, clustering, collaborative filtering and pattern mining. ML Lib can read data from HDFS, HBase or any Hadoop data source, as well as S three on EMR. And as with any spark code, you can write your ML Lib applications with scala Java, Python or Sparkr. Finally we have GraphX.
That’s a distributed graph processing framework built on top of Spark. And we’re not talking about line charts and pie graphs here, we’re talking about graphs in the data structure sense. So imagine like a graph of social network users that have lines that represent the relationships between each user and a social network. That’s the kind of graph we’re talking about in the computer Science sense, not the pretty lines and graph sense anyway.
It provides ETL, capability, abilities, exploratory analysis, and iterative graph computation to enable users to interactively build and transform a graph data structure at scale. It has a highly flexible API and you can select from different distributed graph algorithms. So if you are dealing in something like a social network, that can be a very useful tool.
- Spark Integration with Kinesis and Redshift
Let’s talk about Spark streaming in a little bit more depth. So Spark applications, like we said, usually use something called a data set in their code to refer to your data. And a data set is treated very much like a database table with Spark streaming and structured streaming in particular. You can basically picture your streaming data as a database table that just keeps on growing forever. So as new data is received by the stream, it just keeps adding more and more rows to that virtual database table in the form of a data set. So you can query this data using windows of time. And if we look at this example here, what this code is doing is saying, let’s monitor all the stuff that’s being thrown into a logs bucket in S three.
And this action with a 1 hour window over time means that we’re going to be continually counting up the records being received in that bucket over the previous 1 hour, and then we’re going to turn around and write those counts using JDBC into some external MySQL database. So the amount of code you have to write for something like this is actually pretty trivial. And that’s kind of the power of Spark streaming and structured streaming in particular. Yes, you have to write code, but generally it’s very simple code. And you can rely on Spark to manage all the dirty work of making sure that you’re getting your stream data reliably and distributing the processing of that stream across your entire cluster reliably. You just focus on the logic of what you want to do, while Spark streaming takes care of all the nasty bits of making sure it’s happening in a reliable manner.
You can also integrate Spark streaming with Amazon Kinesis. And I mean, there’s nothing really magical about integrating Spark streaming with any other system because it’s all code, right? With code you can do anything. So it turns out somebody has written a library for Spark streaming built on top of the Kinesis client library to allow Spark streaming to import data from Amazon Kinesis data streams. And you just have to plug in that library and code against it. And you can treat Kinesis as any other stream of data coming into a data set in Spark structured streaming. So for example, you might have some Kinesis producers, say a bunch of EC, two hosts generating logs that are pumping data into a Kinesis data stream.
You can then integrate that using the Spark data set integration code as any other data set coming into Spark streaming and process it across the entire cluster on EMR using Apache Spark. You can also integrate spark with redshift. We haven’t talked about Redshift yet, but basically it’s a massive distributed data warehouse that’s offered by AWS. And like there’s a Spark Kinesis package, there is also a Spark Redshift package and that allows Spark to treat data sets from Redshift just like any other SQL database. So it just makes Redshift look like any other SQL data source to Apache Spark. And again, you can use the power of Apache Spark across your entire EMR cluster to transform and process that data in any way you want in a highly distributed and scalable manner.
So that’s a good way for doing ETL on Redshift using Apache Spark. For example, imagine, if you will, that we have, say, a bunch of airline flight data residing in a huge data lake that’s stored in Amazon S three. I could, for example, deploy Amazon Redshift spectrum on top of that data in S three, which will provide me with a SQL interface on top of that data that lives in S three. Now, using the Spark Redshift package, I can then spin up an Apache Spark cluster on Amazon EMR and use that to perform ETL on that data that’s residing in S three through Amazon Redshift, because Redshift just looks like any other SQL data source to Apache Spark. So by integrating Redshift with Spark, I can distribute the processing of that huge data set sitting in S three and maybe turn around and put that process data back into another Amazon Redshift table for further processing. Maybe I’m doing something like preparing that data for some machine learning algorithm that’s going to be used down the road. And that machine learning algorithm can use that prepared and transform data in that new Redshift table that I’ve produced from my Spark job.
- Hive on EMR
Another important piece of the Hadoop ecosystem is Apache hive. That comes up a lot in the exam and in life in general, really. So Hive is basically a way to execute straight up, well, almost SQL code on your underlying unstructured data that might live in Hadoop yarn or S Three. In the case of EMR, Hive sits on top of MapReduce to actually figure out how to distribute the processing of that SQL on the underlying data. And alternately, there’s another engine called Tez that can take the place of MapReduce. Tez is kind of like Apache Spark in that it uses a lot of in memory directed acyclic graphs to accelerate things. So often you’ll see Hive being used on top of Tez instead of MapReduce. Like I said, MapReduce has kind of fallen out of favor these days for better technologies. But the important thing to know is that Hive basically exposed a SQL interface to your underlying data stored on your EMR cluster. Why would you use hive? Well, the main advantage is that it uses a very familiar SQL syntax. It’s not exactly SQL. Technically it’s called Hive QL, but it’s really close to SQL.
It does have an interactive interface, so you can just log into your cluster and from a little web page type in your SQL queries and Hive will just go off and do it. It’s very scalable, of course, like anything else on a cluster, it works with big data, but it’s really most appropriate for a data warehouse applications and OLAP applications. Even though it is interactive, it’s not that fast, so you’re still going to be using it just for easy OLAP queries, trying to extract meaning from large data sets. But it’s going to be a lot easier and a lot faster than trying to write MapReduce code or even trying to write Apache Spark code for that matter. If you just need to do really quick, high level SQL interactive queries of your data just to sort of explore things, hive is a great tool for that. Hive is obviously highly optimized, and it’s also highly extensible. .
It offers user defined functions so you can extend Hive QL however you wish to, and it also exposes a Thrift Server and a JDBC and ODBC driver so external applications can communicate with Hive from outside of your cluster. Analytics applications, web services, whatever you might have. Just keep in mind, though, that Hive is not really for OLTP, so you shouldn’t be writing a web service that hits Hive continuously hundreds of times per second or anything like that, trying to get results back very quickly. A topic that comes up a lot on the exam is the Hive Metastore. So at the end of the day, Hive is trying to impart structure on unstructured data. It’s basically sitting on top of a mess of CSV files or something that you’re storing on HDFS or Emrfs or what have you on your cluster. So somewhere it needs to store the information that imposes structure on that data and says, this is what the different columns in this raw data mean.
These are the column names, these are their data types, and that information is stored in the hive metastore. You can take a look at an example here that’s being created here. So this is how we might create a structured ratings table just based on raw rating CSV data. You can see by saying Create Table ratings, we’re defining the column names and data types for each column of that information. We’re saying how the actual data is delineated and terminated and the format that it’s stored in and where it is actually located. So all that information has to be stored somewhere. And that’s what we call the hive metastore. It will refer to this information as you try to refer to this underlying CSV data as a SQL table. Now here’s where things get interesting. By default, the hive meta store is just stored in a MySQL database on the master node of your cluster. So it needs to be in one place so everybody can get to it and have a consistent view of how this data should be interpreted across your cluster. But obviously that’s not great, especially if you’re going to be shutting down your master node or something happens to your master node. Hive therefore offers the ability to have an external metastore that you host outside of the cluster somewhere, or on some other node within it, maybe, and that offers you better resiliency and integration.
However, it also offers an opportunity to store that meta store in alternative places. So you might think to yourself, a hive metastore sounds an awful lot like a Glue data catalog, and you’re right, they serve the same function. They’re basically maintaining structure information on unstructured data. How do I map this unstructured data to table columns and names and data types that will allow me to treat that data as a straight up SQL table so it turns out it’s actually possible to store your hive metastore within AWS Glue Data catalog itself. So you can have an AWS Glue data catalog that serves double duty as a hive metastore, and that allows you to centrally locate your metadata for your unstructured data and expose that directly to Hive, where Amazon EMR can get to it, but also expose that same metadata to Amazon Redshift, Amazon Athena, or what have you.
And ultimately, that can all refer to underlying data stored in S three, but use that metadata stored in the Glue Data catalog to impart structure on it. So that allows you to centralize your metadata in one place, not just for hive, but for everything else that requires metadata in the AWS ecosystem. It is also possible to store your hive metastore on an external Amazon RDS instance or Amazon Aurora. Talk about that more later on as well. But basically, RDS is an AWS service that allows you to spin up a database in the cloud. So instead of storing your Hive metastore on a MySQL database on your master node, you can choose to store that in an external RDS database instance that will be more persistent. So even if you shut down your cluster, that metastore will survive in your RDS database. So study this diagram. It is important to understand the intricacies of how the Hive metastore can interact with Glue data catalog and with RDS, because that will come up on the exam. Hive and AWS also integrates in other ways.
So for example, you can integrate it with S Three in many different ways. Using Hive with EMR provides you with the ability to load table partitions automatically from s Three. So you might recall that you can store your data in s Three under different subdirectories. For example, year, then month, and date, then hour. Those are going to be translated into table partitions. And you can do that automatically with Hive on EMR, you can just type in Alter table recover partitions, and that will easily import tables concurrently into many clusters without having to maintain a shared metadata store. Also, using Hive with Amazon EMR provides you with the ability to specify an off instance metadata store, which we already talked about in the previous slide.
And when you write data to tables in Amazon s Three, the version of Hive installed in EMR writes directly to Amazon s Three without the use of temporary files. So you can write directly to s Three using Hive’s extensions on EMR. The special version of Hive installed on EMR also allows you to reference resources such as scripts for custom, map and reduce operations, or additional libraries located in Amazon s Three directly from within your Hive script. So you can store your scripts and external user functions in s Three, and Hive on EMR can pick that up directly. Hive on EMR also integrates directly with Amazon’s DynamoDB, so you can process DynamoDB data using Hive as well. To do that, you need to define an external Hive table based on your DynamoDB table. You can then use Hive to analyze the data stored in DynamoDB and either load the results back into DynamoDB or archive them into Amazon s Three. So this allows you to copy data from a DynamoDB into Emrfs or HDFS, and vice versa. And using Hive on EMR, you can perform join operations between DynamoDB tables, which is kind of interesting.