Amazon AWS Certified Data Analytics Specialty – Domain 3: Processing Part 3
- Glue Costs and Anti-Patterns
Let’s talk about Glue’s cost model real quick. Again, it is serverless, so you’re just paying for the capacity that you’re consuming. You will be billed by the minute for your crawler jobs and ETL jobs. So you’re just billed by CPU time, basically, for any crawlers that are actually extracting information from your schema, and any ETL jobs that are transforming that data on demand. The first million objects stored and accessed are free for the Glue data cap log, and any development endpoints for developing your ETL code are charged by the minute. So if you’re actually going to be sitting there editing your ETL code, that is just charged by the amount of time you spend doing that development. The list of ant patterns for Glue ETL used to be much larger, but they’ve been whittling away at it over time. Right now, the only thing they really advise against is using multiple ETL engines with Glue ETL. Glue ETL is implemented in Apache Spark, remember?
So if you’re going to be running jobs using other engines such as Hive or Pig and things like that, a data pipeline or elastic Map Reduce might be a better choice for that sort of a thing than using ETL. What used to be an anti pattern was streaming data and processing data as it was being consumed. But as of April 2020, glue ETL now supports serverless streaming ETL. It can consume directly from Kinesis or Kafka, clean and transform that data in flight, and store that results as it comes in into S Three or whatever other data store you might want to be using.
Now, under the hood, Glue ETL again is running on Apache Spark, and this is made possible by just using Apache Spark’s structured streaming API. The idea there is that you just have a data set that grows over time as new information is added to it, and as that dataset expands, as new data is received, your Spark job, your structured streaming job, sees that data over some window of time and processes it and puts it somewhere. So yeah, streaming is now actually something you can do with Glue ETL.
- Elastic MapReduce (EMR) Architecture and Usage
Let’s dive into elastic MapReduce. It’s a very big topic because there’s a very wide variety of technologies that run on Elastic MapReduce. So buckle in. We’re in for a long ride here, guys. What is EMR? Well, it stands for Elastic MapReduce, and it’s kind of a confusing name because MapR Reduce is really an obsolete part of Hadoop. And EMR is much more than just a Hadoop cluster too.
But think of it as a managed Hadoop framework on EC two instances. And in addition to Hadoop, it includes many tools that run on top of Hadoop as well, including Apache, Spark, Base, Presto, Flank, Hive, and a lot more. If these are all just names to you right now, don’t worry, we’re going to talk about all this stuff in more depth later on. It also features something called EMR Notebooks, which allow you to actually interactively query data on your EMR cluster using python from a little web browser. And EMR also offers many integration points with AWS services. You might think to yourself, this sounds a lot like Cloudera or something, and it is a lot like Cloudera.
The difference is that it’s actually running on EC two instances, and it has built in integration to other AWS services. So if you’re working within the AWS big data ecosystem, EMR gives you all the power of a Hadoop cluster that you might normally get from something like Cloud Era within the confines of AWS. So what does an EMR cluster look like architecturally? Well, a cluster is just a collection of EC two instances at the end of the day, and each EC two instance is going to be referred to as a node. Each one of these nodes will have a role within the cluster, which is called the node type. Now, at a minimum, you’re going to have one master node. The master node will manage the cluster by running all the software components to coordinate the distribution of data and the distribution of tasks among all the other nodes for processing. It will keep track of the status of all those tasks and monitor the health of the other nodes within the cluster.
Every cluster must have a master node. It’s even possible to create a single node cluster that only has a master node that does everything. Now, the master node is also sometimes referred to as a leader node. Now, if you’re going to have more than one node, then you’ll have core nodes as well. And these are nodes that have software components that run tasks and store data in the Hadoop distributed file system. That’s HDFS, that’s basically the file system that’s spread out across your entire cluster. So again, a multi node cluster will have at least one core node.
And these core nodes serve the dual purpose of running tasks that analyze your data on the cluster and actually storing the data itself. Now, the thing is, you can scale up and down an EMR cluster by adding more core nodes or taking core nodes away. But that’s kind of a risky thing because you might end up losing partial data in doing so, because data is stored on your core nodes. Now, Hadoop and HDFS are built to have redundancy, so it can recover from that usually. But there is a small risk of losing data when you do that. That’s why we also have task nodes available and task nodes are like core nodes. But instead of both running tasks that analyze your data and storing data, all they do is run those tasks. They don’t actually store any data of their own.
So that’s interesting because not only does it mean there’s no risk of losing data as you add and remove task nodes, it means you can just add those as needed as the traffic into your cluster increases. So if you start to need to perform more analysis on your cluster because of some seasonal thing, or maybe you’re a big e retailer and it’s Christmas time, right, you can throw more task nodes on your cluster to process more data more efficiently, but you can do that without actually messing with the underlying file system that the cluster is using. That also makes for a good example of where you might want to use a spot instance. So you can get really cheap temporary instance availability by using a spot instance for these task nodes and just using them on demand as needed to quickly scale up and down the computing capacity of your cluster as a whole.
There are a few ways of using an elastic MapReduce cluster. One is as a transient cluster. So in this case, you’re configuring your cluster to be automatically terminated after it completes some tasks that you give it. So through the AWS console you can say, I want to perform a bunch of steps and you’re going to spin up a cluster, perform those steps, save the results somewhere, and then destroy the cluster. These are called transient clusters. So you might just have some steps that include loading, input data, processing that data, storing the output result, and then shut down the cluster. This can be very inexpensive because you’re only using the capacity that you need. These EMR clusters can get expensive if they are sitting there doing nothing.
So by using a transient cluster, you can spin up a cluster, perform a task, spin it down. Now, in contrast, you can also have a long running cluster that you spin up and just leave running. And then you can just connect directly to the master node to run jobs as needed. And you don’t have to worry about losing data when you shut it down either, which is kind of nice. So in this case, you create a cluster, interact with the installed applications on that cluster directly, and then you can manually terminate that cluster when you no longer need it.
So in this example, you might be using the cluster as a data warehouse, and periodic processing is done on a large data set to effectively reload the data to new clusters each time. On a long running cluster, you would have termination protection enabled by default. So that’s going to prevent things from shutting down when you don’t expect it. And auto termination would also be disabled by default, meaning that it won’t automatically shut down when things are done. So two different ways to use it. One is to have a transient cluster that just spins up, does what you need, and shuts itself down. Or you can just have a cluster sitting there pretty much indefinitely where you just have data sitting there long term stored on the cluster itself. And jobs to analyze that cluster are being run across it as needed.
- EMR, AWS integration, and Storage
As we said, the thing that sets EMR apart from other Hadoop systems is its integration with AWS. And here are some ways in which it does that. First of all, it uses Amazon EC Two for its instances that comprise the nodes in your cluster. So that means that instead of trying to provision and manage all your own Hadoop nodes yourself, you’re letting Amazon do a lot of that management for you by provisioning hosts in EC Two. And that gives you all the flexibility of EC Two do in choosing different kinds of instances or spot instances and whatnot as you may need to. It also opens up the power of VPC so you can run your cluster within a virtual network for security purposes. And it also allows you to use Amazon S Three as both an input and output source for your data. Normally on a Hadoop cluster, you are limited to the storage on the cluster itself, but with the EMR, you can actually access data in S Three as well. EMR also integrates with Amazon Cloud Watch, so you can monitor the cluster’s performance and even set up alarms when things go wrong.
It integrates with the IAM, or Identity and Access Management Service, so you can configure the permissions for your cluster and figure out what other services it can talk to. It also integrates with Cloud Trail, so you can create an audit log of all the requests made to the service. And in general, that’s what Cloud Trail is for that might show up on the exam. It’s a way of creating an audit log of what people did with your service. It also integrates with AWS data pipeline. So if you actually have a self contained job that requires you to spin up an EMR cluster, run that job, and then shut it down, you can schedule that as part of an AWS data pipeline as well. Another integration point that’s important with EMR is on the storage side. So normally, a Hadoop cluster will use something called HDFS. The Hadoop Distributed File System. But with EMR, that’s just one of many choices you have of where to store your data. Let’s talk about HDFS first, though. HDFS is a distributed scalable file system for Hadoop, and it distributes the data it stores across different instances in your cluster.
This is important because it allows Hadoop to actually try to run the code that analyzes your data on the same instance where that data is stored. So that’s a very useful performance optimization there. However, it means that if you shut down your cluster, that data is lost, and that’s not a good thing. So that’s the downside. However, HCFs does try to be very resilient and durable while your cluster is running. It actually stores multiple copies of your data on different instances, and that ensures that no data is lost if an individual instance fails. But if you shut down the whole cluster, well, you’re done. Now each file in HDFS is stored as blocks and is distributed across the entire Hadoop cluster. That means if you have a large file that you’re storing in HDFS, it’s going to get broken up into blocks, and those blocks are going to be stored in multiple places for backup purposes.
By default, a block size is 128 megabytes. Remember that number, guys. So if you have a big file and you’re going to be storing it on HDFS, that file will ultimately be split up into 128 megabyte chunks for processing. If you imagine an exam question that talks about the ideal size to break your files into for processing purposes, on HDFS, that would be 128 megabytes. Now again. HDFS is Ephemeral. When you terminate your cluster, that data is gone. However, it is useful for caching intermediate results during MapReduce processing or for workloads that have very significant random IO. Now, if you’re going to be leaving your EMR cluster running indefinitely, that’s not really a concern. But if you do intend to terminate your cluster for any reason at any point, it’s going to be better to store your data someplace else that’s more durable. One such example is the EMR file system, or Emrfs.
This basically creates a file system that looks like HDFS, but it’s actually backed by S Three, so that if you terminate your cluster, your data will still live in S Three and you don’t lose anything. It’s a good thing. So EMR extends Hadoop to add the ability to directly access data stored in S Three as if it were a file system like HDFS. And you can use either HDFS or S Three as the file system in your cluster. In this case, S Three would be used to store your input and output data, and you could still use HCFs to store intermediate results. Something else Emrfs offers is something called Emrfs Consistent View, and that’s boldfaced for a reason, because you’re likely to see it mentioned on the exam.
Now, a problem here is that you have a consistency problem if you have a bunch of different nodes in your cluster trying to write data or read data from S Three at the same time, right? What happens if this node is trying to write to the same place where another node is trying to access data from on HDFS? That’s not so much of an issue, because data tends to be processed on the same node where it’s stored. But when you bring S Three into the mix, you have this consistency problem. That’s what emrfs consistent view solves for you. It’s an optional feature available when using Amazon EMR 3. 2. 1 or later. And when you create a cluster with Consistent View enabled, EMR will use a DynamoDB database to store object metadata and track consistency with S Three for you. Now, that underlying DynamoDB database does have some capacity limits, and you can actually expand that if you need to. When you’re creating your cluster, the local file system is also an option for data storage. In some situations. You will have locally connected disks on the ECT nodes in your cluster, and those will be pre configured with an instance store.
And of course, that will only persist until the lifetime of that individual EC two instance. That local storage is useful for storing temporary data that’s continually changing, like buffers and caches and scratch data, but for longer live stuff. Obviously that’s not the best choice of where to store information. There is also EBS for HDFS. Amazon EMR automatically attaches an Amazon EBS general purpose SSD ten gigabyte volume as the root device for its amis to enhance your performance. And you can add additional EBS volumes to the instances in your EMR cluster to allow you to customize the storage on an instance. You can also save costs by reducing the storage on an instance if you know you don’t need it. And this is really useful because it allows you to run EMR clusters on EBS only instance families, such as the M four or the C four, which are very general purpose.
Now again, EMR will delete these volumes once the cluster is terminated. So if you want to persist data outside the lifecycle of a cluster, you’ll have to use Amazon S Three as your data store instead. You cannot attach an EBS volume to a running cluster. You can only add EMS volumes when launching a cluster. And when you manually detach an EBS volume, EMR will treat that as a failure and replace both the instance storage, if applicable, and the volume stores so you can’t mess with your EBS storage once your cluster is up and running. That’s something you need to think about before you launch it.