Amazon AWS Certified Data Analytics Specialty – Domain 3: Processing Part 6
- Hue, Splunk, and Flume
There are several other technologies that are associated with EMR that you need to at least know what they are. You’re not going to need a whole lot of depth on these, so I’m just going to spend one slide on each one. The important things here are what each technology does and how it integrates with AWS. Those are the key points. Let’s start with Hue. Hue stands for Hadoop User Experience, and basically this is the front end for your entire cluster. It’s like the manager for your entire cluster. So think of it as sort of like the front end interface to managing your tire cluster, spinning up services, getting operational insights into what’s going on, things like that. It can integrate with IAM to make sure that your Hue users can inherit IAM roles in terms of what they can access within Hue. And they’ve also extended Hue to actually allow you to browse and move data between HDFS or Emrfs and S Three. So that’s kind of how they’ve extended Hue for EMR.
But the main thing you need to remember is that Hue is basically the front end console, the management console for your entire EMR cluster, okay? So you might run into questions on the exam that are like, what’s the most appropriate tool for this problem? Remember, Hue is just a management tool. It’s the front end dashboard for your entire cluster, if you will. Splunk, similarly, also offers operational insight into your cluster as a whole. It just sits there collecting and gathering index data all the time about the actual performance of your cluster and what it’s doing. And in addition to deploying it on EMR, you can just spin up your own Splunk cluster separately.
Amazon offers public amis containing Splunk Enterprise on top of a 64 bit Linux Amazon OS that you can just sit there and have it monitor your EMR cluster separately. So again, the important thing here is that Splunk is basically an operational tool. It’s used to visualize your EMR and S Three data using your cluster. You’re only going to really see Splunk in the context of sort of a list of various technologies that might fit a certain problem. And usually it’s just there to misdirect you. But if you know what Splunk does, that will help you understand those questions. It’s just an operational tool. So Flume is a little bit more interesting.
It’s just another way of streaming data into your cluster, kind of like Kinesis or Kafka might do. So it’s a distributed, reliable, and available service that allows you to efficiently collect, aggregate, and move large amounts of log data in particular. So Flume refers to like a log flume from the logging days. It’s really purpose built for log data coming in from a big fleet of web servers. For example, basically a web server would act as an external source that provides the event to a Flume source. That event is then stored in one or more channels, and a channel acts as a passive store that keeps the event until it is consumed by a flume sync. The flume sync then removes the event from the channel and puts it into an external repository like HDFS on your EMR cluster.
So an example of a sync would be an HDFS sync that writes events into HDFS. It supports creating text and sequence files and supports compression in both file types as well. You can also have a Hive sync, and that would stream events containing delimited text or JSON data directly into a Hive table. Or partition events are written using Hive transactions. So it’s just a way of streaming data from external log sources directly into HDFS or Hive or HBase or what have you. The underlying architecture isn’t that important. You just need to understand what Flume is for. It’s a way of streaming log data into a cluster. So in that context, you might see it as an alternative technology for handling streaming applications on an EMR cluster. Let’s also talk about MXNet.
We haven’t really talked about deep learning yet, but MXNet is an alternative to Tensor Flow, and it is a library for building and accelerating neural networks. It is included on EMR, and it seems to be kind of the preferred way of doing deep learning on EMR. So all you need to really know for the purpose of the exam, they’re not going to ask you to design a neural network, don’t worry about that. But you do need to know that MXNet is a framework that is used to build deep learning applications. Okay? So MXNet is basically a library that makes it easy to write deep learning that is distributed across an entire EMR cluster. That’s it. That’s all you need to know.
- S3DistCP and Other Services
Another tool you might see on the exam is S Three dist CP. And what this is, is a tool for copying massive amounts of data either from S Three into HDFS or from HDFS into S three. And because it uses MapReduce, it’s able to split up the copying of your objects across all the machines in your cluster. So it’s a very efficient way of copying a lot of numbers of objects in parallel, using your entire cluster to do it all at once. So again, it uses MapReduce to copy stuff between S three and HDFS in a distributed manner, and it can work across S Three buckets, and it can even work across different AWS accounts. So just know what it is and what it’s for. That’s what’s important here. And there’s a bunch of other external tools that you might see running on a Hadoop cluster. In general, not all of these come preinstalled, but many of them do.
Again, you just need to know what they are. There’s no need to go into a lot of depth as to how they operate. You just need to know what they’re for the purpose of the exam. A lot of these you’ll just see in the context of they’re going to throw this out there to misdirect you, but once in a while they’ll expect you to actually know what it is. For example, Ganglia is actually a monitoring tool that comes preinstalled on EMR. It’s just a way of monitoring the status of your cluster. You probably want to use Cloud Watch for that in most cases, but Ganglia comes as part of EMR as well. Mahout is actually a machine learning library, so if you see a question that’s asking about ways of doing machine learning on an EMR cluster, that might be a legitimate way of doing it. Obviously, deep learning and using something like MXNet or Tensor Flow is also another way of doing it. But Mahout is also a way to do it as well. It’s also in the same league as like Sparks ML Lib that we talked about earlier. So there’s a lot of ways to do machine learning across a cluster.
Mahout is yet another one of them. And if you want yet another NoSQL database in addition to HBase and DynamoDB, we have Accumulo. The thing about Accumulo is as designed with security primarily in mind, but it is fundamentally just another NoSQL database. Scoop is interesting that might come up that is basically a relational database connector, and it’s used primarily for importing data from external databases into your cluster in a very scalable manner. So kind of like S Three disco paralyzes the copy of data between S three and your cluster. Scoop can parallelize the copying of data from an external relational database into your cluster as well. So Scoop’s Raisondetra is to paralyze the importing and exporting of data between external relational databases and your cluster.
That might be important to know. There’s also something called h catalog. And what that is, is table and storage management for your hive metastore. So it’s kind of a meta layer on top of the meta store, if you will. It’s a storage and table management for Hive meta store, just like it says. That’s all it is. So if you see the term H Catalog, that’s related to Hive megastores and in turn related to Hive. Also, there’s a thing called the Kinesis Connector. This is just a little library that lets you directly access Kinesis streams from any script that you might be writing. So if you do need to have a situation where you’re writing your own custom code on an EC Two node or something, you might use Kinesis Connector to access the data in that stream from a script that you’re writing.
There’s something called Tachyon that might get mentioned. All Tacheon is is an accelerator for Apache Spark. So even though Spark is pretty darn fast in its own right, there’s an add on called Tachyon that will make it even faster. And if you’re looking for yet another relational database, there’s something called Derby. That is an open source relational database that’s implemented entirely in Java. So the thing that makes that database different is that it’s entirely done in Java and uses objects in Java primarily. Finally, there’s a Ranger, Apache ranger.
.That is a data security manager for Hadoop. Just another third party tool. None of these are specific to EMR. They’re all general purpose tools that exist in the Hadoop ecosystem. There’s nothing really special about them in the context of the EMR cluster. They’re just things that you might install on any Hadoop cluster in general. But this is an exam about big data, and once in a while they’ll expect you to understand things that are not necessarily within the AWS ecosystem, but within the larger Hadoop Big Data ecosystem as well.
And in addition, you can install whatever software you want on your EMR cluster. At the end of the day, it’s just a cluster of EC Two hosts, and you can install any external third party systems you want on there. No one’s going to stop you. So if you have your own custom software or some other third party package that you want to run in your cluster, have at it. Why not? You can do whatever you want. You have complete access to this cluster. You can install whatever you want on it to do whatever you want. So keep that in mind too. You’re not limited to the tools that come preinstalled on EMR.
- EMR Security and Instance Types
So we’re almost done talking about EMR. As you can tell, there’s a whole lot about EMR. It’s a very large ecosystem, and there’s a lot of stuff you can install on an EMR cluster. And really, EMR is at the heart of a lot of big data applications, and you can expect EMR to be at the heart of the exam as well. A couple of more things to talk about before we wrap it up, though. One is security and how you might keep your data secure on your EMR cluster both in transit and at rest. There are several interfaces for security in EMR. One is by using AWS’s IAM policies that’s identity and access management, and those IAM policies can grant or deny permissions and determine what actions the user can perform within your Amazon EMR cluster and with other AWS resources.
So it allows you to control what other AWS services your EMR cluster can talk to. You can also combine IAM policies together with Tagging to control access on a cluster by cluster basis and manage that independently for different clusters. Also, you can have IAM roles for EMR FS requests to Amazon s three. This would allow you to control whether cluster users can access files from within Amazon EMR based on user group or the location of EMR FS data within Amazon S Three. Kerberos is also available to you. It’s not really an Amazon tool per se, but Kerberos is another way of providing strong authentication through secret key cryptography. This is a network authentication protocol that ensures that passwords or other credentials aren’t sent over the network in an unencrypted format. Kerberos is available in EMR release version 510 or later.
Also, SSH, of course, is available. Secure socket shell. Again, not really an Amazon specific thing, but any Linux host will offer this, including the EC two hosts that make up your cluster. SSH provides a secure way for users to connect to the command line on cluster instances, and it also provides tunneling so you can view web interfaces that are hosted on your master node of your cluster from outside of the cluster itself. Kerberos or Amazon EC two key pairs can be used to authenticate clients for SSH, and SSH can also be used as a means of encrypting data in transit. Finally, we have IAM roles. So the Amazon EMR Service Role Instance Profile and Service Linked Role control how Amazon EMR is able to access other AWS services. Each cluster in Amazon EMR must have a service role and a role for the Amazon EC Two Instance profile. IAM policies are attached to these roles and provide permissions for the cluster to interoperate with other AWS services on behalf of the user.
For example, if you’re going to be enabling automatic scaling on your cluster, you’re going to need an auto scaling IAM role attached to it. There might also be service linked roles used if the service for Amazon EMR has lost the ability to clean up Amazon EC two resources. So that security on EMR in a nutshell. Obviously there’s a lot to think about there, and we’re going to dive a lot deeper into security toward the end of the course. Something else you need to be aware of is how to choose the right instance types for your EMR cluster. Remember, under the hood they’re just EC two instances, but there are many different kinds available to you. For the master node, Amazon recommends using a general purpose like an M four Large if you have less than 50 nodes to manage, but if you have more than 50 nodes in your cluster, you might want to step things up to an extra large instance type for the master node instead.
Now, for your Core and Tasks nodes, again, general purpose instances like an M four large is generally a good choice, but you might want to choose something more specific if you know that your application has certain qualities. So, for example, if you know that your cluster is going to be doing a lot of waiting on external dependencies, like maybe you have a web crawler that’s running and your cluster is just going to sit there a long time waiting for that crawler to complete. You know that your cluster is going to be sitting idle a lot. And maybe you can get away with something a little bit cheaper for the core and task nodes, like a T two medium. Okay? But otherwise, if you just don’t know when it’s going to be running or if it’s going to be running pretty much continuously, probably better to stick with a general purpose large instance instead.
If you got the budget for it and you want even better performance, of course you can upgrade to an extra large instance instead, like an M four X Large. But if you know the application that you’re going to be running and the demands that it has, you might choose something more custom suited to that specific application. Now tread carefully here because it’s hard to change these out after you spin up the clusters. So you might start off with a certain kind of application in mind, but changing your mind later and adding additional different kinds of applications to this cluster over time. And usually that’s why you want more of a general purpose node there. But if you know you’re going to be running stuff that is very computation intensive, like for example, machine learning comes to mind, a high CPU instance might make sense there too. If you’re going to be doing something that really depends on memory a lot, like databases or things that cache a lot of memory, things like HBase come to mind, right?
Or even Hive or even Spark for that matter, because those all operate in memory. You might want to try a high memory instance instead for applications that are really depending on memory and having lots of it in order to accelerate their operations. Finally, if you have an application that you know is both network and CPU intensive for example, a large data set that’s being crunched on by machine learning or natural language processing algorithms, you might want to choose a cluster compute instance instead. Also, there’s also the decision of whether to use spot instances or reserved instances for your cluster. As we talked about earlier, a spot instance can be a really good choice for task nodes because that gives you the ability to very cheaply add more computing capacity to your cluster and add and remove that without worrying about messing up your underlying file storage in the process. You should generally not use a spot instance for your core or master nodes unless you’re just testing and messing around. Or if you’re in an extremely cost sensitive environment, you have to remember that you’re risking partial data loss whenever one of your core or master nodes goes down, so using a spot instance is not a smart choice for those.