Amazon AWS Certified Data Analytics Specialty – Domain 3: Processing Part 5
- Pig on EMR
Let’s briefly talk about Apache pig. That is also an important part of the Hadoop ecosystem that comes preinstalled on Amazon EMR. So Pig Arose is sort of an alternative interface to MapReduce. So it recognizes that writing code for Mappers and reducers using MapReduce takes a long time. And pig introduced instead a language called Pig Latin. This is basically a scripting language that lets you use SQL style syntax to define your map and reduce. So instead of writing Java code for a MapReduce code, it’s sort of an abstraction on top of that that allows you to use more of a high-level scripting language. In this case called Pig Latin. It’s kind of an older technology, but it still comes up on the exam and it’s still a popular thing that some people use.
Like Hive, it’s also highly extensible with user defined functions, so you can expand on Pigs functionalities in any way you want to if you’re willing to write the code for it. Also like Hive, Pig sits on top of MapReduce or Tez, which in turn sits on top of Yarn, which in turn sits on top of HDFS or Emrfs in the case of Amazon EMR. Now, you’re never going to be asked to actually write or understand code on the exam, but if it helps you understand what Pig Latin is all about, I put a little code snippet here of some Pig Latin code that will analyze some ratings data that sits on HDFS someplace. So a little example of some Pig Latin there.
You can see it’s pretty high level. It looks a lot like sequel, but it’s not exactly SQL. So there is some learning curve to it. But once you get the hang of it, it’s not too bad. Pig and AWS integrate in several ways. Just like Hive, they have extended Pig to be a little bit more EMR friendly. So for example, Pig is not limited to HDFS when you’re running it on EMR can also query data on S Three through EMR, FS. And like Hive, it has the ability to load jars and scripts that are stored externally on Amazon S Three directly. But that’s pretty much the extent of Pig and AWS integration. Beyond that, it’s straight up Pig. Again, kind of an older technology, but you just need to understand what it is and the applications that it’s meant for. It’s basically a higher level alternative to writing MapReduce code, not quite SQL. It’s basically a scripting language that looks a lot like SQL, but it’s very highly distributed and allows you to analyze your code in a distributed manner.
- HBase on EMR
Another piece of the Hadoop ecosystem that comes preinstalled on EMR is HBase. HBase is a no relational database that’s designed to work on petabyte scale data that’s distributed across your entire Hadoop cluster on EMR, or any Hadoop cluster for that matter. And like Hadoop itself, it’s based on Google technology. So Google published a paper called Bitable describing their similar system used within Google for of them. It was built on top of the Google File system, but for us, it’s built on top of HDFS. So basically, you have unstructured data spread across your entire Hadoop cluster. HBase can treat that like a non relational no SQL database and allow you to issue fast queries on it. The reason it’s so fast is because it operates largely in memory, so it’s not doing a bunch of disk sequence when it’s doing its queries. It’s storing that stuff in the memory spread out across the entire cluster.
Also, it integrates with hive, so you can use hive to issue SQL style commands against data that’s exposed through HBase, which is kind of interesting as well. If HBase sounds a lot like DynamoDB, that’s because it is. They’re both NoSQL databases intended for the same sorts of use cases. So how would you choose between using DynamoDB and HBase if you’re going to be using EMR and storing your data on an EMR cluster? Well, if you’re committed to using AWS services, anyhow, DynamoDB does offer some advantages over age based. For example, DynamoDB is fully managed and is a separate system that automatically scales itself. So you’re not going to be tied to this managed EMR cluster for the scale of your HBase database.
DynamoDB is a separate beast where Amazon manages all the scaling of that for you. So that’s a big advantage for using DynamoDB instead of HBase on an EMR cluster. EMR is not a serverless solution, but DynamoDB is, so they serve much the same purpose. So all things being equal, DynamoDB has a pretty big advantage there. DynamoDB also offers more integration with other AWS services and integration with AWS glue. So if you’re going to be using a lot of AWS services, you’re going to find that using DynamoDB instead of HBase is going to offer up a lot more possibilities to you. There are some situations, though, where HBase might be the right choice. First of all, if you’re not married to AWS and you think you might want to actually move off to a separate Hadoo cluster outside of AWS at some point, HBase would be a more intelligent choice there.
But it’s also better at storing sparse data. So if you have data that’s just really scattered across your entire cluster, HBase tends to be able to deal with that better than DynamoDB. HBase is also really good for high frequency counters because it offers consistent reads and writes a little bit better than DynamoDB does. And that’s not really necessarily a big scale operation where you need a massive cluster to store that data necessarily. So the scaling concerns that DynamoDB addresses might be less of a concern for a high frequency counter application. HBase also offers really high write and update throughput. So if you’re going to be doing a lot of writes, HBase might be a better solution than DynamoDB from a performance standpoint.
And if you’re more concerned with integration with Hadoop services as opposed to AWS services, HBase offers that for you as well. So at the end of the day, sometimes it comes down to what’s the ecosystem you’re trying to integrate with. If it’s really AWS, then DynamoDB is a good choice for your NoSQL database. However, if you’re looking to integrate with Hadoop itself, HBase offers more in that department. As with other systems that are on Amazon EMR, they offer some AWS integration points to sort of make it worth your while to use EMR instead of just a straight up Hadoop cluster from Cloudera or something like that.
So, like everything else, you’re not limited to storing data or reading data from HDFS. With HBase on EMR through Emrfs, you can actually store files and metadata data on Amazon S Three instead. You can also backup data from an HBase database to S Three very easily on EMR. So these are both ways in which AWS has integrated HBase more closely with EMR and AWS services for you.
- Presto on EMR
Presto is another technology that comes preinstalled on Amazon EMR, and it’s something you might see in the exam. What Presto does is connect to many different big data databases and data stores at once and allows you to issue SQL style queries across all those databases together. So basically, you can write a sequel join command that combines data from different databases stored in different technologies that live on your cluster. It’s a way of unifying all of your your data and issuing queries across disparate data sources that are stored across your cluster. Now, the key words with Presto are that it offers interactive queries at petabyte scale. So if you see a question that talks about petabyte scale and doing interactive queries across a wide variety of data sources, they’re probably talking about Presto. Presto is great because it has a very familiar SQL syntax.
There’s no new language to learn there really, and it’s optimized for OLAP applications, doing analytical queries and data warehousing kinds of queries across your massive data sets that are spread out across your entire EMR cluster. What’s interesting is that Preston was originally developed by Facebook and is still partially maintained by them. So it is open source, but it has its roots in Facebook and obviously they have big data to deal with as well. Also interesting is that this is what Amazon Athena uses under the hood. So Athena is really just a serverless version of Presto with a nice little skin on top of it. Presto also exposes JDBC command line and tableau interfaces, so you can use it as sort of a window into all the data stored on your cluster.
And you can expose that to external analytics and visualization tools using these external interfaces. So, great way to expose the data on your cluster, even if it’s stored in many different places in different formats. Some of those connectors include HDFS S Three, Cassandra, which is another no sequel kind of database MongoDB HBase straight up SQL databases, whatever they might be. Redshift is also there and Teradata. So again, a very wide variety of data sources that you can tie together using Presto. And they can be both relational and non relational databases. Like it doesn’t care.
It can treat them all as a SQL interface, which is really interesting. Presto is even faster than Hive in many cases, and Hive is pretty darn fast. So that’s pretty impressive, right? And with Amazon EMR, you can launch a Presto cluster in just minutes. You don’t have to do node provisioning or a cluster setup, or Presto configuration or cluster tuning. So it’s very fast and easy to get Presto up and running with EMR. Now, in Presto, all of the processing is done in memory and pipeline across the network in between stages. This avoids any unnecessary IO overhead and that’s why it’s so fast. However, even as fast as it is, it’s still not an appropriate choice for OLTP or batch processing. Presto is just an efficient interactive query tool for doing OLAP style queries, trying to extract meaning from massive data sets that might be stored in different databases and entirely across your ecosystem.
- Zeppelin and EMR Notebooks
Apache Zeppelin is something else you might see in the exam and it also comes preinstalled on EMR. Now, if you’ve ever seen an IPython notebook, that’s basically what Apache Zeppelin is. It’s a hosted notebook on your cluster that allows you to interactively run Python scripts and code against your data that’s stored on your cluster. And it can interleave with nicely formatted notes and allows you to share these notebooks with other people that are using your cluster. So the nice thing is that that instead of running a little IPython notebook on your desktop, it’s actually running on your cluster. That means you can kick off processes on your cluster and run various data analytics tasks across the entire cluster right from an interactive notebook environment.
If you don’t know what an IPython notebook is, basically it’s a web browser interface that allows you to write little blocks of Python code and see the results interactively. And it also lets you sort of intersperse little comments and notes and pictures if you want to so that other people can understand what you’re doing when they look at your code in the notebook. Zeppelin integrates with Apache Spark, JDBC, HBase, Elasticsearch and more. So there’s a lot of different things you can kick off from a Zeppelin notebook. Personally, I’ve usually used it in the context of Apache Spark and the power of Zeppelin is that it allows you to run Spark code interactively, just like you could within a Spark shell. So you can write a notebook on Zeppelin where you just type in some interactive Python code and it will go off and do that for you across your cluster. This is useful for really speeding up your development cycle because it’s a good way of iteratively trying new things with Apache Spark and seeing the results right away. It allows easy experimentation and exploration of your big data. What’s really cool is that it makes it easy to visualize your results in charts and graphs.
So normally when you’re writing Spark code, you’re not going to be able to actually produce pretty charts and graphs to visualize the results. But because Zeppelin integrates with external chart libraries and visualization libraries, it makes it easy not only just to kick off a job to Apache Spark, but also to visualize that data in a meaningful way that might provide new insights from a business standpoint. It also allows you to issue SQL queries directly against your data using Spark SQL. So it makes it very easy to just type in a SQL query and kick that off against your entire data set on the cluster using Apache Spark. So at the end of the day, Zeppelin helps to make Spark feel less like a programming environment and more like a data science tool.
So in the world of data science, people very often use IPython notebooks and this is a very similar format for getting the same sort of capability except using the entire power of your cluster and Apache Spark to get back those results. So think of Zeppelin as a way of scaling up an IPython notebook to an entire cluster and handling big data instead of just data that resides on a single computer. Interestingly, Amazon also offers something similar called the EMR notebook. And this is a similar concept to Apache Zeppelin, but with more AWS integration. And here’s a picture of what that might look like, guys. So if you’re having trouble visualizing what a notebook looks like, this is what a notebook looks like, basically. Little blocks of code that you can type in through a web browser that you can just hit Shift, Enter, and it will go and execute it for you and give you back the results.
Now, the interesting things about EMR notebook are, first of all, they’re backed up automatically to S Three, so you’re never going to lose them if your cluster shuts down. But actually, EMR notebooks exist outside of the cluster itself. So you can actually make an EMR notebook that provisions clusters and shuts them down directly from the notebook. So imagine, if you will, a EMR notebook that says, go spin up this Spark cluster, do all this data crunching on it, give me the results back, visualize it, store it back in S Three someplace, and then shut that cluster down. So it’s a way of very simply writing these scripts that can not only analyze data on an existing cluster, but also spin up clusters on demand to do what you need. So that’s kind of an interesting capability.
They are hosted inside a VPC for security purposes, and you can only get at them through the AWS console. It has some other great features as well. They are packaged with some popular open source graphical libraries from the Anaconda repository, and that helps you to prototype code and visualize results and perform exploratory analysis with Spark data frames. They can be attached to an existing cluster, or you can provision new clusters directly from the notebook. Like I said, that’s a very important capability. And it allows multiple users from your organization to create their own notebooks, attach them to shared multitenant EMR clusters, and just start experimenting with Apache Spark. This is provided at no additional charge to Amazon EMR customers. So that’s really nice. A really nice little value add to the Hadoop cluster running on EMR is the EMR notebook.