Amazon AWS Certified Data Analytics Specialty – Domain 4: Analysis Part 4
- [Exercise] AWS Glue and Athena
Let’s start building out our data warehousing and visualization requirements for Cadabra using Amazon Athena at first. We’re going to do this in a couple of different ways in this course. One using Redshift and one using Athena. Let’s start with Athena, because that one’s easier. All we have to do is set up AWS Glue to actually infer a schema from our data lake in S Three, which houses all of our order data. And we’ve already done the work of importing all that order data into S Three earlier in the course. So we did set up a data firehose way back earlier in the course to dump all that data into S Three for us. So we have a bunch of order information just sitting in S Three as CSV files. Now we just need to tell Glue where that is, give it a few hints about how that data is actually named, and then we can just issue queries against that data from Athena as if it were a SQL database. So let’s build that out and see if it works. Back to the management console. Let’s get started by going to AWS Glue. Go ahead and type that into Find Services. If it’s not here and we’ll go to crawlers and add a crawler, we’ll give this crawler a name called, I don’t know, Order Data.
The defaults here are fine for security. It’s always worth talking about security. The only option here is to encrypt your logs from Glue that are going to be pushed into Cloud Watch, so we’re not going to worry about that. That’s not really sensitive information for us right now. So we need to choose a data store that our data lives in. So we want Glue to go look at our data in S Three and make a table out of that. Other options are JDBC sources, so you could have it on top of some other database or DynamoDB. But for us, we’re using S three. Let’s browse to the path in S Three where our data lives. For us, that’s order logs sundog. Edu Bucket obviously, your bucket name will be slightly different. Go ahead and select that. And we do need to exclude some stuff here, so let’s open that up. If you did the Amazon Elasticsearch Service exercise earlier, we did Configure Firehose to dump errors into an Es sub directory in that same bucket. So we don’t want to see that data as we’re building up our schema here for order data. So we’re just going to say Esstarstar to exclude everything underneath the Es directory in our bucket. Click out of that to apply it, or hit Return whatever and hit next. We’re done. That’s the only data store we want. We do need to create an IAM role so that Glue can talk to S Three.
Let’s call it I don’t know, order data and hit Next. We will run this on demand because I don’t anticipate our schema changing anytime soon. And we need to add a database to actually put the information into. So let’s hit Add database and we’ll call it, I don’t know, order logs create. We don’t need a prefix, and we’re all good. Hit next, we can review everything, make sure it looks correct, and finish. All right, so we created our crawler. We want to run it now. So click on run it now and let it go. Do its thing for a bit. Take a couple of minutes for this to finish. So we’ll come back when that’s done. All right, after a couple of minutes, my crawler is back in ready status, and you can see that it did add one table. So that seems promising. Let’s go back up to the databases here and check it out. So, sure enough, there’s our new order logs database. Let’s click on that. And if we click on tables in order logs, looks like we now have an order logs underscores Sundog. edu table that was created, of course, for you. That will be a different name based on the bucket name that you were inside.
And let’s go ahead and take a look at what’s in there. Cool. And you can see it did extract a schema. Now, the thing is, we don’t have any header information in the CSV data in S three, so it has no idea what these columns actually mean or what they should be called. So we’re going to have to tell glue what these column names are by hand. Also, we have four partition fields here, and if you recall, the data in S three for our order logs is broken up by a year directory and then a month subdirectory, and then a day subdirectory, and then an hour subdirectory. That’s what these partitions are. They’re the subdirectory structure that our S Three data lives within. And we can take advantage of those partitions to speed up our queries, which is kind of nice. So let’s go ahead and clean these up and give these columns and partitions some real names. Let’s hit Edit schema up here, and we can just click on each column name and tell it what it is. I happen to have a cheat sheet here that tells me that the first column is invoice number invoice no is what it’s actually called. The next one is stock code, followed by description, followed by quantity invoice date, unit price, customer ID, and country. And as we said, the partitions correspond to the directory structure our data is stored in, which is the year, month, day, and hour.
All right, now, it did infer the data types as well as best it could based on the data that it saw. If you’re going to be using this from Hive, don’t change those, okay? Because you get all sorts of weird errors about the partition keys being different from the main keys in your data here. And it’s a really difficult thing to recover from in my opinion, it’s a bug in glue. Maybe they’ll fix it someday. But if you do need to transform this data into some other type, let’s say that I wanted my invoice dates to actually be a date type instead of a string. The best way to do that right now would be to create an ETL job in glue and write a script to actually do that conversion for you there. But we’ll live with this for now. This is good enough for us to play with. So let’s go ahead and hit save to apply those changes to our schema. If we scroll down, we can see their invoice name stuck. It’s SQL, so they got all lowercase applied, but that’s fine. All right, so now we have a table in AWS Glue that gives some structure to our collection of CSV files that are sitting in S Three. So kind of neat.
We can start treating that S Three data lake as a database. Let’s head over to Athena and see if it actually works. So back to the console here. Let’s go to Athena. You might have to type that in, of course. And you can see that our order logs database is already here. How cool is that? And our order logs sundog ETU table within it is already here. So it just works. I didn’t have to set up any servers. I didn’t have to import the table or import the database. It’s just there. Like it just talks to glue automatically and knows that it’s all there automatically. So extremely easy to use. Athena love it. We can open that table up as well and see what the actual column names and partition names are so we can refer to them in our SQL queries more easily. And let’s try it out. So just as a test, let’s see if we can get account of every item ordered from France in the month of February. Okay, why not? Now you want to substitute in your own table name, of course, because it will be different. And you will probably also have to change the month and year in this query to correspond to the month and year that you actually uploaded that data into S Three.
So you might want to go check your S Three data right now and make sure that you have a year and month. That makes sense. It’s probably the current one if you did this all in the same month in this course. But whatever. Bottom line is, this specific query will not work for you. You need to adapt it to your own data and your own bucket name. But just to show you that it works, let’s type in the following select description, comma, count, star thusly from and the table name is order logs underscore a sundog. edu. Again, change that to your own table where country equals France and year equals whatever year you imported it in, and month equals whatever month you imported the data in group by description. And again, what that should do is give me account of each item bucketed by its description for orders in the month of July 2019 that were done from France. Let’s see if it works. It did. Cool. Awesome. Yeah. So there we have it.
Apparently there were 15 orders of the Woodland Mini backpack, whatever that is, and 14 Charlotte bag suki to get designs. Sounds nice, but the the cool thing here is that how easy that was. Right? So once you have glue set up and you impose some structure to your unstructured data in S Three, you can query that, like Get database from Athena. And that’s a very quick and easy way to sort of explore your data and extract information from it, even do ETL sort of stuff on it at the same time. So there you have it. Athena running through AWS glue on an S Three data lake in action.
- Redshift Intro and Architecture
Next, let’s dive into Amazon Redshift. Redshift is AWS distributed Data Warehouse Solution. So it is a petabyte scale data warehouse that is spread across an entire cluster, but it’s fully managed so you don’t think about taking care of that cluster, they take care of all the server maintenance for you. Now, given that the world of big data often involves the world of data warehouses and dealing with massive data sets, it should be no surprise that the exam spends a lot of time talking about Amazon Redshift. So we’re going to spend a lot of time on it as well. A lot of depth here needed on Redshift. So pay attention folks, this stuff is all important at a high level. What is redshift? Well, it is a fast and powerful, fully managed petabyte scale data warehouse service and Amazon claims that it delivers ten times better performance compared to other data warehouses.
So not only is it massively scalable, it’s super fast and it achieves that speed using machine learning, massively parallel query execution which is called MPP and using columnar storage which we talked about earlier, and using high performance disks as well. Keep in mind, Redshift is a data warehouse, so it is specifically designed for online analytic processing. OLAP it’s for querying your data and getting insights out of it. In an analytical standpoint, it is not made for OLTP. For OLTP, typically you would want more row based storage. So you’re not going to be hitting your Redshift data warehouse at massive transaction rates expecting fast responses. It’s meant as an analytics tool, which is why it’s in the analytics section here.
They also claim that in addition to being super fast and super scalable, it’s also super cheap. They claim it is the most cost effective cloud data warehouse and there is no upfront cost to using Redshift. And that is in stark contrast to building out a massive data warehouse on premises. I can tell you that firsthand you just pay as you go for the resources that you’re consuming and that can end up being one 10th of the cost or less of traditional data warehouses that are stored on premises. Also, it provides fast querying capabilities, over structured data using familiar SQLbased clients and Bi tools just using standard ODBC and JDPC connections. So it just looks like another relational database to the outside world and you can connect any analytic or visualization tool you want to Redshift on top of it. It’s also easily scalable. You can easily scale up or down your cluster with just a few clicks in the AWS management console or with a single API call.
So if you do need to scale it up or scale it down, that’s very easy to do. Won’t be automatic, but at least it’s easy. It also uses replication to enhance your availability and continuous backups to improve your data durability and it can automatically recover from component and node failures. For monitoring, it integrates with cloud Watch, and for metrics for compute utilization, storage, utilization, and read write traffic to the cluster. Those are all available for you free of cost within Cloud Watch. And you can also add user defined custom metrics that you can add using Cloud Watch’s custom metrics functionality. It also provides information on query and cluster performance using the AWS Management Console, and that helps you in diagnosing performance issues like which user a query is consuming high resources. Some use cases that are listed for redshift are accelerating all of your analytics workloads. So if you just want your data warehouse to be faster, you might want to move to Redshift. It uses, as we said, machine learning, MPP and column restoration on high performance disks and result caching to make it super fast. You might also want to use Redshift because you want to unify your data warehouse and your data lake.
Something we’ll talk about shortly is Redshift Spectrum, which is a way of importing your unstructured data in S Three as just another table in your data warehouse. So you could actually do joins and stuff across your structured data that’s been imported into your redshift servers themselves, together with data lake information that’s actually stored in S Three somewhere. That’s kind of cool. Maybe you just want to modernize your data warehouse and make it faster and more scalable and easier to manage. Redshift is a potentially easy way of doing that. And there are some more specific use cases that come out of the AWS Big Data White Paper. Those would include analyzing global sales data, storing historical stock trade data, analyzing ad impressions and clicks, aggregating gaming data, and analyzing social trends. These are all examples of stuff you can do with redshift, or really any data warehouse for that matter.
Let’s start diving deep into the architecture of Redshift itself. So basically we have clusters that’s kind of the highest level here that encompasses this entire picture here. A cluster is the core infrastructure component of an Amazon redshift data warehouse, and a cluster is composed of a leader node, which you see here, and one or more compute nodes. You can contain between one and 128 compute nodes, depending on the node type. So it’s not infinitely scalable, but 128 nodes can hold a whole lot of data, and each cluster can contain one or more databases. Now, the user data is going to be stored on the compute nodes. The leader node is just managing communication with the client programs and all communication with the compute nodes.
So it’s sort of the interface between your external clients to redshift and the compute nodes under the hood. It receives all the queries from client applications, parses the queries, and develops execution plans, which are an ordered set of steps to process those queries. It then coordinates the parallel execution of those plans with the compute nodes and also aggregates the intermediate results from those nodes. Finally, the leader node will return those results back to the client applications. Let’s talk about the compute nodes in more depth. So compute nodes are responsible for executing the steps specified in the execution plans that it’s getting from the leader node and transmitting data among themselves to serve those queries. It then sends those intermediate results back to the leader node for aggregation before being sent back to the client applications. Now, each compute node has its own dedicated CPU memory and attached disk storage, which are determined by the node type you choose. There are two different types of node types you can choose for a compute node in your redshift cluster, and you choose this when you set it up. One is dense storage, that’s DS node type. That allows you to create a very large data warehouse using hard disk drives or HDDs for a very low price point. And these are available in two different sizes extra Large and eight XL.
An Extra Large Excel has three HDDs with a total of two terabytes of magnetic storage, and an eight XL has 24 hard drives with a total of 16 terabytes of magnetic storage on that node. Now, a DS two eight XL Large has 36 intel xeon e 526 76 haswell virtual cores and 244gb of Ram, whereas an Extra Large has four intel xeon e 526 76 haswell virtual cores and 31gb of Ram. So, depending on the scale of your data, you have the choice there of Extra Large or eight x Large. Dense Storage if you want to optimize for storage capacity, the other option for dense storage node types is the dense compute node type. So instead of DS, we have DC node types. Now, for dense compute and dense compute node types allow you to create very high performance data warehouses using fast CPUs, large amounts of Ram, and SSD solid state disks. They’re also available in two sizes, the Large and the eight XL. The Large has 160GB of SSD storage, two virtual cores, and 15GB of Ram. The eight XL is 16 times bigger, with 2. 56 terabytes of SSD storage, 32 virtual cores, and 244gb of Ram.
So it allows you to get started with a single DC two large node for twenty five cents per hour and scale all the way up to 128 eight XL nodes with 326 terabytes of SSD storage, 3200 virtual cores, and 24 terabytes of Ram. That’s a lot. That’s really big. But at the end of the day, remember that compute nodes can be either dense compute or dense storage. If you want to optimize for computing capacity or storage capacity. One has SSDs, one has HDS, and within those, you also have large and Extra Large options available to you as well. All right, so within the compute node, we have node slices. So every compute node is divided into slices, and a portion of the node’s memory and disk space is going to be allocated to each slice, where it processes a portion of the workload assigned to that node. The number of slices per node is determined by the node size of the a cluster. All right, that’s a lot of depth there, but really, you need to know it. So a cluster consists of a leader node and many compute nodes. One or many compute nodes. A compute node can be dense storage or dense compute. And that compute node, in turn, consists of node slices that process chunks of the data being given to it.
- Redshift Spectrum and Performance Tuning
Next, let’s dive into Redshift Spectrum. Spectrum allows you to query Exabyte’s Exabyte’s, mind you, we’re past Terabytes here, or Petabytes. We’re in big data land here for sure. Exabytes of unstructured data in s three without loading it into your cluster. So in very much the same way that Athena could use the AWS glue catalog to make tables on top of your s three data, redshift Spectrum can do the same thing. It’s very much the same idea idea. It’s just that instead of having this console based query SQL engine in Athena, it actually just looks like another table in your Redshift database. So that way you can have tables that embody your s three data lake alongside tables that embody data that’s actually stored on the Redshift Cluster itself. And you can treat them as the same thing and join between them whatever you want to do. So it’s really cool. It allows you to run queries against exabytes of unstructured data in s three without the need to load that data into the Redshift Cluster itself or to transform that data in any way.
It provides limitless concurrency by allowing multiple queries to access the same data simultaneously in s three. It can scale out to thousands of instances if needed, so your queries will run quickly regardless of data size. And it lets you separate storage and compute capacity, allowing you to scale each independently. So with Spectrum, all of your storage is being done in s three. Spectrum is just doing the compute part of analyzing that data. Redshift spectrum currently supports many open source data formats, including avro, CSV, grok, ion, JSON ORC parquet, RC file regex, certa, sequence files, text files and TSV. So just about any common open source data format you can imagine. If you have that sitting in an s three bucket somewhere, redshift can parse that out and make queries against it. Spectrum also currently supports Gzip and Snappy compression. So if you do want to compress your s three data to save space and save bandwidth, you can do that too. So they’ve pretty much thought of everything here, going back to Redshift as a whole here. How is redshift so fast? Well, again, it uses massively parallel processing, or MPP, to do that. Data and query loads are automatically distributed across all nodes, and adding nodes to the data warehouse is made easy and also enables fast query performance as the data warehouse grows. So it will do all of its queries in parallel. And if you need more speed or more capacity, just add more nodes to your cluster and it will automatically take advantage of that extra capacity.
It also uses column in our, or data storage. So your data is organized by column. As column based systems are ideal for data warehousing and analytics for large data set querying because typically you’re just looking at specific columns of a large number of columns. So you can save a lot of bandwidth and a lot of lookup time by organizing your data by columns instead of by rows. The column or data is stored sequentially on the storage media and requires far less IO, thereby improving query performance. And by using column or storage, each data block stores values of a single column from multiple rows. So as records enter the system, Amazon Redshift is transparently, converting the data to columnar storage for each of the columns. This helps you to avoid scanning and discarding unwanted rows. Now, columnar databases are generally not well suited for OLTP online transaction processing. So remember, an anti pattern for Redshift is OLTP.
It’s meant for OLAP like any data warehouse would be. Redshift uses a blocked size of 1 MB, which is more efficient and further reduces the number of IO requests needed to perform any database loading or other operations that are part of a query execution. It can also do column compression. So compared to row based data stores, columnar data stores can generally be compressed much more, as it’s all sequentially stored on disk in the same type of data format. Multiple compression techniques are often applied for better results, so indexes or materialized views are not required with Redshift, and hence it uses less space compared to traditional relational database systems.
It will automatically sample the data and select the most appropriate compression scheme when the data is loaded into an empty table. Compression is a column level operation that reduces the size of the data when it is stored. Compression can conserve storage space and reduces the size of data that is read from the storage. It reduces the amount of disk IO and therefore improves your query performance. So when you’re loading data into a Redshift cluster, usually you use the Copy command to do that most efficiently. That allows you to copy data into your cluster in a distributed parallelized manner. And when you issue that Copy command, it will automatically analyze and apply compression automatically. You don’t have to do anything, it just does it for you. Now, it is not possible that you’re change the compression encoding for a column after the table is created. But if you want some insight into what’s going on there, it offers an analyze compression command, which will perform compression analysis and produce a report with the suggested compression encoding for the tables analyzed.