Amazon AWS Certified Data Analytics Specialty – Domain 4: Analysis Part 7
- [Exercise] Redshift Spectrum, Pt. 1
So let’s continue building out our data warehousing requirements here for Cadabra. com. This time, we’re going to use Amazon Redshift instead of Athena, and that’s a managed solution that requires you to spin up a cluster with a fixed number of servers. So a little bit more work, a little bit more thinking you have to do about capacity there. But at the end of the day, it’s going to operate very similarly. Now, just to tie things together even more and mix things up a little bit, I’m actually going to use Amazon’s Redshift spectrum to access our data in S three. Because while I could copy that data into Redshift directly and just have it sit within our Redshift servers why have a copy of the data if you don’t need to? Since I already have AWS Glue in place, that can expose my S Three data to Redshift Spectrum, and I can just query that data directly, kind of like I did with Athena. But the advantage is that if I had other tables that were loaded into redshift, I could join that S Three data with that other data as well, if I wanted to. So that’s the main difference there.
So let’s dive in and build this out. So let’s get started with our hands on lab here. I’m back in my AWS Management console here, and what we’re going to do, just to recap again, we’re going to create a redshift cluster, a very cheap and small one, and actually use Redshift Spectrum to go through AWS Glue to talk to data in our S Three bucket and query that directly from our Redshift cluster. So let’s get started. Start by going to Redshift. So go ahead and just type in Redshift in the handy search bar here and select that. And there is a new UI here as of December 2019. So we’re going to go ahead and use that because that is the way of the future. They have made it a whole lot simpler. This used to be a lot more complicated to create a redshift cluster, and they’ve really simplified it quite a bit. So we can just click on this big friendly Create Cluster button. And we definitely don’t want the $13 an hour version here. We’re just going to be querying a very small table that lives in S Three. We don’t even need to store anything locally at all here. It’s going to be querying directly to S three using redshift, spectrum and AWS glue. So to save some money, I’m going to not choose the $13 per hour option. I’m going to choose the 25 cent per hour option here, a DC two large instance type, and I don’t even need two of them. One of them will be plenty. So we’ll have a cluster of one, if you will, and you can see here that that should be reasonable.
We’re going to shut this down as soon as we’re done. So it’s not going to cost that much money. We need to give it a name. Let’s call it casaba. Note the database port of 54 39. If you were going to connect to this database externally, you would need to remember that number. We’ll stick with the default username, AWS user, and you will need to give it a password. So type in whatever your favorite password is and remember it. And the next step is the Cluster permissions. And although it says it’s optional, we do need it because we need to make sure that our new Redshift Cluster has the necessary permissions to talk to our Glue catalog and to S Three to get the underlying data. By default, it will not. So we’re going to have to create a new IAM role for this. So let’s break out into a new tab here. I’ll just right click on the AWS logo there, open a new tab, and we’ll head on over to IAM to create the role that we need. So let’s go to Roles and then Create Role, and this will be an AWS service. We’re creating a role for Amazon Redshift, and the servicing question is Redshift. So we’ll select that, and our use case is Redshift customizable. And then we’ll go to the next step, give it the specific permissions that we want. Let’s go ahead and search for the two that we need to give it. One is Amazon read only access should do. We just need to read the data. And s three for this query from Redshift, select that. And we also need to give it access to Glue. So I’m going to give it broader access to AWS Glue glue Console. Full access should be sufficient. There more than sufficient, actually.
All right, moving on. We’ll go to Tags. We don’t need to give any tags to this role and go on to the review stage. Double check that we have those two policies attached to it. We need to give this role a name. Let’s call it Redshift Spectrum and create role. All right, let’s find that role we just made. Click on it and copy the ARN. We’re going to need that. So click the copy icon there and keep that in your clipboard. Now back to the Redshift console. We can go to Cluster Permissions and open that up. And now if we refresh this, we should see in the list our new Redshift Spectrum role. So go ahead and attach that. Click on Add IAM roll. It’s not enough to just select it. You need to hit that button as well. And here’s where things get a lot easier than they used to be.
There’s some handy default settings here for all the network and security and backup settings and everything here that you can just use. Now, assuming your account was created sometime within the last few years, you should have a default VPC that you can just use for this cluster. If you do need to create one, then you’re going to have to go off and actually create a new subnet and create a VPC from that and assign that to your redshift cluster here. But we can explore what the options are here by just temporarily saying don’t use defaults. So you can see, you can specify a specific VPC if you want to, not just use the default one and a specific subnet group, a specific availability zone as well. These are all things you can set if you need to database configurations. Interesting that they kind of hid that there. It’s kind of important to know what your database name is, right? So remember, Dev is going to be the default database name.
If you did want to create a different database name initially, you would specify that here. Encryption. Also, very important to note that you can do that. This comes up on the exam a lot. So you do have options here for encrypting your data on the cluster at rest using AWS Kms or a hardware security module as well using Amazon HSM. And the new console has this annoying habit of refreshing itself when you least expect it. So, maintenance, let’s take a look at what’s in there. Okay, fine. This is just saying that we’re going to allow it to keep up to date on the latest approved cluster version when it goes down for maintenance. We have monitoring options as well with Cloud Watch. I’m not going to have any Cloud Watch alarms here set up, but you could create them here. Again, this comes up on the exam quite a bit. Cloud Watch is there for monitoring your logs and making sure that things are working as expected and alarming if they are not.
So you could set that up there as well. You can also have automatic backups with an automated snapshot schedule too, and you can specify how often you want those snapshots to run and how long they should be retained for. But I just want the default settings. And there it goes. Reconfiguring everything again, let’s scroll back down here and go back to use defaults for additional configurations. Double check everything is still correct here because with all those refreshes, they got to fix that. Guys, let’s change that Identifier name back to Cadabra and put our password back in. All right, double check everything before it refreshes again. So again, just make sure that we’re using the cheap one, the DC two large for $0. 25 an hour. We have a single node. We’re going to name our cluster Cadabra.
It’s running on port 54 39. Our master username is AWS user. The password is whatever you set, and we have set our Im role. We have to attach that again, too. Redshift spectrum at IAM role, and we’re going to use defaults for everything else. So before it goes and refreshes everything yet again, let’s hit that create cluster button. And now we just have to sit around and wait for it to get out of modifying state and wait for it to be available. So we’re going to hang out for five or ten minutes here and I’ll come back when it’s done.
- [Exercise] Redshift Spectrum, Pt. 2
All right, a few minutes have passed, and our cluster is now in available status. So let’s go ahead and select that cadaver cluster we just made, and you can see it’s up and running. Everything looks healthy. Let’s go ahead and click on the Editor tab here, and that will allow us to actually interactively query this new database. Once you’ve selected that database, you’ll need to sign into it. Now, it’s actually remembered that I’ve signed I did to this before. So for you, you’re going to have to type in that you want to connect to the dev database with the AWS user username and whatever password you specified back when you created this cluster. So go ahead and connect to that database. And here we are. So go ahead and click on the Editor tab here if you haven’t already. And what we need to do is create an external schema to actually connect our Redshift cluster with our S Three data through the Glue schema that we created for it to make it look like a database table. So rather than actually importing a copy of that data from S Three into the cluster itself, we’re going to use redshift spectrum to actually query that in place on S Three using the schema that we created using glue.
So to do that, we need to create that external schema first. Type in the following to take care of that, let’s say create external schema. Order log underscore schema is what we’ll call it from Data catalog. So this is going to import it from the Glue data catalog. That’s what that means. We’re going to look for the database in the Glue data catalog called order logs, with the following IAM role single quote, and hopefully that’s still in your clipboard. That’s the role that we created earlier in this exercise. And region with single quote whoops whatever your region name is for me, that’s us east one. All right, let’s hit run query to take care of that. And it’s completed, so let’s refresh our data objects here. And we should now see our order log schema. Let’s select that.
And sure enough, there’s our table. So take note of what that table name is for you. It might be something different based on how you imported things earlier. For me, it’s order logs underscore a sundog. edu. And go ahead and click on that, and you can explore what’s in it. Sure enough, there’s a schema for our order information from that order data set that we imported into S Three earlier in the course. Pretty cool. So now we can treat that just like any other database table. Let’s run a query against it, shall we? Let’s hit clear and start typing the following query. Let’s say select description, comma count star like that from. And to specify the table name in this manner, we need to give the schema name dot the table name to actually connect it up properly. So the schema name is order log underscore Schema and the table name will say order logs underscore sundog. edu.
Again, substitute whatever the table name is for you. All right, so we know what we’re selecting from, let’s say, where we’re going to extract here. We’re going to look for things that have where country equals single quote, France. So we’re going to look for orders from France and for a specific time frame. So we’ll say year equals and substitute here whatever year you actually imported your data into. You might have to go and refer back to your S Three bucket to figure out what that is. For me, it’s going to be way back in 2019 and a month equals two. That’s way back when I first created this very first version of this exercise for you. Obviously, it will be a much more recent date, but yeah, substitute whatever month you’re taking this course in, basically, and that should do the job. And we will say group by description. And that’s it. So let’s break down the SQL query here. It’s nothing too fancy. We’re just saying that we want to count up how many times each individual product description occurs, where the order came from France, and it was in the specific year and month that we’re giving here. So let’s run it and see what happens. Looks like it’s doing something. And there we have it. Very cool. We have some results here. So the order and things that are returned might be different for you, but you should see a similar output to this.
So, for example, from France, we had 28 orders of Woodlands Charlotte bags and 22 orders of a gumball coat rack and 15 of a robot birthday card. So cool. It worked. We actually set up a new Redshift Cluster from scratch, a very simple one, using a very inexpensive node instance type and only one of them. And rather than just importing data into the cluster itself, we actually connected it to our data that’s sitting in our S Three data lake using Glue data catalog and S Three, and a specific IAM role that allows everything to talk to each other. So we’re using Redshift Spectrum to actually look at that information that’s being stored in S Three in our data lake and interpret that as yet another database table. Pretty cool stuff. All right, that’s it.
Now remember, this is up and running and we are being charged twenty five cents per hour for the hardware that we’re using here. So if you are going to go away from this course for a while, you probably want to shut that down. You can go back to the Clusters tab here and select that node and from the actions you can say delete if you want to actually delete that cluster and make sure that you won’t be billed for it any further. However, our next exercise when we use Quick site does build upon this Redshift Cluster. So if you do intend to move forward with this course soon, go ahead and leave it up, but just remember not to leave it up forever, or else you’re going to have a very nasty surprise at the end of the month. Remember when we created this cluster? It said it was going to cost us at least $180 per month if you just let it run the whole time. So do remember to shut this down when you’re done with it, guys. Otherwise, you’re going to have a pretty big bill at the end of the month that you didn’t anticipate. So you have been warned. Make sure you delete this thing when you’re done with it. And with that, let’s move on.
- Amazon Relational Database Service (RDS) and Aurora
Let’s talk briefly about Amazon RDS, the relational database service. It’s not really relevant to big data, it’s really more for small data. But it will come up on the exam and you need to know what it is usually sort of an example of what not to do, or maybe an example of how to import your RDS data into a larger big data system. What is RDS? Well, it is a hosted relational database and that can be any of the following database types amazon Aurora, which we’ll talk about shortly. You can also host a MySQL database, a PostgreSQL database, MariaDB, which is basically an open source version of MySQL Oracle or Microsoft SQL Server. So it’s a way to just have AWS host your own little database host, so you don’t have to worry about maintaining that database host yourself, which, as we all know, being a DBA is not always a fun thing, kind of a handy thing, but it’s not for big data.
Like I said, it might appear in the exam as an example of what not to use for a big data problem, or in the context of migrating from RDS into redshift or something like that. It’s made for smaller needs where you can get away with a single database host and store all of your data in there. For example, I use RDS in Amazon Aurora for backing up my websites for WordPress. It’s using RDS as its backing store. For that, a good time is Danny to talk about Acid compliance. This comes up frequently in the world of relational databases in general, and all of the RDS databases offer full Acid compliance. So if you do need a database that meets these requirements, RDS can be a solution, provided that you don’t need massive data sets that can’t fit on a single host.
Now, under the hood, it might actually be using more than one host. That’s all kind of a black box to us, but it still makes sense to think of RDS as more of a small data solution. Acid anyway stands for atomic consistent, isolated and durable. And just a little refresher here for you guys. So Atomicity ensures that either the transaction as a whole is successfully executed, or if part of a transaction fails, then the entire transaction is invalidated. So if you’re sending a transaction that does more than one thing, if any bit of that fails, the entire transaction gets discarded. Consistency ensures that the data written to the database as part of the transaction must adhere to all defined rules and restrictions, including constraints, cascades and triggers. Isolation ensures that each transaction is independent unto itself, and this is critical in achieving concurrency control. Finally, durability ensures that all of the changes made to the database be permanent once the transaction is successfully completed. Let’s talk specifically about Amazon Aurora, which is one of the options for RDS as one of the database types you can use. Amazon.
Aurora is compatible with both MySQL and PostgreSQL. So if you have a need for a MySQL or a PostgreSQL backend for something, amazon Aurora can slip in there and take its place. Again, I’m using this set, for example, as a backing store for My WordPress websites and it is up to five times faster than MySQL and three times faster than PostgreSQL, which is pretty darn impressive. And as far as cost goes, it’s up to one 10th of the cost of hosting your own commercial database. Each Amazon Aurora instance can have up to 64 terabytes of storage and up to 15 read Replicas. So it actually does scale up pretty, pretty far. It can continuously back up to s three for you have replication across availability zones and it can automatically scale now with a new feature called Aurora Serverless.
So Aurora is sort of evolving beyond the small data world and into the big data world now, so it’ll be interesting to see where that goes over time. But it’s kind of hard to imagine why you wouldn’t use Aurora if you’re in a situation where you would normally use MySQL or PostgreSQL. As far as security goes with Aurora, it offers VPC network isolation. Obviously, it can also do at rest encryption using Kms, so data backup, snapshots and Replicas can all be encrypted. And it can do in transit security as well using SSL. So that’s Aurora in a nutshell, and RDS again, usually not in the context of big data, but Aurora is going in interesting directions, so we’ll keep an eye on that going forward.