Amazon AWS Certified Data Analytics Specialty – Domain 4: Analysis Part 6
- Redshift Data Flows and the COPY command
The exam is going to expect a lot of depth from you on importing and exporting data to and from your redshift cluster. Now, the most efficient way to import data or load data into your redshift table is using the Copy command. Using the Copy command, you can read from multiple data files or a multiple data stream simultaneously. You can import that data in from S Three EMR DynamoDB or some remote host using SR SSH for access control. You can use role based or key based access control to provide authentication for your cluster to perform those load and unload operations. So part of the Copy command would be the authorization for performing that copy.
For example, if you’re using the Copy command to load from Amazon S Three, well, first of all, there’s a couple of ways to do that. One is by using an Amazon S Three object prefix, in which case it would load all the data underneath a specific prefix or a path in S Three. And the syntax for that would just be Copy table name from S Three BUCKETNAME object prefix, followed by the authorization to do that copy. You can also use a manifest file, which is a JSON formatted file sitting in S Three that lists the data files that you want to load. So the syntax in that case would just be Copy table name from S Three BUCKETNAME manifest file, and then again with the authorization to do that and the word Manifest.
So what does that authorization look like? Well, after the Copy command, you would say, for example, IAM underscore a role, followed by the ARN for a given IAM role. Or if you want to do key based authentication, you could follow your Copy command with the clauses Access Key ID followed by your Access key ID secret Access key followed by the secret Access Key. So every Copy command says what table you’re loading it into, where you’re taking it from, and the authorization for performing that copy. Now, if you want to export data into S Three, you would use the Unload command for that. That allows you to unload or export data from your database tables to a set of files in an S Three bucket. And that’s a pretty straightforward thing.
But if you’re being asked what’s the most efficient way to dump a table in redshift into S Three, the Unload command is the best way to do that. Generally speaking, there’s also something called enhanced VPC routing available for you and that will force all of your copy and unload traffic between your cluster and the repositories through your Amazon VPC. Otherwise, all of that copy and unload traffic will be routed through the Internet. Now, for this to work, you need to configure your VPC correctly or the Copy and Unload functions will fail. So you might need to set up VPC Endpoints or Nat Gateways or Internet gateways within your VPC to make sure that copy and unload can communicate with the repositories that you want to talk to. Little more depth on the Copy command. The exam seems to be focusing on this more and more.
So remember, you want to use Copy to load large amounts of data from outside of redshift into a redshift table on your redshift cluster. If you see a question about how do I efficiently load data into redshift from outside, odds are the answer is the Copy command. Now, if your data is already in redshift in some other table, you don’t want to use Copy. Copy is for external data that’s being imported into your redshift cluster. If your data is already in another table within redshift itself, you want to use either the Insert into Select Statement or Create Table as to actually create a table that’s just a view of the other table that you can refer to. So remember, Create Table as is what you’re going to be using to refer to data that’s already in redshift in some other table. And if it’s outside of redshift, you want to use the Copy command. Some neat features of the Copy command it can actually decrypt data as it’s loaded from S Three. So if your S Three data is encrypted, Copy can decrypt it as it is loaded in. And it can do that very quickly because it has a hardware accelerated SSL capability to keep that decryption as fast as possible. It can also speed things up by compressing the data as it’s sent across. So it supports Gzip, Lzop and bZIP two compression to speed up those data transfers even further when you’re using a Copy command. Another neat feature of the Copy command is automatic compression.
This is an option that will analyze the data that’s being loaded and automatically figure out the optimal compression scheme for storing it. So if you have data that can compress very well, automatic compression can figure out, how do I store this data efficiently that is used to optimize the usage of your actual storage on your cluster. There is one special case that the documentation calls out, and I wouldn’t be surprised if it shows up on the exam. If you have a narrow table, that means that you have a table that has a lot of rows but very few columns.
You want to load that with a single Copy transaction if at all possible. The thing is, if you have multiple Copy transactions, there’s a bunch of hidden metadata columns that appear, and these hidden columns, to keep track of where things left off, consume too much space. So if you’re going to be loading a table that has many rows but very few columns, try to do that within one Copy command. Don’t break it up over several Copy commands. Okay? It’s also important to remember that Copy is designed to parallelize things and do things as efficiently as possible. So if you can do things in one Copy command, why not that’s always going to work. Well, it’s actually pretty amazing the level of depth the exam expects from you on this stuff. So let’s talk about a very specific case here. Let’s say that you want to copy a snapshot from Redshift automatically to another region.
So you have cross region replication of your snapshots for backup purposes. So let’s say you have a Kms encrypted Redshift cluster, and you have a snapshot of that cluster being saved in some S Three region. Now, you want to copy that snapshot to another region for even better backup. The way you would do that is as follows in your destination AWS region, you create a Kms key if you don’t have one already. And then you set up a Snapshot Copy grant specifying a unique name for that snapshot Copy grant in the destination region. As part of setting up that Copy grant, you specify the Kms key ID that you’re creating it for. Then, back in the source AWS region, you will enable copying of snapshots to the Copy grant that you just created. So by using a Copy grant that you set up using a Kms key, you can securely copy Kmsencrypted snapshots for your redshift cluster across to another region. Another bit of connectivity trivia is DB link, and Dblink is an extension that allows you to connect your redshift cluster to a PostgreSQL instance, which might be hosted using Amazon’s RDS service at a high level. You might want to do that to get sort of the best of both worlds between the columnar storage of redshift and the row based storage of Postgres.
Or you might be using it as a way to copy and sync data between a PostgreSQL instance and redshift. So if you want a very efficient way of copying data and keeping data synchronized between Postgres and redshift, the DB link extension is a way to do that. The way it works is like this so basically, you launch a redshift cluster, and you launch a PostgreSQL instance as well. In the same availability zone menu. You would then configure the VPC security group for the Amazon Redshift cluster to allow an incoming connection from the RDS PostgreSQL endpoint. You would then connect to the RDS PostgreSQL instance and run the SQL code you see here to establish that Dblink connection between the PostgreSQL instance and Amazon redshift. So just remember, a Dblink exists to connect redshift to PostgreSQL and get the best of both worlds, or maybe to copy and synchronize data between the two more efficiently.
- Redshift Integration / WLM / Vacuum / Anti-Patterns
How does Redshift integrate with other AWS services? Well, let’s go through a few of them. So, Amazon s three, you can use parallel processing to export your data from Amazon Redshift data to multiple data files on s three. And of course, you can also import data from s three or even sit on top of it using Amazon Redshift Amazon DynamoDB. So by using the Copy command, you can also load a Redshift table with data from a single Amazon Dynamo DB table. So it’s possible to import data from DynamoDB into Redshift using the Copy command as well on EMR Hosts or EC Two instances, you can import data using SSH. So again, via the Copy command, you can load data from one or more remote hosts, such as EMR Clusters or whatever you’re running on EC Two. For that matter, it also integrates with the AWS data pipeline. So you can automate the data movement and transformation in and out of Redshift tables using Data Pipeline. And finally, the AWS Database Migration Service, or DMs, can migrate your data into Amazon Redshift for you, or at least help you with the process. So the Database Migration Service is a whole big topic of its own. Basically, it’s a set of tools that allow you to migrate data from some existing data warehouse into Amazon Redshift.
And often there’s a lot more to that than you might imagine. You will also need to at least know what Redshift workload management is. WLM for short. It’s a way to help users prioritize workloads so that short, fast running queries are not stuck behind long running slow queries. The way it works is by creating query queues at runtime according to service classes, and configuration parameters for various types of queues are defined by those service classes. Now, you can modify the WLM configuration to create separate queues for long running queries and for short running queries, thereby improving system performance and user experience. You can set all this up using the Amazon Redshift Management console, the Amazon Redshift command line interface, or the Amazon Redshift API.
The topic of scaling and tuning your Redshift cluster comes up a lot in the exam as well. One feature you need to know about is called concurrency scaling. This feature allows you to automatically add cluster capacity to handle sudden increases in concurrent read queries. So if you have very bursty access to your Redshift cluster, where you have these sudden floods of read queries that come in from outside, concurrency scaling can automatically scale up your cluster to handle it. It can support virtually unlimited concurrent users and queries. This way, it just keeps on adding more and more capacity as needed as your read queries increase in volume. And you can use workload management queues, which we’ll talk about again very shortly, to manage which queries are sent to the Concurrency Scaling cluster. So you can manage through what queue you assign your queries to, which ones can actually take advantage of concurrency scaling and which ones do not. This can allow you to, for example, segregate those read queries that you think might be bursty in nature or vary in time in their frequency and use that to automatically scale out the capacity for that specific query.
That kind of reduces the risk of inadvertently adding a bunch of capacity for some offline job that doesn’t really need to run fast necessarily. So you can pick and choose which queries take advantage of concurrency scaling. Obviously it’s not free, so you want to give some thought as to which queries can actually have this capability of just automatically adding more and more capacity as needed. Workload management WLM comes in a couple of different flavors. One is called automatic workload management.
So with automatic workload management you can define up to eight different queues that are managed for you and by default you have five queues that have an even memory allocation between them. But you can change that obviously, if you need to. So the idea is that if you have a bunch of large queries going into a queue, like big hash joins or something like that, the concurrency will be lowered on that queue automatically for you. And if you have a bunch of small queries going into a queue like Inserts or Scans or simple aggregations, the concurrency level on that queue might be raised. Concurrency is just how many queries can I run at once? Right, so larger queries, obviously you need more capacity, you want perhaps lower concurrency on those so that they get more resources, whereas small queries can get away with fewer resources and you can run more of them at once. By separating these queries into their own queues, we can take better advantage of the hardware that we have. So each automatic workload query queue can be configured in a bunch of different ways. One thing you can do is set a priority value that just defines the relative importance of queries within a workload. You can also set concurrency scaling mode.
So basically that’s what we talked about before with concurrency scaling. This is where you can say I want this particular queue to have access to a concurrency scaling cluster and have the ability to automatically add more and more resources, more and more servers under the hood to handle the capacity that queue needs. That costs money obviously. So you want to think carefully about whether or not that’s enabled. You can also assign a set of user groups to a queue and you can do that by specifying a user group name or by using Wildcards. So when a member of a listed user group runs a query, that query will automatically run within the corresponding queue. So you can assign queues to query queues based on users. You can also set up a query group, all that is, is a label. So basically at runtime you can assign a query group label to a series of queries and that will define which queue that query goes into. So basically a tag, a label that’s assigned to the query itself, can define which queue it goes to.
You can also set up query monitoring rules. These are pretty cool. They allow you to define metrics based performance boundaries for workload management queues and you can specify what action to take when a query goes beyond those boundaries. So for example, you might have a queue dedicated to short running queries and you might have a query monitoring rule that aborts those queries if they run for more than 60 seconds. So that way you can enforce that your short query queue is actually handling short queries and if something goes wrong with one of those queries, it’s not going to hold up all the other queries in that queue.
So a very useful tool there. You can also do things like kick queries off to a different queue if it violates some query monitoring rule. The other flavor of WLM is manual workload management, and by default this comes with one queue with a concurrency level of five. Again, concurrency levels, how many queries can I run at once within this queue? In addition, there is a super user queue with a concurrency level of one. This is a queue that’s intended for administrative queries that are happening that always must run no matter what. You can define up to eight manual queues and you can have a concurrency level on the queue up to 50. Each of those queues allows you to define whether or not the concurrency scaling cluster is available to it, where it can automatically add more capacity to that queue.
You can also set a manual concurrency level to that queue, defining how many queries I want to be able to run it. Once within it, you can assign user groups to it and query groups like we talked about before with automatic queues, just allowing you to automatically route queries to a queue based on what user is running it or what query label has been attached to the query. You can also define the memory allocated to a given queue, the timeout value for running a query within that queue, and again, any query monitoring rules that you might have. You can also enable what’s called query queue hopping. So if you have a query that times out within a given queue, you can configure things to have it, hop to the next queue and try it again on a different queue that might have a higher timeout or more resources available to it. One more thing to talk about here is short query acceleration or SQA. The idea here is to automatically prioritize short running queries over longer running queries. The short queries will run in their own dedicated space so they don’t end up waiting in a queue behind longer queries. So this can be used in place of workload management queues if you just want to accelerate short queries. It can work with create table as statements, remember that and also read only queries or select statements. So these are both candidates for short query acceleration.
It can automatically run those in their own space so that they don’t get stuck behind longer analytic queries. And it works by using machine learning. Pretty cool. So it actually tries to predict a query’s execution time using machine learning algorithms automatically, so it can kind of guess based on the query itself how long it might take and figure out whether or not that should go into the dedicated space for short queries or not. And you can configure how many seconds you consider to be short. So that’s one dial that you have on short query acceleration. So remember, short query acceleration is an alternative to WLM workload management if all you want to do is accelerate short queries and make sure that they don’t get stuck behind longer queries, and it can work with Create table as statements or Read Only select statements. Finally, you need to know what the vacuum command does with redshift. Vacuum is a command used to recover space from deleted rows and to restore the sort order. So basically it cleans up your table.
There are four different types of vacuum commands. The first one is Vacuum full. This is the default vacuum operation. It will resort all of the rows and reclaim space from deleted rows. There is also vacuum Delete only, which is the same as a full vacuum, except that it skips the sorting part. So it’s just reclaiming deleted row space and not actually trying to resort it. You can also do a vacuum sort only, which will resort the table but not reclaim disk space. And finally, there’s vacuum reindex that’s used for reinitializing interleaved indexes. Remember we talked about the different sort keys? You can have one of them being interleaved. So reindex will reanalyze the distribution of the values in the table, sort key columns, and then perform a full vacuum operation after that.
And lastly, let’s talk about the anti patterns for redshift. This also comes straight out of the AWS big Data white paper of things they do not want you to use redshift for. One is for small data sets. So remember, Redshift’s strength is that it is massive and highly scalable. If you just have a small tiny table that you want to store, RDS might be a better choice for that. OLTP again, they cannot stress enough that redshift is made for analytic queries. For OLAP, if you need to do transactional very fast queries, you want to be using RDS or DynamoDB instead. And they’re also listing unstructured data for me, which is kind of strange because redshift spectrum kind of exists to let you query unstructured data in s three.
But if you do need to do some ETL, you should do that first with EMR or glue ETL or something like that first and it is not appropriate for storing Blob data, meaning large binary files. If you do need to store large binary files in your data warehouse, it’s best to just store references to those files where they reside in S three and not the files themselves. So that’s a lot about redshift. But again, you do need a lot of depth for redshift for the exam. It will be well worth your while to understand and remember everything in these slides.
- Redshift Resizing (elastic vs. classic) and new Redshift features in 2020
So what if concurrency scaling isn’t enough and you need to actually resize your cluster on the fly? You need to add more capacity to it. Well, there’s a few ways of doing that, and you need to understand the difference between them. So one is called elastic resize. You can use this to quickly add or remove nodes of the same type. So if you’re happy with the type of nodes, the type of EC, two instances that are running your redshift cluster, you can add more or remove them using Elastic Resize. With Elastic Resize, your cluster will only be down for a few minutes while it adds that capacity, and it tries to actually keep those connections open across the downtime as well.
So you might not even drop any queries. It might not just keep that connection held open and let it resume once that added capacity is in place on your cluster. Now, one limit of Elastic Resize is that for certain DC two and RA three node types, you’re limited to either doubling or having the size of that cluster. You don’t really need to remember what the specific types are for the exam, but for certain types, that is what you need to do. You can only double or have them. If you need to do something a little bit more involved, then you need to go back to what’s called Classic Resize, and that allows you to actually change the node types and or the number of nodes. The thing with Classic Resize is that if you’re actually changing the node types, it could take hours or even days for your cluster to become writable again.
So your cluster will be in a read only state for the entire period of time that it takes to provision that new hardware and swap it out in your cluster. That can be a very lengthy experience. So you want to use Elastic Resize when you can, but if you need to change your node type, you have to use Classic Resize. That’s the main difference. There one technique for dealing with things in that classic resize scenario is to use what’s called Snapshot Restore Resize, and this is a strategy for keeping your cluster available during a Classic resize operation. So if you do need to change your node type or you need to add an amount of capacity that Elastic Resize won’t allow you to do, what you can do is use a snapshot command to make a copy of your cluster and then resize that new cluster. So your data is still going to the old cluster while your new cluster is being resized and going through the classic resize process. Once your new cluster is finally done, you can then shift your traffic to that new cluster that’s been created. So Snapshot Restore then Resize, is a way of migrating from one cluster to another, using Classic Resize to minimize downtime.
Some new features in Redshift for 2020, these aren’t appearing on the exam as of this recording, but I imagine they will at some point in the future. One new thing that they offer is RA three nodes, and these come with managed storage. So the thing about RA three nodes is that they enable you to scale compute capacity and storage capacity independently. So if you have a lot of complex queries running that take a lot of CPU time, but your storage isn’t necessarily that big, you can use RA three nodes to add more computational resources to your cluster without necessarily adding more storage resources at the same time. So sometimes you need to scale those things independently. Also, a new feature is redshift data lake export. What it does is allow you to unload your redshift queries into s three in Apache Parquet format. And parquet is important because it’s up to two times faster to unload, and it consumes up to six times less storage.
Park K format is also compatible with Redshift Spectrum, Athena, Elastic Map Reduce, and Sage Maker, so it makes a lot of sense to use this when you can. It’s also automatically partitioned. So if your data is partitioned by date or whatever it is, it can preserve that partitioning as it exports your query to s three in Parquet format.