Amazon AWS Certified Data Analytics Specialty – Domain 4: Analysis Part 5
- Redshift Durability and Scaling
Now, let’s talk about the specifics of Redshift’s, durability, and scalability. Redshift replicates all of the data within the Data Warehouse cluster when it is loaded automatically. Also, your data is continuously backed up to S Three for you. Three copies of the data are maintained on the original, on a Replica, on compute nodes, and in a backup in S Three. So your data is stored in three different places. There’s the original copy within your client cluster. There’s a backup replica copy within your cluster as well. And furthermore, it does periodic backups into S Three for you. So for disaster recovery, redshift Asynchronously replicates your snapshots to S Three in another region as well. It can enable automated snapshots of the Data Warehouse Cluster with a one day retention period. By default, you can extend that retention period up to 35 days if you want. If you turn the retention period down to zero, however, automated backups will be turned off. Now, if you need to restore your cluster from a backup, you just choose one of the automated backups, and then AWS will provision a new Data Warehouse cluster and restore your data to it.
You can then switch over to the newly restored cluster from the old one. Redshift replicates the data within the Data Warehouse cluster and continuously backs up that data to S Three. It will mirror each drive’s data to the other nodes within the cluster as well. Redshift will automatically detect a failed drive or node and replace it for you automatically. In the event of a drive failure, the redshift cluster will remain available with a slight decline in performance of certain queries, while redshift rebuilds that drive from a replica of the data on that drive which it stored on another drive within that node. Single node clusters do not support data replication because there’s nothing replicated to. In that case, you would have to restore your cluster from a snapshot in S Three instead. In the event of an individual node failure, redshift will automatically detect that node failure and replace a failed node in your Data Warehouse cluster.
Now, until that replacement node is provisioned and added to the database, the cluster will be unavailable for queries and updates. Most frequently accessed data from S Three, however, is loaded first in the new node, so you can resume querying that data as quickly as possible. Single node clusters do not support data replication, and hence you will again need to restore the cluster from a snapshot in S Three. In the case of a node failure on a single node cluster, AWS recommends at least two nodes in your cluster for production purposes. Now, in the event of a redshift Cluster Availability Zone outage. In this case, you will not be able to use your cluster until power and network access to the Availability Zone are restored. Because redshift is currently limited to a single Availability Zone, however, you can restore the cluster from any existing snapshot to a new Availability Zone within the same region. So in that case, too, your most frequently accessed data will be restored first from S Three, so you can resume your queries as soon as possible.
Again, as of now, redshift only supports single Availability Zone deployments. So if an entire zone were to go down permanently, you would have to restore that data from S Three in a different Availability Zone. How does redshift scale? Well, Redshift clusters support both vertical, which is increasing the node instance type, and horizontal, which is increasing the number of nodes. Scaling your requested changes will be applied immediately when you modify your data warehouse cluster. Here’s how the process works. So, when you’re scaling, an existing data warehouse cluster will remain available for read operations, and a new data warehouse cluster is created during scaling operations. When that new cluster is ready, the existing cluster will be temporarily unavailable, while a CNAME record for the existing cluster is flipped to a point in the new data warehouse cluster that only takes a few minutes, and it’s usually done during some sort of a maintenance window. Redshift will move the data from the compute nodes in the existing data warehouse cluster in parallel to the compute nodes in the new cluster.
- Redshift Distribution Styles
When we discussed the Redshift architecture, we talked about how the data in your table is distributed across many compute nodes and many slices within those nodes. Now, there are several different ways of actually doing that distribution that you need to understand. So when data is loaded into a table, redshift will distribute the table’s rows to the compute nodes and slices according to the distribution style that you chose when you created the table. The two primary goals of data distribution are are to distribute the workload uniformly among the nodes in the cluster and to minimize data movement during query execution. There are four different distribution styles and they are as follows auto Distribution if you don’t specify a distribution style,
Amazon redshift uses auto distribution and based on the size of the table data, redshift will assign an optimal distribution style for you and that might be even, key, or all. So let’s dive in each one of those independently. First, let’s talk about even distribution. So an even distribution, regardless of the values in any particular column, the leader node distributes the rows across the slices in a round robin fashion. So it’s just going to step through each individual slice and keep assigning new data to each slice in a circular manner. This is appropriate when a table does not participate in joins or when there is not a clear choice between key distribution or all distribution.
So even distribution just tries to spread things out as evenly as possible without thinking about trying to cluster data together that might be accessed at the same time. Key Distribution here, the rows are distributed according to the values in one column. So the leader node will place matching values on the same node slice. And matching values from the common columns are physically stored together. So this can come in handy if you’re typically going to be doing queries based on a specific column in your data. By using key distribution, you can make sure that all the data associated with a specific key value will be physically located on the same slice and that can speed up your queries. So the diagram here is showing you that as new rows are coming in from your incoming data, those keys will be hashed and set to a specific slice based on how that key is hashed.
Then we have all distribution. And with all distribution, a copy of the entire table is distributed to every node. That ensures that every row is colocated for every join that the table participates in. The all distribution multiplies the storage required by the number of nodes in the cluster, so it takes much longer to load, update or insert data into multiple tables. All distribution is really only appropriate for relatively slow moving tables, that is, tables that are not updated frequently or extensively as the cost of redistribution is low. Small dimension tables do not benefit significantly from all distribution. Now, if you ever need to view the distribution style of a table. You can query the PG class Info view or the SVV table Info view. The reflective style column in PG class Info will indicate the current distribution style for that table.
- Redshift Sort Keys
Another important finer point of redshift is how it uses sort keys. Sort keys are similar to indexes or indices and relational database systems. Now, while creating a table, you can define one or more of its columns as sort keys. So when data is initially loaded into an empty table, the rows are stored on disk in sorted order. This sorting enables efficient handling of range restricted predicates. Redshift will automatically store the minimum and maximum values for each block as part of its metadata. So if a query uses a range restricted predicate, the query processor can use the Min and Max values to rapidly skip over large numbers of blocks during table scans.
That’s why sorting is important. It allows it to skip over entire ranges of data very quickly. But how do you choose the best sort key? This is something you need to think about upfront. Well, if recent data is queried most frequently, you want to specify the Timestamp column as the leading column for the sort key. That will help to skip entire blocks that fall outside of the time range. So if you’re querying based on recency, you probably want a time based sort key. If you do frequent range filtering or equality filtering on one column, specify that column as a sort key.
Redshift can then skip reading entire blocks of data for that column because it tracks the minimum and maximum column values stored on each block and can skip blocks that don’t apply to the predicate range. If you frequently join a table, specify the join column as both the sort key and the distribution key. That would enable the query optimizer to choose a sort merge join instead of a slower hash join. Beyond that, you have the choice of a single compound or interleaved sort key. So let’s dive into what those mean.
A single sort key is just using a single column, a single value to sort the data. Now, this is useful if you’re consistently querying for data within a single filter column. For example, if you were querying primarily by date, I might choose a single column sort key by date for this data here, and it would be stored thusly just broken down, sorted by date in order. So if I was searching only for movies that were viewed on March 19, 2019, potentially I could skip right over all the March 18 data and go straight to that directly, because date is my single column sort key. Another choice is a compound sort key, and a compound key is made up of all the columns listed in the sort key definition in the order they are listed in. This will be most useful when a queries filter applies conditions such as filters and joins that use a prefix of the sort keys. Now, performance can decrease when queries depend only on secondary sort columns without referencing the primary columns.
So you need to think about the order of the columns there and make sure that your primary column is one that you will most often be including in your queries. Compound is the default sort type, and they also help to improve compression. So they’re a default for a reason. They’re pretty general purpose, and they also improve compression as well. So if you do not specify a sort key, it will be compound. Finally, we have interleaved sort keys. An interleaved sort gives equal weight to each column or sub subset of columns in the sort key. This can be useful if multiple queries use different columns for filters. It uses an internal compression scheme for zone map values that enables them to better discriminate among column values that have a long common prefix.