Amazon AWS DevOps Engineer Professional – Incident and Event Response (Domain 5) & HA, Fault T… part 11
- Disaster Recovery – Overview
So disaster recovery as a solutions architect is super important and the exam expects you to know about disaster recovery. And there’s a white paper on it, you should read it. But I try to summarize everything clearly with graphs and diagrams in this lecture. So you don’t have to read it if you don’t want to. But overall you can expect some question on disaster recovery and as a solutions architect you need to know about disaster recovery anyway. Don’t worry, I try to to make this as simple as possible for you. So what is a disaster? Well, it’s any event that has a negative impact on the company’s business continuity or finances. And so disaster recovery is about preparing and recovering from these disasters.
So what kind of disaster recovery can we do on aws or in general? Well, we can do on premise to on premise. That means we have a first data center maybe in California, another data center maybe in Seattle. And so this is traditional disaster recovery and it’s actually very expensive. Or we can start using the cloud and do on premise as a main data center and then if we have any disaster use the cloud. So this is called hybrid recovery. Or if you’re just all in the cloud then you can do A tos cloud region A to cloud region B and that would be a full cloud type of disaster recovery. Now before we do disaster recovery, we need to define two key terms and you need to understand them from an exam perspective.
The first one is called rpo recovery Point Objective and the second one is called rto recovery Time Objective. So remember these two terms and I’m going to explain them right now. So what is rpo and rto? The first one is the rpo recovery Point objective. And so this is basically how often basically you run backups, how back in time can you to recover? And when a disaster strikes, basically the time between the rpo and the disaster is going to be a data loss. For example, if you backup data every hour and a disaster strikes, then you can go back in time for an hour and so you’ll have lost 1 hour of data. So the rpo, sometimes it can be an hour, sometimes it might be 1 minute, it really depends on your requirements.
But rpo is how much of a data loss are you willing to accept in case of a disaster happens? rto on the other end is when you recover from your disaster. Okay? And so between the disaster and the rto is the amount of downtime your application has. So sometimes it’s okay to have 24 hours of downtime. I don’t think it is, sometimes it’s not okay. And maybe you need just 1 minute of downtime. Okay? So basically optimizing for the rpo and the rto does drive some solution architecture decisions and obviously the smaller you want these things to be, usually the higher the cost. So let’s talk about disaster recovery strategies. The first one is backup and restore. Second one is pilot.
Lights. Third one is warm standby. And fourth one is hot. Site or multisite approach. So if we basically rank them all will have different rto. Backup and Restore will have the smaller rto pilot lights, then warm, standby and multisite. All these things cost more money, but they get a faster rto. That means you have less downtime overall. So let’s look at all of these one by one in details to really understand from an architectural standpoint what they mean. Backup. And Restore has a high rpo. That means that you have a corporate data center, for example, and here is your aws cloud and you have an S, three bucket. And so if you want to back up your data over time, maybe we can use aws’s storage gateway and have some lifecycle policy, put data into glacier for cost optimization purposes.
Or maybe once a week you’re sending a ton of data into glacier using aws snowball. So here if you use snowball, your rpo is going to be about one week because if your data center burns or whatever and you lose all your data, then you’ve lost one week of data because you send that snowball device once a week. If you’re using the aws cloud instead, maybe EBS volumes, redshift and rds. If you schedule regular snapshots and you back them up, then your rpo is going to be maybe 24 hours or 1 hour based on how frequently you do create these snapshots. And then when you have a disaster strike you and you need to basically restore all your data, then you can use amis to recreate EC two instances and spin up your applications.
Or you can restore straight from a snapshot and recreate your Amazon rds database or your EBS volume or your redshift, whatever you want. And so that can take a lot of time as well to restore this data. And so you get a high rto as well. But the reason we do this is actually it’s quite cheap to do backup and restore. We don’t manage infrastructure in the middle, we just recreate infrastructure when we need it, when we have a disaster. And so the only cost we have is the cost of storing these backups. So it gives you an idea. Backup and Restore, very easy, pretty expensive, not too expensive and you get high rpo, high rto. The second one is going to be Pilot Lights.
So here with Pilot Lights, a small version of the app is always running in the cloud and so usually that’s going to be your critical core and this is what’s called Pilot Lights. So it’s very similar to Backup and Restore but this time it’s faster because your critical systems, they’re already up and running. And so when you do recover, you just need to add on all the other systems that are not as critical. So let’s have an example. This is your data center. It has a server and a database. And this is a business cloud. Maybe you’re going to do continuous data replication from your critical database into rds, which is going to be running at any time. So you get an rds database ready to go running.
But your EC, two instances, they’re not critical just yet, what’s really important is your data. And so they’re not running. But in case you have a disaster happening, route 53 will allow you to fail over from your server on your data center recreate. That easy to instance in the cloud and make it up and running. But your rds database is already ready. So here what do we get? Well, we get a lower rpo, we get a lower rto and we still manage costs, we still have to have an rds running, but just the rds database is running, the rest is not. And your easy to instance only are brought up are created when you need to do a disaster recovery. So pilot light is a very popular choice.
Remember, it’s only for critical core systems. Warm standby is when you have a full system up and running, but at a minimum size. So it’s ready to go, but upon disaster we can scale it to production load. So let’s have a look. We have our corporate data center, maybe it’s a bit more complicated this time. We have a reverse proxy, an app server and a master database. And currently our Route 83 is pointing the dns to our corporate data center. And in the cloud we’ll still have our data replication to an rds slave database that is running. And maybe we’ll have an EC two auto scaling group but running at minimum capacity, that’s currently talking to our corporate data center database. And maybe we have an elb as well, ready to go.
And so if a disaster strikes you, because we have a warm standby, we can use Route 53 to fail over to the elb and we can use the failover to also change where our application is getting our data from. Maybe it’s getting our data from the rds slave now. And so we’ve effectively basically stood by. And then maybe using auto scaling our application will scale pretty quickly. So this is a more costly thing to do now because we already have an elb and EC to auto scaling running at any time. But again, you can decrease your rpo and your rto doing that. And finally we get the multisite hot side approach. It’s very low rto, we’re talking minutes or seconds, but it’s also very expensive. But you get to full production scales running on aws and on premise.
So that means that we have your on premise data center full production scale. You have your aws data center full production scale with some data replication happening. And so here what happens, is that because you have a hot site that’s already running your Route 53 can root requests to both your corporate data center and the avias cloud. It’s called an Active active type of setup. And so the idea here is that the failover can happen. Your EC Two can fail over to your rds slave database if need be, but you get full production scale running on aws and on premise. And so this cost a lot of money, but at the same time you’re ready to fail over already and you’re running into a multi DC type of infrastructure, which is quite cool.
Finally, if you wanted to go all cloud, it will be the same kind of architecture. It will be a multiregion. So maybe we could use aurora here because we’re really in the cloud. So we have a master database in a region, and then we have aura global database that’s being replicated to another region as a slave. And so these both regions are working for me. And then when I want to fail over, I will be ready to go full production scale again in another region if I need to. So this gives you an idea of all the strategies you can have for disaster recovery. It’s really up to you to select the disaster recovery strategy you need, but the exam will ask you basically based on some scenarios, what do you recommend? Do you recommend backup and restore pilot lights.
Do you recommend multi site or do you recommend hot sites? All that kind of stuff, warm backups and all that stuff. Okay, so finally, disaster recovery tips and it’s more like real life stuff. So for backups you can use EBS snapshots, rds, automated snapshots, and backups, et cetera. And you can push all these snapshots regularly to S three. S three? ia glacier. You can implement a lifecycle policy. You can use cross region replication if you wanted to make sure these backups will be in different regions. And if you wanted to have your data from on premise to the cloud, snowball or storage gateway would be great technologies for high availability. Using Route 53 to migrate dns from a region to another region is really, really helpful and easy to implement.
But you can also use technology to have multiaz implemented, such as rds multiaz elastic multiaz efs S three. All these things are highly available by default if you enable them. Obviously, if you’re talking about the high availability of your network, maybe you’ve implemented Direct Connect to connect from your corporate data center to aws. But what if the connection goes down for whatever reason? Maybe you can use Site to site vpn as a recovery option for your network in terms of replication, you can use rds Replication, cross Region, aurora, and Global databases. Maybe you can use a database replication software to do your on premise database to rds. Or maybe you can use storage gateway as well in terms of automation.
So how do we recover from disasters, I think you would know already. Cloud formation elastic beanstall can help recreate whole new environments in the cloud very quickly. Or maybe if we use Cloud Watch, we can recover or reboot our EC. Two instances when the Cloud Watch alarm fail. Alice lambda can also be great to customize automation. So they’re great to do rest api, but they can also be used to automate your entire alas infrastructure. And so overall, if you manage to automate your whole disaster recovery, then you are really well set for success. And then finally chaos testing. So how do we know how to recover from the disaster? Then you create disasters.
And so an example that’s I think widely quoted now in the aws world is that Netflix, they run everything on aws and they have created something called a Simon Army and they randomly terminate easy to instances. For example, they do so much more, but basically they just take an application server and terminate it randomly in production. Okay? Not in devote test in production. So they want to make sure that their infrastructure is capable to survive failures. And so that’s why they’re running a bunch of chaos monkeys that just terminate stuff randomly just to make sure that their infrastructure is rock solid and can survive any types of failures. So that’s it for this section on disaster recovery. I hope you enjoyed it and I will see you in the next lecture.
- Disaster Recovery – DevOps Checklist
Okay, so here is a multiregion disaster recovery checklist that you can use as a DevOps. So, first of all, is my ami copied across multiregion? And this reference to the ami itself is stored in the parameter store that helps you when you want to standardize, for example, a cloud formation templates across many different regions and your ami changes over time. Is my confirmation stack stack working in other regions? And have you tested it? What is my rpo and my rto objective and whether the cost associated with it? Are my route tc three health check working correctly? Are they tied to a cloudwatch alarm? And how can I automate with cloudwatch events to trigger some lambda functions? And so I can perform maybe an automatic rds read replica promotion?
So you need to make sure that all these things are covered. And as a DevOps, you can think of the implication of these things. And is my data backed up? Again, what is the rpo and rto for my data? And what does it mean for ebs? For the ami, for rds, for Sri Crossroads Replication, global dynamodb Tables rds and aura Global Read replicas. So again, where is my data living? How is it synchronized? How is it replicated? What are the implications in terms of costs and so on? So all these things are important to think of as a DevOps when you consider multiregion disaster discovery. So let’s talk about backups now and multiregion disaster recovery. So, Efs backups can be done for multi regions or within the region.
So the first one can be used using aws backup. And it allows you to set the frequency when the backup should happen, the retain time and the lifecycle policy. And that’s a managed service by aws which allows you to do backups of efs. So this is something new and you can do Efs to efs backup within the same region. So let me show you this right now. And so this solution is now probably obsolete and replaced by aws backup. But I still want to show you the architecture overview just so you get an understanding of what you could create as a DevOps if we need to do it as multiregion. So in this example, we have a cloud watch event that is going to trigger a lambda function. And that lambda function is an orchestrator.
And what this lambda function will do is that it will look into dynamodb, create an auto scaling group of Amazon EC, two instances that will create a backup from a source Efs to a backup Efs within the same vpc. So within the same region, okay? And when it’s done, it’s going to write the metadata of what it did into an Amazon dynamodb table. It’s also going to write its logs of what it did into an Amazon S three bucket. And finally, the lambda function, when it’s done, will send a notification to sns to tell people that, yes, the backup has been done. So, as we can see here, the traffic of data is going from Efs source through the EC two instances into an Amazon Efs backup. So something to think about right here.
And as we can see here, it uses lambda as an orchestrator. That may not be the best solution. Maybe a step function could be better because maybe the backup will take a lot of time. And if we go over the lambda function timeout, maybe that’s not great. So not super happy about this architecture. And then finally something you could do instead of doing Efs to Efs within the same region, which is definitely a strategy, is to maybe do Efs directly into an S three bucket and have some region cross region replication happening on the S three bucket into another region. And from that create an Amazon Efs system and to have full replication across regions for Efs.
So, yes, as I said here, to do a multiregion, you can do Efs into S three into s three cross region replication back into Efs. And they will give you some kind of multiregion backup for Efs. Okay, then route 53. So your dns, how would you back this up? And you use an api called List Resource Record sets. That’s an api. And what this will allow you is to literally list every single record that you have in your dns and you can make an export of that. And then if you wanted to import that into another dns provider or into another Route 53 zone, then you would need to write your own import scripts. But because dns records are usually somehow standardized across all dns providers almost, then you can write such scripts.
But there’s nothing that says import and export in the aws ui. Unfortunately, that’s something that I wish they would implement, but right now they don’t. So just something to think about again as a DevOps. And then finally, for elastic Beanstalk, how do we do a backup then? We’ve seen this already. We can use saved configurations and we can generate those using the console or using the Ebcli. And what we can do is that using these configurations, we are able to recreate an gnostic beanstalk with the same configuration but in another region. So, okay, that’s it for this lecture. I will see you in the next lecture.