Amazon AWS DevOps Engineer Professional – Incident and Event Response (Domain 5) & HA, Fault T… part 4
- ASG – Suspending Processes & Troubleshooting
So now we are going to learn a lot more advanced things about the ASG. I think so far everything we’ve done should have been known by you. But now we’re going to learn about a bit more advanced features. So we are going to look into suspending and resuming scaling processes. So let’s go have a look and see the use cases for each because the exam will test you on the use case for these suspended processes. So as you can see right now in our ASG, the suspended it processes right now has nothing, there is nothing in it. So let’s go and edit this configuration and I will scroll down and find my suspended processes, which is right here and I’m able to select all these things to suspend.
So I’m going to talk about them. But what I would encourage you to do is to go to this page on the documentation and have a read through it entirely, okay? Because I’m not going to read it to you, but you need to know it by heart. So well, let’s go and see an overview right now. So if we suspend a process of launch, this means that if we do increase the desired capacity, an instance will not be launched. So let’s have a look at it. I just saved this and the suspend process, launch is not suspended. So if we go into our configuration and I say we want three as a desired capacity, what should be happening is that a new instance should come up. But because the launch process has been suspended, nothing will happen. So let’s go have a look at the history.
And as we can see, no new instance is being created. And in here no new instance is being created because we have suspended the launch process. If I edit it and now remove the suspension on the launch process automatically, the ASG will be able to create a new instance for me. So let’s have a look at this right now. And as you can see now a new instance is being launched. Similarly, if we go back in here and we go into again the suspended processes and we suspend the terminate process and we set the desired capacity to two, what will happen is that the esg will not terminate any instances because that means that it would violate the suspended processes. So let’s have a look. As we can see now we need two instances and we have three right now.
But no instances will be terminated because we haven’t enabled, because we have suspended the terminate process. So these use cases for suspending, terminate and launch would be probably to do some troubleshooting or make sure that your ASG remains at a thick size and so on. Now let’s go over it and look at health check. So if we suspend the process health check then the ASG will stop doing health checks and we can have a look at what that means right here. So I’ll type health check and that means that it will check the health of these two instances and marks an instance as unhealthy. If EC Two or Elb sees it as unhealthy. So if we suspend health check then the instance status health will not change.
Overall we can still set it manually using the cli. Okay, but that means that the ASG itself will not decide on the health status of one of our instances. That could be really helpful if one of your instances unhealthy and you don’t want it to go away, maybe you just want to set this to health check removed and so your new instances will not be unhealthy and that could be helpful. Okay, next we have the replace unhealthy. So this is maybe a better one for troubleshooting. So that means that if an instance is unhealthy, do we want the ASG to terminate it and replace it with a new one? And so if we suspend that process and one instance is marked unhealthy, then it will remain unhealthy.
So imagine a situation where all our instances become unhealthy all the time for some various reasons and we need to be able to troubleshoot why they’re unhealthy, why they’re unhealthy before they are terminated. So adding this replace unhealthy suspend processes would allow us to troubleshoot an instance. So let’s have a look. I just had this as a suspended process. So let’s have a look at one of our instances. For example, this one. I’m going to have a look at it. So I’m going to make it unhealthy. So I’m going to connect to it using EC Two instance connect and what I’m going to do is pseudo r remove minus f VAR www HTML health. HTML and so that will make our instance unhealthy.
So now back into our ASG. Let’s just wait a little bit for the instance to be marked unhealthy. So as we can see now one instance is unhealthy, the other one is healthy, but the unhealthy one is not getting terminated because we have set the suspended process of replace unhealthy. Okay, next, what can we see? So we are going in here and we have AZ rebalance. So AZ rebalance is a process that says that if there are too many instances in one AZ and less in another AZ, then it will terminate to make it equal across all the AZ. If you don’t like that aspect you can suspend the process AZ rebalance. Next we have alarm notification. So let’s go back to the documentation and see what that is.
So alarm notification is right here and accept notification from cloudwatch alarms that are associated with the group scaling policies. So if we do set alarm notifications to be suspended, that means that the auto scaling policy that we have removed by the way, that are linked to an alarm will not respond to these alarms being happened. So effectively what we have said is that scaling policies and their alarms cannot change the auto scaling group’s behavior. Okay, next, what else do we have? We have a scheduled action so we can suspend the process of scheduled actions. And finally there’s one last which is called add to load balancer and this is the one I want to demonstrate because the exam may ask you about it.So let’s have a look and see what this means.
So if we suspend the process added to load balancer, then the name indicates that we will not add the instances to the load balancers. So let’s go and see what happens in this case. So I’m going to save this and right now I’m going to refresh my group and seeing what’s going on. So one instance is being terminated because it was unhealthy and another one is being created and I’m not sure that one is going to be in my target group already. So let’s have a look very quickly into the target group. So here in the target group, we have two targets right now. One is draining and one is healthy. So the one that is draining is the one that was deemed unhealthy and the one that is being created right now will not be added to this target group because we have suspended the ad to load balancer process.
So I’ll just wait for things to settle until I can show you what happens. So now we see we have two instances in our auto scaling group and if we go to the target group itself so let’s open this in a new tab and go to the target group, we should be seeing something very interesting. So here we go. This target group has a target and as we can see there’s only one registered target in here. The reason being that we have suspended the add to load balancer process and something that’s really worth interesting to know is that okay, so we have two instances and one of them is not added into this target group. But what if we suspend this process now we remove it, will it get added automatically? So let’s remove this suspend processes and click on save and the answer is no.
So even if we remove this add to load balancer suspended processes, that doesn’t mean that this instance that was created from before will be added to the target group. So what we have to do is to register an instance directly. So we click on Register and deregister instance and then we have to find that instance. That is for example, right here, this one. And we have to add it to the target group manually for it to be registered as a result of enabling and then removing the suspended process for add to load balancer. Okay, so this is pretty good. We have seen all the suspended processes so remember to read the full documentation on them because you can learn a lot.
But what’s important to note is that the add to load balancer adhere whenever we remove this suspended process. It doesn’t mean that the instances that have been created in the meantime will be added to the target group or load balancer. Okay, let’s have another look at two other things. So we are into our auto scaling group and if we go to instances, say for example, we wanted to take an instance out of this auto scaling group. So I can click on the instance right here and click on Detach. And this will effectively remove the instance from the autoscaling group. And it says, okay, when you do remove it, it will remove it from the associated elb so we can troubleshoot it, and it will replace this instance with a new running instance within the ASG.
So let’s look at this for example. So if we do this and we detach this instance, it will automatically replace that instance for us.So this is a kind of use case when we detach an instance where we would say okay, we need to take this instance and have a look at it but we don’t want to impact our operations. Therefore by detaching it we are effectively adding a new one in the end. So this is one option and this can be done like this. But the other option we have is to use standby mode. So what is standby mode? If I click here on this instance and set to standby, that means that this will remove the instance from the elastic load balancer and it will increase the load on other instances but it will not create another instance in return.
So by setting it to stand by, it still belongs to the same auto scaling group. Okay? But it’s not going to be part of the load balancer operations and no traffic will be routed to this instance. So putting an instance in standby is really helpful to troubleshoot and then we can put back an instance into service. Okay? And then finally there is a last setting you need to know about which is instance protection. So we could set scale in protection and by setting scale in protection we want to make sure that whenever a scale in operation happens, that means whenever instances are terminated then this one will never be terminated because we have set scale in protection.
So if we set scale in protection to both instances, okay, and then we go to our auto scaling group and set the desired capacity to one, then as a result, even though it will try to delete instances because these instances are protected from scale in, then it will not happen. So this was a full lecture of learnings in here. We have learned about all the suspended processes, we have learned about standby, we have learned about decision instance from an entire scaling group and instance protection from scale in. And so all these things can be extremely helpful based on the use case the exam will ask you for troubleshooting or for doing various operations on your ASG without impacting your operations. So I hope that was helpful. You really need to know these things going into the exam. And I will see you in the next lecture.
- ASG – Lifecycle Hooks
Okay, so in this lecture, we are going to look at lifecycle hooks for our auto scaling groups. So first, let’s go to the documentation and understand the context of auto scaling hooks. Lifecycle hooks. So if you scroll down, you can see that when you have an auto scaling group, your instance is created. So it goes into pending states, and then it goes into inservice states, and then when it’s being terminated, it goes into terminating states and then finally terminated space state. Okay? So what we can do is add auto scaling lifecycle hooks, and these are the gray boxes in here. So the idea is that instead of going directly from pending to in service, we are able to add a pending weight state and then a pending proceed state.
And we have control over these as to when it goes from pending wait to pending proceed to in service. And then similarly, when an instance is being terminated, we have the option to not have it being terminated right away. We can go into terminating weight, and then we give it a proceed action, and then it goes maybe into terminated states. And so these two lifecycle hooks allow us to enable a few use cases. So let’s talk about these use cases first before we go into the details of how we create these hooks. So a use case, for example, for launch lifecycle hook right here, the one here is, for example, if you want to install or configure an application, and that would be taking a lot of time.
So here, using this hook in here, we’re able to configure the application, take our time, and then when we’re ready, we can put back in service. And then when it goes into terminated, what type of use cases do we have? Well, for example, maybe before an instance is terminated, we want to take all the logs onto our instance and put them into S three. Or maybe we want to take a snapshot of the EBS volume before an instance is terminated. We have so many type of use cases we can do, it’s just up to you to find out what you think is a good use case. Okay, so here the question is now, what do we invoke in these gray boxes to interact with these lifecycle hooks? And the answer is, we have three ways of doing it. We have sns, Sqs, and lambda through clywatch events.
Now, sns and Sqs are the legacy way. This is how you used to send notifications and treat them for your lifecycle hooks. But the problem with sqs is that you need to have an easy to instance or lambda function in the back end to receive it in sns as well, you need to have something that receives the notifications. So now the recommended way and we’ll see this in the hands on, is to route notifications to lambda using cloudwatch events. And that’s the most flexible thing. So you need to remember three targets sns, Sqs and lambda. And so sns and Sqs, you can only configure them using the cli. So we won’t do this. And cloudwatch events, you can configure it using the ui. And that’s actually the most recommended way for Amazon as well.
So let’s go ahead and create our first lifecycle event. So we are in lifecycle hooks, and I’m going to create a hook, and I’m going to give it a name. I’m going to call this launch hook. And the auto scaling group is the one I have. And we need to choose the lifecycle transition. So it can either be instance launch or be instance terminate. So right now we want to do with the launch. So we’ll put instance launch. This is the heartbeat timeout, which is okay. If I don’t hear from the instance for 1 hour, then you should assume that we are done. And what’s the result in case of a timeout should you abandon or should you continue? So if you choose abandoned, then the instance launch will not be successful and therefore the instance will be terminated.
Or if you do continue, then the instance launch will be successful even if the timeout happens and the instance will be put in service. So it’s up to you to choose whatever you want. So I’ll choose abandon. And then here you can choose notification metadata if you wanted to include some extra information in the message itself. Okay, so we’re good. And here, by the way, you get some use cases in here. So you could install or configure software on newly launched instances or download log files from an instance before it terminates. Okay? So we are going to create this, and now we are done. So as you can see for this lifecycle hook, there is no notification target ARN and there’s no role ARN.
So if you were to configure sns or Sqs using the cli, these would be populated. Okay? And you need to provide a role to your auto scaling group to send notifications to sqs or messages to sqs. And here we are able to instead use cloudwatch events, and this is what’s recommended in the ui to interact with this. So let’s go into the cloud watch console right now and see how we can do things. So we are in the cloudwatch console and for example, we can build this on an evan pattern. Would it be auto scaling? And the auto scaling pattern we’re looking for is instance launch and terminates. We’re looking for a specific instance events, which is the EC to instance launch, lifecycle action.
And we could specify a group name, but because I want to show you a sample event, I’m not going to specify one. So this is the kind of sample events that will be sent to your target, and the target would be maybe a lambda function. Okay? And so the event itself contains a lot of good information. It contains the detail type and in the detail itself it contains the lifecycle action token that will be needed to terminate the instance or to go on with the Lifecycle hook. The auto scaling group. Name the Lifecycle Hook. Name the EC two instance ID which also could be used to go on and provide a completion for the hook and the Lifecycle Transition, which is saying that right now it’s for an instance being launched because we choose the EC to instance launch Lifecycle Action.
So I’m going to say, okay, this is for this specific event and this group name and so on. And so now we’re saying, okay, for all the things that happen to this lifecycle hook, then you go ahead and you invoke a lambda function. And that lambda function again, can do many things. It can maybe take a backup, it can take an EBS backup. It can do whatever api calls you need. And I’ll be talking about a special api call in a minute. So you get the idea, right? So now let’s go and practice this. I’m not going to create my cloud watch event because I would need to create a limit of function. But let’s go ahead and add an instance to our auto scanning group. So I’m going to edit it and I’m going to say that now the desired capacity is two and the effect of which will be to create a new instance.
So let’s go into instances in here and let’s see what happens. So my instance is now pending. And if you look at the lifecycle chart that we had above, the pending is right here. So the next day we expect for our instance is to go into pending wait. So let’s refresh this. And now we are in pending wait. So thanks to the lifecycle hook that we have created for this one, then this instance for 1 hour as a timeout will be in pending wait. And then we have to issue an api call to make this instance move from pending wait into pending proceed. So how do we make this move? So let’s assume that this has triggered a lambda function and that lambda function has done some things.
Now finally, the lambda function must invoke an api call to tell this instance to go on. And so for this very easy, what we’re going to do is go into our lifecycle hooks and I just have an api call for you. So you’re going to copy this api call and we’re going to run it. And then I just pasted it. So we’re saying, AWS, auto scaling complete, lifecycle action. Then we give the lifecycle action result. It can either be continue to say it’s good, or abandoned to stop. Then the lifecycle hook name, which is the launch hook, the auto scaling group name which is demo ASG launch templates, the instance ID that we should replace. So let’s go ahead and find the instance ID right here. So I’m going to copy this instance ID and then paste it here.
But you could use the lifecycle action token if that was passed to it. Okay, then the region we’re in and the profile. So here we’re saying, okay, we have finished whatever we needed to do for our lifecycle hook. And let’s just say, okay, it is successful. So I’m going to press enter and now we are done. So back into the auto scaling ui. If I refresh this page now, the instance is in service because it went from pending wait to pending proceed to inservice. And similarly, I’m not going to do this, but you could create a lifecycle hook for your instance termination. And before they are terminated, then they will go into terminating wait, then they will be proceeding and then terminated if you invoke whatever lambda function.
So you have to remember a few things, right? So you can have many kind of use cases, really anything you want. It just gives you control over the instance creation and the instance termination. And the notifications can go to three ways, snsqs and lambda. And so a common question you may have and that the exam may ask you is, can you have this to launch a script on an EC two instance? And the answer is yes, but it is a little bit tricky. But thankfully there is an AWS sample for this. So the way it works is that you would use still a lambda function, but then we would use an EC to run command to make this work.
So how does that work? Well, we are going to create, and this would be if you wanted to do the hands on, but you create the order scaling group and you configure the lifecycle hook. Then you create an instrument bucket for file. So this is to do a backup of log files onto the street buckets. Then you create an ssm document. And this ssm document would be a document that allows you to run commands on your EC two instance. So that means that your EC two instances must have the ssm agent installed. And what this will do is that the run command itself, the document will do a run shell script of run command and it will execute whatever script you want.
And you can have your script in here, in your documents. So here the script is doing an AWS three cp, which means that it’s going to copy some files from the instance into an S three bucket. And then when it’s done, it’s going to say AWS, auto scaling, complete lifecycle action, and then give the hook name and so on. And so this is really helpful because now we can invoke this document directly from a lambda function. So the way it works is that it goes from the lifecycle hook to cloud watch events. That goes into a lambda function that I don’t have here.
And that lambda function invokes a ssm run script document on the EC to instance. And the EC two instance itself will be issuing the call AWS Auto Scaling complete Lifecycle Action. And therefore you can run scripts directly on your EC two instances to interact with these lifecycle events. So that’s just a full story, but something you have to remember. I would recommend you to practice with this hands on to just get a full grasp of how that works. It’s quite a fun one to do. And for me, I hope I teach you well about lifecycle hooks and you can see the use cases and how it can be really helpful. And I will see you in the next lecture.