Amazon AWS DevOps Engineer Professional – Incident and Event Response (Domain 5) & HA, Fault T… part 5
- ASG – Termination Policies
Okay, so now let’s talk about termination policies. So in this example we have two instances but we may have tens and hundreds of easyto instances into an auto scaling group and they may be across many different availability zones and they may use different version of a launch template or a launch configuration. And so the question is if the otto scaling group does a scaling action means it terminates instances then which instance will be terminated first? And that question can be very important to know. So there is a whole documentation page around it and there is different kind of termination policy. There’s the default policy and then you can customize it.
So if we go into the otis group console and click on edit, if we scroll down we can see the termination policy right now is default but you can select all this instance, all this launch configuration, newest instance, closest to next instance, our allocation strategy or oldest launch templates. So the one you need to know by heart is the default termination policy. But we’ll view and we’ll discuss the oldest instance, all this one configuration and so on in this lecture. So let’s look at the default termination policy. So right now the way the default termination policy looks is as follows it determines which AZ has the most instances and one instance that is not protected from scaling.
So if you’d remember we can go into the instances and I can right click and do scaling protection and that means that this instance can never be terminated. So it will look for an AZ that has the most instances and then find one that is not protected from scaling. Okay? So that means that if we have an AZ with three instances, another one with two and another one with two, then the one with three instances will get its instance terminated. Then it will determine which instance to terminate. So that the allocation strategy for the on demand or spot instance is the same.
So if we have an allocation strategy and I said if you’re using launch templates you can have an allocation strategy to have a mix of ondemand and a mix of spots, then it will try to keep that mix going on and then it will look at the instance that use the oldest launch templates. If there is such instance then it will terminate it. So if you have version one, version two, version three, version four, it will look at the oldest version and terminate that instance. For the instances that use launch configuration it is the exact same thing. It will determine whether any of the instancesuse the oldest launch configuration and terminate it.
And then if there’s multiple unprotected instance and all of them qualify they will choose one at random. Okay? So here’s an example. There’s two instances and then there’s three instances with two in one AZ. And so when one will be terminated, it will be one of these instances in that AZ that will be terminated, the one with the oldest launch configuration of the oldest launch templates. And then if they’re all the same, it will choose one of random based on the closest next billing hour. Okay, but you can customize the termination policy, as I said. And so we have these ones, all this instance, which skips altogether the AZ first criteria and says, okay, terminate the oldest instance in the group.
So this is quite helpful if you’re upgrading instances to a new easy to instance type, for example, and you can gradually replace them using this termination policy. Similarly, you can have newest instance to terminate the newest instance in the group especially if you’re testing a new launch configuration and you don’t want to keep it in production oldest launch configuration which is terminate the instance that have the oldest launch configuration. So this is different from oldest instance. It could be an old instance with a newer launch configuration but you can also have a newer instance with an older launch configuration.
We have closest to next instance hour to terminate instances that are closest to the next billing hour and this helps when you have an hourly charge for your EC two instances. But now this is way less common because now we have billing at the second for EC two default, which is the one we’ve seen all this launch template which is similar to all this launch configuration, but applies if you’re using a launch template instead of a launch configuration for your auto scaling group. And then finally allocation strategy to just say that what we prioritize is the allocation strategy for the spot versus ondemand percentage.
So that’s it for this lecture. It’s not very fascinating but you really need to remember how the default termination policy works because because that termination policy is super important. So remember it first balances the AZ then it will balance the oldest launch configuration or launch template and then finally it will try to terminate an instance based on the next billing hour but you can customize this any way you want by adding some stuff in here. Okay, well that’s it for this lecture, I hope you enjoyed it and I will see you in the next lecture.
- ASG – Integration with SQS
So I want to talk for a few seconds about how to integrate an auto scaling group with Sqs. So it is quite a common pattern to have a queue in Sqs and have a bunch of easy to instances process the messages from Sqs. So, for example, I’ll call it my image processing queue, and this is a standard queue, and I’ll just go and quickly create that queue. Okay? So let’s imagine that in this queue we have a bunch of images, images being sent. And so the EC two instances in my group are supposed to process this queue. And what we’d like to have is to have these easy to instances properly scale based on the number of messages in that queue, divided by the number of easy to instances in my autoscaling group.
So for this, we’ll need a custom metric. We can’t just use the metric coming from the queue because it doesn’t divide it by the number of EC two instances available. So if you go to the documentation, it shows very well how that works. We have an Sqs queue and the application will read the Sqs metric, which is how many times how many messages are left in that Sqs queue. And we’ll also read the capacity of the auto scaling group, which is how many instances are in my auto scaling group. It will do the division so the number of messages by the number of servers and it will emit that number as a custom Cloud Watch metric.
And then using this Cloud Watch metric, we are able to go into scaling policies and we can create a policy and we can choose a type of policy that would be using an alarm based on this Amazon Cloud Watch metric. And then based on this alarm, then the auto scaling group would create or terminate instances. And so this whole feedback loop just works fine. The way they say to choose an effective metric and target value would be, for example, to choose a backlog per instance, which is how many messages are left to process per instance and what’s an acceptable backlog per instance.
If you’re under the acceptable backlog per instance, that means you’re good, you don’t need more instances. And if you’re over the acceptable backlog per instance, that means that you’re way over your number of messages that you can process by these easy to instances. And therefore the alarm should be triggered thanks to the Cloud Watch metric value. And the Otiscaling group should create new instances for your processing server. So that’s definitely one thing to do, and it’s a really, really good consideration to have. So you can go through this and configure scaling, but this is just the architecture you need to remember.
The other thing that’s really interesting is, for example, say you have these instances and they’re processing messages.And right now we want to make sure that they’re not going to be terminated whenever they’re processing messages, because an instance that is processing a message, we don’t want it to be terminated while a scaling event happens. And so there’s two ways we can do this. The number one is to protect the instance from scaling whenever the instance is processing a message. So imagine a script on the easy to instance that will issue an api call every time you receive a message from sqs to protect itself from scaling.
And then when it’s done processing the message, it would remove it the scaling protection. And so this is something you have to think of as a devops. So again, whenever an instance you want it to be protected from termination, for example, if it’s processing from the sqs queue because it’s a worker in an auto scaling pool that is performing a long running task, then using the scale in protection as an api from within the EC to instance is a great idea. Another way to obviously use the scaling protection is if an instance has a special meaning in the otis Scaling group, such as the Master in hadoop cluster, or such as the Master in whatever cluster of Jenkins you may have as well.
Okay? And so this is quite a good combination because again, think of it as like you can use an api call to have scaling protection whenever a message is being processed. And from an sqs perspective, that’s really good. That means that you won’t reprocess message twice in case an instance is wrongfully terminated. All right, well, that’s it for this lecture. I hope you liked it. It’s more theory than practice, I know, but good to know how sqs and autoscaling and scale in altogether work. And for auto scaling as well. So I will see you in the next lecture.
- ASG – Monitoring
Okay, so finally, let’s talk about the monitoring and notifications we can get for our oto scaling group. So first about the monitoring, we get auto scaling specific metrics such as the minimum group size, max group size, desired capacity, inservice instances, standby instances, terminating total count, and so on. So this is quite interesting to monitor. We can get easy to level group metrics, which is what is the average tpu utilization, what is the average discrete average networking, the average network out, et cetera. So this is something you can get at the auto scaling group level and you can also get some notifications. So if you get notifications, you can send notifications to an sns topic and these notifications will be sent whenever an instance is launched, terminated when it fails to launch or when it fails to terminate.
Okay, so this would be sms notifications, but we have seen that we could actually use cloudwatch events for a bit more flexibility. So if we go back into Cloud Watch and then we go into events in here, we are able to create a Cloud Watch event for auto scaling and we could have launch instance, launch and terminate, and we can track specific events, launch, successful, unsuccessful, terminate, successful, terminate, unsuccessful, lifecycle election for termination and launch as well. And these are the exact same as the four here for the first four in here. And the advantage of using Cloud Watch events instead of just direct sns integration is that for example, we could have the target of this to be a lambda function directly and send this to a slack channel or so on.
So you’re really free to send these events to whatever a service you want through Cloud Watch events. But the aspect is pretty similar except for lifecycle hooks where you have to use Cloud Watch events as a priority to make it very robust and invoke limited functions. Okay, and then finally for Cloud Watch logs, there is no integration directly between your asg and Cloud Watch logs. But it’s up to you for each easy to instance that you are creating in your asg to have the Cloud Watch agent installed and make sure that that agent is sending its logs to Cloud Watch and make sure that the im role is correct for this. Okay? So this is all about monitoring asg. I hope you like this lecture and I will see you in the next lecture.
- ASG – CloudFormation CreationPolicy
Okay, so let’s have a look at how we can create auto scaling groups using cloud formation because this is something you need to know. So let’s go into create stack and then we’ll choose a template file and that template file will be the one you get from the code under auto scaling and cloud formation. And we’ll start with zero creation policy for ASG. So let’s go quickly see what this does in my code editor. So in this template I have a parameter and it’s using the ssm public parameter store to get the ami ID for the latest Amazon linux two. Then we create an auto scaling group and for the AZ it’s getting them from the region. I’m getting ready for the launch configuration.
It’s a reference to the launch configuration I defined below. The desired capacity is going to be three instances, the Min size is going to be one and the max size is going to be four. But how do we know if this auto scaling group is correctly being launched? We want to make sure that the instances are able to signal the fact that they have launched correctly. So that we know that this creation of the otis scaling group itself has worked as well. So we can attach a creation policy to your oto scaling group so that we are waiting for resources to signal and we are waiting for three resources to signal a correct number. So this is the same as my desired capacity, obviously.
And we’re saying we are willing to accept 15 minutes of wait before these instances come up. And if that timeout happens, then the whole auto scaling group will fail to launch. So this is really good because if you have a very complicated launch configuration with a lot of easy to user data in it and something fails within your easy to user data, you want to make sure that this auto scaling group does receive a failed signal. And if it does, then this entire thing will just roll back and stop working. So if we scroll down and look at the launch configuration, it’s very simple for the image ID and the ami is using the latest linux ami ID that I get from the parameters. And then I will scroll down the instance types, the T two microbe and the user data is very simple.
It’s the base 64 of this entire script file and which does something very simple. It gets the bootstrap scripts and then it uses the cfn signal script to say to my stack, my stack name for the resource auto scaling group which is defined right here for the resource auto scaling group and the region you’re in signal the fact that you are successful. And so this launch config attached to this auto scaling group on the three instances we have, will do three signals onto the auto scaling group and hopefully will work. So let’s go back into confirmation. We’re going to click on Next. I’ll call it demo ASG cfn and I’ll click on next. I’ll scroll down and click on Next and create my stack in the very bottom. Here we go.
And now my creation is in progress. And as we can see here, the Otiscaland Group is being created. And if we go to my Otisal Group ui, it has been created. But the cfm stack itself is not in Create Complete state because we are waiting for three signals to come from these three easy to instances that are being created for me. So we need to wait for these three EC two instances to use the user data script and then signal back into confirmation their success. And when they do signal their success, it hits. Then we will go into create complete and so we can see here that there are three events that happened in this timeline. And the event says receive the success signal with a unique idi, blah blah blah, which is the instance ID here.
So we have three different instance ids. And so the three instances here, here and here have signaled their success to Cloud Watch, to Cloud the Formation. Sorry. And so as Cloud Formation has had the success and it was waiting for three success counts as part of the creation policy, then what we get out of it is a Create Complete for this ASG. So remember going into the exam, okay, if you want to make sure that the instances are properly configured when being created for the first creation of an ASG, then you need to make sure to set a creation policy for your ASG itself. Okay? Well, that’s it for this lecture. In the next lecture we’ll look at updates policy.