Amazon AWS SysOps – S3 Storage and Data Management – For SysOps (incl Glacier, Athena & Snowball) Part 5
- S3 & Glacier Select
Quick theory lecture on s three select and Glacier select. The idea is that we want to retrieve less data, so subsets of what we’re requesting using SQL by performing server side filtering. And so the SQL queries are quite simple. They can only be used to filter by rows and columns. So they’re very simple SQL statements. You cannot do aggregations or anything like this. And you will use Nestor less network and less CPU cost client side. Because you don’t retrieve the full file. S Three will perform the select the filtering for you and only return to you what you need.
So the idea is that before you have Amazon S Three sending all the data into your application, and then you have to filter it application site to find the right rows you want and only keep the columns you want. And after you request the data from S Three using S Three Select and it only gives you the data you need, the columns you want, and the rows you want. And the results Amazon is telling you is that you are up to 400% faster and up to 80% cheaper because you have less network traffic going through and the filtering happens server side. Okay? So similarly, let’s just do another diagram.
We have the client asking to get a CSV file with S Three Select to only get a few columns and a few rows. Amazon S Three will perform server side filtering on that CSV file to find the right columns and the rows we want and send back the data filtered back to our client. So obviously less network, less CPU and faster. So this is great.
So to summarize from an exam perspective, anytime you see filtering of data server side in S Three, you get less. Think about guess sree select and Glacier Select. That works on Glacier as well. And then for more complex querying that’s going to be serverless on S Three you’ll see in the future lectures we have something called Amazon Athena. All right, that’s it. I will see you in the next lecture.
- S3 Event Notifications
Okay, so now let’s talk about Amazon S three event notifications. So some events happen in your S three bucket. For example, this could be a new object created, an object removed, an object has been restored, or there is an S three replication happening. And so you want to be able to react to all these events. You can create rules and for these rules you can also filter by object names. For example, you want to react only to the JPEG file. So star JPEG and so you can create event notification rules. And these rules allow you to trigger some sort of automation inside of your AWS accounts. So use case for it. A very classic one would be to generate thumbnails of images uploaded to Amazon S three. So what are the possible targets for S three event notifications? Well, you have three. You have SNS, which is a simple notification service to send notifications and emails.
We have SQS for a simple queue service to add messages into a queue and finally lambda functions to generate some custom code. Now we’ll see all these services SNS, SQS and Lambda in details in this course. But for now, just remember that you have these three targets for Amazon Sree event notifications and that will make a lot of sense. By the end of this course, you can create as many SRE events as desired and most of the time they will be delivered in seconds, but sometimes they can take a minute or longer. And there is a small caveat which is that if you want to make sure every single event notification is delivered, you need to enable versioning on your bucket. So this is what these two very long lines of text are saying and this comes from the documentation. So let’s go ahead in the console to see how we can set up a very simple S three event notification. So let me create a bucket.
I’ll call it Stefan Event Notification demo and then I will go and create this bucket. Great. So we will enable the extra event notification for these buckets. Okay, so the first thing I would like to do is to go to Properties and make sure versioning is enabled. And I just enable it right now. Okay, so versioning is enabled and the next thing I have to do is scroll down and go to events. And this is where we’re going to create our S three event notifications. So I’m going to add a notification and I’ll call this demo notification SQS as we’ll be using SQS in this hands on. And the temperature events that I want to be using is all object create events.
So this one, anytime an object is created, you will send a notification. You can set up a prefix and a Sufix. This is optional to just filter amongst the files and then you can send this to a destination. Could it be an SNS topic, an Sqsq or a lambda function to make this as simple as possible. We haven’t seen SQS yet, but we will choose an Sqsq. And here we need to set up an SQS Warn. So let’s go ahead into SQS and set up a queue. So I’m going to tap SQS in here and open the SQS management console. Now, we haven’t got to start with SQS, but it is going to be very simple.
We’re going to create an SQS queue called Demo Straight events. And this is going to be a standard queue. Just make sure that you are setting up the queue in the same region where your bucket is. So for me it is Ireland. Okay, I will click on Quick create queue. And here we go. Next we pick up the ARN of that queue right here. And then we go to the SJ management console and paste that ARN into this field. And I will click on save. Now I am getting an error which is that S three is not allowed to publish notification from this bucket into this Sqsq. So we have to go in the Sqsq and give the permission to the Sq bucket to publish to here. So for this, I’m going to go to the permissions tab and here I can add a permission. And this permission I’m going to say allow principle. I will say everybody, this is going to make it a lot more simple to do.
A and I’m going to say send message. So we’re going to allow send message to come from everybody. And this is going to be very simple. It’s way too permissive, but at least it will make our S Three event notification work. So we’ll add this permission so not everybody can send a message to my demo S Three event queue and it will click on Save yet again. And now this S Three event has been activated and it is working. Okay, so we’re going to go to Overview and I will go to Upload and I will try to upload my coffee JPEG file. So I will upload it and it’s been successful. And so what I would hope to see is a message in my SQS queue.
So right now there is zero messages available in my SQS queue. But if I refresh on the top right hand side, as we can see, we have two messages available. And to see these messages, we can do queue action view delete messages and we’ll say don’t show this again and start pulling for messages. And here we are seeing two messages. Actually, the first one was a verification message sent by AWS to just test that the connection was working. And the second is the one that represents the record, where if we look at more details, it says object created put. And somewhere here we should see that it is for our coffee JPEG file. So this is really cool.
Anytime we are adding a file into Amazon history, this will end up into SQS through an event notification. So I can just show you this again. I’m going to upload a new file and this time, I’m going to upload the Beach JPEG file and click on upload. Here we go. We are successful and I’m going to start polling for messages again. And we are receiving right here. This message is going to be for my Beach JPEG file. So this is working. S three event notifications are working just fine, and it was a quick but simple demo to show you how things worked. Now, if you want, you can go ahead and delete this SQS queue and we’ll be able to get back to it later when we do go into SQS. But you don’t have to if you don’t want to. Okay, well, that’s it for this for this lecture. I will see you in the next lecture.
- S3 Analytics
So now we have S Three analytics and this is to set up storage class analysis. So basically we can help with the help of analytics, s Three analytics to determine when we should actually transition an object maybe from standard to standard infrequent access. And that does not work for analytics on one zone, IA or Glacier, but it works for the rest. And the report will be updated on a daily basis and it takes about 2040 8 hours for the report to first start. And the idea is that once we get that report, then we know how to put efficient lifecycle rules.
So we know exactly after how many days we should move objects from A to B. So this is the kind of stuff that we can do. Okay, let’s see how we can enable it. Right now to enable it pretty easy, we go to analytics, and here in analytics we’re supposed to be able to create a storage class analysis. So we have to create and add a filter. So I’ll call it Demo filter and we could set a filter for a specific prefix, but we’ll set for everything. And that data could go into a destination bucket.
So maybe it could go into a bucket in this account or another account, but this is optional. I click on save and now it says analyzing your data. And so this will take about 48 hours, 2040 8 hours to be done. And because I don’t have much in this bucket, we’re not going to get very meaningful results. But let me show you one of my actual buckets that has a website. So this is my bucket for Kafka tutorial. And in there I already enabled analytics maybe less than a month ago to show you what it would look like. And so I still don’t have any recommendations from Amazon analytics around when I should transition my objects to infrequent access.
But basically we can start seeing some graphs around how much data and storage is used and how much data is retrieved. Then we get an information around the percentage of storage that I retrieve and then we can get even more information around how old my objects are. So because I don’t have that many objects, I don’t see much things here. But for example, we can see that some objects are between 90 and 120 days old and maybe a lot of them are over a year old and maybe less of them are less than a year old. So it gives you some insights.
And the idea is that as soon as you start having an S Three bucket with a lot more activity than mine, obviously, and a lot more objects, then you’re going to get some meaningful insights from S Three analytics. Right now I don’t have anything meaningful, but that’s fine, I’m sure you get the idea. So that’s it for s three analytics. Just remember what it’s used for, basically to recommend to you when to move object to IX, for example. And I will see you in the next lecture.
- Glacier Overview
Okay, so let’s talk about glacier. So if you remember, glacier was the last storage class option that we haven’t really talked about. And Glacier, as the name indicates, is really, really cold. Cold means low cost storage, usually for archiving or backup. That means that the data because it’s so cold thing of a real glacier well, it’s going to be stored there for the long term. We’re talking about tens of years. It’s basically an alternative to on premise magnetic tape storage. So instead of having this in tapes and have them into an on premise drawer, you would put them on Glacier and they would be sitting there for a very long time for archival or backups.
And you will pay a very, very low fee. The cool thing about Glacier is that it has the same durability that s three has. So we still don’t lose any file at all? Only once every 10,000 of years for 10 million files. So not much. Now, the cost per storage per month is really, really low. It’s almost a 10th of s. Three. So we’re talking about zero 4GB. Plus, if you want to retrieve your files, you’re going to pay for a retrieval cost. But overall, storing files in glacier is really, really cheap. Now for Chen Yuan naming convention. In S three buckets we have objects and in Glacier it’s called an archive. So in each item in Glacier will be called an archive. And an archive can be as big as 40 terabytes.
So you can have really, really big archives. And each archives. They’re not stored in buckets anymore. They’re stored in vaults. And so this is something you have to be familiar with. So as an exempt tip, if they ask you, hey, we want to achieve data from S three somewhere after X number of days. Think Lifecycle policy and think to use Glacier, for example.
Now, Glacier has many operations. There’s three of them that you need to know. There is upload, basically, to upload an object. You can use upload simply or you can use a multipart upload. If you have larger archives, then there’s download basically to initiate a retrieval job. And then it’s not going to be downloadable right away. Glacier has to prepare the file for download. And so then, once the file is ready to download, you have a limited time to actually download the data from the staging server. And finally, there is delete. If you wanted to delete and remove a file away from Glacier, So anytime we get a restore links by doing the download, we’ll have an expiry date on it.
And so we have three retrieval options and you can be asked about it at the exam. These retrieval options can be expedited. And that’s one to five minutes retrieval. And you’re going to pay a lot of money for this. You’re going to pay three cents per gigabyte and one cents per request and standard, which is going to be a longer type of retrieval. Three to 5 hours. And you’re going to pay one cents per gigabytes and 0. 5 cents per thousand requests. So way cheaper. And then bulk when you’re going to retrieve a lot of files at the same time. We’re talking five to 12 hours retrieval time though. And you can see the prices right here. I won’t read it out for you, but it’s even cheaper. So the idea is that when you have Glacier, you have the option to say how fast you want the data.
If you obviously want it really quickly, it kind of goes against of how Glacier is created. So that’s why they make you pay more. Now the last thing you need to know about Glacier is vault policies and vault locks. So vault is going to be a collection of archives and each vault will have one vault access policy and one vault lock policy. So vault policies are written in JSON, they’re like bucket policies, right? So it’s similar as it will allow us to restrict a user or a specific account to do a few things in our vault and Glacier. But a vault lock policy you need to know about it absolutely is a policy that you lock. And this is used for regulatory and compliance requirements.
So the idea is that you set a policy on a vault and it’s immutable, it can never be changed. You can never remove a lock policy. That’s why it’s called luck. And the idea is that the file, the archive will never ever, ever go away. And you would use this basically to get immutability. And the reason you would do this is for regulatory and compliance requirements.
So number one, you can say, okay, I forbid deleting an archive if it’s less than one year old. That could be one of these luck policy. Or number two is to implement something called Warm, which is write once, read many, which basically prevents any archive once created to be tampered with. And I know the exam sometimes ask about the Warm policy for Glacier and the answer to that question is how do we do this?
Well, we have to use a vault lock policy. So that’s all you need to know about Glacier. Three slides, very important, feel free to see them again. But you absolutely need to know this before going ahead in the exam because they ask a few trick questions for glass this year that I just mentioned, including the one with the lock policy. Okay, hope you liked it. I will see you in the next lecture.
- Glacier S3 Storage Class – Hands On
Just a quick handson to see how we can use Glacier as a storage class in S three. So I’m going to tearing demo Stefan, and here’s my coffee JPEG. As we can see, the storage class is standard IA, but I’m going to change it. And I’m going to change the storage class to become a glacier. And basically it says it’s data archiving for retrieval times ranging from minutes to hours. So here I select Glacier, and there’s a small warning if you use crossregion replication application, but we’re not doing this. And I click on save. Now basically, when I change the restoration class, the last modified it is updated. So we’ll click on Change and here we go. Now my file is archived in Glacier.
And so if I wanted to restore it, I would right click and click on Restore from Glacier and I will have to pay for it. And now I have to specify how many days I want my restored copy to be available. As you can see, that restored copy will be in the RRS type of class. So I’ll just say, okay, my restored copy should be available only for one day.
And do I want to retrieve it between five to 12 hours? Three to five or one to five? I’ll select one to five minutes. And for this, I need to purchase capacity units. So basically that means that if you want to have an expedited retrieval, you need to purchase capacity units and add those basically to allow you to get this kind of retrieval time. If you do standard retrieval or bulk retrieval, then you don’t get those. I don’t want to add capacity units.
I’m going to pay for it otherwise, so it’s not refundable. So don’t do this. I’m going to do standard retrieval and click on restore. And now I do have to wait for a lot of time for my files to be available, about three to 5 hours. So this is the idea. As you can see on the right hand side, it says the restoration is in progress and then the rest short tier will be standard. So now I just have to wait three to 5 hours for my filing Glacier to get back to me. But that’s how Glacier works, right? It’s supposed to be for long term data archival.