Amazon AWS SysOps – S3 Storage and Data Management – For SysOps (incl Glacier, Athena & Snowball) Part 4

CloudFront Monitoring

Alright, so the first thing we need to talk about with CloudFront is going to be access logs. So it is possible for you to log every request made to Cloud Front into a logging S Three bucket, which is super fun. And so the idea is that you have many websites and they’re on Cloud Front and once you enable logging, all these edge locations will start logging and all these logs will go into one log bucket in one Amazon s Three buckets. So this is really cool because now we can have global monitoring and global access logs for our application and we can enable some really cool use cases to audit or to see if anyone is attacking us or anything like this. We can also analyze these logs in Athena and we’ll see this in a section. Now we can also get reports from Cloud Front and you can get reports on cache statistics, popular objects, top refers, usage report and your reports, which is really nice coming from a CDN. And all these reports are actually generated based on the data that would be generated from the access logs. Although you don’t have to enable the access logs to get access to these reports.

Finally, troubleshooting, maybe you’ll get some questions about Cloud Front will cage basically Http four xx and five XS status codes that are written by S Three which is going to be your origin server. The idea is that if you get a four xx code, that means that the user doesn’t have access to the underlying bucket. It’s a 403 so it’s unauthorized. Or maybe the object the user is requesting is not found and that will be a 404. And if you get a five x x step of error, that means that there’s a gateway issue. So CloudFront has some kind of monitoring and we’re going to see this right now.

So on the monitoring side it is possible to enable logging. And so for this in general, I will edit and I will scroll down and you can do basically logging and you can say on and we can create a bucket for logs. So let’s go back to S Three and we’re going to create a new bucket for our logs. So we’re going to create a bucket called AWS Stefan cloud Front Logs. That sounds about right. And click on Next and Next and create bucket. So we have created this Cloud Front Logs bucket. Excellent. And now I’m going to go back to my Cloud Front and in there I should be able to just refresh.

So let’s just go back to edit. And here I’m going to say logging on and this demo say Clap front Stiffen Cloud Front logs. Here we go. And we can say log prefix as well. So like for example, Cloud Front. Okay, and we’re basically going to log everything that happens in our Cloud Front directly into the S Three bucket because maybe we want to analyze this later on. Everything is good. I click on yes edit and now the cloud front has been edited and now we get a status in progress. So it’s still going to go ahead and update our entire distribution because of that one little change.

Finally, in terms of looking at reports and statistics, as we can see here we have a cache statistics, monitoring and alarms that we can set and popular objects that we can set as well and see, understand if we’re getting errors, what kind of error code we’re getting. So here it says we’re getting some three xx. That’s the redirect issue I just told you. But you can also get some information about the top refers, some usage and viewers. We’re not seeing much, but I’m going to show you something on my personal website. So here is just two distributions I have for two personal websites, datacumulous. com and CAFCA tutorials. com.

And so if we go to cash statistics now, I’m getting a lot more information around the number of requests I’m getting a day. So in the thousands and with some cool graphs, the number of people getting cache hits, cache misses. So this is when they actually have very fast load time. This is when they have slower load time and errors, is when they can’t get a specific file and then we can get information around the number of bytes that are transferred through a cache hit and a cache miss. So right now we’re seeing that my client from distribution is doing really, really well and is transferring a lot more bytes than missing them. So that means that a lot of the files I have are properly cached, which is super good. And this is what we want from a CDN here.

We get some information around the status codes that we get. So two xx, three xx, four XS and five xx over time. So this is something I would look at in case there would be a lot of errors in my website. I would try to understand why are people getting so many five XXX or four XXX if they did. And then you can get some information around the number of guests that did not finish downloading, et cetera, et cetera. We can look at the popular objects, so for example, the files that are used the most, so jQuery, script, JS, all these things and we can get some information around the status codes for each of these files. We can look at top referrals, which is where is the traffic coming from, so cafeterias.

Data, Cumulus, Medium, Google Quota, Bing so basically all the websites that send traffic to my Cloud Front pretty cool. And I don’t know why I’m invited vineyard designs, but why not? And then we can get some information around the usage, so same graphs as before and we can get some information around the viewers, understand which devices they come from, so desktop, mobile, if it’s bots tablets and then we can get some information around the trends for all these devices. So overall, I think Cloud Front has a lot of options if you wanted to troubleshoot and monitor and if you are really, really serious about running a website at scale, cloud Front is a great tool to have, especially for all the Caching capability. So that’s it for the monitoring of CloudFront. I hope you enjoyed it and I will see you in the next lecture.

S3 Inventory

Alright, very quick lecture on S Three inventory. So this is a feature to help you manage your storage. And with this you can audit and report on whether or not your objects are replicated and encrypted and the status of your objects. So some use cases could be the business compliance or maybe just regulatory needs to understand how your bucket is doing. And that can generate a CSV or a parquet file. And you can query all that data using Amazon. Athena Redshift, Presto, Hive, spark. You’re really free to do whatever you want with it. And overall, we can set up multiple inventories. The only thing we have to set up is saying that the data goes from a source bucket, which is the bucket we are going to set up inventory on, to a target bucket, which is where we are going to set up our policies.

Now the target bucket, because it gets data written from a source bucket, we need to set up a bucket policy so that it can accept data from the source bucket. So why don’t we go ahead and do this right now? Okay, so now let’s go and create a bucket. And I’ll call it Stefan s three inventory. And this will contain our inventory of other buckets we call Next, next, Next and create buckets. All right, so now that this bucket has been created estra inventory, we have to set up the inventory on one of our buckets to basically report to the S Three inventory bucket. So maybe we want to talk to maybe the original bucket of Stefan. Maybe I’ll upload another file as well, just so we can get some information. The beach, here we go. And upload it. And maybe I’ll go ahead and take that beach file and I’m going to encrypt it.

So I’ll change the encryption and I’ll set s three AES 256 encryption. All right, good. So now in management, I go to inventory. And now there’s no inventory reports yet enabled. So we can set up inventory to happen on a daily or weekly basis for the entire bucket or a shared prefix. So I create a new inventory. I’ll call it main inventory. And I will not filter by prefix and from the destination bucket, I can either set it in this account or in other accounts. But for now, I’ll just use the inventory bucket I’ve created. We can also set the prefix if we wanted to. And we can set the frequency to be daily or weekly.

I’ll just set daily. Now, the output formats could be whatever you want CSV, but that’s only when you have one less than 1 million objects. Otherwise you could use ORC or parquet if you wanted to have more formats that will be more friendly with Data Analysis Tool. So the object versions, we can include all the versions or the current version only I’ll include all the versions. And we can set some optional fields such as size, last modified date storage class, the replication status, the encryption status and all these things. And this is really cool because now we can get some information around all these things directly, automatically on a daily basis.

So I’ll go and click on save. And now my inventory has been successfully created. And here there is a bucket policy that was created. And so it says basically Amazon is free has created for us the following bucket policy on the destination buckets so that we are allowed to place data in that bucket. So that bucket was generated automatically and placed onto my target buckets to allow my current bucket to write to it. So we can verify this by basically going to my other buckets. So let’s go into S Three and go to Stefan S Three inventory permissions bucket policy.

And as you can see, yes indeed, that bucket policy was created. So we’ll just read it. It says allow the service S Three from to do put objects on anywhere in this bucket. And the condition must be that the source accounts must be my own personal accounts and we want the ACL to be bucket owner, full control. And then finally the ARN that’s able to do this is going to be my original buckets define, which is the one I have just created. So now basically this inventory will happen on a daily basis. You can always edit it and maybe make it weekly. But the annoying thing is that you can’t really trigger it on demand. I wish you could, but you can’t. So what we have to do now is just wait. And then tomorrow when I relog in, there should be something in that bucket. So I’ll just let you know then. All right?

S3 Storage Tiers

Into the exam, you need to know about all the S Three storage classes and understand which one is the most adapted to which use case. So in this lecture, which is going to be quite long, I want to describe to you all the different storage classes. The first one is the one we’ve been using so far, which is Amazon S Three standard, which is for general purpose, but there are some more optimized one depending on your workload. The first one is S three in frequent access or IA.

Also called S three. IA. So this one is when your files are going to be infrequently accessed and we’ll have a deep dive, by the way, on all of them. There’s going to be S Three one zone IA when we can recreate data. There’s going to be S Three Intelligent Tiering which is going to move data between your storage classes. Intelligently amazon glacier for archives and Amazon glacier deep archives for the archives you don’t need right away. Finally, there is one last class called Amazon S Three Reduced Redundancy Storage, which is deprecated, and therefore I will not be describing it in details through this lesson. Okay, so s three tendered general purpose. We have very high derivability, it’s called Eleven Nines. So 99. 9 of objects across multiple AZ.

So if you store 10 million objects with Amazon S Three general purpose, you can on average expect to incur a loss of a single object once every 10,000 years. Bottom line is you should not lose any object on S Three standard. There’s a 99. 99 availability percentage over a given year. And all these numbers by the way, you don’t have to remember, they’re just indicative to give you some knowledge. You don’t need to remember exactly the numbers going into the exam, just until you understand the general idea about a storage class. And it can sustain two concurrent facility failures, so it’s really resistant to AZ disasters. The use cases for general purpose is going to be big data analytics, mobile and gaming applications, content distribution.

This is basically anything we’ve been using so far. Now we have S Three standard infrequent Access or IA, and this is suitable for data as the name indicates that is frequently less frequently accessed, but requires a rapid access when needed. So we get the same durability across multiple AZ, but we have one nine less availability and it is lower cost compared to Amazon. It’s freestandard. The idea is that if you access your object less, you won’t need to pay as much, it can sustain two concurrent facility failures and the use cases for this is going to be a data store for disaster recovery, backups or any files that you expect to access way less frequently. Now we have S 31 zone IA or infrequent Access. And this is the same as IA. But now the data is stored in a single Availability Zone.

Before it was stored in multiple Availability Zone which allowed us to make sure the data was still available in case an AZ went down. So we have the same durability within the single AZ, but if that AZ is somewhat destroyed, so imagine an explosion or something like this, then you would lose your data. You have less availability. So 99. 5% availability and you have still the low, latency and high throughput performance you would expect from Asteroid. It’s lower cost compared to it supports SSL for all the encryption, and it’s going to be lower cost compared to infrequent access by about 20%. So the use case for One Zone IA is going to be to store secondary backup copies of on premise data or storing any type of data we can recreate.

So what type of data can we recreate? Well, for example, we can recreate thumbnails from an image, so we can store the image on S three general purpose, and we can store the thumbnail on S three One Zone. If we can access and if we need to recreate that thumbnail over time, we can easily do that from the main image. Then we have S Three Intelligent tiring, and it has the same Low, latency and high throughputs as S three standard. But there is a small monthly monitoring fee and two tiring fee. And what this will do is that it will automatically move objects between the access tiers based on the access pattern. So it will move objects between S three general purpose S three IA. And so it will choose for you if your object is less frequently accessed or not.

And you’re going to pay a fee from S Three to do that level of monitoring. So the durability is the same, it’s eleven nine s, and it’s designed for 99. 9 availability, and it can resist an entire event that impacts an Availability Zone. So it’s available. Okay, so that’s for the general purpose, s Three storage tiers. And then we have Amazon Glacier. So Glacier is going to be more around archive. Glacier is cold. So think cold. Archive. It’s a low cost object storage meant really for archiving and backups. And the data needs to be retained for a very long time.

So we’re talking about tens of years to retain the data. In Glacier, it’s a big alternative to on premise magnetic type storage, where you would store data on magnetic tapes and put these types away. And so if you wanted to retrieve that data from these types, you would have to find the type manually, put it somewhere, and then restore the data from it. So we have still the eleven nines of durability, so we don’t lose objects. And the cost per storage is really, really low, $0. 4 per gigabytes, plus a retrieval cost, and we’ll see that cost in a second. So each item in Glacier is not called an object, it’s called an archive. And each archive can be a file up to 40 terabytes. And archives are stored not in buckets, they’re stored in vaults. Okay, but this is a very similar concept.

So we have two tiers within Amazon Glacier we need to know about. The first one is Amazon Glacier, the basic ones, and we have three retrieval options and they’re very important to understand expedited, which is one to five minutes. So you request your file and between one to five minutes you will get it back standard, which is three to 5 hours.

So you wait a much longer time. And bulk when you require multiple files to run at the same time, which takes between five to 12 hours to give you back your files. So as we can see here, Amazon Glacier is really to retrieve files and not have some kind of urgency around it. If you’re very, very in a rush, you can go and use expedited, but it’s going to be a lot more expensive than using Standard or bulk. And the minimum storage duration for Glacier is going to be 90 days. So again, files that are going to be in Glacier are there for the longer term. And we have an even deeper storage tier for Glacier called Deep Archive. And this is for super long term storage and it’s going to be even cheaper. But this time the retrieval options are standard 12 hours.

So you cannot retrieve a file in less than 12 hours. And bulk, if you have multiple files and you can wait up to 48 hours, it’s going to be even cheaper. So Deep Archive obviously is going to be for files that you really don’t need to retrieve urgently, even if it’s archived. And the minimum storage duration for Deep Archive is going to be 180 days. Now, you have to remember these numbers at a high level, because going into the exam, there will be questions asking you to understand which point to pick between glacier and glacier deep archive.

And for example, if the storage file is going to be less than 180 days and you have to use glacier, if you need to retrieve the file very, very quickly between three to 5 hours, it’s going to be glacier. But if it’s going to be a file to be retrieved in 72 hours and it’s going to stay for one year in your vault in Glacier, then maybe Deep Archive is going to provide you the best cost savings. So let’s compare everything that we’ve seen. We’ve seen s three standard. Intelligent tiering standard IA 198, glacier and glacier deep archive. So for Durability they’re all eleven nine. So that means you don’t lose any objects for availability, while the ones that we look and look at is S three IA. Because it’s infrequently accessed, we have a little bit less availability.

And if it’s one zone IA, then it’s going to be even less availability because we only have one availability zone. So that makes sense for the SLA. This is what Amazon will guarantee you to reimburse you. And it’s not something to know about, but I’ll just put it in this chart in case you need it in real life. Now the number of AZ your data is stored onto is going to be three everywhere except in one zone IA. Because as the name indicates, it’s only for one zone. So you’re going to have the one. Then there is a minimum capacity charge per object. So when you have the normal S three or intelligenteering, you’re fine. But when you’re using IA, you need to have a large object, or rather large object of 128. For Glacier 40, minimum storage duration is going to be 30 days for Standard IA and 30 days for Pro one IA and for Glacier, 90 days. For extra.

Glacier deep Archive. 180 days. And then finally, is there a retrieval fee for the first two? No, there is not. But when you have Standard IA because it’s infrequently accessed, then you’re going to be charged a fee any time you retrieve the data. And then for Glacier, glacier D Archive, again, there’s going to be a fee based on the number of gigabytes you receive and the speed you want to receive at. So you don’t need to know all the numbers in it, but the numbers should make sense from what the storage tier really, really means to you. And for those who like numbers, here’s just a chart that I have you can look up on your own time. But what it shows is that the cost of Sree standard is zero point 23, which is high. And if we go all the way to the right to Glacier, we have Deep Archive, we have 0000, 99 per gigabytes per month, which is a lot cheaper. And so if you want the data fast enough for Intelligent tiering is going to be between zero point 23 and zero 125. Standard AA is going to be that number and ones on it is going to be even cheaper and so on.

And it shows also the retrieval cost. So if we want an expedited retrieval from Glacier, it’s going to cost us $10 per 1000 request, whereas if we use standard or bulk, it’s going to cost us a lot less. Same for a glacier deep archive. Okay, so that’s it. And finally, for Sri Intelligent Tiering, there is a cost to monitor objects because it’s going to be able to move them between S three standard and standard IA on demand. And so the cost is quite small, but it’s zero point 25 per 1000 objects monitored per month. Okay, well, that’s it. Let’s go in the hands on to see how we can use these tiers. I’m going to create a bucket and I’ll call it Stefan S Three storage class Demo. Okay. And I’m going to click on next. I will not set up anything special. So next, next again and create buckets. Okay, excellent. My bucket is created and I’ll just find it here we go. And I go inside of it.

Next I’m going to upload a file and I will add a file, and that file is going to be my coffee JPEG. Click on next. And this is fine. I’ll click on next. And for the properties, this is the interesting part. We can set the storage class. So as I told you, there’s a lot of storage classes. Standard Intelligenteering, standard IA, one zone II, glacier, glacier Deep Archive, and Redis redundancy, which is not recommended because it’s deprecated. And so in this example, we have another table that describes what we had learned already. But as we can see, we can choose standard for frequently accessed data, intelligent tiering when we want to don’t know the patterns in advance if it’s going to be accessed frequently or not. And so we want Amazon to make that decision for us. Standard IA, when it’s going to be infrequently accessed data. One is on IA when we can recreate it.

So non critical data, glacier for archival with retrieval times ranging from minutes to hours. And then glacier deep archive for data that’s going to be rarely accessed. And if ever needed, we can wait up to twelve or 48 hours to retrieve it. Okay, so let’s have an example. We’ll use standard as a class and click on next upload. And our cafe has been uploaded. And as we can see on the right hand side there says Storage standard. But what you can do is click on Properties, click on Storage class. And we’re able to move that storage class, for example, to standard IA. So let me just save this. And now our object is in Standard IA, storage class, which is here.

And so we’ve just moved it. So very simple. And if I refresh this, it should show storage class, standard IA. And likewise, if we wanted to change against that storage class, we can go to properties sorry, properties of the object itself. So oops, here we go, standard IAP. And I click on it and I say, okay, now I want you to be in Glacier and it’s saying, okay, if you put it in English here, it’s going to be built for in 90 days, so save it. And here we go. My file is now in Glacier, so it’s an archive. And so that’s fairly easy as we can. And from the UI, it tells you exactly which file belongs to which your class. So based on your patterns, based on the applications you’re using, how you can have the best cost saving, and how you have the best performance. All right, that’s it. I will see you in the next lecture.

S3 Lifecycle Policies

So you can transition objects between storage classes as we’ve seen in the previous handson. So we can do it in what way? Well, there is a giant graph on the Alias website that describes how to do it. So it’s pretty complicated. But as you can see from standard IA, you can go to Intelligent Tiering. One is an IA, and then glitchier Deep Archive and it just shows the possible transitions. As you can see from Glacier, you cannot go back to Standard IA. You have to restore the objects and then copy that’s restored copy into IA if you wanted to.

So for efficiently access objects, move them to standard IA. For archived objects that we don’t need in real time, the general rule is to move them to Glacier or Deep Archive. And so moving all these objects around, all these classes can be done manually, but it can also be done automatically using something called a lifecycle configuration. And configuring those is something you are expected to know going into the exam. So lifecycle rules, what are they? You can define transition actions which are helpful when you want to transition your objects from one story to class to another. For example, you’re saying move objects to student IA class 60 days after creation and move to Glacier for archiving six months later. So, fairly easy and fairly natural expiration actions, which is to delete an object after some time. So for example, your access log files, maybe you don’t need them after another year. So after a year you would say, hey, all my files are over a year old, please delete them, please expire them. And it can also be used to delete old versions of a file.

So if you have versioning enabled and you keep on overwriting a file and you know you won’t need the previous versions after maybe 60 days, then you can configure an expiration action to expire objects, old versions of a file after 60 days. It can also be used to clean up incomplete multipart uploads in case some parts are hanging around for 30 days and you know they will never be completed, then you would set up an expiration action to remove these parts. And rules can be applied for a specific prefix. So if you have all your MP3 files within the MP3 quote unquote folder or prefix, then you can set a lifecycle rule just for that specific prefix. So you can have many lifecycle rules based on many prefix on your bucket. That makes sense.

And you can also have rules created for certain object tags. So if you want to have a rule that applies just to the objects that are tagged Department Finance, then you can do so. So the exam will ask you some scenario questions and here is one, and you need to think about it with me. So your application on EC Two creates images thumbnails after profile photos are uploaded to Amazon straight. And these thumbnails can be easily recreated and you only need to be kept for 45 days. The source images should be able to be immediately retrieved for these 45 days, and afterwards the user can wait up to 6 hours. How would you design this solution? So, I’ll let you think for a second, please pause the video and then we’ll get to the solution. So the S Three source images can be on the standard class, and you can set up a lifecycle configuration to transition them to Glacier after 45 days.

Why? Because they need to be archived afterwards and we can wait up to 6 hours to retrieve them. And then for the thumbnails, they can be ones on I eight. Why? Because we can recreate them. Okay. And we can also set up a lifecycle configuration to expire them or delete them after 45 days. So that makes sense, right? We don’t need the thumbnails after 45 days, so let’s just delete them. Let’s move the source image to Glacier, and the thumbnails can be on one zone IA because it’s going to be cheaper. And in case we lose an entire AZ in AWS, we can easily, from the source image, recreate all the thumbnails. So this is going to be providing you the most cost effective rules for your S Three buckets. Now, second scenario, there is a rule in your company that states that you should be able to recover your deleted Sree objects immediately for 15 days. Although this may happen rarely after this time and up to one year, deleted objects should be recoverable within 48 hours. So how would you design this to make it cost effective? Okay, let’s do it. So you need to enable Sere versioning, right? Because we want to delete files, but we want to be able to recover them. And so with Sfree versioning, we’re going to have object versions and the deleted objects are going to be hidden by delete marker and they can be easily recovered.

But we’re going to have non current versions, basically the objects versions from before. And so these non current versions, we want to transition them into S three IA, because it’s very unlikely that these old object versions are going to be accessed. But if they do are accessed, then you need to make sure to recover them immediately. And then afterwards, after these 15 days of grace period to recover these non current versions, you can transition them into deep archive, such as for 104 365 days, it can be archived and they would be recoverable within 48 hours. Why don’t we use just Glacier? Well, because Glacier would cost us a little bit more money because we have a timeline of 48 hours and so we can use all the tiers all the way up to deep archive to reach through our file and get even more savings. So this is the kind of exam questions you would get and it’s really important for you to understand exactly what the question is asking and what storage class is corresponding the best to it, and what lifecycle rule can also correspond the best to it.

So let’s go into the hands on just to set up a lifecycle role. So I am in my bucket and under management, I have lifecycle, and I can create a lifecycle role. So I can say this is my first role. And then I can have a filter by tiger by prefix for these files. So, for example, as I said, it could be MP3, but it could also be a tag if you wanted to. So whatever you want. That means you can set up multiple lifecycle rules based on prefix or tags. For now, I want to apply it to my entire buckets, so I will not add a prefix or a tag filter. And there is some text apparently in it. Okay, here we go. Next, then the storage class transition. So, do we want to transition the current object version if we have versioning enabled or the non current object version, so the previous versions.

So this is just for the current versions of the object. And we can add a transition. For example, we’re saying transition to standard IA after 30 days and then transition to Glacier after 60 days. Okay. And it says, by the way, if you transition small objects to Glacier or Deep Archive, it will increase cost. I acknowledge this and I’m fine with it. And we can finally add one last transition, which is put it into Deep Archive after 150 days. Okay, this looks great. And you can also transition old objects versions, so non current versions. So if we scroll down, we can add a transition, and it’s saying, okay, when the object becomes a previous version, you can transition it to standard IA after 30 days. And we can transition it right after into Deep Archive after 365 days. So, here we go. I click this acknowledge box, and I’m done with this. Finally. How about expiration?

Do we want to delete objects after a while? Maybe yes. For the current version, I want to delete all the current objects after 515 days. That makes sense. And for the previous versions, maybe 730 days. And do we want to clean up incomplete multipart uploads? Yes. Okay, that makes sense. And I’ll click on Next, and we can review this entire policy and click on Save. And here we go. We have created our first Lifecycle rule, which is not showing up here right now. Which is showing up here right now. Excellent. And you can create multiple ones based if you have multiple filters, multiple prefixes or tags and so on, and based on the actions you want. But as you can see, it’s really, really powerful. And you can set up more than one lifecycle rule per bucket. All right, that’s it for this lecture. I will see you in the next lecture.

S3 Performance

So we have to talk about the S three baseline performance. So by default, Amazon s three automatically scales to a very, very high number of requests and has very, very low latency between 102 hundred milliseconds to get the first bite out of s three. So this is quite fast. And in terms of how many requests per second you can get, you can get 3500 put copy post elites per second per prefix, and 500 hundred and 5500 get head requests per second per prefix in the bucket. So this is something you can get on the website, and I think it’s not very clear. So I’ll explain to you what per second per prefix means. But what that means in Viral is that it’s really, really high performance and there’s no limit to the number of prefixes in your bucket. So let’s take an example of four objects named file, and let’s analyze the prefix for that object. The first one is in your bucket, in folder one, subfolder one file. In this case, the prefix is going to be anything between the buckets and the file.

So in this case it is slash folder one, sub one. So that means that for this file, in this prefix you can get 3500 puts and 500 gets per second. Now, if we have another folder one and then sub two, the prefix is anything between bucket and file. So slash folder one, sub two. And so we get also 3500 puts and 500 gets for that one prefix and so on. So if I have one and two, we have different prefixes. And so it’s easy now to understand what a prefix is. And so it’s easy to understand the rule of 3500 puts and 5500 gets per second per prefix in a bucket. So that means that if you spread reads across all the four prefixes above evenly, you can achieve 22,000 requests per second for head and gets. So that’s perfect. Next, let’s talk about Kms as a limitation to S three performance.

So if you have Kms encryption on your objects using SSE Ms., then you may be impacted by the Kms limits. When you upload a file, it will call S Three on your behalf to generate data key Kms API. And when you download a file from S three using SSE Kms, KS Three itself will call the decrypt Kms API. And so these two requests will count towards the Kms quota. So let’s have an example. Our users connect to SV bucket. They want to upload or download a file using Ssekms encryption. And so S Three buckets will perform an API call, either generate data key or decrypt to a Kms key and get the result from it. And so by default, Kms has a quota of number requests per second. And based on the region you’re in, it could be 500 or 30,000 /second request. And you cannot change that quota. So as of today, you cannot request a quota increase for Kms.

So what this means is that if you have more than 10,000 requests per second in a specific region that only supports 500 requests per second for Kms, then you will be throttled. So you need to ensure that Kms doesn’t block your performance on S three. Now, these quotas are pretty big for normal usage, but still good to know if you have many, many files and a high usage of your Sree bucket. Now, let’s talk about S three performance, how we can optimize it. The first one is multipart upload. So it is recommended to use multipart upload for files that are over 100 megabytes and it must be used for files that are over 5GB. And what multipart upload does is that it parallelizes uploads and that will help us speed up the transfers to maximize the bandwidth. So as a diagram, it always makes more sense. So we have a big file and we want to upload that file into Amazon is free.

We will divide it in parts so smaller chunks of that file and each of these parts will be uploaded in parallel to Amazon is free and Amazon is three. Once all the parts have been uploaded, it’s smart enough to put them together back into the big file. Okay, very important. Now we have S three transfer acceleration, which is only for upload, not for download. And it is to increase the transfer speed by transferring a file to an atom isled edge location, which will forward then the data to the S three bucket in the target region. So edge locations, there are more than regions. There are about over 200 edge locations today and it’s growing. And let me show you in the graph what that means. And that transfer exploration is compatible with multipart upload.

So let’s have a look. We have a file in the United States of America and we want to upload it to a mystery bucket in Australia. So what this will do is that we will upload that file through an edge location in the United States, which will be very, very quick, and that will be using the public Internet. And then from that edge location to the Amazon, a three bucket in Australia, the edge location will transfer it over the fast private alias network. So this is called transfer acceleration because we minimized the amount of public Internet that we go through and we maximize the amount of private database network that we go through.

So transfer acceleration is a great way to speed up transfers. Okay? Now how about getting files, how about reading a file in the most efficient way? We have something called an S three byte render fetches. And so it is to paralyze gets by getting specific byte ranges for your files. So it’s also in case you have a failure to get a specific byte range, then you can retry a smaller bite range and you have better resilience in case of failures. So it can be used to speed up downloads this time. So let’s say I have a file in S three. It’s really, really big and this is the file. Maybe you want to request the first part, which is the first few bytes of the file file, then the second part and then the Nth part. So we request all these parts as specific byte range fetches. It’s what it’s called by range because we only request a specific range of the file and all these requests can be made in parallel.

So the idea is that we can parallelize the gets and speed up the downloads. The second use case is to only retrieve a partial amount of the file. For example, if you know that the first 50 T bytes of the file in S three are a header and give you some information about the file, then you can just issue a header request, a byte range request for the headers using the first, say 50 bytes, and you would get that information very quickly. All right, so that’s it for s three performance, we’ve seen how to speed up uploads download, we’ve seen the baseline performance and we’ve seen the Kms limits. So make sure you know those into going into the exam and I will see you in the next lecture.

Amazon AWS SysOps – S3 Storage and Data Management – For SysOps (incl Glacier, Athena & Snowball) Part 4

Related Posts