Amazon AWS Certified Data Analytics Specialty – Domain 4: Analysis Part 3
- [Exercise] Amazon Elasticsearch Service, Part 2
Let’s go back to the Firehose configuration screen here. We can close out of this and choose the lambda function we just made, log transform, and move on to the next page. Now we need a destination. The destination in our case will be the Elasticsearch service to the domain that we just set up. Let’s choose our domain. Hopefully it’s there. Cadabra. We need to specify an index to put this stuff into. Let’s call it web logs. Let’s rotate it, I don’t know, daily. That’s a good strategy because if you rotate your logs daily, that means it’s very easy and efficient to drop old data as it becomes too old for you to care about. So that’s the main reason you’d want to rotate things. It also makes it easier to find stuff for a given day. We need to specify a type name as well. We’ll call that web logs.
The concept of types and elasticsearch is actually going away very soon, so you’ll probably see that field go away at some point. We do want to back up our data persistently, just in case. Why not? S three? Storage is cheap, so let’s go ahead and just say that any failed records will get dumped into our S Three bucket that we created earlier, and we can keep them all nicely organized under, say, an EF directory. Next. All right, now we need to specify our buffer conditions. This is similar to when you’re dealing with Kinesis. Again, since we want to see more immediate results, we’ll set the 300 seconds down to 60, which is the minimum buffer interval. So that way we’ll be sure to get data at least once per minute.
Scrolling down, you can see that compression and encryption is also an option with Firehose. We’ve seen this before. We need to create a new IAM role for this Firehose stream. And again, the defaults are fine, so we can just say allow here. If you want to take a look at what it’s doing. You can examine the actual policy document here that gives it permission to talk to S Three and to Lambda.
All right, hit next. So at this point, we want to make sure that our elasticsearch domain is actually up and running before we move on further. So let’s go back to the Elasticsearch screen here. You can see it’s actually still loading, so we’re going to give this some more time to start up before we move forward. So I’m going to pause this video and come back when that’s ready. All right, about five more minutes have passed, and our domain is now up and running in Elasticsearch. So that’s cool. Let’s go back to Firehose and finish tying that to Elasticsearch.
So the final step here was to create our delivery stream. We’ll scroll down and hit Create delivery stream. All right, so that’s getting set up. So at this point, we have a Firehose delivery stream that is set up to dump web log data into our elasticsearch domain that’s up and running right now. And as part of what it’s doing, it’s calling a lambda function that we set up to actually transform those raw log information into JSON formatted data that is elasticsearchfriendly.
So the last thing we need to do before we start playing with it is to actually tie that delivery stream to the logs themselves. To do that, we have to go back to our EC two instance, and I’ve already logged in here. Obviously, let’s configure our Kinesis agent to actually pick up those new logs and put them into the stream that we want. So let’s say pseudo, nano, etc. Awskinesisagent JSON now, if you recall, we actually deleted the streams that we were working with previously, so we can go ahead and delete the Cadaver orders flow here. Just going to hit CTRL K until that whole clause is gone. We also don’t need the purchase logs firehose stream anymore either, but we do want to tie things into the new firehose delivery stream that we just set up. So to do that, we can just say open curly bracket quote file pattern uotevarloghttpdsslunderscoreaccessstar and that should pick up all the SSL access logs that I just dumped in there. Comma quote delivery stream this is the name of the firehose delivery stream that we want to dump it into, which is web logs, thusly comma and we’re going to specify an initial position of start underscore of underscore file.
That way, I don’t have to manually keep pushing data into there. We’ll just go ahead and pick things up from the beginning of the data that I already copied in there, close off that curly bracket, and that should do the job. Double check everything, make sure we’re not missing any commas. Think it looks good. CTRL O enter ctrlX now, to apply that new configuration, we can say pseudo service AWS kinesisagentrestart hopefully it will work. If I did make a typo, it would tell me now. All right, we seem to be in good shape. The data should be flowing, so let’s go ahead and tailor log and see what’s happening. Tail FFR logisticsageentlog we can see that it did start up. We just have to wait for it to find that data and start pushing it through. All right, it looks like it’s going so cool. Things seem to be working. We have some records being sent successfully to the destinations. Presumably that is the web logs firehose stream that we just set up, because that’s the only destination that we have at this point. So I’ll control C to get out of that and take it on faith that data is flowing into Elasticsearch. But instead of taking it on faith, we can go look for ourselves. So let’s go back to the AWS dashboard and go to the Elasticsearch console again. So let’s click on indices and we can dive into what’s going on here. As you recall, we set things up to actually put things into rotation by day. So although we have a prefix of web logs, we actually have individual indices that are web logs followed by the given day that those logs were inserted. So we have a web logs 2019 227, because that’s the date that I’m recording this on. If we open that up, sure enough, there are 79,120 entries in that index.
So things appear to be going into Elasticsearch successfully. So we want to be able to actually play with this stuff. And to do that interactively, we’re going to use Cabana to mess around with the data, search for it, visualize it, and stuff like that. But first we need to grant access that we need from our desktop to this cluster so Cabana can actually talk to it. Like we’ve said a few times, getting Cabana to talk to an Elastic search cluster on AWS can be a bit of a challenge. So to do that in the most simple way possible, let’s click on Modify Access Policy, and what we’re going to do is add another clause to our access policy that opens up Cabana from our IP address. So if you go to our course materials, I’ve made life easy for you, go to the log analysis folder and open up Cabana access text. Now, you do need to have your IP address and your account ID here for this to work. What we’re doing here is opening things up to your IP address, wherever you’re connecting from, and from your account ID to the Cadabra elasticsearch domain. So basically anything you’re doing from your computer at home is going to be fair game here. So we need two things, your IP address and your account ID.
Let’s start with the account ID. That’s the easier of the two can just go back to the cluster here, go to Big Data or whatever you called your account, and go to my account again, copy your account ID. Yours will obviously be different and paste that into where it says your account ID. We also need your IP address, and that might not be what you think of it. It is from the outside world. So you can go to a website like what is my IP address? Or something like that, or what is my IP? What ismyip. com well, there’s mine right now, which happens to be the external address for my cable modem. Copy that obviously yours will be different and paste that where it says your IP. So now let’s copy this entire clause here that we have to our clipboard. Go back to Elasticsearch here, and we’re going to add this as another clause here underneath the statement. So after the statement open bracket, we’re going to hit enter. Paste that all in, and you can see that that is all self contained, followed by a comma and then the permissions that we had previously. Let’s hit submit, and I’ll give that time to apply.
- [Exercise] Amazon Elasticsearch Service, Part 3
You can see our domain has gone back into processing status. It actually needs to restart things to apply that new access policy. So again, we’re going to pause this video and come back when that’s done. Okay, it didn’t take too long. So our domain status is back in active state here, which means that we can play with it again. And now that we’ve opened things up to our IP address here at home, we should be able to click on this cabana link and have it just work. So let’s see if that does in fact happen. Click on the link that it gives you here. Add. Cool. We’re loading cabana. So we actually have an interactive view into our Amazon elasticsearch cluster, which has been receiving data from our web logs through a firehose stream. Pretty cool. It seems to be working. Let’s go to management and get things wired up here. We need to give it an index pattern. Click on index patterns and the default one will be web logs star. And sure enough, we have a success. It did find the one that exists. Click next step. Now I want to know what our timestamp, our time filter field name is. And there are actually two to choose from. One is the local time timestamp and one is UTC time. To keep life simple, I’m going to go with at timestamp which is local time and create index pattern.
All right, things seem to be working. Now we can go to the discover link here and actually start messing around with our data now. Right now nothing is showing up because we have a last 15 minutes time filter here, and these server logs contain data from much, much further back in time. So the first thing we have to do is change that time range. So let’s click on where it says last 15 minutes and instead set up an absolute time range here. Let’s change it to 2019 127. Should do through 2019 202. We’ll pick up the data that are in these actual logs. Click go. And now we can see we actually have some data here. So automatically it’s giving me this histogram of hits per three hour timestamp window here.
That’s pretty cool to look at. We can see kind of the ebbs and flows of traffic to my website here graphically without doing any work. And we actually have individual rows of server log data here as well that’s already been broken out thanks to our firehose stream that broke this out from the raw log format into a JSON structured format instead. So that’s pretty cool. We can do searches here if you want. Let’s say you want to look for things that had an error. We could say response colon 500 for example, and it very quickly gives me back all of the 500 responses that we got. And we can see that we did in fact have a big peak here of eleven of them at that particular time of day. So, operationally, that might be something that I’d want to dig into. Where do those come from? What drove those? And down here, I can actually see more detail about where it might have come from. So, indeed, I don’t really see an obvious arrive or reason or pattern to these requests. But upon further examination, I might uncover a real issue with my website. My guess is that somebody’s crawling it and not identifying themselves as a crawler. But we can also make pretty graphs here.
So let’s click on Visualize to play with that. And let’s create a visualization. How about a vertical bar chart? Just select our index that we want a graph. And now you can, like, mess around and, like, create graphs interactively, which is kind of cool. So let’s add a filter. We’ll filter on response is 500 for buckets. Let’s set the x axis to a date histogram with an hourly interval and hit the Play button to apply all that stuff. And we’ve created a nice little histogram here of just the 500 errors that we’re seeing again. So we sort of created that same visualization the hard way here by creating a custom visualization. We could actually save this visualization for later use if we ever want to have a quick way to view 500 errors without trying to type in responses and search queries from the top there, actually click into these things and dig into them and zoom in and stuff.
So pretty cool stuff. Anyway, that’s cabana. Obviously, it’s capable of a whole lot more than that. We just scratched the surface of it. But the important point is that it’s a part of Amazon Elasticsearch. We’ve integrated it with Kinesis Firehose to automatically dump data into it. And Cabana is often used as a way to visualize server log data and be used not only just as a search engine, but it can also be an operational tool as well, and a visualization tool. So Elasticsearch, it’s not just for search anymore, if you will. All right, so this Elasticsearch cluster is costing us money. So let’s remember to shut it down. Now that we’re done, let’s get rid of all the stuff that we no longer need back to the cluster here. Let’s hit Delete domain to get rid of it. Yes, I really want to get rid of it. And we will also go back and clean up our Firehose stream just to be safe. We don’t want this thing pointing into La La land, really. So we’ll click on web logs and delete that. All right, so everything’s back to normal here and yeah, congratulations. So we have messed around with Amazon’s Elasticsearch service through Kinesis Firehose, dumping data directly into it. So that’s kind of how Amazon has integrated Elasticsearch into its larger ecosystem. And we’ve used that to actually tie together our server logs on an EC two host into Elasticsearch, where we can visualize it and search it very easily.
- Intro to Athena
Next, let’s dive into Amazon Athena. It’s a really cool query engine for doing interactive queries on your data in an S Three data lake. And it’s completely serverless, which is really super interesting. So what is Athena? Its official definition is an interactive query service for S Three. So basically, it’s a SQL interface to your data being stored in some S Three data lake. There’s no need for you to load your data from S Three into Athena. The data actually stays in S Three, and Athena just knows how to interpret that data and query it interactively under the hood. It’s using presto. You might remember Presto from our Elastic MapReduce lectures, and it’s just a very highly customized and pre configured presto instance for you, with a nice little user interface on top of it. But the great thing about Athena is that it’s completely serverless. So you do not need to manage any servers.
You don’t need to provision servers. You don’t have to even think about how it works. You just think about using it, really. Athena supports a wide variety of data formats that might reside in your S Three buckets. Those include, and these might be important to remember CSV, JSON ORC Parquet and Avro. And this is as good of a time as any to get into how these different formats differ from each other, because depending on your application, a different format might make sense. Right? So the good thing about CSV, which is comma separated value, and it’s cousin TSV tab separated value list, is that they’re human readable. You can look at them in a text editor and make sense out of the data that’s in there very easily. It’s just one line per row of data. Every line contains a bunch of fields separated by commas. Easy peasy, right? Same thing with JSON. JSON has a bit more structure to it, and that you can have more hierarchical data inside of it.
But it’s still human readable, so it’s still one row per document. And each document is something you can look at and understand. But then, if you want to really do things at scale and do things more efficiently, you should be looking at maybe nonhuman readable formats such as ORC and Parquet. ORC and Parquet are both examples of columnar formats that are also split table. So instead of organizing data by rows, it’s organizing it by each column. So if you have an application that’s doing queries on your data based on specific columns, like, let’s say you have a use case where you’re always looking up records by their, I don’t know, user ID or something, that column or storage makes it very efficient to go and retrieve data for a specific columns value. The other advantage to Orin Parquet is that they are splittable. So even though they’re organized and compressible in very cool ways, they’re still splittable.
So in a big data setting, those files can still be split into chunks that can be distributed across an entire cluster. So you can still have a massive ORC or a massive parquet file and still have the ability for your cluster to split that data and view it across different instances. Avro is also an example of a splittable file format, but it is not columnar, so it is not human readable. However, it is for more row based storage where you might be looking at an entire row’s worth of data at a time, generally speaking. But it is still splitble. So it’s important to remember that these are all splittable formats, because that can be important on the exam. All right, let’s put that little diversion aside and get back to Athena. So another important point is that Athena doesn’t really care if your data in S Three is structured or semi structured or structured. It can work with glue and the Glue data catalog to impart structure on that data and make it something that you can query from a SQL command. Some examples of usage here given by Amazon include the ad hoc querying of web logs, so Athena is given as a better example of how to deal with querying web log data in S Three, and they would rather have you use this on Elasticsearch, for example. You can also use it for querying staging data before loading it into redshift.
So maybe you have a bunch of data being dumped into S Three and you want to transform it and load it into a big redshift data warehouse, but maybe you want to be able to play around that data and investigate it and see what it looks like beforehand. Athena might be a good way of sort of getting a bigger picture of what that data looks like before you actually commit it into a data warehouse. It’s also useful for looking at other logs besides web logs in S Three, including cloud trail logs, cloud front logs, VPC logs, elastic load balancer logs, whatever you have.
If it’s NS Three, Athena can query it. It also offers integration with tools like Jupiter’s and Zeppelin and RStudio notebooks, because you can just treat it like a database. It has ODBC and JDBC interfaces, so you can treat Athena like any other relational database that you might interact with or integrate with. That also includes QuickSite. You can also integrate Amazon’s QuickSite visualization tool into Athena to actually use that as sort of the glue. Well, glue’s a poor choice of word because we are actually using the glue service for this. But the thing that connects your unstructured data in S Three to a more structured visualization or analysis tool such as QuickSite.
- Athena and Glue, Costs, and Security
So, speaking of glue, let’s talk about how Athena integrates with AWS Glue to actually impart structure on your unstructured data in S Three so that it can actually query it like a database. So again, let’s do a little brief recap on how glue works. You might have a Glue crawler, populating the glue data catalog for your S Three data that’s looking at what’s stored in S Three and trying to extract columns and table definitions out of it for you. And you can use the glue console to refine that definition as needed.
So once you have a glue data catalog published for your S Three data, athena will see it automatically, and it can build a table from it automatically as well. So anytime Athena sees something in your Glue data catalog in your account, it’s going to make a table for that for you. So you can query it just like you would any other SQL database. And it’s not just Athena that can use that glue data catalog either. It will allow any other analytics tool to visualize or analyze that data as well. For example, RDS Redshift Spectrum EMR any application that’s compatible with an Apache hive metastore as well.
Because remember, the Glue data catalog can be used as a hive metastore too. But in this example, we’re using it to expose that table definition to Amazon Athena. And with Athena integrated with AWS’s Glue data catalog, that allows you to create a unified metadata repository across various services, crawl data to discover Schemas, populate your catalog with new and modified table and partition definitions, and maintain Schema versioning all under the hood. And Athena just sits on top of that and provides a SQL interface to that underlying glue structure. The Cost model is very simple because it is a serverless application. You just pay as you go for the activity that you actually consume. It charges you $5 currently per terabyte scanned, which is pretty generous. Important to remember that successful or canceled queries do count toward that scanned number there. You do get charged for those, but any failed queries are not charged. So any successful or canceled queries you will be billed for. Failed queries, however, are free. There’s also no charge for any DDL operations such as Create, Alter, or Drop. Those are free as well. Now, a very important point is that if you want to save money using Athena, it can save you a lot of money by converting your data to a column or format like we talked about with ORC and Parquet. So not only does that give you better performance for applications that are typically querying on a small number of columns, it can also save you 30% to 90% in terms of money as well.
That’s because it allows Athena to selectively read only the required columns when doing a query and processing your data. So if you’re only accessing certain columns from your queries by having a column or format, you’re reducing the amount of data that you actually have to scan with Athena. And remember, you are charged by terabyte scan. So by reducing that scanning, you win. So remember, Athena works best with column of formats. It can save you a lot of money and give you better performance as well. Examples of column of formats include ORC and Parquet. All right. And in addition to Athena, of course, glue and S Three have their own charges as well. Athena just sits on top of glue to get the table definition for your S Three data.
And your data is still being stored in S Three. So there are separate charges for Glue and S Three in addition to using Athena as well. Also, I should point out that partitioning your data can also help Athena to reduce costs as well. So if you do have your data structured in S Three in different partitions, such as by date or hour or something, queries that are restricted to that given partition will also scan less data as well. So in addition to using a columnar format, partitioning your data within S Three can also save you money with Athena as well. Let’s talk about security too. That’s always important with Athena. It offers many different ways of securing your Athena traffic. One is through Access Control, who can actually get to Athena and query your information in the first place. And you can use IAM Access Control lists and S Three Bucket policies to restrict access to that information. The IAM policies that are relevant here are the Amazon athena full access and AWS quickside.
Athena access policies. You can also encrypt your results if you are sensitive about the results of your queries. And you can encrypt them at rest in a staging directory in an S Three. And that can be encrypted in several different ways. Again, these are important to remember, guys. So you can encrypt your S Three data results in server side encryption using an S Three managed key that is called SSE S Three. You can also do server side encryption of your results using a Kms key that’s called SSE Kms. Or you can encrypt them on the client side with a Kms key as well that’s called CSE Kms. And depending on your security requirements, one or more of these may make sense. So think about how that data is flowing, where it’s being stored, and how you want it to be secured. You can also have cross account access to find an S Three Bucket policy. So it is possible for you to access Athena to an S Three bucket that is not owned by your account.
If that other S Three bucket has a policy that grants access to your account, it’s possible for an Athena console in one account to access a data lake stored in another account. You can set that up. As far as In Transit security goes, you can use TLS, the transport layer security for all the traffic going between Athena and S Three as well, that can be set up to for In Transit security with Athena. Always important to remember security aspects, guys. It’s a huge part of the exam. Finally, let’s talk about anti patterns. Again, these come straight out of the AWS. Big data white paper of things they don’t want you to use Athena for. One is for highly formatted reports or for visualization. At the end of the day, Athena is just a SQL query engine. If you want to do nicely formatted stuff for visualizing things with charts and graphs, well, that’s what QuickSite is for. And we’ll be getting to that shortly. Also, if you want to be doing ETL, extract, transform and load operations using Athena, that’s generally not the best tool to use. Glue is an alternative there. And using glue ETL instead, you can also do that with Apache Spark or what have you for larger scale tasks. So that’s Athena in a nutshell. Let’s go and play with it, shall we?