Amazon AWS Certified Data Analytics Specialty – Domain 4: Analysis Part 2
- Intro to Elasticsearch
Let’s dive into Amazon’s elasticsearch service. Elasticsearch is a pretty exciting technology, I think, for doing largescale analysis and reporting petabyte scale, in fact. And what’s interesting is that even though elasticsearch has started out as a search engine, that’s fundamentally what it was made for originally. It’s not just for search anymore, really. It’s primarily for analysis and reporting these days. And for some applications, it can actually analyze massive data that’s like that a lot faster than something like Apache Spark could. So for the right source of queries, elasticsearch can be a really good choice for getting answers back really quickly across a massive data set that could span petabytes across an entire cluster. Let’s dive into what it’s all about. So what is elasticsearch? Well, we don’t really talk about elasticsearch itself so much. It’s part of a larger context called the elastic stack, of which elasticsearch is one component. So elasticsearch is fundamentally a search engine. You can send a JSON style request to say, go index this document, and another JSON request to say go search for documents that contain these keywords or these attributes.
Or you can do things like fuzzy matching and things like that as well. So at its heart, elasticsearch itself is really a very good and very scalable and very fast search engine. It’s actually built on top of an open source solution called Lucine. And Elasticsearch fundamentally is just a scalable version of lucien. It’s distributed horizontally across many nodes in a cluster. However, it’s been expanded over the years to include more tools and become more of an analysis and visualization tool. So another piece of the elastic stack is a technology called cabana, which we’ll look at momentarily. That’s basically a visualization tool for querying and analyzing and visualizing data that’s stored in elasticsearch. So you don’t have to limit yourself to storing document information in elasticsearch.
I mean, you could use it to build a search engine for wikipedia or something, but you can also use it to store semi structured data, like data coming in from server logs or something like that. And if you’re doing that, you could use cabana to visualize that data and kind of make your own little Google analytics dashboard, if you will. It is also a data pipeline. So they realize that there’s a need to actually feed data at scale into elasticsearch. And kind of like you can use technologies like kinesis and kafka to do that in other systems. The elastic stack has its own tool called log stash, which is actually part of a larger package called Beats, and that is basically a framework to allow you to at scale import data from any wide variety of different sources into your elasticsearch cluster. You’re not going to need a whole lot of depth on that for the exam, but just understand that beats and log stash are ways of streaming data into your elasticsearch cluster. It is, as I said, horizontally scalable.
That is the primary thing that Elasticsearch brings to you. It’s basically Lucine scaled out infinitely as you add more service to your cluster. So I promised you a closer look at Cabana. Here’s an example of a Cabana dashboard just to give you sort of a gut level feel of what it’s capable of. And as we said, you can easily import log data using Log stash into an Elasticsearch cluster and then use Cabana to visualize that data. You can also use Cabana as sort of a nice to use front end or UI for actually querying that data interactively and trying out specific requests on your data set without having to type in command lines using curl from a terminal or something like that. So this is what Cabana looks like. Again, just think of it as sort of a Google Analytics dashboard for your data stored in Elasticsearch and it is a very scalable solution for that. So there are some companies that are either uncomfortable with exposing all of their data to Google or their data is too large for Google Analytics to handle for them. Cabana plus Elasticsearch is actually a really good alternative to look into. What is elastic search used for?
Well, like I said, it’s not just for search anymore, although that is still a really great application of it. If you do need to build a search engine for a website, Elasticsearch is great at that and it’s also very scalable. So totally feasible to build something like Search on Wikipedia using Elasticsearch, but it’s also for other things too like log analytics. It’s very well suited for that application monitoring, again based on incoming log data, it’s a very quick and easy way to visualize what’s happening in real time as your data comes in from monitoring your servers.
Security analytics is also an application and Clickstream Analytics, these all kind of like fall under the umbrella of analyzing log data though that’s really kind of the niche that the Elastics stack has started to fill. To go into some specific examples though, for full text search, Mirror Web uses Amazon Elasticsearch service to make the UK government and UK parliament’s web archive searchable. And using the Amazon Elasticsearch service, Mirror Web indexed 1. 4 billion documents billion with a V for just $337 and indexed 146,000,000 docs per hour, which was 14 times faster than the previous use technology. As you can probably tell, these are case studies that AWS gives you that I’m talking about here where they talk about how awesome they are, but it actually is impressive how performant and cheap this can be. Log analytics adobe uses Amazon’s Elasticsearch service as well and they’re visualizing large amounts of log data for its developer platform and at peak they’re receiving over 200,000 API calls per second.
I mean that’s pretty mind boggling traffic right there, even for me. But using Amazon Es, that’s shorthand for Elasticsearch service, adobe can easily see traffic patterns and error rates and quickly identify and troubleshoot any potential issues, all with reduced operational overhead because Amazon is doing all the server maintenance for them for application monitoring in real time. Expedia is an example of a customer of Amazon Es. For that they’re using Amazon’s Elasticsearch Service for application monitoring and root cause analysis and price optimization. And Amazon Es is enabling Expedia to monitor huge volumes of docker logs, cost effectively, identify and troubleshoot issues in real time, scale easily to accommodate additional log sources and offload the operational overhead as well because again, it is a managed solution. In the realm of security analytics, the example they give is in the realm of security analytics.
Amazon Es allows you to centralize and analyze events from across your entire organization in real time. You can index and analyze your data as soon as it’s received from multiple sources instantly and find and prevent threats faster. And finally, for Clickstream analytics, the example they give is Hearst Corporation. They built a Clickstream analytics platform using Amazon Elasticsearch service, amazon Kinesis Streams and Amazon Kinesis Firehose. We’ll talk about how those all fit together shortly to transmit and process 30 terabytes of data a day from 300 plus Hearst websites worldwide. That’s a lot of information. And with this platform, Hearst is able to make the entire data stream from website clicks to aggregated data available to editors within minutes. So massive scale, near real time and is cheap. Awesome stuff, right? Let’s talk about Elastic search’s main concepts here.
So there’s basically three different types of entities you need to think about in the context of Elasticsearch. One is a document. So basically Elasticsearch is a document storage and retrieval engine. Documents are the things you’re looking for and it’s not just limited to text, you can put structured JSON data in there as well. Now every document is going to have some unique document ID that you can search for it by and currently, but not for much longer. Also a type associated with it as well. Types are what would define the schema and mapping shared by documents that represent the same sort of a thing, like a log entry type or an encyclopedia article type. Basically it defines the schema you expect to see associated with that type of document. However, types are going away and we’re moving to just one type per index right now. And in later Elasticsearch releases, types are going to be eliminated entirely. So today we talk more about indices than types. An index would power search into all documents within a collection of types. An index also contains an inverted index that lets you search across everything within that index all at once.
So currently in Elasticsearch Six, we have one type per index. So the structure that’s expected is that for any given type of document, you’re going to have a specific separate index for that type. And ultimately the entire concept of types is going away and you can just put whatever data you want into your JSON data. So, again, types are going to be a thing of the past soon. We really want to think about documents and indices, but if you see something on the exam about Elasticsearch, six types might still be a thing. How does it all work? Well, an index, which, again, is a collection of documents that are related to some thing, is split into shards. So this is pretty much straightforward horizontal scaling stuff, right? Basically, every document is hashed to a particular shard.
Every shard might live in a different node within a cluster. And what’s interesting is that every shard in Elasticsearch is actually a self contained Lucien index of its own. So every shard is actually its own little mini search engine. It’s kind of cool, redundancy works in much the way you would expect. So in this example, we have two primary shards and two replica shards for each primary. And the way it works is that write requests would be routed to your primary shard and then replicated to however many replicas you specify. Read requests, however, can come from either the primary or the replica shards. So it’s up to your application to actually round robin that and try to distribute the load of those read requests so that you’re not putting all that read load on just the primary shard. You can make use of those replica nodes for expanding your read throughput as well.
- Amazon Elasticsearch Service
Now that we’ve talked about the open source Elastics stack a bit, let’s talk about what’s specific to the Amazon Elasticsearch Service. So Amazon Elasticsearch Service and As for short will just call it is a fully managed service to make it easy to deploy, secure, and operate Elasticsearch at scale with zero downtime. Before this, we would have to install Elasticsearch Ourself on a bunch of EC two instances and manage that ourselves based on our requirements. But now you don’t have to worry about all that. All the pain of managing those instances and installing patches and all that stuff is done by AWS, just like it would be on like RDS or EMR. But it is not serverless.
So you still have to decide how many servers you want in your Elasticsearch cluster. That’s not going to be scaled up and down for you. You still have to think about the actual servers in your cluster. Kind of like EMR, it offers open source Elasticsearch APIs, managed cabana, and integrations with Log, Stash, and other AWS services such as Kinesis to enable you to securely put in data from any source and search, analyze, and visualize that data in real time. You only pay for what you use. You pay for the instance hours used by your cluster, the storage actually consumed by it, and the data transfer between it.
But keep in mind that your server is just sitting there doing nothing. It’s still consuming instance hours, so you’re paying for what you use in terms of how much server capacity you’ve allocated. But it does not mean that an idle cluster is not going to cost you anything. So if you’re not going to be using your Elasticsearch cluster, make sure you shut it down. It also offers a high level of network isolation that you can achieve with Amazon VPC. And in addition to that, you can ensure data security by encrypting your data at rest and in transit using keys.
And you can manage authentication and access control using Amazon Cognate and IAM policies as well. It also integrates with Iota. Well, one good use case is sending data into Elasticsearch from your devices from your Internet of Things, which can then be analyzed and visualized. It also offers zone awareness, so you can actually allocate nodes in your Elasticsearch Service cluster across two different Availability Zones in the same region. By doing that, you can increase your high availability of the service, but it can cause increased latency.
Of course, other integration points of Elastic search service include s three buckets via lambda two, Kinesis. So you can use sort of lambda as an intermediary between Kinesis and Elasticsearch to pipe data in from s three. It does integrate with Kinesis data streams as well. In the same manner, you can also integrate Elasticsearch with DynamoDB Streams to receive data from that, and it also integrates with Cloud Watch and Cloud Trail for auditing purposes and operational monitoring. When you spin up an Amazon Es cluster. There are a few things we’ll ask you about that are kind of specific to Amazon. One is how many dedicated master nodes do you want? And you have to choose how many of them and what instance type you want to use.
Now, the master nodes are only used for the management of your Elasticsearch domain, which you’re creating, and it does not hold or process any data. So generally you don’t need too many of them unless your cluster is really massive. What’s a domain? Well, again, this is sort of an Amazon specific thing. An Amazon Elasticsearch service domain is a collection of all the resources needed to run the Es cluster. So it contains all the configuration for the cluster as a whole. So basically, a cluster in Amazon Es Parlance is a domain. It also allows you to enable automatic snapshots to s three for data backup purposes. So if you do inadvertently shut down your cluster, you won’t lose that data. And like we talked about, zone awareness is also an option if you want increased availability at the price of higher latency. Security is always an important point on this exam. So let’s talk about how Amazon’s security systems integrate with Amazon Es.
We’ll dive into this when we go into the exercise a little bit more. It does allow resource based policies, so you can attach those to the service domain. That determines what actions a principal can take on Elasticsearch APIs, where a principal is a user, an account, or a role that can be granted access. You can also have identity based policies using IAM policies or IP based policies to tie specific actions to specific IP ranges as well. You can also sign your requests going into Amazon. Yes, in fact, you have to. All requests to Amazon Es must be signed. And when you send in requests from the AWS SDKs to Elasticsearch, that will give you the means you need to actually digitally sign all of those requests going in. Otherwise, all that traffic would just be an unencrypted JSON data. And that’s obviously not a very secure approach when you’re sending stuff in flight across the Internet.
For further security, you can also put your cluster in a VPC instead of making it public, and that will give you additional security to your Es cluster instead of having it accessible from the outside world at all. Of course, that does make it harder to actually connect to your cluster and use tools like Cabana, but we’ll talk about that on the next slide. Now, keep in mind that you cannot set up your cluster first in a VPC and then move it out of the VPC or vice versa. You have to decide upfront if your cluster is going to live in a VPC or be publicly accessible. You can’t change that later. And finally, it integrates with Cognito, and primarily that’s useful for the context of talking to Cabana. So the thing is, if you’re hosting your elasticsearch cluster within a VPC, how the heck are you going to access Cabana on it? You access Cabana through a web interface, so it needs to be able to just go inside your cluster and open up an Http connection to actually view and interact with Cabana.
So the simplest way of dealing with that is using Cognito. AWS offers an integration with the Cognito service, which allows end users to log into Cabana through an enterprise identity provider such as Microsoft Active Directory using SAML 20, and also through social identity providers such as Google or Facebook or Amazon. So you could set things up so that Cognito is allowing people to log in using their Facebook account or what have you, and that will grant them access to actually get into Cabana, even if it’s hidden behind a VPC. You can also set up secure, scalable, and simplified sign of experiences using Amazon Cognito user pools. That makes it a little bit easier to manage these things. But getting inside a VPC from the outside is hard. There are ways around it.
One of them is to use a reverse proxy server, and that’s what we’re showing here in this diagram on the right here. So NGINX is an example of a reverse proxy server that you could run on some ECT host somewhere that would then turn around and forward your request to your elasticsearch domain within the VPC. You could also open up an SSH tunnel on port 56 one, which is what Cabana listens on. Other options for getting inside Cabana on your VPC would include VPC Direct, Connect, or just using a VPN to get in there. There are some anti patterns and ways you should not be using Amazon’s Elasticsearch service. These come straight out of the AWS big data white paper, which again I strongly encourage you to read. One is OLTP. Although it is really fast, it’s not really intended to be hit like a web service, so it doesn’t have any transactional support like a real database would.
If you’re going to be doing stuff like that, RDS or DynamoDB might be a better choice. Ad hoc data querying? Well, I’m not really sure why that’s an anti pattern because you’ve definitely can do that using Cabana, but AWS would rather you use Athena for that sort of a thing because, well, that’s their service and that’s what it’s really made for. But remember, Amazon Es is primarily for search and for analytics. That is primarily the bucket that Amazon Es fits into. So if you’re looking for a search for an analytics solution for a problem, amazon Es might be a potential solution for that.
- [Exercise] Amazon Elasticsearch Service, Part 1
So let’s go handson with Amazon’s Elasticsearch Service and explore how it integrates with Kinesis data. Firehose. These are concepts that may come up on the exam, and you’ll get your hands dirty here and see how it all works. So for this one, we’re going to be doing some operational stuff here. We’re going to be using some real Apache Http server logs in this case. So we’re going to imagine that we’re actually running a website on my little EC Two instance here, and that’s going to be dumping raw Apache server logs into Amazon Kinesis Data Firehose. That in turn will be configured to dump data directly into Amazon’s Elasticsearch Service, which we can then use to interactively search and create dashboards and visualizations on top of that, Raw server data. So if something goes horribly wrong, we see a huge spike in error rates on our web server. That gives us a really easy way to dig into what’s going on operationally. And this is all near real time. We’re dealing with Firehose, so data will be at best a minute behind, but for operational purposes, that’s probably okay. Let’s dive in and build this thing. So first let’s go to our EC Two instance and get those logs.
So I’ve already logged in with Putty, and I’m in my home directory, I believe. And this is a zipped up archive of some real web logs that come from a web server that I run for my company. It’s an Https website. So the server logs will be in the SSL server logs. Ultimately, we’ll see that when we unzip it. So unzip httpd zip and you can see that in addition to our access logs, we also have an SSL access log. That’s where most of the real traffic on my site is, so we can take a look at one. Let’s see the Httpd SSL access log. And you can see this is just standard Apache access log format here. Contains a bunch of information. The IP that came in, the data that hit it was a get request for this URL, so on and so forth. Staci 200 yada Q to get out of that. All right, so let’s actually move that data. Let’s CD back up into our home directory there and move it into our VAR log directory where it would be in in the real world. So pseudo MV httpd slash VAR slash log slash httpd cool. Alrighty, back to the AWS console. Let’s actually create an Elasticsearch cluster to listen to this data. So let’s go to Amazon Elasticsearch Service. There it is.
All right, we’re going to create a new domain that’s basically what they call a database. And Elasticsearch, this is just going to be for development and testing, so we don’t need to spend a lot of money on multiple availability zones. We’re just messing around here. Obviously, in a production environment, you would choose production instead. Let’s stick with Elasticsearch version 6. 4. There are some very big changes that are planned for Elasticsearch Seven, which is slated to come out shortly after this course is released. So to make sure things go smoothly, you probably want to stick with Elasticsearch Six if you’re given the choice here, click next. We need to give our domain a name. Cadabra. Sounds as good as anything. Has to be lowercase. M four. Large. That’s totally fine. One instance, also totally fine. Again, in production. You would obviously have more than one instance for redundancy purposes and to parallelize things, and you would also have a dedicated Master instance. But we don’t need that, and we don’t want to pay for it. We want to keep our costs low in this course. Right. All right. Storage. The defaults here are all fine. We don’t need a lot of storage because we’re not really putting a whole lot of data into it.
But you do have options here as to whether you want to store on the instance itself or an EBS. I’m not sure why you would ever choose anything, but EBS there. SSD is fine for us. Great. We do not need encryption for this example. But it’s an important thing to understand that encryption options are available here in Amazon Elasticsearch. We can do that node to node in flight and also at rest when it’s actually written to disk. Defaults here are all fine. So hit next. So normally you would want to keep your elasticsearch clusters safe inside a VPC, but this does make it very difficult for us to access that cluster. And we’ve talked about that a little bit about how tough it is to actually get Cabana through to a VPC to communicate with your elasticsearch cluster when you’re trying to hit it from your own web browser in your home or your office.
So to make life a little bit easier, we’re actually going to change this to public access. Even though that’s generally a very bad idea. We will lock that down to at least only be accessible from our own account. Interesting to note that you can actually use Amazon Cognito to authenticate with Cabana as well. So you can actually use a variety of identity providers to actually log into your Cabana instance if you want to, and not just the one that’s built into Cabana. Now, for the access policy, we don’t really want this to be totally public. We want to at least lock this down to our account. So first we need our account number. You can get that from this little window up here. I’m going to right click and open that a new tab, and obviously your account number will be different.
Just go to my account and you can copy your account ID there. All right, so now that we have our account ID in our clipboard, we can say allow or deny access to one or more AWS accounts or IAM users. We’re going to allow our account ID paste that in and say, okay, so we at least have some rudimentary security going on here. Again, it’s not ideal. And given that security plays a huge role on the exam itself, it is important for you to understand how you would actually do this in the real world, where you would actually have this cluster behind a VPC. Let’s go down and hit next. All we’re doing is reviewing everything here, so let’s keep on scrolling. Everything looks okay to me, so hit Confirm and we just have to wait for that to spin up.
Now, fortunately, we have this handy dandy button here to create a Firehose delivery stream to actually set up our input into Elasticsearch using Firehose. So while we’re waiting for that domain to spin up, we can click on this convenient little button here to create a Firehose delivery stream, click on Create a delivery stream. We will call it web logs. The source is direct put because we are going to be using the Kinesis agent again to actually shovel that data from our server logs into the Kinesis firehose. Click Next. So now we’re actually going to do something with these options here for transforming source records. It turns out that if you try to transform the Apache logs on the client side using the Kinesis agent, the timestamps don’t end up in the right format for Elasticsearch.
So we actually have to use a lambda function to handle that transformation between the raw Apache logs and JSON data that Elasticsearch will like. So let’s go ahead and enable record transformation and we will create a new lambda function to do that. Create New now here’s where things get a little bit weird. What we want to do is convert Apache log files to JSON format so we can turn around and process that into Elasticsearch. And it looks like here that there is a handy dandy lambda blueprint for doing just that Apache log to JSON. If you don’t see that, just select General Firehose Processing or something else. As you’ll see, it doesn’t really matter because as of this recording, if I hit the Apache log to JSON blueprint, it doesn’t actually exist. It just takes me to a blank function here. So like I said, it doesn’t really matter how you get to this Create function page, you’re going to end up in the same place here. Turns out what happened is that there used to be a blueprint for that that worked with no JS six, but AWS recently deprecated node JS six in favor of Ten, and apparently they didn’t bother or forgot to update that blueprint in the process. If you’re lucky, they’ve actually added that blueprint back in. And you can just ignore this next step, but if you’re unlucky, you’re just staring at some blank code here and you’re stuck on the Author from scratch option.
So in a situation like this, you kind of have to be resourceful. Amazon unfortunately does this sort of thing pretty frequently where they change very quickly and sometimes things break in the process. So what do we do here? Well, we still need to have some lambda function that will transform firehose data from Apache access log to JSON format. Let’s just make sure that that blueprint really isn’t there. Let’s click on use a blueprint and look at the options here. Looks like they did update the syslog to JSON firehose one, but there’s still no Apache access log option there available. Well, let’s try this other option. Let’s see if there’s something in the repository of entire applications that we can lift from. Let’s click on that and just search for, I don’t know, Apache JSON firehose and see what comes up. And there we have something that looks promising. Kinesis firehose Apache log to JSON. So let’s click on that and explore it, shall we? And it looks here that this is an entire application stack, but we just want that lambda function.
So let’s explore its GitHub URL where the actual source for this thing lives. And sure enough, if you click on index JS, we will find the very same function that we’re looking for. So this is the old Apache log to JSON firehose lambda function. In its glory, this is exactly the same name function that used to exist. And I don’t know why they didn’t just update it, because it still works on node JS ten. So let’s go ahead and copy that, shall we? Let’s hit Raw so we can easily copy it. And if you had trouble finding this, I’m going to put a copy of this index JS file in your course materials as well. So worst case, just open up that index JS file and notepad or whatever, copy and paste it. So I’ve copied that to my clipboard. Let’s go back to our function here and we’ll go back a step here to create function. We’re going to author this from scratch after all.
We need to give it a name. Let’s call it Log transform. We’ll stick with NodeJS ten and we will create a new role with basic permissions and hit Create function. And now we can just scroll down to where the code is for the function code and we’ll highlight that and paste over it with our new function. And while we’re down here, let’s go ahead and increase the timeout as well, just to make sure that things work smoothly. So I’m going to change the timeout under basic settings to 1 minute and hit Save. All right, you can look at that code if you want to, but I can promise you there will be no coding on the actual exam. So just take it on faith that this, the JavaScript code here actually does convert Apache log data to JSON format for us as it is being streamed in. If you want to go through there and try to understand how it works. You can. But again, not important for the exam.