Amazon AWS Certified Data Analytics Specialty – Domain 4: Analysis
- Intro to Kinesis Analytics
As we start our journey into the analysis domain of big data. Let’s start off with Kinesis Analytics. It’s another system for querying streams of data continuously, very similar in spirit to Spark Streaming, but it is specific to AWS Kinesis. So conceptually it’s pretty simple. Kinesis Data Analytics can basically receive data from either a Kinesis data stream or from a Kinesis data firehose stream. And just like Spark Streaming, you can set up windows of time that you can look back on and aggregate and analyze data across. So it’s always receiving this never ending stream of data. And you can just write straight up SQL to analyze that data and turn around and spit out the results to some other stream, which ultimately might go to some analytics tools of your choice. If we want to go into more depth.
There are basically three main parts to Kinesis Analytics. One is the source or input data, and this is where the data to be streamed is coming from. Now, the input or the source can either be Kinesis Streams or Firehose streams coming in from the left there, and that goes into our input streams. You can also optionally configure a reference data source to enrich your input data stream within the application. This will result in an in application reference table. Maybe it’s some sort of lookup data that you have that you want to refer to within your SQL of your actual analytics job. So if you want to do that, you have to store your reference data as an object in an S Three bucket.
And then when your Kinesis Analytics application starts, the Amazon Kinesis Data Analytics will read the Amazon S Three object and create an in application table. So you can refer to that data in S Three however you wish. Then we have the real time analytics or the application code itself sitting in the middle there. That’s where the actual analysis happens. And that’s going to perform real time analytics just using straight up SQL queries on your stream of information. So the stuff that forms the analytics job is called the application code. These are just SQL statements which would be processed on the streaming data and reference tables. And again, it can operate over windows of time. So you can look back through some fixed time period as new data is being received. Then we have the destination going out to the output streams there to the right, and that’s where the process data will go.
So once the data is processed, the output can be sent to either a Kinesis data stream again, or to another data firehose stream. And from there it can go off to wherever you want, like an S Three Bucket or Amazon redshift. If you want to store that in a data warehouse, lambda might be involved too, if you want to use that as the glue between these services to do whatever you want. Also, if any errors are encountered, those are sent out to the error stream. So Kinesis Data Analytics will automatically provide an in application error stream for every application. If your application has any issues while processing certain records, like a type mismatch or later rival, that record will be written out to the error stream for you. What are some common use cases of Kinesis Analytics? Well, here’s three of them. One is streaming, extract, transform and load. Or ETL.
For example, you could build an application that continuously reads IoT sensor data stored in a Kinesis data stream, organize that data by the sensor type, remove duplicate data, normalize the data per specified schema, and then go on to deliver that process data to Amazon S Three. Another example is continuous metric generation. So one example is you could build a live leaderboard for a mobile game by computing the top players every minute and then sending that on to Amazon DynamoDB. Or you could check the traffic to your website by calculating the number of unique website visitors every five minutes or so and then sending the process results off to Amazon Redshift for further analysis. Responsive analytics is another one. So an example of that would be an application computing the availability or success rate of a customer facing API over time, and then sending those results on to Amazon Cloud Watch. You could build another application to look for events that meet certain criteria and then automatically notify the right customers using Kinesis data streams and Amazon Simple Notification Service, or SNS.
- Kinesis Analytics Costs; RANDOM_CUT_FOREST
Let’s talk about Kinesis Analytics cost model. It is serverless, so you just pay for the resources you consume and nothing else. It scales up for you automatically and you don’t have to worry about the underlying resources needed to actually run your Kinesis Analytics applications. However, it’s not cheap. Unlike some other serverless services that AWS offers, when I was making this course, kinesis analytics actually made up the bulk of the actual charges that I incurred while developing this course.
So use it carefully and do make sure to shut down any analytics job once you’re done using them, just to be safe. As far as security goes, you can use IAM permissions to access the streaming source and destination services that you’re working with, so you can configure IAM to allow your Kinesis Analytics application to talk to whatever upstream and downstream services it needs to communicate to. There’s also a cool feature in Kinesis Analytics called Schema Discovery and that’s how the column names in your SQL are found. It can actually analyze the incoming data from your stream as you’re setting it up so you can check that it worked. We’ll see that in practice as we go into our hands on exercise, but it’s pretty cool. It can actually analyze an incoming stream and right in the AWS console, you can see how it’s trying to make sense of that data and impart a scheme on top of it.
And you can edit and correct that data as needed. It’s also worth talking about something called random cut forest. This is a sequel function offered by Kinesis Data Analytics that you can use within your data analytics application for anomaly detection on any numeric columns in a stream. And you can tell that AWS is really proud of this because they published a whole paper on it and you can go look it up if you want to. It is a novel way of identifying outliers in a data set so you can handle them however you need to. An example of doing this that they give in the paper is detecting anomalous subway ridership during the New York City Marathon.
So they have a stream of data of people going through the subway turnstiles and they are using random Cut force to automatically identify anomalies in that rate of usage of the subway system. So this is important because if you ever see a question about trying to detect anomalies or outliers in a stream theme of data, random Cut Force with Kinesis Analytics is very likely a good answer for that. And with that, let’s dive into a hands on example because it will make a whole lot more sense when you see this all in action.
- [Exercise] Kinesis Analytics, Part 1
So our next handson activity is going to be the most complicated one in this course by far. It strings together a lot of the services that you’ll see crop up on the exam and illustrates how you can piece them together to do something really interesting and useful. So it’ll be worth the effort, I promise you that. So let’s take a step back and talk about what we’re trying to accomplish with this particular application. Let’s imagine I’m the owner of Cadabra. com and I want to be alerted if something unusual is happening on my site. Let’s define that as getting more than ten orders in the span of 10 seconds. Individual orders, mind you. Now this can probably mean one of two things. Assuming that that’s way out of whack with the normal rate of orders that I receive from a day to day basis, it could mean that my site is under some sort of weird attack where someone’s using a fraudulent credit card to buy stuff on my website over and over again automatically. Or it might mean that it’s legitimate and I’m about to get rich. Either way, I want to know about it, right? So here’s what this system does.
First of all, it starts by monitoring our order information back on our EC two instance where it’s generating those order logs that’s going to get published into Kinesis data streams like it does already. And then we’re going to introduce Amazon Kinesis data analytics. This will have some SQL that actually sits on top of that stream of data and keeps an eye on it. Basically, it’s going to have a sliding window of 10 seconds where it keeps count of how many orders are coming in. And if that order count exceeds ten items in 10 seconds, it will then in turn fire off a trigger that will be fed into another data stream, which in turn triggers AWS Lambda. And Lambda will say, oh, something’s unusual here, I need to do something about it. Lambda can then turn around and talk to Amazon’s SNS service, simple notification service, which will actually fire off a text message to your cell phone when anomalous activity is detected. And we’re really going to build this.
We’re going to build out this whole thing. We’re going to generate an anomalous event on the server side here. And if everything’s set up right, you will actually receive a text message on your cell phone when that happens. So, fun stuff. Let’s dive in and make it happen. First, let’s fire up the AWS console, obviously with your own account. So let’s go to the AWS console here and we’ll search for Kinesis. Now, if you did delete your Cadabra orders stream earlier to save money, make sure you remember to recreate that. Now, again, one chart is all you need here. Now we’re actually going to need a second stream here as well to actually feed our alarms into. So Cadaver orders, if you remember from the diagram is receiving the raw data from our server logs. But once Kinesis Data Analytics does its thing, it needs to feed that to another data stream.
That’s all it can really talk to. So we need a data stream to serve as the output of our Kinesis Analytics SQL function that monitors that stream. So let’s go ahead and say create a data stream and let’s call this one Order Rate Alarms. Thusly, all we need is one shard. And again, this will cost real money. But we only need this stream while we’re working on this particular exercise. So we’ll delete it as soon as we’re done. We’re talking about pennies that you’ll be spending for this one. Create kinesis stream. All right. And that’s spinning up and getting created. Meanwhile, let’s start creating our Kinesis Data Analytics application. So if we go to Data Analytics here, we’ll click on Create Application. We’ll give it a name. Let’s call it transaction rate monitor. I won’t bother with the description, but feel free to write some crafty pros if you wish. The runtime will be SQL.
It’s interesting that Flink is also an option here. Create Application. All right, now we will connect streaming data to our function here, and we will choose the Cadaver order stream. So this is the input into our Kinesis Analytics function here. That is what it’s going to be monitoring. Okay? Very easy to do. And it tells you very handily that you can refer to that stream as Source SQL stream One in your SQL queries. Now, I’m not going to get into a lot of depth about how the actual SQL for this thing works. I’ll explain it at a high level. But keep in mind on the exam itself, you will never be expected to actually write code or even understand code. So I’m not going to clutter your brain up with how the code works, okay? It would do more harm than good. So we’ll have to take some of this on faith. Let’s scroll down. These are all fine. We’re not going to pre process our data with Lambda, although we could access permissions will allow to create those automatically as well. And this is a neat feature, so it can actually discover the schema of your input data automatically, which is kind of a neat feature. So let’s see if we can get that to work. Click on Discover Schema. Now the problem is we don’t have any data streaming in right now for it to actually play with and try to discover. So let’s push some data into that stream so it has something to work with. Let’s go back to EC Two and fire up our EC Two instance log in as EC Two user. And we’ll just run pseudo log generator PY to kick off 100 lines of data. Now because I did run this earlier today, it actually did pick up some data already. But you’ll probably have to sit here and wait for that to get processed before Schema discovery actually works for you. So if you want to get an indication of when that might be, we can always tail the log to find it.
And we’ll just wait for that to pick up those 100 new records and process them. Looks like it’s starting to do that right now. I can see that the record sent count is increasing, so it looks like some data is flowing. That’s a good thing. Control c out of that. And yeah, let’s retry a Schema discovery just to make sure that it has the latest data here. And let’s see what I found. So that’s interesting. We can see that it did pick up some data coming in here, and it actually did pick up the actual field names as well from some earlier work that we did. So that’s pretty cool. It works awesome. So let’s hit save and continue because everything seems to be in order now. We now have all of our fields identified so we can refer to them in our SQL. So now we need to go to our SQL editor and actually write the code that will monitor this stream of data. Sure, let’s go ahead and start it. Why not? All right, now to make life easier, I have provided the sequel you need for this in your course materials. Go into the transaction rate alarm folder and you’ll see the analytics query text file here.
Let’s walk through what’s going on here. Really high level, like I said. So what’s going on here is we actually have two different streams that we’re creating within this sequel. First of all, we’re monitoring the incoming stream that’s from Source SQL Stream One. Remember, that’s what it told us the name of our source data was. And this is setting up a ten second sliding window. So it’s always looking at the previous 10 seconds and looking at the count of orders that it received during those previous 10 seconds. And if the order count is greater than or equal to ten during that ten second window, it will then in turn place that into a new stream called Alarm Stream. Now, the reason we’re doing this into a second stream is because we don’t want this to be firing off continually, otherwise my cell phone would blow up and I just get constant text messages whenever the order rate was beyond a certain limit here that we’ve set. So this Alarm Stream down here is further monitored so that we only take a look at it once every minute. So that means that at maximum, I’ll receive one text per minute while things are in an alarm state, and that will be much more friendly to my cell phone battery. So let’s go ahead and Control A to copy that and Control C to copy it.
And we will place it in to here. So Control A to select everything that’s in there already and Control V to replace it all with that code. Cool. Let’s go ahead and click Save and run SQL to save and apply that and it will actually just start running it live here so we can see the results in real time. All right, it’s running, so let’s kick off some more data for it to play with. Look at that. It’s already picking up some data here. It looks like we’re already getting an alarm condition because I did in fact send off more than ten events in 10 seconds. Cool. Yeah, and you can see how that’s just increasing over time as I send in more and more data. So if you recall, we actually sent in 100 events and obviously that’s going to keep on going up to like 99 right during this entire period. It’s going to be alarming and that’s why we had that second stream there, to limit that to one alarm per minute. But it looks like it’s actually working. That’s pretty cool. If you’d like, we can actually go back and kick off some more events here and watch those come in as well. You can see new results are added here automatically every two to 10 seconds. So this is a live view here. For the most part it is sampled, however, so the results can sometimes be a little bit confusing. But basically if you see activity here, things are probably working pretty well. Let’s click on scroll to bottom when new results arrive so we can see that more easily. And yeah, it does look like some new data has arrived. The current time of my clock is actually twelve past the hour, so looks like it is in fact working. That is very cool. All right, let’s continue to build out the rest of our application here.
- [Exercise] Kinesis Analytics, Part 2
So now we need to click on destination and tie this to a destination. So we have our alarm here. We know that we’re in some sort of an alarm condition. What do we do with that information? Well, you only have a few options here of where to send that data. You can send it to another kinesis stream or to a fire hose delivery stream. Since we want this to be real time in nature, we’re going to connect it to another kinesis stream, the one that we created earlier on in this exercise. So go ahead and click on on connect to a destination, scroll up and we will select our order rate alarm stream that we created earlier as the output. And we need to specify the stream that we are outputting to it. So we need to manually select the trigger count stream, which is our final output of our SQL function there. We can output it in JSON. That sounds good because Lambda will like that format and we can just let it automatically manage the permissions for us. Save and continue. Okay, so at this point, we have a new kinesis stream that is receiving data whenever we’re in an alarm state, at most once per minute. Now we can use that to trigger a lambda function that will do something interesting with that information. So we pretty much got everything set up here on the kinesis side.
Let’s set up our lambda function next back to the Dashboard lambda. All right, actually, we’re going to need an IAM role for our lambda function first. So let’s get that out of the way before we actually set up the lambda function itself. Let’s go back to IAM and we will create a new role. It will be for the Lambda service and let’s attach the following policies to it. We need AWS lambda kinesis execution role that will allow lambda to talk to kinesis. We also need Amazon SNS full access because we want our lambda function to fire off SNS events. We need to talk to Cloud Watch so we can log our results, and that is Cloud Watch logs full access. And we also need the basic permissions for lambda itself, AWS lambda basic execution role. And that should do it. We don’t need any tags. Let’s give it a name. Let’s call it lambda kinesis SNS and create that role. All right, so now we can go back to lambda and actually set up that function. Create function. Let’s call it transaction rate alarm. We’re going to use Python two seven for that. And we’ll choose the role that we just created, which is lambda kinesis SNS create function. All right. And once again, we will wire kinesis into our lambda function as a trigger. So if we go down here and click on kinesis on the left-hand side here, there it is. Scroll down to configure it. All right, the stream that we’re actually listening for updates on is our order rate alarm stream. Everything else is fine. Let’s go ahead and add that. Now, let’s configure the lambda function itself. If we scroll down, we do need to increase the timeout on it. 3 seconds might not be enough. It might lead to some errors. So let’s go ahead and make this 1 minute instead of 3 seconds.
And we can paste in the lambda function itself as well. So again, that is in your course materials. Under the transaction rate alarm, there’s a lambda text file. Go ahead and open that up. And we’ll put this in as well. You can see it’s very simple. We are again using the Bodo Three library here to create an SMS client. We will include an SMS topic here once we create that and our lambda function itself will simply push a message to that that says investigate a sudden surge in orders. And that’s pretty much it. Whenever it gets triggered by that Kinesis stream, it’s going to fire off an SMS message. And we just have to set up that SNS message. So let’s go ahead and copy all this control C, control V. And we do need an SNS topic to put in here. So let’s go take care of that. Next, let’s open a new window so we don’t lose what we did there and back into the console. Let’s go to SNS, see if we can get this set up.
So let’s go to create topic. Topic name should be Cadabra alarms and display name has to be something shorter. We’ll just call it casaba create topic. All right, now we need to create a subscription for this topic. Basically, what’s it going to do when it gets fired. And let’s have it fire off an SMS message. That’s the protocol for actually sending text messages to your cell phone. So go ahead and type in your cell phone number in here. So I’ll go ahead and enter in my number, make sure you enter yours and hit Create Subscription. Now, to test it out, we can say Publish to topic. Let’s call the subject test and message. I don’t know if this is a test. Hit Publish and let’s see if my phone blows up. Let me turn on my wringer so you’ll actually hear it. All right. Publish Message hey, I actually got a message. It says from Cadabra, this is a test. Cool. So I don’t know why that excites me so much, but it’s kind of neat that AWS can actually send a text message to your cell phone that easily. It’s cool and frightening at the same time somehow. But anyway, that’s all set up now.
So we’re going to need that topic ID, right? So we can put it back into our lambda function. So we need the topic ARN. Let’s copy that, go back to lambda and paste that here. Obviously, yours will be different. Let’s hit save. All right. And it should be all up and running at this point. So let’s try it out. Let’s kick off some more fake orders here. Back to our EC two instance. We’ll run the log generator again to dump in 100 orders all at once, which should trigger our alarm. Might take a minute or so for it to get triggered, though, because a lot has to happen here, right? So right now we have a Kinesis agent sitting on this thing that’s going to then push that data into a Kinesis stream. That Kinesis stream is now being monitored by our Kinesis data analytics function where it will fire off an alarm to another Kinesis stream that will in turn trigger a lambda function, which in turn sends off an event through Amazon SNS to send me a cell phone message as a text message.
So let’s see if that actually shows up. We’re listening for a ding on my phone shortly. And there it is. I got a text message that says, cadabra investigates Sudden Surgeon orders. It actually worked. That’s really cool. So, yeah, in the space of like 20 minutes here, we’ve actually put together a full end to end application that implements an operational monitor on this stream of events using AWS. And it’s a pretty complicated example here, so good example of how you can tie these different services together to build a larger, more complex and actually useful system. This is exactly, exactly the sort of thing you can expect to see in the exam. All right, obviously, you know, the details of the code are not going to be important, but how they fit together is important. Anyway, let’s make sure we clean up our mess here. So first of all, we can go back to our Kinesis streams and delete all of them. We’re done with Kinesis at this point in the course, so go ahead and wipe these guys out. Cadaver orders can go away. You can just select these from the dashboard here and say, delete, I’m getting more text here because another minute has passed and therefore it’s still complaining that I’m in an alarm state. So let’s go ahead and delete these to make sure those stop, all right, delete. And those are getting deleted right now. We can also clean up our lambda function if you want to make sure. It kind of makes me nervous having that SNS stream set up as well, that’s set to my cell phone.
So let’s kill that as well. Go to Topics and we can delete the Cadaver alarms topic. Also very important, make sure that you shut down your Kinesis analytics application as well, or at least you get a nasty surprise in billing. Go back to Kinesis Dashboard here and select data analytics. Select your application and delete that application, all right? That way you’ll make sure you don’t get billed for this application that you’re not using. And if you want, you can also delete the lambda function, but you’re not going to get charged for that unless it’s being used. So at this point, we should be in a safe state as far as billing and cellphone usage goes. All right, congratulations again, guys. That was a really interesting application to build out and think it’s really cool that it worked. All right, onward.