Amazon AWS Certified Data Analytics Specialty – Domain 3: Processing
- What is AWS Lambda?
Let’s talk about AWS lambda. Lambda is a serverless data processing tool that you can use. Let’s talk about what that means. So what is lambda? Basically, it’s a way to run little snippets of code in the cloud. So if you have a little bit of code in pretty much any language you can imagine, lambda can run that for you without you having to worry about what servers to run it on. So you don’t have to go and provision a bunch of EC two servers to run your code. Lambda does that for you. It thinks about the actual execution of your code. You just think about what the code itself does. So it is a serverless method of actually running little bits of code that will scale out continuously without you having to do anything. Lambda will automatically scale out the hardware that it’s running on as needed, depending on how much data is coming into it. So you can see how this fits into a big data world. If you have a lot of data flowing in, lambda will automatically scale itself up and down to handle the processing of that data dynamically. In the context of big data, it’s often used to process data as it’s moved around from one service to another. So some services don’t talk directly to other services in AWS. But Lambda can be used as a glue between anything pretty much.
So it can sit there and get triggered by some other service sending data into it, like a Kinesis data stream, reformat that information into a format required by some other service, send that data to another service for further processing, and maybe retrieve that data and send it back. So Lambda is just a way of like running little stateless bits of code in the cloud that is often used to just glue different services together in big data. Let’s look at an example. It’s not always used for big data. A very common use of Lambda is in use of a serverless website.
And it’s actually possible to build a website without having any servers that you manage at all. This is called a serverless website. And you can do this with AWS Lambda. Now, you normally can’t do this with a really dynamic website, but if you can build your website by just having static HTML and Ajax calls embedded within that HTML, well, you can serve that from s three or something, right? And then the Ajax calls are all you need to deal with. So maybe you have an API gateway in Amazon that sort of serves as the wall between the outside clients and you the interior of your system there. So let’s say, for example, you have a website where someone needs to log in. That login request might go through the API gateway, which in turn would then get sent off to AWS Lambda. Lambda would say, okay, the website wants this person to log in. It could turn around and craft a request to Amazon Cognito to say, do you authenticate this user or not?
Amazon Cognito will come back and say, sure, here’s your token. Lambda could then format that result and send it back to the website. So Lambda is sort of the glue there between the website API and Cognito on the back end. Similarly, let’s say that we’re building like a chat application on our front end there. Maybe we need to retrieve the chat history for that user ID once they’ve logged in. So again, the API gateway would receive that request. Lambda would be triggered by that API request and say, okay, I need to go get the chat history for this user ID. How do I craft that request at DynamoDB? It would then turn around, craft that request to DynamoDB for that chat history, talk to DynamoDB to get that, and then send it back through the API gateway back to the website. So here you see, Lambda often fills a role as sort of the glue between different services.
That’s exactly how we’re going to use it in our own application here. Very soon we’re going to build this out. You’ll recall that in our order history app previously, we used a Kinesis consumer application running on an EC Two host. This sort of service the glue between Kinesis data streams and Amazon DynamoDB. Now, we don’t want to have to manage EC Two servers for such a simple task as that. We can use lambda instead. That’s a much better choice. So we’re going to build a Lambda function that actually sits between our data stream that’s receiving server logs coming in, and between DynamoDB, where we want to store that data. Longer term, Lambda is just going to sit there waiting for triggers of events from the data stream.
And each trigger event will actually have a batch of events that needs to go through and extract each individual record and then turn around and write that into DynamoDB. So again, Lambda is just the glue between the data stream and some other service, in this case, DynamoDB. Later in the course, we’ll build out a transaction rate alarm. Its only purpose is to notify us when something unusual is happening on our system. So in this case, we have a Kinesis data stream that’s receiving events that says something weird is going on that requires someone’s attention. Lambda is going to be triggered by those data stream events, and it will then turn around and craft an SNS request to actually send out a message to your cell phone, notifying you, that something requires your attention. So, yet again, Lambda is the glue between two different services that don’t talk directly to each other. But because Lambda is just code, it can do anything and transform that data in any way. You can talk to any service on the back end. You can imagine it can do whatever you want. So it’s sort of the magic glue that enables you to put different components together in creative ways. And this is just one example of that.
- Lambda Integration – Part 1
So what’s the big deal about serverless processing of your code? Why not just run it on a server? Well, I mean, even though Lambda is still running on servers under the hood, they’re not servers that you manage and that’s a big deal. So Amazon has people with pagers to keep those servers working and dealing with all the patches and monitoring and hardware failures and stuff. So you don’t have to, you don’t think about that at all when you’re running AWS Lambda, all you think about is the code that you have running on it. And yes, even though individual servers can be cheap, scaling those servers out can get very expensive, very fast. And if you scale out a fleet to run your function to manage your peak capacity, you’re going to end up paying for a lot of processing time that you’re not even using when you’re off peak with Lambda, you only pay for the processing time that you’re actually consuming. So that can be huge from a cost saving standpoint.
And although it’s not so big of a deal in the big data world, if you’re dealing with something like a serverless website, it does make it easier to split up your development between front end development and back end development. So all the back end work can happen in your Lambda functions while your front end developers worry about all the static stuff on the front end client website. So there are several main uses of Lambda. One is real time file processing. So as new data is coming into S three or some other destination within AWS, lambda can be triggered and actually process that file however you want to. And that can include doing some rudimentary ETL that’s extract, transform and load. So lambda is just code. You can do whatever you want in there, and that includes doing ETL on incoming data. You can also do real time stream processing as we discussed, just listening for new events coming in on a kinesis stream or a kinesis firehose stream.
You can also use Lambda as a Cron replacement. And this is interesting, you can use time as a trigger for a Lambda function as well. So just like you can trigger events using Cron on a Linux system, you can trigger off Lambda events on some fixed schedule as well. So if you need something to happen once a day or once an hour, or once a minute or whatever schedule you want, you can actually set that up to call your Lambda function periodically.
And that might come in handy for kicking off daily batch jobs of something or something like that. And also it can process arbitrary events from AWS services. And as we’ll see, there’s a very long list of AWS services that can generate triggers to Lambda. Pretty much any language you want to develop in is supported by Lambda, which is great node JS, Python, Java, C, Sharp, Go, PowerShell, or Ruby. So you can process data and do whatever you want with it in pretty much whatever language you want. And this is huge because it means that any system that has an interface in any of these languages is fair game for your lambda code. So remember, lambda’s code and with code anything is possible. So you don’t have to just limit yourself to serving as a glue or a transformation between different AWS services.
You could actually talk to other services outside of AWS if you want, or other libraries even, that might enable you to perform more interesting transformations on that data. So really you can use your imagination when you’re using lambda to do a lot of different stuff. It really makes a lot of things possible, all without managing servers in the process, which is awesome. Now, if you are using lambda to serve as sort of the glue between AWS services, there’s a long list of services here that actually trigger lambda events for you. So any of these services can fire off triggers to lambda functions whenever something interesting happens from these services.
And this is just a partial list of them guys. So also Alexa, and there’s also a manual indication you can do of lambda functions as well. Let’s call out a few of the more common ones that are relevant to big data though. And that would be S. Three kinesis DynamoDB, SNS and SQS and IoT. So you can integrate lambda with S three. So whenever a predefined thing happens to an object within S three, that can act as an event data that will invoke a lambda function. So for example, a new object is created in S three. You might have a lambda trigger that picks up the creation of that object, turns around, parses out that information and sticks it in redshift or something. Same story with DynamoDB. Every time something is changed in a DynamoDB table, that can trigger event data that invokes a lambda function as well, and that allows for real time event driven data processing for the data arriving in DynamoDB tables. You can integrate lambda with kinesis streams and then the lambda function can read records from a stream and be processed accordingly. Now, under the hood, lambda is actually pulling the kinesis stream. The stream is not pushing data into lambda.
That can be an important distinction in some contexts there. So remember, when you have kinesis streams talking to lambda, kinesis isn’t actually pushing that data into lambda as the architectural diagrams might suggest. Lambda actually pulls that stream periodically and collects information from it in a batch manner. You can also integrate lambda with Iota. So when some device starts sending data to the Iota service, that can then invoke lambda and process it accordingly as defined in your lambda function and it integrates with kinesis firehose. In that case, lambda can transform the data and deliver that transform data to S three or redshift or elastic search or whatever you want it can output to pretty much anything because you can write code to do what you want in response to these triggers. All AWS services have APIs you can call, so really the sky is the limit as to what you can call downstream from your lambda function. You can do anything you want.
These are just the ones that can trigger a lambda function for you automatically. Now, as long as these services are all under the same account, you can set up IAM roles to allow lambda to access them. So as long as all the services you’re using are under that same account, pretty much lambda can talk to them if you set up the proper IAM role.
- Lambda Integration – Part 2
Let’s dive deeper into some specific examples here. So you can integrate lambda with Amazon’s elasticsearch service. Here’s an example of how that might work. So imagine that you have some data being sent into Amazon S Three, like a data lake sort of a thing that object creation in S Three can then trigger off a lambda function that turns around and processes that data and sends it into the Amazon Elasticsearch Service. So this is one way in which lambda can service the glue between S Three, and this case the Amazon Elasticsearch Service to process and analyze that data and make it searchable. Another example is using lambda with the data pipeline. So in this example, an object comes into S Three and we might trigger a lambda function there to process that data and kick off a data pipeline to process that data further. So you can use lambda to activate the pipeline when new data is committed into an S Three bucket without managing any scheduling activity, it just kicks it off automatically as that data is received. So normally with data pipeline, you can schedule activities where you can define preconditions that see whether or not data exists on S Three and then allocate resources accordingly.
But by using lambda, that’s a better mechanism, because data pipeline can be activated at any random time and not on a fixed schedule. So you can say, anytime I see new data coming into S Three, go ahead and kick off this pipeline to actually perform some complex sequence of events on that data automatically. So it frees you from running on a fixed schedule and just running things as it’s needed, which might be more often. Might be less often, but either way, it’s a better use of resources. So let’s look at another example where we’re using lambda to load data into redshift efficiently. Now, the best practice for loading data into redshift is by using the Copy command, and we’ll talk about this later in the course, but that allows you to upload batches of data in parallel into redshift. Now, if we just were going to copy data in one record at a time, we could just do that by having a trigger from S Three that says, okay, some new data was received in S Three lambda, go deal with it.
Your lambda function would then turn around and insert that into your redshift database. But that’s not very efficient. It’s better to actually batch things up and send it in together in a parallel manner. Now, the problem here is that lambda cannot have any stateful information. So there’s no way to pass information from one call of lambda to another. The reason being that your lambda function could be deployed on many different servers, all running concurrently at the same time, so they can’t really share information between themselves easily.
So how do we keep track of where we left off? How do we know when we have enough new data coming into lambda to actually build up enough of a block or a batch of information to batch up and send into redshift with a copy command. Well, in this situation, we can use DynamoDB for that. So lambda might actually communicate with a DynamoDB table to keep track of what table this data should go into, and when the last time we actually wrote that data in and it can figure out, well, okay, we’re at the point now where I can actually batch things up and send it into redshift. So in this example, s three is triggering off an event to lambda. Lambda is using DynamoDB to keep track of how much data has been received so far. And when we hit some threshold, we’ll batch that up and actually copy it into redshift altogether at once. So that’s an example of using DynamoDB to contain stateful information when you couldn’t normally do that with lambda. The important point here, though, is that lambda has to be stateless itself. Lambda itself cannot pass information from one lambda call to the next. Another very important integration example is lambda and kinesis. And there are some finer points to using lambda with kinesis data streams. First of all, you need to understand that your lambda code is going to receive an event with a whole batch of stream records.
You’re not going to be invoked for every single record coming in from a kinesis stream. And when you’re setting up the trigger between kinesis data streams and your lambda function, you will specify a batch size for that, and the maximum is 10,000 records. Now, this is important in the context of the exam, because if you have too large of a batch size, that might cause your lambda function to time out. By default, lambda functions will time out after 900 seconds. So if you have some massive batch size of 10,000 records coming in and you’re doing some complicated operation on all those, if that takes too long, your lambda function will time out, and that means it will cause an error, and that can cause problems. Now, batches can also be split beyond lambda’s payload limit. So if you’re trying to send more than six megabytes worth of data to a specific lambda function call, that’s going to get split up automatically as well. So there is a six megabyte limit to the data coming into a lambda function that you also need to think about.
Now here’s the thing. If lambda fails, for whatever reason, and that might include a timeout, it will keep retrying that batch until it succeeds or the data coming into it expires. And that can really back things up when that happens. So that can stall an entire kinesis shard if you’re not handling errors properly. One way around this is to use more shards to ensure that your processing isn’t totally held up by errors. So even if one shard gets stuck, at least the rest of them will continue processing. But really you don’t want this to be happening. You should be avoiding errors as much as possible. And if you have too big of a batch size or too small of a timeout, that can lead to errors that you could otherwise avoid.
So remember, you can specify a batch size, you can specify a timeout on your lambda functions between kinesis and lambda. And you need to pay attention to how that might actually impact the performance of your larger system. And this is really a big deal because lambda processes shard data from kinesis synchronously. So as data is received from kinesis, that batch gets sent off to lambda and your kinesis shard is just sitting there waiting for the response. It’s not going to do this in an asynchronous manner.
So again, if kinesis gets stuck waiting for a response from lambda, or it gets stuck because it’s retrying a lambda function that keeps airing out on it, that’s going to stall the shard. Okay, it’s important to understand this because there will be some questions on the exam about here’s this beautiful system that I set up between lambda and kinesis, and it’s not working. Why might that be? These are some possible reasons. So keeping all these finer points in mind, they are important for the exam.
- Lambda Costs, Promises, and Anti-Patterns
Let’s wrap up our discussion of lambda before we dive into some examples with some more trivia about the service. And yes, these numbers can be important to know. Again, with the cost model, you only pay for what you use. And that’s awesome, because you’re not going to be paying for any idle server capacity. You’re only paying for the actual resources that your Lambda function uses. And that is measured from a cost standpoint based on how many requests you’re sending to Lambda and how much memory you’re consuming in your lambda function. Now, it also has a generous free tier. You can actually send up to 1 million requests per month, or 400,000 gigabyte seconds of compute time, and that will be totally free. But beyond that point, you’ll be charged only twenty cents per million requests and 0166 $7 per gigabyte second.
So lambda is pretty cheap. I mean, obviously if you have a big enough website or big enough application, it can add up, but generally speaking, it’s a pretty good deal. Some other promises made by Lambda are high availability. Under the hood, it uses replication and redundancy. For that high availability, there are no scheduled downtimes. And if your code does fail for some reason, lambda will automatically retry it three times so that it can hit some other server. Worst case scenario, scalability is effectively unlimited. There is a safety throttle of 1000 concurrent executions per region, which you’re unlikely to run into. But if you do have an application that big, you can actually request AWS to increase that limit as well. So really the sky is the limit for scalability.
And again, you don’t think about how that works. Amazon does. High performance new functions are callable within seconds of being created, and you can typically process events within milliseconds. They will cache your code automatically as necessary. Now, you can specify a timeout of your own, and this can cause problems like we talked about if you’re hitting that timeout, because errors in lambda can cause kinesis shards to stall. The maximum timeout, and this is an important number to remember, is 900 seconds. Okay? So if you’re trying to do something in a lambda function that takes more than 900 seconds, lambda is not the tool for you. Or maybe you need to reduce your batch size to reduce the load of that individual function. Call here are some of the anti patterns identified by the AWS big data white paper, so you should know them. However, all of them come with caveats. So remember, at most you have 900 seconds to run your function in lambda. For heavier processing, you should be using EC Two instead.
Or you can also chain multiple lambda functions together. So that’s kind of a way around that limitation. If you do have to do something more complicated and you don’t want to use EC Two, it is possible to have a lambda function that does some simple part of the process you’re trying to do that, then passes that off to another lambda function and then another lambda function. So if you’re able to split up your processing into enough discrete steps, maybe you can even do more heavy duty processing on lambda than you might think. Otherwise, the white paper also says that EC Two and Cloud Front is a better choice for building dynamic websites.
But again, there’s this whole world of serverless applications that use Ajax on the client side to construct websites that seem dynamic. However, you should not be using lambda to construct HTML on the fly for an entire site. And you can’t assume that lambda will run in any given environment. So you can’t maintain information from one lambda call to another. Like we said, it is stateless, but you can’t can store state information in DynamoDB or S Three like we talked about. That can be shared by your lambda invocations. So it’s not entirely true that you can’t make a tasteful application in lambda if you use DynamoDB or S Three. You can sometimes get around that as well.