Amazon AWS Certified Data Analytics Specialty – Domain 3: Processing Part 2
- [Exercise] AWS Lambda
So now that we know what AWS Lambda is all about, we can finally finish up our order history app requirement here. We’re just going to fill in this last missing piece here. Already. We’ve created our fake server logs on an EC Two instance. We’ve put those into a kinesis data stream using the Kinesis agent, and we have a DynamoDB table set up to receive that data that an app could then talk to to actually retrieve customer order information from an end user. The missing piece here is AWS lambda. Right now we have a consumer script that’s running on our EC Two instance playing this role. But obviously, that’s not a very scalable solution. Just having a script running on one EC Two instance would be much better if we had an AWS Lambda function running in a serverless environment on AWS that can scale itself up pretty much indefinitely on an as needed basis. So let’s go ahead and build out that lambda function. Now, note here that our lambda function, in terms of security permissions, needs to consume data from a kinesis stream, and it needs to write data into a DynamoDB. So with that in mind, let’s start by creating IAM roles that our lambda function can use to access the data that it needs and to write it out to DynamoDB.
All right, so back to the AWS management console here. Let’s go to IAM Identity and Access Management, and we’re going to go to Roles and create a role. The service that will use this role is Lambda. And now we can click on Next permissions to specify the policies that we want to attach to this role. The ones we need are Amazon Kinesis read Only access. That’s so we can read data from our stream. So check that off. And we also need to have access to DynamoDB. So for that, we’re going to give it Amazon DynamoDB full access. And check that as well. Click the next button. We don’t need any tags, so we can do a final review here. The role needs a name. Let’s call it Cadabra Orders, and you can give it a better description if you want to double check that. The two policies we want are there and hit Create role. So now we have the IAM role that our lambda function will need to talk to the services it needs to communicate with.
All right, back to the AWS console. Now we’re going to go to the Lambda service. Cool. So let’s click the big friendly Create a Function button here. We’re going to author it from scratch, and we will give it a name. Let’s call it, I don’t know, process Orders. For the runtime, we’re going to use Python 2. 7, because odds are, if you know a language, it’s Python. The role, we will choose an existing role, and that existing role is the one that we just made. Cadabra orders create function. So we have a shiny new lambda function here called process orders. All we have to do now is wire it up. So what we want to do is add a trigger to actually feed data into this lambda function from our kinesis stream cadabra orders. So we need to wire up a trigger to monitor that kinesis stream and feed data into our lambda function as an input. And then we’re going to write some python code to actually turn around and store that data into DynamoDB. So to get started, let’s hit add trigger. And we want a kinesis trigger, and we will select our cadabra orders stream. And the rest of the defaults here are pretty much okay. We don’t want a consumer. The default batch size is fine.
Default starting position is fine. So we’re just going to hit add. And that appears to have worked. So now we need to give this thing some code to actually process that data that’s coming in from kinesis. And you can see that that kinesis trigger has been wired into our lambda function here now. So let’s click on the process orders function itself and scroll down to the function code. Now, it has some boilerplate code in here that you can look at and start from. But just to save you some typing, I’ve given you a copy of the complete lambda function code within the course materials. So if you go to your course material download, you should go into the order history folder here. And from there, there is a lambda function text file. Just open that up, select all and copy it and go ahead and paste it, replacing what’s in there already. So if you look at this code, it’s very similar to the consumer script that we ran on our EC two instance earlier. It’s just reserializing the data a little bit differently. So we have this extra code in here to decode the incoming stream that’s being fed into lambda.
But once that’s done, it’s basically the same thing. So we just go through every line of data coming in, we parse it out, and then we just turn around and store it within DynamoDB using the batch underscore writer object that is part of the table DynamoDB object that comes out of bodo three. So again, we’re just using the bodo three library to simplify talking to DynamoDB here. Pretty simple what’s going on here. Probably not a whole lot of point in going through this in too much detail. It’s basically the same script that you looked at before, just with a little bit more wiring here to actually parse out the data from lambda. All right, so let’s hit save to store that. And at this point, we should be pretty much ready to go. So let’s review here. Looks like everything has been successfully updated. We saved it. Our kinesis trigger is in place feeding data to the process order script. And that script is actually handling storing that data into DynamoDB. So we don’t have to explicitly add a destination for DynamoDB. The script itself is taking care of that for us.
So with that, let’s spark up our EC Two instance and play with it. Log in as EC Two user and let’s put some more log data in there. So pseudologgenerator PY ten. I should put another ten instances into our DynamoDB instance if all goes well. So let’s go to DynamoDB and look for it again. It will take a minute for that to get picked up, but let’s go back to DynamoDB, go to our tables, select Cadaver orders and go to items. Looks like we have ten in there so far, but after that lambda function gets invoked, that should go up to 20 as soon as Kinesis streams picks up those ten new rows. So let’s just sit here and refresh this until that changes to 20. And there we have it. We now have 20 items. So cool. It worked. That’s awesome. So we’ve actually built this order history app out end to end, minus the actual app itself. There wouldn’t be much value in building that out right now. But we have raw data going into an EC Two host, and that could be an entire fleet of ECT hosts.
Right? We have a Kinesis agent sitting on each of those hosts that’s pumping that data into a kinesis stream, which in turn triggers a lambda function that turns around and inserts that data into DynamoDB where our app can query it directly. If you want to poke around and see what’s going on in more depth. There’s also a metrics tab here where you can see what’s going on with DynamoDB. And you can see there the metrics indicating that there is some activity going on here. We can also go back and look at lambda in more detail if we wanted to. If you go to monitoring, we should see some activity here as well. And yet we have these single data points here reflecting our one little burst of activity there. So a way to monitor what’s going on if you’re running into trouble. You can also view the logs in Cloud Watch if you want more data as to what’s actually going on from your lambda function if you need to debug things. But congratulations, we got this thing up and running at this point. If you want to delete that Kinesis stream, you can.
We are going to need it again later in the course when we’re creating our order rate monitor. But if you are worried about money, it might be a good idea to delete that Kinesis stream and recreate it when we get back to that stage of the course again. That kinesis stream does incur charges whether you’re using it or not, as long as it exists to the tune of about thirty cents per day. So keep that in mind. It will be easier to keep it up and running. If you think you’ll be working through this course relatively quickly. But if you’re really sensitive about money, it would be a good idea to go delete that Kinesis stream and just recreate it again when we need it later on. All right, let’s move on.
- What is Glue? + Partitioning your Data Lake
AWS glue can be an important component in your big data pipeline. Basically, it can define table definitions and perform ETL on your underlying data lake and provide structure to unstructured data. So what is Glue at a high level? Well, its main use is to serve as a central metadata repository for your data lake in S Three. So it’s able to discover schemas or table definitions missions and publish those for use with analysis tools such as Athena or Redshift or EMR. Down the road. It can also do custom ETL jobs on your data. So as it’s finding new data in S Three, it can actually trigger off jobs to actually transform that data into a more structured or purpose built format for later processing. Under the hood, those ETL jobs use Apache Spark, but you don’t have to worry about that or manage the servers that it’s running on, because Glue is completely serverless and fully managed. So the AWS glue crawler is one component of glue. Basically, it scans your data in S Three, and often the Glue crawler will infer a schema automatically, just based on the structure of the data that defines there in your S Three buckets. So if you have some CSV or TSV data sitting in S Three, it will automatically break out those columns for you automatically. And if they have headers, it might even name them for you.
You can schedule that crawler to run periodically if you need to. So if you think you’re going to have new data or new types of data just popping into S Three at random times, glue can discover those automatically and just run on a schedule and automatically pick up that data so other services downstream can see it and processes it. Now, the Glue crawler, when it’s scanning your data in S Three, will populate what’s called the Glue Data Catalog. And this is a central metadata repository used by all the other tools that might analyze that data.
The data itself remains where it was originally in S Three, only the table definition itself is stored by Glue. In the Glue Data catalog, that means things like the column names, the the types of data in those columns where that data is stored, stuff like that. That’s what the Glue data catalog is storing and vending to other services like Redshift, Athena, or EMR. The data itself stays in s three. The Glue Data catalog just tells these other services how to interpret that data and how it’s structured. Now, once this catalog that data is immediately available for analysis in Spectrum or Athena or EMR, we’re actually going to build a lot of these components out later in the course. And once you have that in place, you can even visualize that data and run reports using Amazon Quick Site.
I want to talk a little bit about how you structure your data before you use Glue. So you have to remember that the Glue crawler is going to extract partitions based on how your data is organized within S Three and how those partitions are structured can really impact the performance of your downstream queries. Glue cannot work magic and automatically structure your data in S Three in an optimal manner to be scanned by later analysis tools. You have to do that upfront before you do anything. So before you store any data in an S Three data lake, you need to make sure you think about how that data is going to be queried and that will influence how you think about how you structure the directories within that bucket. So let’s take an example. Let’s say I have a bunch of devices out there in Iotland that are sending sensor data every hour into S Three. So how am I going to query that data? Will it be primarily by time ranges? If so, then I would want to organize my buckets with the following directory scheme. So I would start with the year as a top level directory, followed by the month, followed by the date.
That’s going to allow me to sort everything by time and have everything for a given day in one place and then sort by device. So that means that I can very quickly go to the partition that corresponds to the year, month and date I want and all the device information for that day will be in one spot and I can scan that very quickly in my queries. However, if I was going to query primarily by device and then by date, I would structure things differently. In that case, I would actually want to have device as my top level directory followed by the year, month and date. So if I know I’m going to query first by the device ID, like maybe I have thermostats and then I have, I don’t know, door being open, sensors, home security stuff. Maybe I want to treat those as totally different worlds.
And I would want to structure all that information under each device in their own little directory tree. So by bucketing everything together, you can actually make sure that queries that look at those buckets operate more efficiently. And this is going to be important on the exam, folks. So you need to think about how you’re going to structure your data given a certain application, if you know you’re going to be querying primarily on one attribute or one column of your data, make sure your data is all structured so that physically that information is in the same place. And you can do that by smartly choosing your directory structure in S Three.
- Glue, Hive, and ETL
Now, we haven’t covered Apache Hive in depth yet, but it’s a service that runs on Elastic Map Reduce that allows you to issue Sqllike queries on data accessible to your EMR cluster. And it’s important to remember that Glue can integrate with Hive. You can use your AWS Glue Data catalog as your metadata store or Metastore for Hive. And conversely, you can import a hive metastore into glue. This will make more sense when we talk about Hive in more depth, but it’s important to remember that the Glue data catalog can provide metadata information to Hive on EMR. Let’s talk about Glue ETL a little bit more. So some features of Glue ETL is that it can automatically generate code for you to perform that ETL after you define the transformations you want to make to your data in a graphical manner.
Pretty cool. And it will generate that code in either Scala or Python, and you can adjust and tweak that code as necessary. Again, it’s running on Apache Spark under the hood, and Spark itself runs on Scala or Python. So that’s where that restriction is coming from. It has some encryption features as well. So if you want to store your transform data in an encrypted manner, it can do serverside encryption to encrypt that at rest. And it will also perform SSL in transit to keep your data secure while it’s in transit as well. Glue ETL can be event driven, so it can be run in response to new data being discovered. For example, you don’t have to run it on schedule.
You can actually have it run just as often as needed. And if you find that your ETL jobs are not completing in the time that you want, pretty unusual for that to happen. But you can provision additional, what we call DPUs or data processing units, and that will increase the performance of the underlying Spark jobs that are running your ETL jobs on Glue. If there are any errors in your ETL job, those will be reported to Cloud Watch. And from Cloud Watch, you could actually integrate SNS to notify you of those errors automatically.
- Glue ETL: Developer Endpoints, Running ETL Jobs with Bookmarks
Let’s go into a little bit more depth on Glue ETL as it’s becoming a more important part of the exam. So remember, Glue ETL’s purpose is to transform, clean, and enrich your data before you do your analysis. And it allows you to do this by writing your own ETL code in Python or Scala. And you can either start from code that’s generated by Glue itself, and then you can build upon that and modify it if you want to so it can get you started. And then if you have to do something more specific, you can modify the code that it generates to meet your needs. Or if you just want to start from scratch, you can provide Glue ETL with your own Spark or Pispark scripts to do whatever you want. Glue under the Hood will spin up your own Spark cluster and actually execute that for you as part of its process. The output of your ETL job can go to S Three. It can go to an external database using the JDBC protocol. So that includes all the databases that you might be running under Amazon, RDS, or even Redshift. And you can also output to the Glue data catalog itself.
The beauty of it is that it’s all fully managed and therefore it’s very cost effective. You only pay for the resources that you consume. The jobs are run on a serverless Spark platform. So that Spark platform, that cluster that runs your ETL job, just spins up and runs your job, and then it goes away, and then you don’t have to pay for it anymore, so it’s all taken care of. You the actual maintenance and creation of that Spark platform, you just have to provide the code that runs on top of it for running your jobs. It has a glue scheduler. Then we’ll go into a little bit more depth on that shortly to schedule your jobs on a recurring schedule every month, every day, every hour, whatever you need. And it also has triggers that allow you to automate your job runs based on events, and that can integrate with Cloud Watch as well. A little bit more depth on the transformations you can do with Glue ETL. There are some transformations that come bundled within it.
That includes drop fields and drop null fields, which does what it sounds like. It removes null fields from your data set. So if you need to clean your data by just dropping missing data instead of imputing it, that allows you to do that very quickly and easily. You can also set up filter functions if you want to filter out certain records that meet a certain criteria. So if you want to clean your data by saying anything that’s above or below a certain threshold value gets thrown out, that might be an easy way to get rid of outliers in your data set. Just using the bundled transformations, you can also do joins that are part of a bundled transformation, and you can use that to enrich your data, basically joining in data from another table to augment the data that you have.
So maybe you want to include the names of some Identifier within the data set that you’re exporting. You can also do mapping, so that allows you to do things like adding fields and deleting fields or performing other external lookups beyond a join and basically transforming data as it’s read in in pretty much any way you can dream up. You’ll also be expected to know about the machine learning transformations that Glue ETL offers. The main one is Find Matches ML, and its purpose in life is to identify duplicate or matching records in your data set. This might sound like an easy thing to do. Why do you need machine learning for that? Well, records don’t always match exactly right? So sometimes records don’t have a common unique Identifier, and no fields match exactly, but it could still be a match. An example would be a product catalog on Amazon. com.
You have multiple vendors, multiple third party merchants submitting their different products into the catalog, and often they’re selling the same thing, just under slightly different names or slightly different descriptions. With Find Matches ML, you can train that system on a known set of duplicate items, and it will learn from that training set and use that information to identify future duplicates going forward. That might not be exact matches, but still represent the same thing. So that’s kind of an important thing, especially in the world of ecommerce. Glue ETL can also do format conversions for you.
This is a very common use case, so maybe you need to switch to a different format for various performance reasons. For example, a lot of downstream systems work better with parquet. So if you need to convert your CSV data to parquet, glue ETL can do that for you out of the box. It can currently handle CSV, JSON, Avro, Parquet, ORC and XML formats for you automatically. So important to remember, glue ETL can actually do format conversions for you as well. Also, you can just provide your own Apache Spark transformation.
So Spark comes with a bunch of built in things that it can do. For example, Kmeans Clustering. It’s a whole machine learning library that comes with Spark, and basically everything that’s in Spark is at your disposal as well. So use it if you need to. What if you have a really complicated ETL script that you want to develop, and it’s more than you can do just within the console of Glue itself? Well, it offers AWS Glue Development endpoints to allow you to attach an external notebook to Glue ETL and use that notebook to iterate and develop your ETL script. And then when you’re done, when you’ve actually debugged it and you’ve worked out all the kinks, and you make sure that it does what you need it to do, you can then create an ETL job that runs that script using Spark and Glue. The endpoint will be within a VPC that’s controlled by whatever security groups you have attached to that VPC, and depending on that security group, that will allow you to connect to the endpoint via maybe Apache Zeppelin running on your local machine.
Zeppelin is just a notebook environment for Apache Spark, or you can have a Zeppelin notebook server running on EC Two, and the glue console makes that very easy for you to spin up automatically. If you’re using Amazon Sage Maker to develop a larger machine learning pipeline, you can also develop ETL scripts from a Sage Maker notebook as well. Using glue development endpoints. It also lets you just pop up a terminal window and start typing away pretty easy. And if you’re using PyCharm, which is a Python IDE, the professional edition of that product has built in support for AWS Glue Development Endpoints as well. One thing to note if your endpoint is within a private endpoint address, you can use Elastic IPS to access that from outside of the VPC. Some of the details on actually running your glue job here so as we mentioned, you can run it on a time based schedule. There’s a glue scheduler that lets you run these on some sort of fixed frequency.
The format of that is very much like Linux’s Cron commands, and that just lets you run it every hour, every day, every week, whatever you want, whatever you can dream up there. One important concept is Job BOOKMARKS. This is showing up on the exam more and more. So this is what lets you actually pick up where you left off from when you ran the previous job. It persists state from job run to job run. So if you’re running your glue job on a schedule, job BOOKMARKS allow you to process only the new data when it picks up again on the next iteration of that schedule. It can work with S Three sources in a variety of formats. That seems to be the thing that it works best with. If you have a data lake in S Three, job BOOKMARKS work really well with keeping track of where you left off on processing that data with Glue ETL.
But it can also work with relational databases and some constraints here if it can talk to it through JDBC, which is pretty much every relational database out there. You can also use job BOOKMARKS in that situation. Sometimes it depends if the primary keys are in sequential order, so it can actually keep track of where it left off. And it only handles new rows. It can’t go back and find updated rows that were modified. So within these constraints, it can also work within databases as well. But mostly Job BOOKMARKS are used with S Three data lakes. Also, your glue job can integrate with Cloud Watch and Cloud Watch events. So when your ETL job succeeds or fails that can fire off a lambda function or an SMS notification or whatever you want through cloud watch events. It can also, invoke an EC two run, send an event to Kinesis or activate an AWS step function to move on to the next stage in your pipeline once your data has been transformed using glue ETL.