DP-203 Data Engineering on Microsoft Azure – Design and Develop Data Processing – Azure Event Hubs and Stream Analytics part 2
- What is Azure Stream Analytics
In this chapter, I want to give an introduction onto Azure Stream analytics. So, this is a real time analytics and event processing service. So I’ve taken this diagram directly from the Microsoft documentation because this gives a good picture on what this service can achieve. So, on the left hand side, we are looking looking at ingesting data from a variety of data sources. So you could ingest your data that’s coming in from your IoT devices, from your log files, from your SaaS based applications. You can then use services such as Azure Event Hubs, which you have seen earlier on, to ingest that data. Or your data might be copied directly onto Azure Blob storage. You could also use Azure IoT Hubs totake the data from your IoT based devices.
And then you can use Azure Stream analytics to actually take that data that is being ingested into Event Hubs or onto IoT Hubs. You can then analyze that data in real time. And then you could store that data or transfer that data onto a destination data store. For example, you could store the data in an Azure SQL data warehouse. You could also look at the data using Power Bi. Or you could store that data, let’s say in Azure Data Lake Gen Two storage account. So the entire purpose of Azure Streamatrix is to analyze your data in real time. So we’ll be looking at examples on how to use a service in the subsequent chapters.
- Lab – Creating a Stream Analytics job
Now in this chapter, I just want to go ahead and create a stream antics job. So here in all resources, I’ll hit on Create. So here I’ll search for the stream antics job. I’ll hit on create. I’ll just give a job name. Here I’ll choose my resource group. I’ll put my location as the North Europe location. In terms of the streaming units, I’ll leave it hazardous. So the streaming units is a metric that is representative of the amount of compute resources and other resources that will be given onto our stream antics job. So remember, this is a managed service in a job. You still need some sort of infrastructure to ensure that it can analyze the information that you’re going to send on to the job. And again, that infrastructure is going to be managed for you. Now, the amount of resources that need to be given to your job is actually based on this measure of streaming units.
At this point in time, I’ll leave it as three streaming units and I’ll hit on Create. So it will now create a stream antics job. Let’s wait till this is complete. Once we have the resource in place, I can go ahead onto it. Now here, if you go on to your inputs, here is where you now add an input for your stream addicts job. So from where is your data coming from? Now, when it comes to what you can define for your streaming inputs, you can choose as your Event Hub, as your IoT Hub, and as your Blob storage or Data Lake Jed two storage accounts. You can also add reference inputs as well, and I’ll go through reference input in a subsequent chapter.
Then you have your outputs. So once you take the data from the input, and once you perform some sort of transformation analysis on that particular data, where do you want to send that transform data onto? So here you have a variety of options. So you could send your data either on to a SQL database, back onto, let’s say, as your Data Lake Gen two storage account, or onto Azure Synapse. Once you define your inputs and your outputs, then you have to define your query. So this is like a SQL based query. You can use this query to analyze your data, and based on that analysis, you can then send that data onto your output, onto your destination data store, right? So at this point in time, I just want to create the stream antics job. I said in subsequent chapters, we’ll see how to work with this particular service.
- Lab – Azure Stream Analytics – Defining the job
Now in an earlier chapter when it came to Azure Event Hubs, we actually saw how we could use a net program to send events onto an Event Hub. So the name of our event hub was app. Hub. Here in the overview you can see the metrics about the requests and the message is now let’s use our stream at XTOP to also read data from the Event Hub. So now let’s use our stream at its top up to take data from the Event Hub and then write that data onto a table in Azure Synapse. So firstly, I need to go on to my inputs and I need to define my stream input. So here I’ll choose event hub. Here I’ll give a name for the input. So I’ll give the same name as the event hub. You can define your own name. I’ll scroll down. So I’ll choose my event hub namespace.
I’ll choose my event. Hub. I’ll scroll down. So here it will create a new Event Hub consumer group. So we could be having our net program that can be reading data based on one consumer group, the default consumer group. And here the addicts job will actually create a new consumer group because this is a new consumer of the events in that Event Hub. So I can leave it as it is. Now in the authentication mode, I’ll choose connection string, and here it will create a new Event Hub policy. So again, I’ll leave it as it is. Now here if you have a partition key we don’t have a partition key defined in our program. So we are leaving it onto Azure Event Hubs to create our partitions. So I’m leaving it as blank here.
Now, in terms of the event serialization format, so what is a format of the data? How is it coming in? So is it in JSON based format? Is it in CSV based format? Is it in Avro based format? So I’ll leave it in JSON I’ll leave the encoding hazardous and I’ll click on Save. So now I am first defining my input of the Azure Event Hub. Now I also need to define my output. Before that, let’s create a table that will store the information that we are sending on to the Azure Event Hub. So here I am in SQL Server management studio. So for my dedicated SQL pool, I’ll right click and hit on New Query. And here I’ll use this particular statement to create a table. So it’s a simple orders table. Here I have the order ID, the quantity, the unit price, and the discount category. So this is similar to the information that my net program is also sending.
So each event has an order ID, a quantity, a unit price, and a discount category. So let me take this statement and let me hit on Execute. Right, so I have now the table defined here. I should not be having any data in my table. So now let me go on to outputs and let me add now an output. So I’m going to be choosing Azure synapse antics. Here I’ll give a name so I can give the name of our database table. Here I’ll choose my database in the authentication mode. I’ll use connection string. I need to choose what is the username. So I’ll put in the username. I’ll put in my password here, what is our table? So it is the orders table. So I’ll copy this place. It here. Click on save. So now here we are getting a connection test failed because it needs something known as a job storage account to run.
So when we looked at Azure data factory over there, also we saw that if we had to transfer data onto Azure Synapse, we needed to have a staging area in place. And that staging area in our case was your Data Lake Gen Two storage account. We had a separate Synapse container there. So the same thing needs to be in place also for the stream antics job. So here we have something known as the storage account settings. So let’s go there and let’s add a storage account. Now, here I’ll provide an existing Azure data lake Gen Two storage account. Before that, let me go on to my Data Lake Gen Two storage account. I’ll use our same account. I’ll go on to my containers and here I’ll create a new container.
I’ll hit on create. Now, next year I’ll select my storage account. So here it’s Data Lake 2000. I’ll leave the connection string hazardous and let me click on Save. So now we have our storage account settings in place. Now if I go back onto my output so now here, going back, let me test the connection again. So now you can see the connection is successful. Now, once we have our input and our output defined, now I’ll go on to the query part and let me take the query that I have here. Now here I need to take my query. So this is the information that I want here. I’ll just replace it. Let me just hide this for the moment. So I’m selecting my order ID, my quantity, my unit price, and the discount category here into my output alias.
So what is my output? It’s orders. And what is my input? My input is the app hub. So once we have the query in place, let me hit on Save. Once this is saved, now I’ll expand this. I’ll go on to the overview. So now if you want this job to continuously take data from the event hub and then stream it onto the table in our dedicated SQL pool, we need to start our job. Now. Here I can put a custom start time, or I can put the start time has now itself. So let me leave it as now, and let me click on Start. And now it will start our stream antics job. It will just take a couple of minutes to start the job. Let’s wait for some time. And now in Visual Studio, let me run my program again to send events onto the Event Hub.
So we’d seen this program earlier on, right? So the events are sent now. If I go on to SQL Server Management Studio and if I execute the statement now, you can see all of the order information in our table. So within no time, our data was taken from the Azure Event Hub and was populated in our table here. Now definitely at this point in time, you’ll be having a question. We had already sent events onto the Event Hub. Why didn’t our stream antics job take the events that were already present in our Event Hub? So we’ll answer this question in the next chapter. Because there are some other aspects that I also want to discuss when it comes to the stream at its job.
- Review on what we have seen so far
So what have we seen so far? So, we have configured an Azure Stream Addicts job to taken events from Azure Event Hub and then send them on to a table in Azure Synapse. We had a net program to send events onto Azure Event Hub. So the Azure Event Hub instance is reading our data in real time, right? It’s ingesting the data and then at the same time, within a fraction of a second, we have our Stream Antics job that is able to read the events from Azure Event Hub and send it onto a table in Azure Synapse. Now, I mentioned that in the previous chapter, you might be thinking to yourselves we have more events in the Event Hub. Even if I go on to my program, which is being designed to read events from the Event Hub and let me run this program. So here we can see we have a total of ten events, but if we go on to our table and hit on Execute, we only have the recent five events.
Now, why is this the case? So, the main reason for this is the time. Remember when it came on to Azure Event Hubs, I talked about the checkpoint feature which is available as part of the Azure software development kit, wherein a program can ensure that it does not read old events and it reads events after a checkpoint. Similarly, when it comes to your Stream Antics job, since this is real time processing, we are talking about time here. So time is the important factor. From where in the timeline should it actually start reading events? This is important. When we started our Stream Antics job, there were two settings that were available. There was a setting of starting it right now and there was a setting of custom.
So should we start the job at a particular point in time? Since I put the option of now that’s why it didn’t read the older events, because the older events have a particular timestamp, so it did not read those events. So, when you are processing data when it comes to street antics, you will see that one of the most important factors is the time. Because I said this is real time processing of your data. Now, if I go back onto my streamatics job, let me just hide this. Now, if you choose the input of the App Hub, see, first of all, our job is running. So we can’t change the query over here, but what we can see is we can actually see a preview of the data in the Event Hub. I’ll just change the zoom factor so that we can see the data here. So in the input preview, you can see the order information. Now here you can see all of the events in the Event Hub.
Now remember, this is not the job that’s actually running. The job is running in the background.It’s reading events from the event. Hub. This is like having a test area where you can see the data from your input. And the data from your input is based on the time range that you select here. So if the time range here is before the events that were sent onto the event hub, it will actually come in over here in the input preview. So having the input preview is also very important because it gives you an understanding of the type of data and also on the data itself that’s coming in from the input stream. Sometimes you might not be aware on what is a structure of data that’s coming in from the input stream. This gives you a very good idea on that particular fact. Now here when it comes to your input stream, so we can see our order ID, our quantity, our unit price, and our discount category.
Apart from that, you can also see some other columns as well. You have the Event processed UTC time, you have the Partition ID, and you also have one more that’s known as the Event Nqtc time. So when it comes to Azure Event Hubs, these three pieces of information is also sent onto the stream antics job. That’s why in my select query, I specifically mention what are the columns of interest. If I do a select star, it would also take this information into consideration and our job would fail. Our job would fail because our table in Azure Synapse does not have columns for the event process UTC time, the Partition ID, and the event queue UTC time.
And it is basically this information that is taken by the Azure Steam Addicts job to understand the time factor when it comes to the events. So when it comes to a particular input stream, so you have the Azure Event Hub, you have IoT Hubs, and you have Azure Blob storage. There are different properties that are appended onto the stream depending upon your input source. When it comes to Azure event hubs it’s the event process UTC time, the partition ID and the event nqed UTC time. Now, if you have an application that is sending data on to the stream antics job, you could have your own time inserted in the data and then stream antics can actually build based on the time factor that is added in your data, right? So in this chapter, just wanted to discuss a couple of important points.
The first thing was the time factor when it comes to processing your data. The next thing is when it comes to understanding the query in the query window here, and how do you look at data from your input, which is also very important. Even when am working with stream antics, when I’m trying to debug, right, if something is not working, I normally look at the input preview to understand how the data has been formatted, what are the columns of data, and then based on that, I can decide how to construct my query because this is important.
- Lab – Reading database diagnostic data – Setup
Now in this chapter, I want to show you an example on how you can continuously stream data onto Azure Event Hubs. And then we can take that data from our stream at its job. So here I have my Adventure works. SQL Database. This is a SQL database base we had created earlier on. So if you have any SQL database in place, you can actually make use of it. So in Azure you have this feature which is available for most of their services, and that’s when it comes on to diagnostic settings. So here you can send out different diagnostic settings. It’s like having log information and metrics sent on to a particular destination. So here I can actually add a diagnostic setting to send data onto Azure Event Hubs.
Before that, I’ll actually go onto Azure event hubs onto my namespace. And here let me create a new Event Hub just for my database. So I’ll name it as DB Hub. Now here I’ll just change the message Retention for seven days and let me hit on Create. So now we are creating a separate Event Hub that will take the diagnostic information for our SQL database. Once we have the event Hub in place, I’ll go onto the SQL database and I’ll click on Add Diagnostic. Setting now here on the left hand side, I can decide what do I want to send on to my Azure Event Hub, what do I want to stream onto my Azure Event Hub? I’ll choose Basic when it comes to the metrics. So this means that the basic metrics about the SQL database can now be sent on to a destination.
Here on the right hand side, you can see different destination details. Here I can choose to stream onto the Event Hub, and here I can choose my DB Hub. I can leave the Event Hub policy as it is. Now, in order to see your Event Hub namespace, your Azure SQL Database, and the Event Hub needs to be in the same region. Currently, all my resources are in the North Europe location. That’s why my Azure SQL database can actually see this Azure event. Hub. Next, I need to give a name for the setting and then click on Save. So now it will start streaming the diagnostic data for the Azure SQL Database onto Azure Event Hub. Now it might take around five to ten minutes for the information to start showing up in Azure Event Hub, but then after that it’s continuous streaming of data. So let’s come back after some time.