DP-203 Data Engineering on Microsoft Azure – Design and Develop Data Processing – Azure Event Hubs and Stream Analytics part 1
- Batch and Real-Time Processing
Now in this section, we are going to see how to start working with streaming data. So, in the earlier section, we had looked at using Azure Data Factory as an ETL tool, an extract, transform and load tool. And this is normally used in the use cases when it comes to batch processing. So in batch processing, you normally take large amounts of data from the source, you transform it and then load it onto a destination data store. For example, you can take data from your Azure Data Lake Gen Two storage account, and then you can perform some transformations and then put it onto a destination data store like Azure Synapse onto a dedicated SQL pool. Here you could take your semi structured files that we have seen, for example our JSON based files, Et cetera.
And then you can basically put into structured files or structured data on a dedicated SQL pool that can be analyzed at a later point in time. So, for example, you could have your web server logs that can be copied onto Azure Data Lake Gen Two over the period of the day, and then a batch process, let’s say, kicks off in the night, that would take the data, process the data, and then send it on to an antique store for daily reporting purposes. So in batch processing when it comes to the different operations. So first you have data storage, right? So you can have your data in Azure data lake. And then as part of the batch processing, you can take the data from Azure Data Lake and then put it in an article data store such as Azure Synapse, right, in your dedicated SQL pool.
And you can use actually the Apache Spark Clusters that is part of, let’s say, Azure Data Factory, or even have Spark running on its own when it comes to processing of the data. And when you want to have an orchestration of the data that is sent from the source on to the destination, we’ve seen how to use as your data factory. And then you can use Power Bi when it comes to the reporting of the data, right? So power bi can work of the data in the Anacle data store. So this is when it comes to batch processing. Now, what about real time processing? So here, streams of data are captured in real time and processed with minimal latency. To generate real time reports here, you have to ensure that whatever is the engine that is taking in or processing the real time information, it needs to process this as fast as possible so that it does not block the incoming stream of data.
See, when you’re looking at streams of data, you are looking at ingesting large amounts of data that is coming at a very fast rate. So in such cases, you need to have the proper platforms in place that have the ability to ingest the data and process the data at a very fast rate now when it comes to the different systems that are available for different parts of real time processing. So first you have the systems that are used for real time message ingestion. So you have systems such as Apache Kafta that can be used or in Azure you can make use of Azure Event Hubs. And as part of this course we’ll be looking at Azure Event Hubs. When it comes to data storage, remember, your streams of data will just be ingested over here.
Then these systems can take the data and let’s say store it in Azure Blob Storage or in Azure Data Lake Storage accounts. If you want to process the streams as they are coming in, you can use stream processing tools. For example you could use Aze Stream antics, you could choose Apache Storm or you could also use Spark Streaming as well. And you could also take all of the process data and put in an article data store such as Azure Synapse. You can use Spark Hive Hpase and then find only again for reporting you can use Power Bi. So now in this section we are going to be looking at Azure Event Hubs that can be used for ingesting data and we will look at Azure Stream antics on how it can be used to process the data in real time.
- What are Azure Event Hubs
Now in this chapter, I’ll just give an overview of Azure Event Hubs. And then in the next chapter, we’ll actually build a resource based on Azure Event Hubs. The main aspect that we need to look as part of this course is actually Azure Stream antics. But we can combine both Azure Event Hubs and Azure Stream antics. So first, let’s cover Azure event. Hubs So this is a big data streaming platform. This service has the capability to receive and process millions of events per second. So here you can stream any type of data. It can be a log data, your telemetry data, any sort of events onto Azure Event Hubs. So here giving a reference from the Microsoft documentation on the Event Hubs architecture. So in Azure, you create something known as an Event Hubs namespace first, and then as part of the namespace, you can create an Event Hub and Azure Event Hub.
Then you can have sources that actually produce the events and send it onto Azure Event Hubs. So you could have, let’s say, your web servers that is continuously streaming, let’s say, log data onto Azure Event Hubs, or it’s streaming metrics about the application onto Azure Event Hubs. On the right hand side, you can have a receiver. So this can be a program that is taking in all the data that is being sent onto Azure Event Hubs. The event receivers can take the data and process the data accordingly. In Azure Event Hubs, you have this concept of having multiple partitions. And again, with any system, the main purpose of having a partition is so that you can ingest more data at a time. You can have data in multiple partitions and your event receivers can take in the data either from one partition or from all partitions.
So this not only helps in taking in the data at a fast rate with the help of partitions, you can also have multiple event receivers that are taking in the data in parallel from the different partitions. So you could also consume the data at a very fast rate, right? So, just giving an overview when it comes to the Azure Event Hubs architecture. So here you have the different components, you have the event producers. This is the entity that actually sends data onto the Event Hub. Here are the different protocols that are available when it comes to publishing events onto Event Hub. Your data can then be split across multiple partitions. This allows for better throughput of your data onto Azure Event Hubs. We have something known as consumer groups, right? This is just a view or a state or position or offset of the entire Event Hub. We’ll not be going through consumer groups in that detail.
See, we are only going to be using Event Hubs as part of this course for ingesting data. And we look at the important points when it comes to the exam perspective, because the main thing that we have to focus on is Azure Stream Analytics. And then we have something known as the throughput. This is the capacity that is available for Azure Event Hubs. And then you have the event receivers. This is the entity that actually reads event data. Now in this particular section, so we are going to be looking at the use case wherein we’ll have an instance of Azure Event Hub. Then we’ll actually configure the diagnostics of an existing SQL database that we have to send metrics information onto Azure Event Hub. And then our receiver is actually going to be Azure Stream Analytics. So using Azure Stream analytics, we’ll actually look at the real time data that is being sent by the diagnostic setting of the SQL database onto Azure Event Hubs. That’s going to be our main use case scenario.
- Lab – Creating an instance of Event hub
Now here we are in Azure. Let’s go ahead and create an instance of Azure Event Hub. So here, in all resources, I’ll hit on Create. Here. I’ll search for event hub. I’ll choose event. Hubs. I’ll hit on create. So first we have to create something known as a namespace that can encapsulate multiple Azure Event Hubs. Here I’ll choose our existing resource group. I need to give a unique name for my namespace. So it already exists, my location. I’ll keep it as North Europe. Now, here, in terms of the pricing tier. So if I go on to the pricing page for Azure Event Hubs. So here we have the Basic Tier, which we are choosing. Now, one throughput unit allows you to have one MB per second when it comes to the amount of data that can be ingested into as your Event Hubs, and two MB per second when it comes to the amount of data that can be taken out of or read from Azure Event Hubs.
Here you can see what is the price per R when it comes to using one throughput unit. You can also see that when it comes to the Basic Tier, you get this much storage retention and the maximum retention period is one day. So, just a quick note when it comes to Azure Event Hubs. So what is the meaning of that one day retention? So, let’s say events are being sent on to Azure Event Hub. And let’s say a receiver reads the event from Azure Event Hubs. Now here the messaging system is a little bit different. So normally in other messaging systems, when you read a message from the system, then you can actually delete the message so that no one else can read the message again. But this is not actually messages, these are events. So there is no facility to delete the events from Azure Event Hubs.
That’s because you could have multiple consumers that could be reading events for different purposes. For example, you might be having the It Security Department that might be reading the events based on web server logs just for the purpose of security. Or you might have your monitoring team that is again reading the same set of events for monitoring purposes. So that’s where you can also have different consumer groups. So you can have one consumer group for your It Security Department that has one view of Azure Event Hubs when it comes to the events. And you could have your mounting team that could be consumer group number two. So the events will be retained here for a maximum duration based on the Basic tier that you’re choosing for one day.
And during this one day, you can have your consumers that can read the data. Obviously, if you want more data retention, you have to use a different pricing tier. So again, just want to make this very, very clear. When it comes to the storage of the events in Azure event Hubs. Now what I’ll do is that I’ll actually choose the standard just for this purpose. You can actually go with the basic, but I’ll choose the standard tier because I want to have a greater maximum retention period. And the reason for this is because, see, I’m building this course and I want the events to be in place. So if I want to reuse the same events at a later point in time, I can. During this seven day duration, I leave it up to you on to what tier that you want to use, whether the basic or the standard. Here you can see what is the price per R when it comes to the throughput units.
I’m also just going to be adding one more throughput unit. I’ll just keep it as two throughput units for my entire namespace. So here I’ll choose standard and let me just put two throughput units just for the extra ingress and egress traffic. I’ll go on to next for tags. I’ll go on to review you and Create. And let me go ahead and hit on Create. This will just take a couple of minutes. Now once the deployment is complete, I can go ahead on to the resource. And here we have an Event Hub namespace in place. Then we can click on adding a new Event Hub. Here I can give the name of the hub and I can specify the number of partitions. Here, let me leave the partition count as one itself. And here in terms of the message retention so I mentioned that for the message retention, this is the amount of time that the message will actually be retained in the Event Hub.
Remember, there is no manual way of deleting the events from the Event Hub. It needs to expire based on the setting that you put for the message retention. For now, I’m putting the message retention for the maximum of seven days when you choose even the standard tier. When it comes to the Azure Event Hub, you can also turn on the capture feature. This will actually give you the ability to also take the events and put it in a storage account. For now, I’ll turn this off and let me hit on Create to create an Event Hub once deployment is complete of the Event Hub. So if you go on to Event Hubs here you can see your Event Hub in place, you can go on to it. And here you’ll get an overview of the Event Hub. So in this chapter we were looking at getting our Event Hub namespace in place and also our Event Hub.
- Lab – Sending and Receiving Events
Now, in this chapter, I just want to show a simple example on how we can use a net program to send events onto the Event Hub. And how do we use another net program to receive events from the Event Hub. The entire purpose is just to show you how Event Hubs actually work. Then in a subsequent chapter, we’ll see how to continuously stream the diagnostic data from an Azure SQL database onto the Azure Event Hub. Now, I have ensured that I have both of my net projects which are attached has zip files onto this project. So, there are two applications in place. One is azure event hub send. So you can double click on this solution file. This will open up in Visual Studio. So I have one for sending events onto the Event Hub, and another program for receiving events from the Event Hub, so we can open up both solutions. So here I have the program in place for sending events onto the Event Hub.
Again, we are not going into details about this particular program, but what I’m trying to do is I am trying to create a list of orders. This is order information. This is based on my order class. Then I am sending all of this has events on to my Event Hub, so I can use the messaging Event Hub packages that is available for Azure in Visual Studio. Now, here, what I need to do is I need to change the connection string. So I need to ensure that this program is authorized to send events onto the Event Hub. Now, how do we get this connection string? So in Azure you have to go on to your Event Hub. So remember, we created the app Hub in the previous chapter. Now here we have to go on to Shared access policies. And here we have to create a new policy.
In the policy I can give a name, and here in terms of the permission, I’ll choose Send. So this policy will have the permission of sending events onto the Event Hub. I’ll then hit on create. I’ll then click on the policy. And here you will see, you will get the primary key, the secondary key and the connection string based on the primary key and the connection string based on the secondary key. Now, you can copy either connection string. So I’ll copy the first connection string. I’ll go on to my program, I’ll just replace it here. And we have a program in place. And now I can run this program. So, this program is now built to send events onto a zero Event Hub. So the batch of events have been sent. Now, next, I have a program that is used to receive events from the Event Hub.
So here the program is reading from all of the partitions in the Event Hub. We only have one partition, so this should be simple. We also have something known as a consumer group so I said you can have different views of your Event Hub. So if you have one department that is consuming events from the Event Hub, they can have a separate consumer group. And if you have another department, they can have their own consumer group. You can actually create your consumer groups here itself. So if you go on to consumer groups, this is for the Event Hub. Here you can see the default consumer group in place and you can just go ahead and create a consumer group and give it a name. Now again, for this program, we have to replace this connection string.
So here again, I have to go on to Shared Access policies. And here I need to add a new policy. Now this policy will be used to listen to events. So I’ll choose the permission of listen and I’ll hit on Create. So here I’m creating different policies, one for sending events and the other for listening to events. Now here again, I can take either connection string. So I’ll take the primary connection string. Then I’ll just replace it here. And then I’ll run this program. And now you can see that it has received events from the Event Hub. Now our main data is here, right? So all of the objects of data that I’m sending has events. This is being sent on to the Event Hub. Apart from that, there are some other attributes that are being set by Azure Event Hub for each event.
And I’m just displaying those different attributes. So first is the partition ID. Since I mentioned only one partition for our Event Hub, there is only one partition ID. And remember, I mentioned that you have these partitions in Azure Event Hub so that you get better throughput when it comes to the underlying ingestion of data, and also when it comes to reading of data from the Azure Event Hub. And then you have the sequence number. So each message has a sequence number and you also have the data offset. So here each of my events is consuming 120 bytes of data. That’s why you can see the data offset here itself. Now remember, I told you that when you send events onto the Azure Event Hub, so here we are reading all of the events and it is not, I said not deleted.
So if I were to run this program again, so you can see we are reading the same data again. Now let’s say that you have a process that is reading data from the Event Hub, like our program, maybe it is then taking that data and then sending it on to another data store. So remember that Azure Event Hub is used to ingest data. So it’s like having a temporary solution for first taking the data that can come from multiple data sources. Then you’ll have another program that will take the data and let’s say store it in data like Gen two storage account for further processing, or maybe in Azure Synapse. So now, let’s say, right, so this program has read a set of data, and let’s say the program stop, it crashes. Now, when you start the program again, it should not read the same data set again, right? So that’s important.
Now, there are features which are available in the net packages, basically, in the Azure SDK, to have something known as checkpoints. So here, once it starts reading data, it will have a checkpoint, and then it can read data after that checkpoint. So this checkpoint feature is implemented as part of the Azure SDK. Now, obviously, I’m not going into detail because this is not that development based course. But I just wanted to inform you of this feature because I want you to understand this particular feature when it comes to Azure Event Hubs, that you can read your data at any point in time. And remember, this data will be in Azure Event Hubs for a duration of seven days as per my retention period, right? So in this chapter, just want to show you a very simple example on how you can send and receive data using Net programs.