DP-203 Data Engineering on Microsoft Azure – Design and Develop Data Processing – Azure Event Hubs and Stream Analytics part 6
- Custom Serialization Formats
Now, in this chapter I just want to give a note when it comes to the custom serialization feature that you have as part of Azure Streamatics. So when you go on to your inputs so if I go on to any one of our inputs here, if I scroll down on if you look at the event serialization format, so this means that what is the format of the events that are coming into the Schematics job. So here you have the options of JSON, CSV and Avro. So these are the only formats that are available. So if you want to have other serialization formats in place, you can choose this option. But you need to implement something known as a desrilizer in Csharp that can read the events in any format.
So this program will have the facility to deserialize the events, these custom based events and then the job can process those events accordingly. So again, we are not going into details how to implement this. Instead I am actually going to attach a link has a resource on to this chapter wherein there is a tutorial on how to implement a custom net deserializer. Now, when it comes to the exam, you might get just a question on how do you implement this, what are the steps for the implementation? It won’t go deep into the entire implementation process. All you need to do is to understand what are the different steps.
So normally the first step is in Visual Studio you’ll create something known as an Azure Stream ATS custom deserialization project. And then once you have that project in place, you then add all of the classes that are required for deserialization and then you have to configure the stream antics job. So this will be in the net project and once you have all of that in place, then you can ensure that you have that desertized code and then you can add that on to your input in stream antics in your stream antics job. So I said when it comes to this particular topic, it is just important to understand what are the key steps that are involved when you want to implement your own custom serializer when it comes to events coming into your stream antics job.
- Lab – Azure Event Hubs – Capture Feature
Now, in this chapter, I want to go through the Azure Event Hubs capture feature that is available. So we have seen that how we could stream our data using Azure Streamatics from Azure Event Hubs. Now, Azure Event Hubs also has this capture feature in place wherein you could also take the events and stream it onto an Azure Storage account, for example, and as your Data Lake storage account. So you could have your events that could be processed after it has been put into an Azure Data Lake storage account. At the same time, you could also ensure that the events are processed by a stream antics job. Your events can be processed by multiple consumers. Remember, this is possible. Now, one of the reasons as to why you might want to stream events onto an Azure Data Lake Gen Two storage account is let’s say that you want to have two processing or different types of processing jobs in place.
One is in real time with the help of Azure Stream antics. Another could be based on processing the batch of events for a different purpose. So you might have your events being captured onto an Azure Data Lake Gentle Storage account, and then maybe you might have Azure Data Factory that might process those events once in a day. So normally, this is also referred as your hot and cold path. So this will be your hot path wherein you are processing the events in real time. And then you have a cold path wherein you have your events that are being processed has a batch. Now, please note that when you enable the capture feature for Azure Event Hubs, the data is captured in Apache Avro format.
So let’s go ahead onto Azure and see how to enable the capture feature. Now, another quick note. When it comes to enabling the capture feature, this is only available when you choose the pricing tier has the standard pricing tier here, if we click on the view full pricing details here, you can see that as part of the standard pricing tier, you get the capture feature in place. Here you can see the estimate cost per month. Now, since I had created my as your Event Hub earlier on based on the standard pricing tier, here, let me go on to my Event Hubs. Here let me go on to my DB Hub. And here let me scroll down and let me go on to capture. Now, before that, let me go on to my Azure Data Lake Gen Two storage account.
And here let me go on to Containers and let me create a new container to store all of that capture data. So I’ll just give a name of event. Hub. I’ll hit on create. Now, here I’ll turn the captcha feature on. Let me just hide this for the moment. So now this capture feature will take the events either in a five minute that’s a default time window frame, or when the size of the events reaches 300 MB. So based on either of these windows, it will take the events and then store it in your storage account in terms of the capture provider, so we can make sure it’s as your storage account. This will also have support for Azure Data Lake. So here, if I go on to select container, I can go on to my Data Lake 2000 storage account.
And here I can choose my Event Hub container and hit on select here azure Data Lake Store gen one is the earlier version of the Data Lake storage account. We have to always ensure that we use the most recent version of the Azure Data Lake Gen Two storage account. And here you can see what will be the capture file format. So here the format will be in terms of the namespace, the name of the Event Hub, the partition ID, the year, month, day, minute and second. And here it’s also giving an example. So let me save the changes. So now, once this is done, your events will be captured after every five minutes when the threshold reaches third MB onto your Data Lake Gen Two storage account.
But remember, you can still read your events from the Event Hub using your Steam at X job. It can be done both ways because your events will not be deleted from your Event Hub instance. This is based on the retention period that you specify. So let’s come back after a duration of ten minutes, once we have some data in place in our container. Now I’ve come back after a duration of time. Now, if I go on to my Azure Data Lake Gen Two storage account, if I go on to Event Hub so here I can see my namespace, then I can see DB Hub, then I can see my Partition ID, and then based on the date, and then based on the time. So here you can see the file in average format if I go on to another folder. So here I can see that at every five minute interval the file is being generated.
So in this chapter, I want to show you how you can use the Event Hub capture feature to also ensure that your events are stored onto an Azure Deal Lake Gentu Storage account.
- Lab – Azure Data Factory – Incremental Data Copy
Hi and welcome back. Now in this chapter, I want to show you how you can perform an incremental copy of your Blobs from one container on to another. So earlier on, we had enabled the event Hub capture feature to ensure that our data is being or our events are being streamed onto to a container that’s DB Hub. All of the files are in the arrow format. So here is our partition ID in DB Hub and then it is organized in the datewise format. So here 23rd and this is based on the time. So the latest is 06:00. Here we have all of the time frames. So now I want to copy a subset of this data and then ensure incrementally. The data keeps on getting copied. So we’re going to be using a trigger in our copy data activity that we will create in Azure Data Factory.
So firstly, in our data Lake Gen two storage account in containers, let me create a new container. So I want to copy my contents onto this container. I’ll hit on create. So now we have this new hub container in place. Now I’ll go on to Azure Data Factory, I’ll go on to the home screen and here I’ll choose the ingest option. I’m going to be using the builtin copy task. Now, in addition to showing you an incremental copy of Blobs, I want to also now use the tumbling window trigger. And I want to explain what are the features of the tumbling window feature. So here, when it comes to the start date, I can actually mention a date in the past. So here I can mention 06:00. So the tumbling window also has the ability to take your backlog data as well.
This is something that is not available as part of the schedule trigger. In the schedule trigger, even if you put a past start date in place, it will still not work. Only the tumbling window has the facility to ensure that it can backtrack and take events from a previous time or date. Now, next in the recurrence, I’m going to choose minutes, and here I’ll choose the tumbling window to execute every ten minutes. Now, another feature of the tumbling window is that it does have some sort of session information in place, because it does understand the information that has been processed during this recurrence window. So every ten minutes it understands what is the data that has been processed.
You can only associate one tumbling window trigger with a single pipeline, whereas when it comes to the schedule trigger, you can actually associate with multiple pipelines. So there are some core differences when it comes to the schedule trigger and the tumbling window trigger. Apart from that, when it comes to the advance. So there are some other settings in place as well. So, for example, if you want to introduce a delay, so there will be a delay before the data processing of the window actually starts. So there are some additional settings in place as well. So now I’ll leave this hazardous and I’ll go onto Next. Now here I’m going to create a new connection. I’ll choose Azure Data Lake. I’ll hit on continue. Here I’ll give the name, I’ll scroll down, choose my subscription, I’ll choose my storage account.
I’ll hit on create. Now here I’ll browse for my container, so my Event Hub container, onto my namespace, onto my Event Hub, onto the partition, and let me hit on a key. So I want to copy the data from the folder 2021 onwards. Now, here in the file loading behavior, we can choose incremental loads. So this is an option that is available. So we can do it either based on the time partition folder file names or based on the last modified date. So I’m going to choose the last modified date when it comes to the incremental load. Now, since our files are based on the Avro file format, I have to choose a binary copy when it comes to copying data from the source onto the destination, because I’ve seen in the Copy Data tool and many forums that the Avro file format is not supported when it comes to the copy data activity.
That’s why I’m just copying it as a binary copy, which just takes the data hazardous from the source onto the destination. I’ll ensure the recursive option is in place and I’ll go onto Next.Now I need to choose my destination. So I’ll choose a new connection. Again, I’ll choose Azure Data Lake Storage Gen Two. I’ll hit on continue. Yeah, let me give the name. I’ll choose the subscription, I’ll choose the storage account name. I’ll hit on create. Here I’ll browse for my folder path. So here. I’ll choose new hub. I’ll hit on OK, now here in the Copy behavior, I’ll ask to preserve the hierarchy, which means that the same hierarchy of the folders which is present in Event Hub, I want it to be copied onto New Hub. Now here, let me go onto Next. Let me give a name for the task.
Let me go onto next. Onto next. So to create our data sets, create the pipeline, create the triggers, it will also start the trigger. Once this is done, I’ll hit on Finish and I’ll go over onto the monitor section. Here I’ll just filter on my increment pipeline. So the pipeline name. Now here you can see that there was 123-4566 pipelines that ran in parallel. And these separate pipelines are actually based on the six window frames that have been executed based on the tumbling window trigger. If you actually scroll on to the right here in the parameters, so every pipeline run has a parameter. If I go on to it here, you can see the window start time. And what is the window end time. So if I just maximize this here, we can see the window start time and the window end time. So ten minutes between both.
So if I go on to New Hub now, if I hit on refresh so I can see the folder 2021, I can see zero 720, 3rd as expected. So my recent time of 06:00 and if I scroll down, I have all of these folders. So this was all in the ten minute window time frame if I go a folder up and if I go on to 05:00, it was just this one folder that was copied in one of the ten minute window interval. So here you can see that now in every ten minute window interval, based on the timestamp of the files that have been uploaded onto the folders, it is now ensuring to only copy those files and folders onto a New Hub container. And you also saw that it run pipelines in parallel. It actually ensured to take the backlog data as well based on the back time that we specified when it came to the trigger. If you choose the normal scheduled trigger, it will not work, it will not take the backdated data and also when you come back.
So at the moment I have six runs in place. I’ll just wait for some time and come back. Now I’ve come back after around ten minutes. Now let me hit on Refresh. So now I can see I have 123-4567, so I have the 7th one as well if I go on to New Hub and if I go on to 18, so I can see that we have 19 olds in place at 07:00. So now it started copying these files as well. So now it’s doing an incremental copy during every ten minute window interval. It is doing it automatically based on that trigger. Now, when you are done with this, when you want to delete the pipeline, there’s a process that you need to follow. So one way is to go on to the author section, go on to your pipeline.
Let me just hide this. Now here, go on to your trigger, click on New slash Edit here. Go and click the trigger. And here first deactivate the trigger. So mark the activated as no, hit on a key, hit on OK, and then hit on Publish and hit on Publish here. Once this is done, you can now detach the trigger from the pipeline. So again, click on New Edit, click on the detached trigger, click on Detach, hit on Close and hit on Publish. Again, hit on Publish and then finally, once this is done, you can go ahead and delete your increment pipeline. So in this chapter, I want to go through these two concepts. One is the incremental copy of Blobs and the other was the tumbling window trigger.
- Demo on Azure IoT Devkit
Hi and welcome back. Now in this chapter, I just want to show you a very simple example on how I had actually used an Azure IoT Dev kit to actually stream metric data, telemetry data onto Azure Stream antics. So I actually bought this Azure Dev kit a couple of years ago. I like to experiment on a lot of stuff. So I actually she bought this from Amazon. com. So with the help of this IoT Def kit, you have the ability to directly send data onto IoT Hub. This is a service that is actually available on Azure. So here are the first thing that we need to do is to actually create something known as an IoT Hub resource. This IoT Hub resource will actually take all of the telemetry data that is being sent by my IoT device, my IoT Dev kit.
So here what I’d done is I had gone on to IoT devices and here I had created something known as a new device. So here you can create a new device and in the device, when you want a device to actually connect on to this IoT Hub, you basically use either the primary or the secondary connection string. So there is a way to actually ensure that you have your IoT device connect on to your local WiFi and also then use the connection string to start sending data onto IoT Hub. Then in Azure Stream antics, what you can do is that I have already done this. So if you go on to inputs here, you can add an IoT Hub has a stream input. So if you go on to this, you can actually choose your IoT Hub resource that you would have created. Now here, let me go on to the query section.
Let me just hide this and let me choose IoT Hub. So you can see I’m getting some information. So what’s happening is that this hardware device, this IoT device is continuously sending its Telemetry data onto IoT Hub. And Azure Steamatics in real time can actually take all of this Telemetry data. And then you can process the data and then let’s say, store it in a table in your decade SQL pool or maybe just store it as your block storage for further processing. Now, by default, it actually records the temperature and the humidity and it is sending it on to IoT Hub. Now here you will obviously see that there are some places where it is giving null values, where you have the temperature and the humidity.
So, again, a very important point when you are looking at your data, you need to check all of this. So here one sample statement that I have. So just put it here. So I’m just taking the message ID the temperature, the humidity and the event process UTC time from IoT Hub. That’s here where the temperature is not null and the humidity is not null. So let me test the selected query. So now you can see you are only getting those results where the temperature is not null and the humidity is not null. And here I have a variation where I am taking my tumbling window. So if you want to look at the average temperature during each tumbling window, there’s something that you can do as well. So here, let me test the query. So here I am grouping it by the event include UTC time.
So again, this is a property that is also available for those events which are stream on to the IoT hub. And here you can see you’re getting the results as desired. So in this chapter, I just want to kind of showcase what you can actually do with the stream antics job. So again, if you have opportunities to actually explore these services, please don’t hesitate. The best way to actually learn is to invest in yourself. And I invest in a lot of things just for learning purposes. I have done this all these years and it’s given me a lot of experience. Yes, it did take time for me to understand how I could make the IoT device work in the first place. But in the end, once you have everything up and running, you just learn so much from it. So always learn as much as you can. Always try to take on projects, your own personal projects, to get better experience. Learning on cr.
- What resources are we taking forward
Now again, a note on what we are going to take forward. So I’m going to be leaving again my Stream addicts resource in place because I want to go through some aspects when it comes to monitoring aspects when it comes to the streaming units aspects when it comes on to the partitions.So we’ll go through all of this in the monitoring section. So I’m going to keep this resource hazardous, I’ll keep my Azure event Hubs hazardous and all my other resources as well, such as my dedicated pool and other artifacts, other resources.