DP-203 Data Engineering on Microsoft Azure – Design and Develop Data Processing – Azure Event Hubs and Stream Analytics part 5
- Lab – Reading Network Security Group Logs – Server Setup
Now I want to show you another example on how we can stream data in Azure Stream at six. Now, for this, I’m going to make use of a Windows virtual machine that I have in Azure. This is again based on the Windows Server 2019 operating system. So again, I use the same method to create a virtual machine. We had already seen this when working with the self hosted integration runtime environment when it came on to Azure Data Factory. Now I have logged on to the machine. Now I am going to install the role of Internet Information Services on this machine. This will actually add a web server onto this machine itself. So in server manager in the dashboard, I’m going to click on add roles and features. Here I’ll go onto Next, go on to Next, go on to Next. Year, I’m going to choose web server.
I’ll add features. I’ll click on next. Click on next. And finally I’ll hit on install So now it’s installing the web server role on this machine. Now, Next for the virtual machine, I’m going to go on to the networking section. And here I am going to add an inbound port rule. So in case if this is already done, I want to add an inbound port rule to add traffic on port 80 in the lab where we saw having NGINX in place on the VM in that chapter on the Azure Data Factory self hosted integration runtime that time. Also, we already have the rule in place, but in case if you don’t have the rule in place, you can create a rule over here to ensure to allow traffic on port 80. See our web server, which we are installing on the virtual machine, which is Internet Information Services listens on port 80.
So here what I can also do is in the service, I can choose the service of Http. This will ensure that it adds the destination port range of port 80. And here I can give a name and then I can click on add. So it’s ensuring that traffic reaches this virtual machine on port 80. Now, once the feature installation is complete on demo VM, I’ll hit on close. I’ll go back over here, and soon enough we should find a rule that should be added for port 80, right? So it’s come up over here. Now I’ll go on to the Ov for the machine. I’ll take the public IP address, I’ll go on to a new tab. And here we should be seeing the home page for Internet Information Services, right? So this is the first part of our setup. Let’s mark an end onto this chapter and go into the next chapter for the next part of our setup.
- Lab – Reading Network Security Group Logs – Enabling NSG Flow Logs
Now in the last chapter, I’d installed Internet Information Services on an Azure Virtual Machine. Now I’m going to go on to a service known as Network Watcher. And here I want to use the feature of flow logs. So what this energy flow logs does is it records all the traffic that is flowing via that network security group. So if I again go on to my Virtual Machine, if I go on to the networking section, if I just hide this so all the traffic that is flowing into and out of the Virtual Machine actually flows via this network Security group which has these defined rules. This is like a filter for your traffic.
It’s like having a network firewall in place, a basic network firewall in place to restrict what traffic is coming into and what traffic is going out of this Virtual Machine. So when we enable that NSG flow logs, it will then create a log flow for all of the requests which are flowing in and out of the Virtual Machine. This is quite useful for your It security Department if they want to see what are the requests that are flowing into your VM. So for that we need to use the Network Watcher service. Here in the NHG flow logs. I’ll hit on create. I’ll go on to the right. Here. I am choosing the network security group that is attached onto my Virtual Machine. Here I need to choose a storage account that is going to store my NHG flow logs.
So I’m going to be using my Azure data lake Gen two storage account. Here you can mention the number of days to retain the data. I’ll go on to next for configuration. Now here there are two versions for the flow log. So version two actually gives you much more amount of data. But for the sake of this particular demo, I’m going to choose version one just to limit the amount of data that we have. I’ll go on tags, I’ll go on to review and Create and let me go ahead and hit on Create. So now it’s creating something known as a flow log. Now once the deployment is complete, so now we have the flow log setting in place. So now let’s wait for around ten to 15 minutes so that we have some of the flow logs that get recorded.
Now after waiting for around ten to 15 minutes, if I go on to my Data Lake Gent to Storage account, if I go on to containers here you will see a new container that has been created. So Insights Log Network Security Group Flow Event now if you go on to it, I’ll just hide this, you can see the data is being segregated into multiple folders. So based on my subscription, based on my resource group based on the resource group name, the providers microsoft Network, network Security the name of my network security group, the year, the month, the day and here is the R. So I also make a note of this when it comes to structuring your file content in Azure DLC Gen Two storage account.
So if I go on to this particular folder here, I can see my JSON based files. If I go on to it. And here if I click on Edit here, I can see again I have some record information in place. So one thing that is very important is this information. So now this is giving all of the traffic that is actually flowing into the virtual machine. Now all of these traffic flows are actually being denied by the default network security group rule in place. So all of this is now denied traffic if you look at a particular flow event. So this particular flow event is described in the Microsoft documentation. When it comes to network security group flow logging, this actually gives the entire log format. So if you scroll down, you can see this is an indication of the different flow events.
So if you go back here, this actually indicates the timestamp, this indicates the IP address from where the request is coming from. This is the private IP address of our local virtual machine. And here, this basically indicates this is an inbound network flow, this is a deny request, et cetera. So you can actually see there are so many IP addresses that are already trying to reach a particular virtual machine. And all of these requests are actually coming in onto different port numbers. It’s not trying to go on to port 80. Here you can see there is no port 80 anywhere. This is an indication of the port numbers. So there are always bots on the Internet that is trying to infiltrate your virtual machines that are defined on the cloud platform.
I mean, when it comes to the Internet, you always have to understand all of these risks in place. There are all sorts of attacks that are possible on the Internet. And here it’s like one bot is actually trying to get to a virtual machine, trying to connect onto a port number. That is not allowed. So all of this is actually being denied by a network security group. But this information is actually quite useful for your It security team. So I want to now show you how we can use a query in Azure Steamatics to start getting this information.
- Lab – Reading Network Security Group Logs – Processing the data
So now that we have our data that is streaming in has blobs onto our data like gen Two storage account. Now I want to show you how you can use a query to actually process some of that data. So we are going to now see how to work with blobs. So far we have been seeing inputs when it it comes to event hubs when it comes to continuous streaming. Now, what about continuous streaming when it comes to the blobs that are stored in a data lake, gen Two storage account. So the first thing I will do is to specify an input. So I’ll go onto inputs. So now we need to add a new stream input. Here I’m going to choose Blob storage ADLs gen Two here I’ll give an alias name. Here I’ll choose my storage account. Here I need to choose my container.
So it’s the insights logs container, your authentication mode. I’ll use connection string. I won’t choose any pattern, I won’t choose any partition key. The event serialization format is JSON. Let me click on save. So in this chapter we are not going to go ahead and define any sort of output. I only want to show you how we can actually work with that streaming data which is coming in the form of Blobs. Now as your data lake, gen Two storage account. So here I’m going to go on to the query and let me hide this. So now I’ll choose NSG. I want to see what data am I getting from NSG from that input. So here I can see I’m getting a table of data. If I go on to Raw, I’ll just reduce the zoom. So we can see we are getting our flow records in place.
So this is what I want. But here you can see the complexity. So we have an array, we have records, we have another array, we have some properties and in here within properties, if I scroll down then I have another JSON object that flows within that another array. Again, another JSON object, within this another array. And then finally I have the array over here and here I have all of the flow records. So there is quite an in depth when it comes to what we want to get. Because see, for me I want to get the IP addresses. This is very important, those IP addresses that are being denied. This is of much more relevance onto my It secure department.
Rather than just getting the system ID, the Mac address, et cetera. That’s something that we can also get if you want. At this point in time I want to get that internal information. So let me again increase the zoom. Now here I’ve actually divided my code into multiple stages. So let’s go through each stage at a time. So I’ve divided into multiple stages because from my perspective to try and get on to that internal data, I wanted to go step by step. So the first step is to get all of the array elements that’s in record. So the first step is to go into records and get all of the array elements. That’s the first step. So let me just take this select statement. I’ll go here, I’ll replace this and let me test out my query.
So so far I am getting the first set of elements, which is the time and I am getting the flows in terms of the rules, et cetera. So I still haven’t reached that actual flow of getting the flow details. Where have I reached so far? So here, so far I’m only getting the time and I’m getting all of the rules. So you can see all of the rules here. So we still have some way to go. So now I want to embed this as a stage. So let me first put this as a stage. So stage one. So I’ll just copy this and let me put this as stage one. I’ll just close this and let me go back. And now let me select what I want to do from stage one. So here, now I want to get the array element from flows. I only want to get the first array element. So if I go on to flows.
So this is the first array element that I want to get because everything is being embedded in this rule. So I want to now go into the flows. That’s why I’m only getting the first array element which starts from position number zero. That’s my next step. So here, I’ll copy this. So I said I’m going step by step to try to understand. Now I’ll test my query. So now we’ve gone one step further. Now we have all of the rules. So what’s next? I’m defining this as stage two. So let’s do that first, then close this. Next is my stage three. So what am I doing in stage three? Now I’m going on to flows. Flows again the array element zero. If I go back here I’m going into flows and again within this flows to get now this part.
So going back here, let me take this, let me test the query. And now here I can see we are getting the Mac address and we are now coming closer on to what we want to achieve. So now I’ll embed this as stage three. Then I’ll close this. Now what do I need from stage three? So now here, I’m actually now trying to get the exact rules. So I’m taking now the array value to get the exact rules and I’m also getting the time. So let me run this, I’ll test the query and now I can see I’m getting all of the rules. So we’ve almost reached our goal. Now in order to kind of break this up, this is kind of an issue. So there is no sort of function in Azure Steep antics to actually break this down based on a comma. So normally you can actually break it into an array in some programming languages based on a separator.
But there is no such function currently available here. So, what I’ve done is that first of all, let me define this as stage four and let me close this. So now what I’m doing is that I’m only selecting a substring that starts from position number 13. So what is position number 13? It’s from here onwards. So I currently don’t want this in this entire string. So remember, this is all just one string value. So now I’m using the string functions that is available in Azure Stream analytics. So, let me take this now. So, let me take this and let me test the query. And now you can see we have this flow string in place. Let me define this as my final stage. I’ll close this and now my final query is to just get what is the IP address and what is the action, which is deny.
Here I’m again using multiple string functions that are available in Azure Steamatics. One is again substring and I’m using the car index function to get what is the character position of the comma. So, I’ll select this. And now let me test the query. So, now you can see I am getting the time, I’m getting the IP address and the action is deny. Now, the main reason for having this chapter is two things. One thing is to show you how you can stream your Blob data. Secondly is if you have a complicated structure for your data, how do you go step by step in trying to extract the information that you want? And remember, you can then take this information and then stream it onto a table.
Again, I am showing you how you can use windowing functions along with A data because this is very useful. So, for example, if you want to look at the count of IP addresses that is coming in a 32nd tumbling window interval, that is having an action of D, that is deny, we can use this. So I can just replace it here. Then I also need to put this as my final stage. So, we had stage five and now we have our final stage. And then I’m selecting all of this for my final stage, our test query. So now I’m getting the count of hits that is coming in from certain IP addresses.
So you can see there was in a 32nd window, five hits from this particular IP address. If you want, you can also then use the hopping window wherein you want to see every 10 seconds within a 32nd interval. What are the count of IP addresses? So, I can just test the selected query. So here I can see I’m getting the results has desired. And let’s say you want to have a sliding window wherein you want to look if the count is greater than five when it comes to the hits of the IP addresses. You can choose that as well. So I’ll test the query and here you can see the desired results. Right? So in this chapter, I just want to show you another example on Azure Stream analytics
- Lab – User Defined Functions
Now, in the last chapter we had seen how we could process the records that we had that were coming in as part of our network security group. We had seen all these stages here and we had caught the time, the IP address and the action. Now, please note that when we created this as your virtual machine, we also installed Internet information services and we ensured that there was a network security group rule that was allowing traffic on port 80. Now, even though here we are only looking at the denied traffic, you can always design your query in such a way that it looks at those requests that are coming into port 80. So here there is a separate rule that we had defined. I had just given an example on how you can process the data that’s coming in.
Now, this was just a point I wanted to make. This chapter is wherein I want to show you about user defined functions that you can create in Azure Stream matrix. So we had seen, or I told you that we were using this substring function to get the data that we wanted and we had to go through a lengthy exercise just to ensure that we get what we want because it was difficult to split the string based on commas, right? So if we wanted everything, it was difficult to get that information because there was no predefined function that would allow us to split a bigger string into smaller strings based on commas. But what we can do is we can define our own function. So in Azure Streamatics you can actually define your own JavaScript user defined functions.
You can also create C sharp user defined functions as well. In this chapter I want to now give an example on how we can work with JavaScript user defined functions. So first of all, my job is in a stop state, my job is not in the start state. I also ensure to save my query. Now I’m going to go on to functions that is available here and I am going to add a JavaScript UDF that is a user defined functions. Please note there are other options in place as well. For the purpose of the exam, we have to understand about JavaScript UDF or user defined functions. So I’ll choose this. Now here we need to add some code and we need to give a function alias, a name for the function. Now, here in my code I have a very simple code in place for the user defined function.
First of all, I’ll give a name for my function. I’ll leave the output type as any, but you can define a specific type when it comes to your output. And then I am going to take this function and I’ll replace it here. So what are we doing in this function? Well, I am now going to take the entire flow log and then I’m going to be using the split method that is available by default in JavaScript to split a long string based on the commas and then I will return the string at a particular index position. So if I’m calling this main function and I say please return to me the first string before the first comma, then give me the string at index position zero, at index position one, so on and so forth and then I’m returning that string so first let me save this function.
This is done, I’ll go back on to my query just to kind of explain what we want to do so I’ll scroll down, let me just change this so I’ll select the time and I won’t select the substring so let me just select the entire flow string. So this is from stage five and let me test my query so I’ll have to choose NST again, let me just wait for the data to come in and then let me test the query. So once I have some data in place, I’ll hit on test query. So remember, we had our entire flow string. So I wanted to now split everything based on the commas. So I want to get each individual piece of string. So this flow string is what we can now pass into our function has flow log. So that entire string will come has flow log. And then we can get an array of items by using the split method. So I said this is already available in JavaScript so it will split the entire string separate by the commerce and will get an array of values and then I can return the string at a particular index in the array.
Now, next, let me copy this entire set along with the last statement that I have and let me replace it here. So what are we doing? So now I want to call my getItem user defined function. So you have to put UDF dot your user defined function and here I am saying let’s pass in our flow string. I want to now get the string. At array index position zero and array index position zero will give me the source IP address, which is trying to access a virtual machine. So I said, that entire flow string will now be passed on to our user. Defined function. And now we’ll get an array of individual strings and we can access each individual string. Now, based on its index value. Similarly, number two will give me the source port, I’ll get the direction and I’ll get the action. So let me test the selected query.
And here you can see you are getting the results has desired. You are now getting a split. We are getting the source IP address, the source port number, the direction. Which is inbound and the action which is deny now in case if you are getting any sort of error so it says that get item is not defined, action is not defined, direction is not defined just test the selected query again else ensure to save the query and then refresh the page and then test the selected query again. So it has seen that sometimes when testing the query you might get this sort of error. But here I want to show the advantage of being able to now define your own user defined function. So even though we had a limitation on the inbuilt functions in Azure Steam antics we were able to define our own user defined to overcome this limitation.