DP-203 Data Engineering on Microsoft Azure – Design and Develop Data Processing – Scala, Notebooks and Spark part 4
- Lab – Spark Pool – Removing NULL values
Now, in this chapter I want to show you a couple of aspects. So first is the grouping of your data. This is similar to the group by clauses that you have in SQL. And also how do you remove null values. Now, before that, I just wanted to let you know that one of the reasons as to why I have set my session timeout to 30 minutes is because if it is idle for 30 minutes, then your state will not be ready over here. Since your pool is in a ready state, it can actually take commands from your notebook. That’s the main reason. Now, in the last chapter we had created our data frame based on the data in our log CSV file. So now we can start working with the data. So I can create a new code cell. And here I want to use a number of methods again. So let me take this and add it here.
So the first is the group pie, the group pie function. Very simple, we want to group our results based on the operation name. So we want to see how many delete SQL database operations are there, how many delete SQL pool operations are there, et cetera. So this is like actually performing SQL based operations on your data. The only difference is that everything is working in Spark, right? And I said Spark is meant for big data processing. Here I just have rows in a log CASC file which we have taken onto a data frame. You could be having a lot of data that needs to be processed. So I’ll scroll down. So we have a group pi function. Then on the group pie you normally perform aggregation. So what is the aggregation that I want to perform? So I want the count of the correlation ID.
So with every operation there is a correlation ID. So that will actually give me the number of operations based on the operation name. Then I’m using the alias function to actually give a name on to the column that gets generated by the aggregate clause. And then I am doing an order by order by the column of total operations. So here the alias is creating a new column of total operations. And here we are using the order by that new column again in the descending order. And then we are displaying the new data frame. So, let me run this. So here it’s saying that the count is not defined. And again, why is that? Because we need to import the required classes or the required methods. So I can run this. So here you can see the operation name and the total operations.
Now, if I just scroll on top. Now for these entire results, which is in the tableau format, if you go on to the chart, something that is always available when it comes to notebooks is visualization.Now here I’ll go on to the view options here in terms of the key, I’ll choose the operation name. And here, in terms of the value, I’ll use the correlation ID and let me hit on apply. The aggregation is count. So the same thing which we were trying to achieve before, but with the help of functions that are available, I can also do the same aspect here when it comes to the chart. But remember, this is only for display purposes. What I am creating below is a new data frame that can be taken forward.
Maybe now you want to do something with this data frame, right? So you have multiple data frames and you can perform operations on these different data frames. The thing over here is only from a display purpose. Now, if I scroll down for this particular group Pi functionality that we just implemented, if I go on to the chart here, I can see that there are some nulls in place. So there are eight operations that don’t have any sort of, let’s say, operation name. We don’t want these null values in our data frame. So I’ll go on to my next python file. Again, very simple. If you now only want to filter on the operation name, where they don’t have any null values, we can use the method of is not null and we can assign it onto a new data frame.
I’ll copy this, I’ll go on to a news code cell. Let me run this. And here, if I go on to the chart, you can see we don’t have that null value being shown in the chart over here. So now you have seen how easy it was to remove null values from your data set. You don’t have to write lines and lines of code to remove these null values. And as I mentioned before, if you are a SQL developer, you know that all of this is also possible with SQL based statements on your database. Again, remember, the core difference is here we are trying to perform ad hoc analysis on data that could be in a different data source. Remember, when it comes to a SQL database, if you want to perform SQL queries in your database, it needs to be in the database.
But let’s say you have a log file, you want to perform some sort of analysis on that log file. You want to remove the null values, you want to take the columns of interest, you want to take the rows of interest, and then maybe take it onto a SQL, let’s say data warehouse or even a database. So for that, you need a system to first filter out that data. It’s no point taking that data, the entire chunk of data, because log data is normally very big, right? You don’t take a chunk of data, copy it onto a SQL database, run the SQL commands, and then take your data of interest. Here you are performing exploratory analysis of your data. And here the data is in our data. Lake gen two storage account.
- Lab – Spark Pool – Using SQL statements
Now, if you are talking about SQL, that is something that is also available when it comes to Spark. So, Spark gives you the ability to also issue normal SQL statements against your data. So firstly, let me take these first three statements. Here, let me see, scroll down and create a new code cell. Now, in order to perform normal select operations, normal SQL statements, you first have to take your data frame. So here my data frame. If I go ahead and scroll on top, so my initial data frame is DF one. Let me scroll down. So I’m seeing DF One, right? So that’s my data frame dot create or replace a temporary view and give the name of that temporary view as log data.
Once we have that view in place, we can then use the Spark context and use SQL to start issuing SQL based statements. When it comes to Spark SQL, we have to use the show command to display the results. So, if I execute this so here we can see the operations as required. So it only shows the top 20 rows. Now, there is another way that I’ve shown over here on how you can perform select statements. So if you don’t want to use Spark SQL,you just want to issue the select statement. Let me copy this. I need to go on to a new cell. And here I’m entering the statement.
Now. What is this statement all about? This is one of the magic commands that is normally available in notebooks. See, our default language for this entire notebook is Python, right? So the implementation of Python that actually works on Spark is Pi Spark. But if you want to, let’s say, use another language, so there is, remember, support for different languages in your notebook. This is the current default language for this particular notebook. If you want to just use another language for a particular cell, you can use this magic command to tell, okay, now I want to use the SQL based language. That’s the entire idea. And here, if I run this cell again, you can see the output has desired and you’re getting this in the tablet format. You can go on to the chart. And here, remember, again, you are getting null values, right? Because remember, we are working on our original data frame.
- Lab – Spark Pool – Write data to Azure Synapse
Now, in this chapter, we are going to see how to read data from our Log CSV file again and use the builtin connector that is available in Azure Synapse to copy the data or write the data onto our SQL Pool. So, I still have my dedicated SQL Pool in place. So there is a connector between Azure Synapse, the Spark Pool and Azure Synapse, the dedicated pool that allows you to write data directly onto the Derrick SQL Pool. Currently, this is not available when it comes to the serverless SQL Pool. And over the serverless SQL Pool, you only create external tables. Now, this also only works with Scala. It doesn’t work with another programming language. You also need to import the following libraries when you want to make use of this connector.
So here, let me delete whatever cells I have. So I’m going on to my original cell. Let me delete this here, let me put my import statements and let me change this on to Scala. Now, next I want to read the data from my Log CSV file. So now I’m actually using a different method. Over here when it comes to spark, it’s kind of the same spark read format. Here I’m specifying that I want to read a CSV file here. In the option I am saying, please use or make sure that the header is true, because the first row defines the column information, the column headers. And here I want to specify now a schema. I’ll explain about this. And here I’m using the load operation to directly load the data from our data lake.
Here, I am not specifying any sort of spark configuration when it comes to the access key. So this is something that we can do. So if I just place the statement over here just for now. One of the reasons that we can actually do this is because let me, first of all, publish this particular notebook so that it’s saved. So here if you go on to your data section, earlier on, we had actually linked Azure data Lake Gen Two storage account onto our Synapse workspace so we could see everything over here in the data container. Because we have now linked our data lake Gen two storage account onto our Azure Synapse workspace, it is now integrated without the need of specifying any sort of credentials to authorize ourself to use that as your data lake Gen two storage account.
So really, Azure Synapse has that capability of bringing everything together so that you can actually work with your data from one system itself. Now, next, what I want to do is I want to define my data schema. So if you want to ensure that the data frame adheres onto a particular schema, you can define it here. So there are some data types that are available. For example, we have the integer type, the string type, and here we have the timestamp type. So with the help of the stopped field. I can define a field as part of our schema. Here I’m giving the name what is the intercept and this is whether it is nullable or not. So I’ll copy this place, it here. And the last thing I’m doing at this point in time is to print the schema of my data frame.
That’s it. So let me also ensure there are these import statements. Forgot to put it here. Let me run this particular cell. So now you can see we have our data frame in place. And here it is abiding by our schema, right? So we have the integer, we have the string, we have the timestamp. I’ll scroll down. Now here we are going to be using a very simple command to write our data onto the SQL pool. So, because this is already available, has this implementation in notebooks in the Spark pool, we just need to write one very simple command. So our dataframe dot write SQL antics. What is the name of our database? The schema and the name of our table? The Constance internal just says that this is a managed table in the dedicated SQL pool and not an external table.
Here I have just given the definitions of writing data. Note that you could also read data from a table. I’ve given it here. Now, some aspects about this particular command when it comes to the dedicated SQL pool, firstly, the table must not exist. So there are some constraints. With every technology there are some constraints, but these constraints get removed as new and new features start to open up. And let me tell you, when working with Spark on my local system, getting everything set up, trying to make things run, it’s not so easy. But working with a notebook in the Spark pool is just so simple. There are some limitations just when it comes to Azure, but these are very small limitations.
At this point in time, there is still so much that you can do not only with the Spark pool, but also see with Azure data bricks. So there is still a lot that you can actually do. So it’s just in this particular case, there is a limitation wherein the table must not exist. So I’ll go on to my dedicated SQL pool. I’ll right click and hit on new query. So first let me drop the log data table. I’ll hit on execute. So this is done. Now next, I’ll copy this command, I’ll create another cell. I’ll ensure to even copy the import statement. This is also required. Let me run this once this is done. So it took a minute. Now here, let me do a select star. And now I can see all of my data in place. So what you can do is you can actually work with your data in a notebook, you can analyze your data in the notebook and once done, you can actually write it onto a dedicated SQL pool.
Now, just a quick note that when it comes to the permissions so you should be able to write to that dedicated SQL Pool from that notebook in the Spark Pool. Now, for this, we are logged in has the Azure administrator. Now, the Azure administrator, if I go on to my workspace, if I go on to the SQL Active Directory admin, I am also the admin for this workspace itself. Also, you have to ensure that the user is given the role of the Storage blob contributor on the storage account that is attached onto the Synapse workspace. And this is automatically done when you create the Azure Synapse workspace. So in another tab, if you actually create a Synapse workspace, I’ll hit on Create.
So here you will see that automatically it will grant the storage blob contributor role. And again, what is a storage Blob contributor role? So, if I go onto all resources, if I go on to the Data lake storage account that is actually assigned onto my workspace, that’s this one here. If you go on to Access control here, if I go on to role assignments, let me just hide this. So here the storage block contributor role is given both onto my application and on to the Azure admin as well. Now, please note that when it comes to the security aspect for this course, we’ll go through both aspects, the role based access control and also how to work with Azure ad authentication when it comes to Azure Synapse.