DP-203 Data Engineering on Microsoft Azure – Design and Develop Data Processing – Scala, Notebooks and Spark part 5
- Spark Pool – Combined Power
In this chapter, I just want to have a review on the combined power that you have in Azure Synapse. So you have your dedicated SQL Pool in which you can host your SQL data warehouses, and you have the Spark Pool in which you can run your data processing workloads. Remember, as part of your Derrick SQL Pool, you have the Control Node. This accepts all of the queries and then it distributes them onto the different compute nodes. And in Spark the same thing. The only thing is that your applications are sent on to a driver node. The driver node then sends the application the jobs on to the executors.
In the Spark Pool, the instances are created when you connect to the Spark Pool, create a session and then run the job. And as I mentioned before, if you want to submit another job, if there is capacity in the pool and the Spark instance also has spare capacity, it will run the second job. If the pool has capacity, but the Spark instance doesn’t have the capacity, it will create a new Spark instance to run the second job. Right, just a quick view on the Azure Synapse power of having the dedicated SQL Pool and the Spark Pool.
- Lab – Spark Pool – Sharing tables
Now in this chapter I want to talk about the ability to create Spark tables. So, we have tables that can be defined in our dedicated SQL Pool. We can also define something known as external tables, both in the dedicated SQL Pool and in our serverless SQL Pool. In the Spark Pool, we are making use of Spark to create RDDs or data frames. And then we can write that data onto, let’s say, a table. In our dedicated SQL Pool. We have seen all of this, but we can also create a Spark table on its own. This is actually stored in the metastore part of the Spark Pool. Even in Spark itself, you have something known as the Hive Metastore in which you can also work with tables here in terms of the Spark Pool, you can create tables in the metastore of the Spark Pool. So here my commands are going to first read the data that I have in my log data table.
Remember, earlier on we had seen how to write data. We are using the Read API to read data. Then we are taking the data onto a data frame. And then we are now using the Write again method with the mode of override. So there are some modes that are available depending upon the destination that you write to. And here we are saying now save it asa table, save it has a Spark table. And then we can give the name of the Spark table. So, let me run this. So here in our existing cell, let me write this statement, I’ll execute this. Once this is done, I can create a new code cell. And here, using the SQL language, I can work with. Now the data in that table which is part of the Spark Metastore. Now remember, you are not going to be storing your data in these tables.
This is just meant like having temporary tables in place. If you want to perform analysis idly, your tables will be in your dedicated SQL Pool. But the benefit of creating a table here, a Spark table here is the metastore is shared between not only your Spark Pool, but your serverless SQL Pool as well. So let me just publish this again to save everything. So now here if I create a new SQL script and let me just hide this here, let me choose the built in servers SQL Pool. And here in the database I’ll choose Default. Now, Default is the default database that is create when it comes to the Spark tables. And now I can do a select star from Log data internal here as well, I’ll click on Run. So, because Azure Synapse encapsulates both your Spark Pool and your SQL pools, you have this capability of sharing tables as well.
- Lab – Spark Pool – Creating tables
Now, in the last chapter, I have shown you how you can create a spark table. This was actually based on the data that is already available in a table in a dedicated SQL pool. Now here I’ll just show you how to create a table within in a database. So the table that was created earlier was part of the default database. And in the earlier chapter you would see that there was also in the drop down for the databases in the serverless sequel pool, there was also a database of the name of internal DB. Now, I have gone ahead and deleted that database and the tables because I want to show you how you can create a database and create the spark tables. So something that you can actually do in the spark pool.
So firstly, I want to create a new database. I’ll run this. Now my spark session has actually timed out. So let me come back once this command has finished executing. Now, once this command is complete, I can now create a table. So here I am saying please create a table known as Customer. In my internal DB database. I’m mentioning the columns of ID and name and I’m specifying int and varka. And here I’m using parquet. So these spark tables are basically parquet back tables. The daytime format is not available when it comes to these spark tables. So an important note. So let me create the table and then I’ll go on to the next command to insert some data into the table.
Just one row, this is also done, thenyou can select the data from the table. So here we can see the data. Now, another variation is I want to read the data from our Azure data lake storage account, the log CAC file. So earlier on, remember, we had written the contents onto a table in our dedicated SQL pool. But you can also write this onto an internal table. So here I’m reading the contents of the file, I’m loading it in a data frame and then I’m using the right method to write it as a table. Here I’m giving a table name that’s part of my internal database. So this is being executed in Python Spark. So here, let me replace this. Here I’ll replace this. Let me execute this. Let me also create a new code cell.
So this is done. And as normal I can use the sequel language to look at the data in the table. So you can see all of the data. And if you go on to the data section so here, let me refresh everything. So here we can see our internal database, we can see our tables, the customer table and the log data new table. If you want to delete this database, you first have to delete the tables. So here we first need to drop the table. So one is log data new and then we have the customer see with IntelliSense you can see what are the tables in the database. I’ll drop this as well. And then we need to drop the entire database so it will be internal DB. So we can drop drop the database as well. So again in this chapter, want to go through some extra concepts when it comes to the spark tables?
- Lab – Spark Pool – JSON files
Now, when it comes to the Spark pool, we’ll see how to work with JSON based files. So I’ll be using two files which we had also seen earlier on in Azure Data Factory. These are files, JSON based files that has information about customers. And one file had information or added is within the JSON object and the other had objects within a JSON object. So I’ve already uploaded those files onto my Azure data lake. Gen Two storage account. So if I go on to Data Lake 2000 here, if I go on to containers, if I go on to my data container, if I go on to my Raw directory, if I go on to my customer directory, will target first our customer underscorer JSON file if I go on to edit. So there are two objects. Each object has an array.
So this array, it treats the courses that could have been bought by this particular customer. Now, please note that there is a slight difference when it comes to the structure of this file. From the file we had seen in Azure Data Factory here, everything has a single object. In the chapter for Azure Data Factory, everything would have been as part of an array like this. But here, if you do this, you’ll actually get an error if you want to pass the file as it is. So I have deleted those square brackets and kept it as simple objects. There is no comma also between the objects, I’ll click on Save just to ensure that we have this file in place. Now, next, I am again reading the data. This time I’m specifying it as the JSON format, right. So into a data frame here I’m loading first the customer array JSON file.
So, let’s take this. So in one of my cells, let me run this. So here, spark means scala in the end. So we don’t need to actually add this magic command. Now, here you can see the Courses is coming, has an array when it comes to the table in Azure Data Factory, we had seen how we could flatten this particular array to come as separate rows. We can do the same thing here as well. So for that, we need to use something known as the explode function, which is available. So let me take all of these statements. Let me copy it here.So again, I’m loading my data frame. Now. Here I am creating a new data frame. I am saying now please select the columns of interest. So here I am showing you another function. Here I can select what columns I want for my original data frame.
So here I’m mentioning I want the customer ID, the customer name and register. And then when it comes to arrays, there is a function known as Explore. So we can mention the column that we want to explore in this particular case. So this will explore that array column. Let me run this. So here you can see the contents as designed. This is something that we had also seen in Azure Data Factory. But over there we were using the flattened transformation. Now, next, let me go back onto the data container. Now, here I’ll go on to the Customer object or JSON file. If I click on Edit. Remember, in addition to the courses, we also had another object embedded object that’s Details. Here we have the mobile and the city. Again, here. Notice the structure of the JSON file.
Again, it’s different from that we had seen in Azure Data Factory. So let’s look at the code. Pretty similar to reading this content. And here, the only difference is that in addition to exploring the courses here when I’m actually selecting the columns, since the city is part of the Details object over here, I can now get the city and the mobile from the Details object itself so I can access each part of that JSON object. If I run this again, you can see you are getting the results as desired. If you scroll on to the right, you’re getting the city and the mobile number, right? So in this chapter, want to go through how you can actually read JSON based files, right? So from the Spark Pool perspective, these are the chapters that we have covered. So now you can go ahead and delete this spark Pool if it’s not required.