DP-203 Data Engineering on Microsoft Azure – Design and Develop Data Processing – Scala, Notebooks and Spark part 3
- Lab – Spark Pool – Starting out with Notebooks
Now in the last chapter we had gone ahead and create a Spark pool. Now in this chapter and in the subsequent chapters, we’ll see some simple examples when it comes to now working with notebooks. And during the day I’ll also explain some of the important concepts when actually comes to Spark. So here, here I have some commands in Scala. Now what I’ll do is that in Visual Studio code, I’ll just go on to extensions, let me search for Scala and install the Scalar syntax. So that’s done. So if I go back now onto my Scalar programs. So here I can see now that it is recognizing this as a scala program. Just want to kind of have this in place as well. So let’s now create a notebook. So we are going to be doing this from Azure Synapse Studio.
So again, I’ll go on to my workspace, I’ll launch Synapse Studio. Now here, I’ll go on to the develop section, let me hide this. And here I can create a new notebook here. If you want, you can give a simple name for the notebook. Let me hide this and let me also collapse this. Now here it’s saying please select a Spark pool to attach before running the cell. So remember I showed you notebooks in Jupiter’s Jupyter notebooks there also you’ll run everything in a cell. Same concept over here. So every notebook that you’ll actually see, the implementation actually goes into running commands in a cell. Now here, let me attach it on to a Spark pool. So we have attached the notebook onto the Spark pool, but it is not started. So we’ve still not started bearing any sort of cost when it comes to the Spark pool.
Here you can see the different languages available. So you have Python. That’s Pi Spark. You have Scala. You have Net Spark, and you also have Spark. SQL. Now, since we are looking at scala, initially I’ll choose Spark Scala here. And now let me take these commands. So I’ll take the first command. So what we’re doing here is we’re using Scala and we are creating a simple array of values. Remember, when looking at Scala, I talked about a list of values. Here I’m just creating a simple array of integers. So I’ll place it here. What’s my next command? Now my next command is the Palais method with Spark that helps to create something known as an RDD. RDD is a resilient distributed data set. So I set. That what Spark does.
It takes your data set and then executes whatever you want to do on that data set on multiple executors. So these are done as individual tasks. Now when you have your data set and that is split across these different compute nodes, has tasks, that’s the entire purpose of the distribution part of the RDD. Your data is distributed across various compute nodes, then you can perform whatever transformation that you want. It will be done much faster and you get the result. The other part of RDD, right, we have resilient distributed data set. So we’ve done the distribute plot and we understand the data set resilient. So here in the resilient nature of Spark, even if one task or one node were to go down, it will still ensure that your data set is intact.
It has that internal capability. So I said as much as possible for those students who are not aware on Spark itself, right, I’m just trying to explain some concepts. Now, please note that if I have to install Spark on my local system, which you can do, so you need to have Java installed, you need to have Python installed, then you need to install Spark and you need to then do a whole lot of configuration. And this all takes time. So here in Azure Synapse, you can see that you can start working with Spark by just creating a pool. And now we’ll actually see how we can actually work with notebook commands. Now, here I am using the parallel method. So this is available as part of SCSC is nothing but the Spark context.
So in your Spark application, you have the availability of the Spark context that allows you to use the capabilities of Spark itself. So now I’m paralyzing my data and then assigning it on to a new variable. So let me copy this and I’ll place it here. Now I’ll run this particular cell. Now, to get the output, it will take some time because see, the first thing it is doing, it is first now starting an Apache Spark session. So first it is going to get the executors up and running, then it will start executing these commands on those executives. Now here I want to get the count of values in the RDD. So there’s a method that is available and if you want to get the elements of the RDD, you can use the collect method. Now, once we get an output, right, once a Spark session is in place, what have we got has the output.
So first is basically our data object or our data variable. It’s an array of integers. That’s something that we’ve already seen in Scala. But now when it comes to our distribution, you can see this is of the type RDD. And here we can see it’s basically based on the integer values. Now, when it comes to Spark, operations such as parallelize are all lazy operations, right? They don’t do anything. So Spark remembers what needs to be done, but it will only do something once. You ask it to do something very specific. So such has an action. So when you perform an action on Spark, that is a time that it will actually do something. So here in paralyze, it’s not actually done something as such, it just keeps a note that yes, you want to have a resilient distribute data set in place.
Now, if I call on, let’s say an action of collect. So I can run this now in a new code cell you can hit on the plus over here, choose code cell and let’s run this. So now with the collect operation, it’s doing nothing much but just giving you what are your array of values and if you do account so let’s run this. You can see also you are getting IntelliSense when it comes to working with notebooks. Very useful. I’ll run this so you can see the number of values in your RDD. And all of this is basically based on scala, right? So in this chapter, want to start simple first because I want to go through also the basics of Spark as well, because this is also important when you are trying to understand future commands when it comes to working with notebooks in Spark.
- Lab – Spark Pool – Spark DataFrames
Now, in this chapter, I’ll show you how to create a very simple data frame. So, going forward, we’ll actually be working with data frames. So we’ve already looked at RDVS. The resilient distribute data set. So this is the base abstraction when it comes to your data set in Spark, remember, Spark is the underlying tool, the underlying engine that is used for processing your data. And your data is in the form of an RDD. It can also be in the form of a data set or something known as a data frame. So the Spark data set is a strongly typed collection of domain specific objects. Here, the data can be transformed in parallel.
Normally, you will perform either transformations or actions on the data set. The transformation will normally produce a new data set. An action will actually trigger a computation and produce the required results. Even when it comes to RDDs, when you are actually learning Spark on its own, you will actually learn about something known as mapping transformations on your data. Now, the benefit of having a data set is you can use powerful transformations on the underlying data, and then you have the Spark data frame. Now, the data frame is nothing but a data set that is organized into named columns.It is like having a table in a relational database. So that’s something that we are very familiar with.
You can also construct data frames based on data in external files. When it comes to data sets, the API for working with data sets is only available onto Scala and Java. But when it comes to data frames, you can use Scala, Java, Python and even R. Now, here I am showing how you can convert an existing Rd onto a data frame. So, if I copy this command and let me replace it here, and let me run this. So now we have something known as a Spark SQL data frame. Now, I can also display this data frame. So here we can see it has a table of values. So this is like having a normal table in a database. We are more familiar with this. We can also create a list of strings.
We can paralyze this. We can also convert this onto a data frame. Using the Show method is another way of seeing the details of a data frame. So let me take this, I’ll go on to a new cell. Let me take the last command and I’ll run this. So here we can see again our data frame. This time we have all strings, and here we have the output, right? So in this chapter, just want to briefly go through the concept of the data frame.
- Lab – Spark Pool – Sorting data
Now in this chapter I want to show you how we can take now a set of objects and then create a data frame. So I’ll just copy this here. What I’ll do is that let me delete the cell, I’ll delete the other cell as well. Here, let me change this. And now you can see that we are getting some scriggly lines over here and that’s because these are Python based files. So we need to change the language from Scala to Python. As simple as that. If you added this on your local system, you can switch jupyter notebooks between Python and Scala. But I said before that there’s a lot of configuration that you have to do on your local machine. So what are we doing here? I again have like kind of a Courses object or a Courses variable.
Here I have a list of different objects. So here I am saying that the first object has something known as the course ID, then the course name and what is a price. So here I am now telling Spark, please create a data frame directly from my data. Here I’m trying to give names on to the individual columns that are going to represent my data frame. And then I’m showing my data frame. So let’s run this. So here you can see we are having a data frame in place. You can also use the display method. So you can see in the table like format we have our ID, the name and the price.
Now here I want to show you how you can start using the different methods of functions that is actually available when it comes to Spark. So here, now I want to sort my data based on the price. Let me just copy both of this in another cell. I’ll just copy this and instead of Show, let me just put Display. So here I am using now the Sort function. Here you can also see, right, so the IntelliSense is showing what all is required by the Sort function. So you need to sort by a particular column. Now, to access a column, there are different ways. So here, Call is actually a class that is used to represent a new column that will be constructed based on the input column data. And there are different ways to access a column.
So you can use the Call class, you can use dollar along with a column name. So there are different ways to refer onto a column. So I’m saying please use the Price column in the Sort function and please perform it in descending order. Now, if I run this hazardous so I’m getting an error. It’s saying that the Call class is not defined. So when you have different classes and you want to ensure that they are available at Runtime, you need to ensure that those classes are imported in your code. Then it will actually load those classes at Runtime. So that your code actually works. So normally this is the case with most programming languages. So if you want to use a functionality of a particular class so here we want to use the functionality of the score class.
So for that we need to ensure that the class is imported. Normally it’s not a good practice to have all of the classes running at the same time. That will just make your program very heavy. It’s only the classes that are required. So if you only want to have certain classes in place you can use the import function. So I’ll place it here and now let me run the cell. And now we’re getting the output as desired and you can see it is sorted by the price. You can see you can also click on the column itself and it will also sort by the price. This table itself has quite a lot of functionality but normally when it comes to now, starting with working with your data you are going to start using these different clouds, classes and methods that are available.
- Lab – Spark Pool – Load data
Now in this chapter, we are going to see how to load data. So we are going to take our log CSV file, which we are present in our container, in our Azure Data Lake gen Two storage account. So if you go on to our containers, if you go on to the data container, if you go on on to the raw folder here we have our log CSV file. So now we want to see this data in our pool. Maybe we want to perform some sort of action on it, some sort of transformation on it. We first want to see the data. So here I have the required commands to go ahead and read the data from this particular file. Let me copy this. I’ll go on to our cell. So let me delete the second cell. In the first cell, let me place everything.
Now let’s see what we have over here. So we have some variables which defines the account name, the container name, a path, right? It’s basically raw. And here it is constructing a path with the container name, the account name and the relative path. Here the dollar S will actually replace based on the number. So the container name will be replaced here. The account name will be replaced here. The relative path will be replaced here. So we are going to have the protocol. We have now the container name that’s data at the rate data lake 2000 TFS co Windows net, raw. And here we have Slash log CSV. Now here I am showing a way in which you can set the Spark configuration to authorize yourself to use the Azure Data Lake gen Two storage account. Here we are making use of the access key. Here we are saying the authentication type is the shared key. Here again, we have the account name that’s going to be replaced here. The account name is going to be replaced here as well. Here is where we have the shared key. So where do we get the shared key from? So if you go on to your Data lake storage account, if you go on to the access keys here, you can show the keys and you can take either key one or key two. So I’ve copied key one and I have just replaced it here.
So here we are setting the Spark configuration to authorize this code, this notebook to read from Azure Data lake gen Two storage account. Later on you will see how we don’t even need to set the account key. And that’s because of the integration of Azure Synapse with your Azure Data lake gen two storage accounts. You’ll see that a little bit later on. Now here I am creating a data frame. I’m using now the read dot option that’s available as part of Spark to read our log CSV file. So our path is being set here and there are many options that are available when reading that log CSV file. Here we are telling or giving the option that the first row is the header row that contains, right, what are the column names and what is the delimiter? It’s comma and then we display our data frame.
So let me run this. And here you can see all of the data in your log CSV file. Now, in other programming languages, if you had to read a simple CSV file that’s based on a particular format, you’d have to write more lines of code just to achieve what we have done over here. When it comes to Python, when it comes to even scala, they make programming much more easier and especially when it comes to Python, when it comes to scala, and then when it comes to Spark, all of these combined are actually geared towards working with data. They know what are the different data formats that are available in the industry and they try to ensure that all of the underlying implementation is based around that.
And all of the common things that you do from a data analyst perspective is also implemented as functionality. Even if it’s not built into the core library of that language such as Python or Scala, there will be other libraries that can be imported that I’ve built around actually working with data sets. Because see here you have seen that how easy it was just to read a blog CSV file, a comma separate file. Here we have put the dot CSV option. If you put the dot JSON option, it can read a Jsonbased file. So there is a lot of functionality that is available in Spark. Now, another note before I stop this particular chapter. Here you can see Spark two executors and the eight cores. So I mentioned that your code, right, is now running on these two executors has tasks.
If you go on to settings for configuring the session, here you can see your application IDs. So it’s being run as an application. Here you can see it is attached onto the Spark pool. Here you can see you have three nodes, but here you can see the number of executors is only two. And note that for this particular notebook, this particular session, you have allocated both of the executors. That means if you try to open up another notebook, another session, you will not be able to run that notebook because all of the executors are being dedicated onto this particular session. If you want, you can reduce the number of executors to one. That’s something that you can do. And here you have the session timeout, right. So there are some other aspects to consider.