DP-203 Data Engineering on Microsoft Azure – Design and Develop Data Processing – Scala, Notebooks and Spark part 2
- Scala – List collection
Now in this chapter I just want to briefly go through the list collection which is available. So there are many collections you can actually define in Scala. One is the list collection. So now here when I’m defining a variable, instead of defining a simple variable of an integer or a string or a boolean value, I’m defining it as now a list of integers. So normally when you’re dealing with data, you’re not dealing with just one piece of data, you might be dealing with a list of values. So normally you would actually use collections in your programs. So here I am defining what is a list. So it’s a list of numbers. Now in Scala, once you have a collection in place, there are some methods or properties that you can actually invoke on that list. So here I am using the head to get what is the head of the list of values.
And I’m using now another method that is available that is the for each to print all of the numbers in my list. So if I copy this and place it here and execute this, you can see the head of the list is indeed ten. That’s the first value in the list. And here we can print all of the values in that list. I can also define a list of strings. So here I have different strings. Here I am saying that what is the position of user b in that list? So I can use now the index of method to find what is the position of user b and then the value of the customer at index position number two. So what is a user at index position number two. And then again, I’m using the four h to print all the users in my list. I’ll take this and let me run this.So here I can see the position of user b is number one.
That means our position starts from zero. So user A is at position number zero and user b is at position number one. And what is the user at position or index number two? It’s user C, right? So this ends the chapters when we are looking at Scala. So I said this is just some very small videos just to get started on this programming language. As with any programming language, there is a lot to learn. But one thing when it comes to Scala is that you can actually have short programs to fulfill what you want to achieve. In some programming languages you have to write a lot of code just to achieve something. But in Scala you’ll see that you don’t need that many lines of code to actually fulfill a requirement.
So as I said, we’ll be looking at developing notebooks in Scala when it comes to working with Azure data bricks and even in some of the notebooks when it comes to working with spark pools in Azure Synapse that time we’ll be working with purely from a data perspective. For the notebooks, this was just to get you up and running. Just for you to get the feel of working with Scala, right? So that you are not lost when we start working with notebooks. Even though we are not going to be working with the if construct, the While construct, we are looking purely from working at a data perspective. Nevertheless, since you know that these constructs are possible, if you want to extend your notebook programs into something else then you know that these constructs are available. I said this was just to get your feet off the ground when it comes to working with Scala.
- Starting with Python
Hi and welcome back. Now, in this chapter, let’s have a short introduction on Python, and we’ll see how to install Python and start using the same sort of constructs that we had seen when it came to scala. And how can we use them in Python? We’ll be seeing the use of Python when It comes to some of the notebooks when it comes to the Spark Pool in Azure Synapse. So this is an interpreted, object oriented, high level programming language. This was first released in the year 1991 and has been here for quite some time. And normally, Python is used when it comes to data science and also when it comes to machine learning. So there is a lot of use cases when it comes to python. Now, the core purpose of this was to address certain aspects, which is also known as the Zen of Python. And some of them include beautiful is better than ugly.
These are all of the aspects he’s trying to address. Explicit is better than implicit. Simple is better than complex, complex is better than complicated. And readability counts. Just some of the aspects that Python was trying to address. And again, when you look at Scala and when you look at Python again, you will see that when you want to achieve something via a programming language, you don’t need to write so many lines of code. You don’t need Bulk code just to execute something. Python just makes everything much more simpler. Now to install Python, I’m going on to the official downloads page on Python. org. Here I’ll download Python version 3. 9. 6. I’ll hit on keep. I’ll start the exe. So here I’ll choose to add Python 3. 9 on to the path and then I’ll hit on install.
Now, once Python is installed, you can go ahead and click on the disable option here. Once Python is in place. Now here I am in Visual Studio code. We can actually run our Python programs from Visual Studio code itself. So if I open up our first Python file So here I can see the Python file in place. Now, if you want to run this Python based file, you can actually go on to extensions here in the marketplace. Let’s search for Python. And I’ll hit uninstall. Now, once we have this in place, if you actually go back onto Explorer and if I open up up the Python file here, you can see the run button now in place so that we can run the Python file in Visual Studio code itself. So let’s mark an end on to this chapter. And in the next chapter, let’s look at this very simple program that we have been starting to work with. Python.
- Python – A simple program
Now, with any programming language, the first thing that you should do is to understand the data types if there are data types that are present for that programming language. So you should understand what are the available data types for that programming language. Because see, in the world of data, this is also important. The type of data, whether it be an approach program, whether it be in a data store is always important. So here in Python, I have a simple variable. Here I am assigning a value of hello, world. So this now denotes a simple string data type.
Then here we have a simple integer data type to represent a number and then we have a float data type to iterate that we have a decimal based value. Next I have a Boolean data type and then I am just printing all of the values over here. So if I run this simple Python based program, so it will run here in the consumer. So here we can see Hello World, we can see 120. 6 and the value of True. So here, just trying to first iterate that in Python we do have different data types in place.
- Python – If construct
Now here, just looking at a control flow statement, this is the if construct. We have seen this also in Scala. Here again, for a particular variable of x, I am assigning a value of ten. Now, if the value of x is, let’s say, greater or equal to ten, and here we can say greater or equal to ten or else print the statements that this is less than ten.Let’s run this. And here you can see the output of greater or equal to ten. So in this chapter, just wanted to go through the if control statement. Here you can choose which statement to run based on a particular condition.
- Python – while construct
Now here I just wanted to go through the while construct which is available in Python. So here I am assigning a value of ten onto a variable x. And here I am saying while the value of x is greater than five, then print the value of x and at the same time ensure to decrement the value of x. So we need to decrement the value of x otherwise this will result in an infinite loop. Let me run this program. And here you can see the values of ten, nine, eight, seven and six. Right so in this chapter just want to go through the valve construct.
- Python – List collection
Now, next I just want to briefly explore the list collection that you have in Python. So here in the variable of x, I am defining a list of integers. I am then printing the list on to the terminal. Then I am printing what is the value in the list at the index position number one. Then I am changing the value at the position number one in the list with the append method. So there are different methods again available for the list collection. So we can use the append method to now append a number onto our list. And then I’m again printing the list of values. I can also define a list of courses.
And then I’m using the fork construct to iterate through all of the string values that I have in my courses. So here the first value. AZ 400 will be replaced in x. Then DP 20 three will be replaced in x. So on and so forth. If I run this. So here I can see my list. I can see what is a value at position one. That means again, in the list, it starts from position zero. Here you can see I have appended the integer or the value of 20 onto the list. And I am displaying all of my cos strings. And here you can see I’m displaying each string separately which is in my courses list.
- Python – Functions
And finally I just want to go through adding functions. So I mentioned before, even when came to scala, if you have reusable code, then you can actually define them as part of functions. So here I am defining an add function. Here it’s using the dev keyword to define a function. Here I am taking the parent meters of x and y. So you’ve seen that when we have been defining variables in Python, we are not explicitly defining what is the data type for these variables. This is all based on the dynamic data typing that is available in the Python programming language. And here I’m using the return keyword to return the sum of these two numbers. So here I can say the sum is and I can put STR to convert this onto a string. And then I’m invoking the function and sending in two values of two and three.
Let me run this. So I can see the sum has five. But we can also extend this. See here we are not explicitly defining a data type, right? We are only seeing add two variables. So I could also change this on to let’s say hello and put world. So now I’m putting strings. Let me run this. And here you can see the sum is hello world. So because it is based on dynamic typing, right, we are not explicitly defined types of the variables. Here our function is not only reusable from a code nature, it’s also reusable when it comes to the underlying types. As long as the types can actually be added or it can be concatenated, right, you will get a result.
- Quick look at Jupyter Notebook
Now, in this chapter, we’ll just have a quick look at a Jupyter notebook. So, a Jupiter notebook is a web application that allows one to create and share documents. These documents actually contain live code. The documents also can contain visualizations. Now, the normal use cases for notebooks is when it comes to data cleaning activities, data transformation, modeling your data, and machine learning. Now, to get started, you can install Jupyter notebooks on your local machine, or you could also run Jupyter in the web browser. If you do run it on your local system, it will actually use something known as a notebook kernel. The notebook kernel is a computational engine that actually executes the code in the document.
And by default, you have to use the I Python kernel that runs in the notebook. You can also run your notebook that contains scala based commands that run for the Spark distributions on your local machine. But again, that needs quite a bit of configuration. So quickly we’ll just look at Jupiter notebooks because we’ll be looking at commands in details in notebooks when it comes to the Spark pool in Azure Synapse, and when it comes to Azure data bricks. So here I’m in the home page for Jupiter notebooks. If I scroll down and if I look at the Jupyter notebook, I can choose the link to try it out in the browser. So here I can try the classic notebook. So here we can see the notebook in place.
Here we can see the amount of memory that is being allocated, and it’s currently running Python Three. So here everything runs in something known as cells. Here let me go on to file and create a new Python Three notebook. And here the first thing that we get is a cell. And in this cell we can now run Python based commands. So if you look at our previous examples on Python, if we copy something and paste it here and click on Run here, we can see the outputs of the commands. Now we are going to be looking at how to run these commands in notebooks, in Spark, in a Spark pool. And as your data breaks. So the entire purpose of notebooks is you can actually run your code here itself.
It’s like having an integrated development environment all to yourself. It runs in the web browser. But most importantly, when we look at subsequent chapters, we will see how we can run not only Python based code, but scalabased code. We can also run SQL commands as well in a notebook. And those notebook commands can actually target a Spark cluster. We can use the capabilities of the Spark cluster to run our commands. So the entire purpose of this chapter was just to give quick view on Jupyter notebooks. This is something that you might have heard of. So here you can add your code. You can also collaborate on the code with other people as well.
- Lab – Azure Synapse – Creating a Spark pool
So now in this chapter, we are going to create an Apache Spark pool that is available in Azure Synapse. So, Apache Spark has a power processing framework that can be used in big data article applications. Azure Synapse brings Spark into the picture for your big data processing needs. And you can use Spark to process the data that is stored in Azure. And that is what we’re going to see in the next set of chapters. How do we make use of Apache spark in Azure? Synapse. The first step is to create something known as a Spark pool. So, similar to our dedicated SQL pool, we can also create an Apache Spark pool. So, let’s go ahead. So here I am in my azure synapse workspace.
Here I can go onto Apache Spark pools and here I can create a new Apache Spark pool. Here I need to give a pool name. You can decide on the node size. So the smallest node size is four cores, 32 gigs. Here you can enable the auto scale setting. For the moment I’ll disable this and I’ll choose the minimum number of nodes, which is three. Here you can see the estimate cost per R is zero USD. Now, if I go on to the pricing for Azure Synapse and let me go on to big data antics with Apache Spark, here you can see that when it comes to memory optimized, the price is . 143 per virtual core R, since we are in the North Europe location. So I can choose North Europe.
So here you can see the price per virtual Cop per R. Now, if I go back on to creating the Apache Spark pool now has per the documentation, this is also a serverless Spark pool. So here you are not actually charged based on the creation of the pool. You are charged when the underlying nodes start up to process your Spark jobs as part of the pool. That’s the time that you get charged. Now, when it comes to Spark, the main purpose of Spark is to take your data sets, your large data sets and distribute the computation across various nodes. Again, a divide and conquer approach. So when you have multiple nodes that work on, let’s say, transformation of your large data set, it can work in parallel and give you the results much faster.
That’s the entire idea of Spark to ensure that it can process large data sets. Now, when it comes to the Spark architecture, by default you have the driver node that actually accepts the incoming jobs or your applications, your Spark applications. And then it will distribute the work on to executors that will actually execute the tasks that are part of your Spark application. It’s somewhat similar to your SQL data warehouse, our dedicated SQL pool. There we had a control node that would take all of the queries and then it would distribute the queries onto the compute nodes. So kind of a similar architecture where the driver node takes the Spark applications, right? And then it disputes onto its executors.
So this is how roughly the architecture is when it comes to Spark. And the same thing is also applicable over here. Here we have three nodes. One node will be reserved as the driver node and the remaining two nodes will be our executors. And remember, it’s only once we submit the Spark application has a job on to the Spark pool. Then it will create the nodes and that time we actually get charged. So now I’ll go on to additional settings. Now, the automatic pausing feature is very important at this point in time. I’ll leave it as 30 minutes. So what’s the biggest advantage of this? So let’s say that you’ve submitted a Spark job Spark application basically onto the Spark pool.
Now the Spark pool has finished, your job has finished, your application has given you the results. Now you are not sent any further jobs onto the Spark pool. Now, at that point in time, you are still paying for the compute cost for those three nodes, the driver node plus your executor node. But here you have this automatic pausing feature where it will pause all of the compute infrastructure, your nodes and your drivers, so that you don’t bear a compute cost. So after 30 minutes, right, so it will become in an idle state. So after a duration of 30 minutes, if the cluster is idle, then it will actually go into a paused state. Then when it needs to resume, there will be a small time frame wherein it will first start up and then take your Spark job.
Here you can see what is the version of Apache Spark that will be available. And you can see the other elements that are also available on the Spark cluster that will be created for you. So for example, when we create notebooks, we can target our notebooks with Python code that will use this particular version, right? 3. 6 we can also use Scala, we can also use Net as well. Now I’ll go on to review and create and let’s go ahead and hit on create. So let’s come back once you have the Spark pool in place. So once we have the Spark pool in place, we can go ahead on to the resource and that’s it, you have this pool in place. Now in the next chapter, let’s start working with notebooks that actually target our Spark pool.