DP-203 Data Engineering on Microsoft Azure – Design and Develop Data Processing – Azure Databricks Part 2
4. Lab – Creating a cluster
Now in the last chapter, we have gone ahead and create our workspace. In this chapter, we are going to now create a cluster. So the cluster will have the underlying machines that is going to have Spark installed. So as part of the data bricks implementation, you have Spark running on the clusters that is used for processing your data. So this is a completely managed service wherein the underlying machines are going to be managed for you. Now, when it comes to the exam, so you can get questions that are based on Spark itself. So in terms of understanding data frames, also when it comes to Azure data bricks, you can expect a set of questions also on clusters. That’s why I do give some emphasis on to the working of clusters and on to the different types of clusters. This is because it is important from an exam perspective.
Next, for those students who want to understand if I want to run Spark on Azure for running my jobs, should I use Azure data Bricks or should I use the Spark offering that is available in Azure Synapse? So we have seen that if I go on to my workspace, if I go on to my Apache Spark pools, we have seen that we can create a Spark pool which in the end runs Spark in the background for running your data processing workloads. So, there are some differences when it comes to the choice of using Azure data bricks and using the Apache Spark pool. When it comes to the Apache Spark pool, it’s Microsoft’s own implementation in trying to tie up Spark with the entire Azure Synapse ecosystem. Whereas if you look at Azure databricks, the Data bricks team are the ones who are also responsible for the underlying Spark engine.
They are actually developing their entire ecosystem that is directed towards data engineering, data science, machine learning, everything in one complete package when it comes to working with data bricks. So there are some differences between both of the platforms. One difference that you will note straight ahead is if we look at the databricks runtime version, so they have a specific runtime version and along with the runtime version, so you can see you have a version of Scala and you have a version of Spark. So here the version of Spark is 3. 1. 1. If I go on to the Apache Spark pool, if I actually go on to the additional settings here, in terms of the version of Apache Spark, you can see we were using version 2. 4. So that’s a previous version and the version of 3. 0 is only in preview.
So when it comes to the version of Spark, you will always be ensured that as your data bricks will always have the most recent version of Spark. And that’s because I said the team that actually works on data bricks, right, the entire company that works on data bricks, that is actually making it available on Azure is also responsible for the updates on the Spark engine. So you will always be in short, that you’ll get the most recent version of Spark in Azure databricks also at this point in time. Just a note when it comes to the Apache Spark pool, you can also run your net notebooks in Apache Spark pool. So they have actually made this additional language available when it comes to Azure Synapse. This is not possible with the Azure databricks notebooks. It is a feature that has been made available especially in Azure Synapse. So if you have net code based code that needs to run in notebooks, it can work in the Apache Spark pool. So, here, let me give a cluster name.
Now, next is the cluster mode. Again, very important from an exam perspective. So we have the standard cluster. So if you have single users who are working on the cluster, they could be developing notebooks they can use the cluster mode has standard. Now, when you choose a cluster mode has standard, there are some other options that are also made available to you. First is the ability to terminate the cluster after 120 minutes of inactivity. This is very important from a cost perspective. So remember that you will be charged based on the amount of compute of the underlying machines. These are the nodes that will be running your data processing needs, your data processing workloads. So if this entire cluster is idle and you want to save on compute costs here, with this option, it will automatically terminate the cluster if it has been idle for 120 minutes.
So this is a very useful option. Next, we also have this capability of auto scaling. Now, when it comes to a standard cluster, we have the driver node and the worker node. So when it comes to the entire Spark cluster implementation, remember you will always have the driver node that is used for taking the requests, your Spark applications, your Spark jobs, and then it will distribute on to the executors that can be running on the worker nodes. So these worker nodes are the ones that will actually run your data processing activities.
So here in Azure databricks, you have this option of auto scaling. So in case initially you specify the minimum number of workers has two. So there will be two workers in place. Now, suppose the job that you actually submit, your Spark application that you actually submit on to the workers, it’s taking the maximum capacity. Let’s say you have a very large data set, right? It is based on the RDD, the resilient distribute data set. It’s a large data set. You are performing some sort of transformation on the RDD. The Rd has been split across the worker nodes, across the executors. But still, because of the sheer transformation that is required and the sheer bulk of the data set, these two nodes are not enough to carry out that implementation of that transformation.
With the help of auto scaling as your data bricks can actually spin up a new worker in the cluster, so that the load can now be distributed across that additional worker. So this is something that is available when it comes to the auto scaling feature in as your data breaks. So I said I’m putting focus on this cluster concept because it is important from an exam perspective and then you can decide what is the worker type, so what is the instance size.
So this determines the amount of virtual course and the amount of memory that is assigned onto the machine that is spinning up as your worker and the same as the driver node. Now, in terms of the size, so here the minimum size is four cores. There is nothing less than this. So this is because the expectation is you are going to be having a lot of data that needs to be processed on this cluster. Now, next I want to talk about the pricing. So remember, when creating the workspace, we had chosen the option of trial premium wherein we get 14 days of free DB use that’s known as data bricks units. So if I go on to the pricing page for Azure data bricks so here you are charged based on both the virtual machines that are provisioned in the cluster and the data bricks units.
So there are two things that you are charged on. Here you can see the amount that you are charged when it comes to the data brick units per hour. So when you are looking at the free tier, it does not mean that everything is free. When you start running your workloads in as your data breaks, sometimes people actually get confused with this concept. It’s always important to go on to the pricing page. So this will be free. You are not charged for these prices, but you will charge or be charged for the underlying virtual machines. So here, remember, we are choosing a worker type. This determines the size of the underlying virtual machines that are going to be, let’s say, running your notebooks, running your jobs. You are going to be charged based on that. And if you scroll down, so here you can see based on the different sizes, how much you are going to get charged. So here you can see the pay as you go total price when it comes to using that particular instance. So there is a cost for both the data bricks unit and the underlying charge for the machine. But because you have the terminate condition here and you can terminate your cluster at any point in time, you can save on cost just by ensuring that you don’t have the cluster in place, right. So, quite a lot to actually take in.
Now, next we have something known as a high concurrency cluster. The high concurrency cluster actually allows you to have many users actually working on the same cluster. So if you have many users that need to run their workloads, run their notebooks, run their jobs, then you can use a high concurrency cluster mode. Here you can see by default the terminate after option is not enabled. Now, in our case, there is another cluster mode known as a single node, and that is what I’m going to choose. So this is an option that will actually help us save on cost here.
In this case, we are only going to have one node. That one node is going to behave as both our driver node and like in the background has our executor node as well. So we’re only going to have like one node that behaves as our driver and has our worker executor, and all of our workloads will only run on one node. So I said just to save on cost, this something that we can actually use has the cluster mode.
Obviously, in production based environments you have to look at using either the standard mode or using the high concurrency mode. So with that in place, now let me go ahead and hit on creating the cluster. Here you can see the amount of workers and the amount of drivers. So I’ll create the cluster, right? So the creation of the cluster will take a couple of minutes. So let’s mark an end onto this chapter and go on to the next chapter in this course.
5. Lab – Simple notebook
Now, once we have the cluster in place, so I can see the green symbol here, that means it is in the running state. So please note at this point in time, where are we? We are in the compute section of our workspace. We have our cluster here at any point in time, if you want to terminate the cluster, you can do it here. You can also delete the cluster altogether. And then we have some options when it comes to the cluster over here.
So if you go on to any other section, if you go on to the job section, you can go back on to the Compute section. Here we have a cluster in the running state. You can just click on the cluster. And here you can see all of the options in place. Now, next, let’s create a notebook. Now I’ll choose the default language has Scala. So here you can see that the available languages are Python, Scala, SQL and R. Remember, in the Spark pool in Azure Synapse you also have Net has an option. So I’ll choose scala. Here I’m attaching my notebook onto an app cluster.
So the notebook, whatever commands I have in the notebook that’s going to run on this particular cluster, I’ll hit on Create. Let me hide the menu options. I’ll choose auto. So again, as we have seen with the Spark Pool in Azure Synapse, we have a notebook, we have different cells and we can run those cells. Those cells will be commands that will be sent on to the Spark engine that is running on our cluster. So as always, let’s start simple with our common examples of creating an array of numbers using paralyzed method and getting the count. So we can attach this on to our notebook. And then I can run the cell. So let’s run the Spark job. And here I can see the count. And finally, if you want to do a collection, we can create another cell here. Then I can run this particular cell and I get all of the values back. So we quickly create a notebook. Let’s move on to our next chapter wherein we’ll work with data frames.
6. Lab – Using DataFrames
So in the last chapter we had create our notebook based on Scala. Let me close the cell, I don’t require it. I’ll go on to our next program parent. We are working with data frames. Now, here I am creating a sequence in Scala. I want to create a sequence which has information about courses. So this can be our price, this can be the course name and this can be just an ID for the course. I’m then creating an RDD and from the RDB I am creating a data frame and then I’m using the display method to display the data frame so I can copy this place it here, run the cell.
So I have my data frame in place. Now, here I am showing another set of statements wherein we can now specify a particular schema for our data. So, here I am using the stock type to define different stock fields. So these are going to be the fields for my data set. And if you want to create a data frame directly from a sequence, you have to ensure that you use the row type when creating the sequence. So, here I am creating the data frame. Here I am also mentioning what will be the schema. So I’ll take this, I’ll copy it, here I’ll run the cell and here you can see of the course ID, the course name and the course price. Finally, some other commands. So if you want to sort based on the price so, here I have my data frame, I’m assigning it on to a new data frame.
Now I’m using the sort method and here I am saying please sort it based on the column of the course price in descending order. So I can take this, create a new cell, I’ll run the cell getting the output has desired, so the course price in descending order. Next, I can also use filtering based on the where condition. So, here I am now using the where method and saying please only give me where the course name is equal to DP 203. And then I’m displaying my new data frame so I can create a new cell, I can run the cell so I can see that also in place and finally I can also use aggregation. So I’m saying please give me the average when it comes to the coast price.
So let me display this. So I’ll create a new cell, let me run the cell, so I’m getting the average course price. So, again, just going through some commands when it comes to Spark. So obviously if you want to learn everything about Spark, then you should be taking a dedicated Spark based course. Here I’m just giving some familiarization when it comes to some of the commands in Spark when it comes especially on to data frames, because you can get a couple of questions here and there when it comes to working with data of claims in the exam.
Interesting posts
The Growing Demand for IT Certifications in the Fintech Industry
The fintech industry is experiencing an unprecedented boom, driven by the relentless pace of technological innovation and the increasing integration of financial services with digital platforms. As the lines between finance and technology blur, the need for highly skilled professionals who can navigate both worlds is greater than ever. One of the most effective ways… Read More »
CompTIA Security+ vs. CEH: Entry-Level Cybersecurity Certifications Compared
In today’s digital world, cybersecurity is no longer just a technical concern; it’s a critical business priority. With cyber threats evolving rapidly, organizations of all sizes are seeking skilled professionals to protect their digital assets. For those looking to break into the cybersecurity field, earning a certification is a great way to validate your skills… Read More »
The Evolving Role of ITIL: What’s New in ITIL 4 Managing Professional Transition Exam?
If you’ve been in the IT service management (ITSM) world for a while, you’ve probably heard of ITIL – the framework that’s been guiding IT professionals in delivering high-quality services for decades. The Information Technology Infrastructure Library (ITIL) has evolved significantly over the years, and its latest iteration, ITIL 4, marks a substantial shift in… Read More »
SASE and Zero Trust: How New Security Architectures are Shaping Cisco’s CyberOps Certification
As cybersecurity threats become increasingly sophisticated and pervasive, traditional security models are proving inadequate for today’s complex digital environments. To address these challenges, modern security frameworks such as SASE (Secure Access Service Edge) and Zero Trust are revolutionizing how organizations protect their networks and data. Recognizing the shift towards these advanced security architectures, Cisco has… Read More »
CompTIA’s CASP+ (CAS-004) Gets Tougher: What’s New in Advanced Security Practitioner Certification?
The cybersecurity landscape is constantly evolving, and with it, the certifications that validate the expertise of security professionals must adapt to address new challenges and technologies. CompTIA’s CASP+ (CompTIA Advanced Security Practitioner) certification has long been a hallmark of advanced knowledge in cybersecurity, distinguishing those who are capable of designing, implementing, and managing enterprise-level security… Read More »
Azure DevOps Engineer Expert Certification: What’s Changed in the New AZ-400 Exam Blueprint?
The cloud landscape is evolving at a breakneck pace, and with it, the certifications that validate an IT professional’s skills. One such certification is the Microsoft Certified: DevOps Engineer Expert, which is validated through the AZ-400 exam. This exam has undergone significant changes to reflect the latest trends, tools, and methodologies in the DevOps world.… Read More »