DP-203 Data Engineering on Microsoft Azure – Design and Develop Data Processing – Azure Databricks Part 8

  • By
  • June 26, 2023
0 Comment

24. Autoscaling a cluster

Hi and welcome back. Now, in this chapter, I just want to give some notes when it comes to auto scaling. So I’ve shown you that when you create your cluster in Azure databricks, you have this auto scaling concept. So when creating an Azure databricks cluster, you can specify the minimum and the maximum number of workers for the cluster. Data bricks will then choose the ideal number of workers to run the job. If a certain phase of your job requires more compute power, the workers will be assigned accordingly. Now, there are two types of auto scaling. You have standard auto scaling here, the cluster starts with eight nodes, and it scales down when the cluster is completely idle and it has been underutilized for the last ten minutes.

It scales down exponentially, starting with one node at a time. We have something also known as optimized scaling. This is only available for Azure Databricks premium plan. Here, it can scale down even if the cluster is not idle, by looking at something known as the shuffle file state. Here, it can also scale down based on a percentage of current nodes. On job clusters, it can scale down if the cluster is underutilized for the last 40 seconds. And on all purpose clusters, it can scale down if the cluster is underutilized over the last 150 seconds. So in this chapter, I just want to go through some notes when it comes to auto scaling for Azure databricks.

25. Lab – Removing duplicate rows

Hi and welcome back. Now in this chapter, I’ll show you how you can remove duplicate records in your data frames. Very simply with the help of the job duplicates method. It’s as simple as that. So firstly, I am going to upload a file that has duplicate records. And don’t worry, I’ll ensure that this file is available, has a resource onto this chapter. So firstly, I’m going to upload this onto my Azure data lake gen two storage account in my Raw directory.

So here I’ll go on to containers in my data lake gen two storage account. I’ll go on to my data container, I’ll go on to my Raw folder and here let me upload the file. So here in my local temp directory, I’ll upload the log with duplicates file, I’ll hit on upload. Now let me go on to my core. So firstly, I’ll create a data frame based on that file. And here, using the group by method, I’m going to group by the ID. And here I want to get the count of the IDs based on each group and I’m ordering it by the ID itself. So let’s take all of this, let me go on to a notebook and here let me create a new cell.

Let me run this. So I’m directly displaying the data frame. So here you can see that for the ID, the count is two, for the IDs equal to the count is two. So here, actually there are duplicate records in place. So what I’ve done is that for the IDs from one up to ten, I have just copied the rows again. After that, we all have unique rows in place. So it’s only for the IDs from one to ten where we have duplicate records in place.

So now, if you want to remove these duplicate records, you can just easily go ahead and use the drop duplicates method. So let me do that and then display the data frame again. So here, let me create a new cell, let me run the cell. And now you can see all the duplicate records have been removed. So it’s only one record. Now, for each ID, you can also drop the duplicates based on a particular column value. So let’s say you only want to drop duplicate rows if there is a duplicate value for the ID column. So here you can mention what column the drop duplicates method should actually target when it comes to dropping the duplicate rows.

26. Lab – Using the PIVOT command

Hi and welcome back. Now in this chapter I want to give a quick note when it comes to using the Pivot clause in your SQLbased statements. So firstly, I want to create a data frame. This is actually based on a sequential row of values. So first of all, I am creating some data rows using the sequence operator. So this is based on scala here in terms of the data that I have. So I have a date type value and then I have an integer type. So here I am having a date representation and what is the temperature recorded on that date. So this is the temperature. Now I have an entire set of values. So I have values for Jan, Feb, March, et cetera, all the way up to December.

So now, once I have these values in place, what I’m doing is I am creating a data frame using that particular schema, and then I am creating a table known as temperatures. Now I am selecting the year. So I’m using the year function now, and I’m using the month function. We had seen this early on on our date. So it’ll get what is the representation of the year and what is the representation of the month. And here I’m getting the temperature from the Temperatures table. I’m using a where condition to get all of the information between these dates. And now I’m using the pivot clause. So, using the Pivot clause, I want to ensure that I now transpose my data on to these columns.

And when I execute this statement, you’ll actually understand what I am trying to achieve. So let me copy all of this. Here, I just have a notebook in place, so let me just create a new cell. So here I have my schema, I have all of my rows, I have the different temperatures, I’m creating a data frame, I am creating a table, and then I want to perform my SQL statement. So let me delete this because we need to run this in a new cell. So first let’s run this cell. So it has created the table. So here you can see you have the date and you have the temperature.

Now let me create a new cell. I’ll take my SQLbased command and now let me run this cell. So here what I’m getting is now my year. And using the Pivot clause, I now have all of my months being represented as columns. So now I’m transposing my row into columns and I’m getting now the average temperature for each month. So here, based on my cast operator, wherein I am now taking the temperature as a float. So earlier on, if you look at my table, it was of the integer type. So now the representation of an average is being taken as a float. So the entire purpose of this chapter was to just give you a use case scenario on using the Pivot clause. Again, this is based on my experience giving the exam. I remember did getting this sort of question on the exam when it came to using this particular clause in your statement.

27. Lab – Azure Databricks Table

And welcome back. Now in this chapter, I want to go through databases and tables that is available with Azure databricks. So in Azure databricks also you can create a database and tables. So in addition to using notebooks to look through other data sources, you can also create a database and tables. Within Azure databricks, the table is nothing but a collection of structured data. You can then perform operations on the data that are supported by Apache Spark on data frames on the Azure databricks tables. Now, there are two types of tables that you can create. One is global and the other is local tables. A global table is available across all of your clusters.

The global table is actually registered in Azure databricks Hive Metastore or an external Metastore. The local table is not accessible from other clusters and is not registered in the Hive Metastore. So I’ll quickly now go on to Azure databricks and show you how to create a data brick table so we can actually create an Azure databricks table based on the data that we have, let’s say in our log CSV file. So here, if I go on to the data section here I can hit on create table. Now here I could choose different sources and if I want to upload a file so let’s say I want to create now a table based on our log CSC file. I can actually upload my file from here itself. So I click on upload. So here I have my log CSV file. I’ll hit on open. Now here I’ll create the table in a notebook. So here we have a cell which just has some markdown language in place. And in the next cells we now have what is required to create a table. Now here I am going to make some changes. Firstly I am going to infer the schema has two I am going to mention to ensure that the first row is taken as the header, because that is how it is going to create our data frame. I’ll scroll down. Here it is first going to create a view. Let me just remove the underscore to give a proper name for the view. I’ll scroll down. It will do a select star from that particular view using the magic command to ensure that it uses SQL based statements. I’ll scroll down.

If you want, you can then create a permanent table and as your data breaks table, you can remove the hash to ensure that it uses the dataframe write method to write it or save it as a table. Here you can see the tables are based tables. So now I can actually run all of the cells. Now here, let me attach this notebook onto one of my clusters. And then let me run all the cells in the notebook. So now it’s running this cell, it’s displaying the information. It has also run this. We done a select star. Now here it’s giving us an error. So it’s saying the correlation ID contains invalid characters.

So here, because we are using now parquet based tables, the column names should not contain any sort of spaces. So if I open up the log CSV file that I have on my local system here, you can see that I have spaces between the column names. So this can actually give you an error here. I have just purposely ensured that there is a space in the column name just to show you this particular error. Otherwise, it’s very straightforward to go ahead and create the table. And then once you now have the table in as your databricks, you can actually use this table to work with your other data sources as well. So we have multiple options on how you can reference your data.

28. Lab – Azure Data Factory – Running a notebook

Now in this chapter, I’ll show you how you can run a notebook which you have in Azure Data Bricks in as your data factory. This is an option that is also available. So here I have one of my sample notebooks which we had seen earlier on. This was used to take the data from our log CAC file and copy it onto our Log data table. So firstly, in SQL Server Management Studio, let me see if I have anything in my log data table. So I do have some information in place. So let me delete from the log data table. So now we don’t have any rows in place. Now this particular notebook is actually in my workspace. If I go onto my workspace, it is in the Shared folder in my workspace. I’d also shown you how you can actually move any one of your notebooks. So if you click on Move, you can choose Shared and you can hit on select and it will move that notebook onto the Shared folder.

Now, in order to ensure that Azure Data Factory can pick up your notebooks which is present in Azure Databricks, you have to ensure that you give the Manage Identity permission onto Azure Databricks, the Manage Identity which is assigned on to Azure Data Factory. So I’m going to go on to the Data workspace that is hosting my Azure Databreak service. Here I’ll go on to Access Control and I’ll click on Add and add a role assignment. Here I’m going to choose the contributor role. And here I’ll search for App factory. That’s the name of my Azure Data Factory resource. I’ll choose that and I’ll hit on Save. Let’s now wait for a couple of minutes to ensure that this role takes effect. Now once I wait for a couple of minutes, I’ll go on to Azure Data Factory. Here, let me go on to the author section and let me create a very simple pipeline. Just close the tabs which I have open. So here I’ll create a new pipeline. Now here I can give a name for the pipeline, I can go on to Data Bricks and here I can drag notebook.

Yeah, I can just leave the name as it is. I can go on to Azure Databricks here in the Databricks Link service, let me create a new one. So here if you want you can change the name, then you can scroll down, you can choose your subscription, you can choose your Databricks Workspace name. Now I already have a cluster in place, so we have our App cluster. If I go on to Compute, I have my App cluster which is in the running state, so I can make use of this same App cluster.

So now the notebook will actually run on our cluster. So I can choose an existing interactive cluster if you want to. Can also choose a new job cluster. Here in the authentication type I’ll choose Manage Service Identity and here I should be able to see my cluster. So I’ll choose App Cluster. Now in case if you can’t see your cluster, you can just switch on to existing instance pool. Then again switch back on to existing interactive cluster. So that it. Just refreshes this list of existing clusters. And then I’ll hit on Create. So now it will create this link service.

Now I’ll go on to settings. And here I need to browse for my notebook. So I’ll go on to the shared folder. Here I’ll choose my App Notebook and let me hit on K. Let me now publish my pipeline. Once pipeline is published, I’ll trigger my pipeline. So here I’m just giving a very simple example on how you can execute a notebook as part of your pipeline. Please know that we have seen various scenarios on how you can actually connect activities together in a pipeline.

So you might have an activity before this notebook activity and you might have another activity after this. So you might have your notebook that performs a particular function and you might have activities in your pipeline that runs before this or runs after this. Let me go on to the multisection so I can see my database pipeline has succeeded. So here if I do a select star, I can see my data in place. So now this data has has been injected with the help of the notebook that is run in my databricks cluster.

Comments
* The most recent comment are at the top

Interesting posts

The Growing Demand for IT Certifications in the Fintech Industry

The fintech industry is experiencing an unprecedented boom, driven by the relentless pace of technological innovation and the increasing integration of financial services with digital platforms. As the lines between finance and technology blur, the need for highly skilled professionals who can navigate both worlds is greater than ever. One of the most effective ways… Read More »

CompTIA Security+ vs. CEH: Entry-Level Cybersecurity Certifications Compared

In today’s digital world, cybersecurity is no longer just a technical concern; it’s a critical business priority. With cyber threats evolving rapidly, organizations of all sizes are seeking skilled professionals to protect their digital assets. For those looking to break into the cybersecurity field, earning a certification is a great way to validate your skills… Read More »

The Evolving Role of ITIL: What’s New in ITIL 4 Managing Professional Transition Exam?

If you’ve been in the IT service management (ITSM) world for a while, you’ve probably heard of ITIL – the framework that’s been guiding IT professionals in delivering high-quality services for decades. The Information Technology Infrastructure Library (ITIL) has evolved significantly over the years, and its latest iteration, ITIL 4, marks a substantial shift in… Read More »

SASE and Zero Trust: How New Security Architectures are Shaping Cisco’s CyberOps Certification

As cybersecurity threats become increasingly sophisticated and pervasive, traditional security models are proving inadequate for today’s complex digital environments. To address these challenges, modern security frameworks such as SASE (Secure Access Service Edge) and Zero Trust are revolutionizing how organizations protect their networks and data. Recognizing the shift towards these advanced security architectures, Cisco has… Read More »

CompTIA’s CASP+ (CAS-004) Gets Tougher: What’s New in Advanced Security Practitioner Certification?

The cybersecurity landscape is constantly evolving, and with it, the certifications that validate the expertise of security professionals must adapt to address new challenges and technologies. CompTIA’s CASP+ (CompTIA Advanced Security Practitioner) certification has long been a hallmark of advanced knowledge in cybersecurity, distinguishing those who are capable of designing, implementing, and managing enterprise-level security… Read More »

Azure DevOps Engineer Expert Certification: What’s Changed in the New AZ-400 Exam Blueprint?

The cloud landscape is evolving at a breakneck pace, and with it, the certifications that validate an IT professional’s skills. One such certification is the Microsoft Certified: DevOps Engineer Expert, which is validated through the AZ-400 exam. This exam has undergone significant changes to reflect the latest trends, tools, and methodologies in the DevOps world.… Read More »

img