DP-203 Data Engineering on Microsoft Azure – Design and Develop Data Processing – Azure Databricks Part 4

  • By
  • June 24, 2023
0 Comment

12. Lab – Filtering on NULL values

Now in this chapter, I just want to quickly show you again, how do you filter on null values? Again, something we had seen earlier on when it came to working with Azure Synapse, the Spark Pool. So just one thing that I want to confirm. Let me first display my data frame. So here I am looking at the resource group. So I’ll just click here, I want to go on to the resource group column, right? So there is a space between resource and group. Just want to verify that. So again, the spaces between the column names is also very important. So let me filter and I can just easily use is null. Here I’m displaying the rows where the resource group is null. So if I replace it here and let me run the cell. So if I click on this, if I scroll down and if I go on to the right, I can see all of the rows where the resource group is null.

If you now want to create a new data frame wherein you only want the rows where the resource group is not null, you can create a new data frame and then you can display the data frame accordingly. So here I can create a new data frame and then you can use the display command to display the data frame and then you can do a comparison by looking at the count of rows in both data frames, right? So in this chapter, just want to quickly go through how do you filter on null values?

13. Lab – Parquet-based files

So in our continuation, in looking at how to read different file formats in this chapter, let’s see how to read park based files. Again, very simple. The format is of the type park. So firstly, what I can do is I can again go ahead and upload data or click on Browse. Here I have a log park file. I’ll open this up. I’ll go onto next I’ll go ahead and hit on Done. I can take this command. Let me do it here and let me display the data frame. Let me run this so we can see all of the details here.

The ID is an integer and the time is a timestamp because this is how the park file has been generated. If I just do account quickly so I take TF, park and do account we don’t need this anymore because we only have the data frame in place so I can see the count of records. So this is only a subset of the log data which I’ve taken and generate a park based file. The entire purpose was just to show you how you can work with a parquet based file. And now, since everything is coming in has a data frame, you can now use your non spark based commands to work with that data frame.

14. Lab – JSON-based files

Now let’s see how to work with jsonbased files. Again, very simple. We have the format of JSON here and we are loading our files. So again, I have both my files in place. One is the representation of an array in our customer objects, our jsonbased objects which are seen earlier on. And I have one just for the objects within an object. So here I can again upload my file. So I’ll click on upload data, I’ll browse to my files. So I’ll go on to my array based customer file. I’ll hit on Open, I’ll go on to next. So here again you can copy the statement if required. I’ll hit on Done if you want to see the file. So here on the array, if I go on to customerari JSON. So this is the list of my JSON objects. And here I have the array in terms of the courses. So if I go on to my Scalar based file here I am reading the JSON based file. So let me run it here.

So you can see this is coming has an array. So I can use the Explore function which we have also seen when working with the Spark pool. So I’m exploding the column of the courses. This is the function that can help to expand the array. So I’ll run the cell and you can see I’m getting the values as desired. Next we have our object. So our customer object JSON file. So here, in addition to the array, I have another object within the object itself. So again, if I go back onto my file, let’s first upload this go on to my object. Let me hit on next I’ll hit on done. So now let me first of all put this statement here. So we are defining our data frame first. So let me run this. So here, in addition to exploding the column of the courses, I am also ensuring that I access Details Mobile and Details City. I’m also using an alias here to ensure that displayed has courses. So again, something very similar we’ve seen when working with the Apache Spark pool in Azure Synapse.

15. Lab – Structured Streaming – Let’s first understand our data

Now we want to look at a scenario wherein we can stream the events from Azure Event Hubs on to Azure databricks. So earlier on we had Enable Azure SQL database diagnostics logs to send events onto Azure Event Hubs. This is something that we yet seen in the section of Azure Stream Analytics and Azure Event Hubs. So we want to accomplish the same thing here in Azure databricks. And then from Azure databricks, we are going to take the events from Azure Event Hubs and then stream it directly onto our dedicated SQL pool in Azure Synapse. So, as always, the first thing that I normally do to try to understand how to work with data so we can stream our logs both onto Azure Event Hubs and onto an Azure Storage account when it comes to SQL diagnostic logs. So initially I had sent the logs on to an Azure Storage account and there it would generate a file, a JSON based file. So I always do this so that I can understand again what is the structure of the data, what is the type of data so that I understand how to process it in the required service. I never make an assumption on processing of data. That’s why we are first going to see in Azure databricks how do we work with this JSON based file. So, firstly, let me upload data. So I’ll upload that JSON based file that I have, I’ll go on to next I’ll hit on done. So here if I go on to the command, let me copy all of this. So here I’m loading my JSON based file and then I am selecting a certain number of columns, right? I’m selecting the count, the minimum, the maximum, the resource ID, the time, and the metric name. So let me run this so everything seems as it should be, so I’m able to work with the data.

Now in the next chapter, we’ll see how to continuously stream this data from Azure Event Hubs onto Azure databricks, onto our dedicated SQL pool. But as I mentioned before, the first step that I normally perform is to first understand whether I can read that data into a data frame and what is the data that I’m getting. This is always step number one. And then we look at step number two. In the next chapter. There are obviously some more things that we need to implement in order to ensure that we can stream events from Azure Event Hubs.

16. Lab – Structured Streaming – Streaming from Azure Event Hubs – Initial steps

Now in this chapter, we’ll see how to stream events from Azure Event Hub. So, just as a refresher earlier on for my Adventure Works SQL database, I had gone on to the diagnostic settings. And here I had a diagnostic setting wherein I was streaming all of my data onto an Azure Event Hub. So all of the metric data was coming in onto Azure Event Hubs. And now I want to consume this data in Azure databricks. So I go on to my next scalar file wherein I have the code that will be required for reading data from Azure Event Hubs. So there is some work that we need to do in order to make this work. So for making this work, we need to install a library. So one thing with Azure databricks is you can actually install external libraries. So if you want to use functionality of external libraries in your code, you can actually install the libraries with the help of Azure databricks.

So if I go on to data bricks, I have to go on to my Compute section. If I go on to my cluster, if I go on to libraries, I can install a new library. So if I have a Java file so this is Java based file, you can actually just directly upload it here. But I want to search for a package that actually helps to extend the functionality of Azure databricks to work with Azure Event Hubs. So here I’ll choose Maven. Here. I’ll click on search packages. Here I’m going to choose Maven Central. So here if I search for Event Hubs, I can see all of the packages that are available. Just reduce the zoom and I’ll scroll down and I’m going to be choosing Azure Event Hubs park underscore two point twelve, I’ll hit on select here, and then I’ll click on Install. So while this is being installed, just a quick note on the libraries.

So you can install a library in Azure data bricks. If you want to make third party or custom code available in your notebook, these libraries can be in Python, Java, Scala or R. Now, there are only common libraries that are available in the databricks runtime itself. You can actually make use of it, but if you want to make use of external libraries, there are multiple ways in which you can achieve this. So you can also install the libraries in different modes. You have workspace libraries. So remember, in the workspace we have seen something known as repos.

So this can be used as a local repository to create cluster install libraries. And then we have cluster based libraries, which is what we are doing currently. In this particular chapter, these are installed on the cluster and then available for all notebooks running on the cluster. So if you want libraries that can be used across all of the clusters, you will create workspace libraries. If you just want a library for a particular cluster, then you’ll use a cluster library. Now, once the library is installed, I’ll go back on to my notebooks. I’ll go on to the app notebook. And here now in order to actually make use of that library, I will need to choose the drop down on the cluster and I’ll need to detach and reattach our notebook back onto the cluster. I’ll hit on confirm. So this is the only way that I can actually make use of the new library. Now here, let me take the first set of statements. I’ll go on to my notebook and attach it here. Now here we need to first define our connection string. So we need a connection string onto our DB Hub. So I’ll go on to my event hub namespace here. I’ll go on to event hubs. I’ll go on to DB Hub. Here, I’ll go on to share access policies. Let me add a new policy.

So this is for listening and also managing of events. And I’ll just give this a name of data bricks and let me hit on Create. I’ll go on to the policy. Yeah, I’ll copy the connection string so we can copy either connection string. So then I’ll replace it here. Then we are creating a new Event Hub configuration. So this is now making use of that external library. And then we can read our stream via our Spark context. Here I’m seeing the format is Event Hubs. So again, this is making use of our library. And here I’m using the event configuration. So let me run the cell hazardous. So now we can see that we are getting a SQL data frame in place. What I’ll do is that let me display the data frame. So it is event hubs. Let me run this. So now you can see it is initializing the stream because now we are asking the Spark engine to go ahead and start reading the events from Event Hubs and then display to us.

So now we are telling Spark to go ahead and perform an action, display all of the events that are coming in from Event Hubs. So now it’s waiting for the events to come up in Azure Event Hub. So this is now a continuous streaming process. Now after a minute I can see that I have my first row of data. Now here I can see in the body I have some sort of encoded string, I have the partition which is in Azure Event Hub and what is the offset.

If I go on to the right, I can see other aspects such as the sequence number, the NQ time, ETCA. So these are all properties of the event in Event Hub. But then I can’t see my data. So in Azure Event Hubs, remember we had seen that we could get the diagnostic metric information such as what is the count, the minimum, the maximum, et cetera, all the metrics for the underlying database. But here I can see that I only have some some sort of encoded string when it comes to the body. So let’s mark an end on to this chapter and let’s continue our exploration on understanding how do we stream events from Azure Event Hubs. For now, let me cancel this particular job. So it will cancel the notebook, it will cancel the Spark job and let’s move on to the next chapter.

Comments
* The most recent comment are at the top

Interesting posts

The Growing Demand for IT Certifications in the Fintech Industry

The fintech industry is experiencing an unprecedented boom, driven by the relentless pace of technological innovation and the increasing integration of financial services with digital platforms. As the lines between finance and technology blur, the need for highly skilled professionals who can navigate both worlds is greater than ever. One of the most effective ways… Read More »

CompTIA Security+ vs. CEH: Entry-Level Cybersecurity Certifications Compared

In today’s digital world, cybersecurity is no longer just a technical concern; it’s a critical business priority. With cyber threats evolving rapidly, organizations of all sizes are seeking skilled professionals to protect their digital assets. For those looking to break into the cybersecurity field, earning a certification is a great way to validate your skills… Read More »

The Evolving Role of ITIL: What’s New in ITIL 4 Managing Professional Transition Exam?

If you’ve been in the IT service management (ITSM) world for a while, you’ve probably heard of ITIL – the framework that’s been guiding IT professionals in delivering high-quality services for decades. The Information Technology Infrastructure Library (ITIL) has evolved significantly over the years, and its latest iteration, ITIL 4, marks a substantial shift in… Read More »

SASE and Zero Trust: How New Security Architectures are Shaping Cisco’s CyberOps Certification

As cybersecurity threats become increasingly sophisticated and pervasive, traditional security models are proving inadequate for today’s complex digital environments. To address these challenges, modern security frameworks such as SASE (Secure Access Service Edge) and Zero Trust are revolutionizing how organizations protect their networks and data. Recognizing the shift towards these advanced security architectures, Cisco has… Read More »

CompTIA’s CASP+ (CAS-004) Gets Tougher: What’s New in Advanced Security Practitioner Certification?

The cybersecurity landscape is constantly evolving, and with it, the certifications that validate the expertise of security professionals must adapt to address new challenges and technologies. CompTIA’s CASP+ (CompTIA Advanced Security Practitioner) certification has long been a hallmark of advanced knowledge in cybersecurity, distinguishing those who are capable of designing, implementing, and managing enterprise-level security… Read More »

Azure DevOps Engineer Expert Certification: What’s Changed in the New AZ-400 Exam Blueprint?

The cloud landscape is evolving at a breakneck pace, and with it, the certifications that validate an IT professional’s skills. One such certification is the Microsoft Certified: DevOps Engineer Expert, which is validated through the AZ-400 exam. This exam has undergone significant changes to reflect the latest trends, tools, and methodologies in the DevOps world.… Read More »

img