DP-203 Data Engineering on Microsoft Azure – Design and Develop Data Processing – Azure Databricks Part 5

By
June 24, 2023

0 Comment

17. Lab – Structured Streaming – Streaming from Azure Event Hubs – Implementation

So where did we leave off in the last chapter? So, we had run our notebook to read events from the Event Stream, but we were getting the body of our event in some sort of encoded format. So I have now some code to display now the string version of the encoded coded format of the body which we have received. So here I am again using another way to access the column. So I am accessing the body column here, and I am again taking the body of the column data and then casting it onto a string type. So let me just take this statement and write it here. And let’s run the cell again.

So again, it will wait for a streaming event. Now I’m just getting an area wherein I can’t use string type. So for that, I just need to copy this Import Statement. Let me run this again. So now it is initializing the stream. This Import Statement is making use of the Event Hub Library that we installed in the previous chapter when it came to ensuring that Azure databricks can actually work with Azure Event Hubs. So let’s wait till we have some data in place.

So now, after getting our first stream of data, now we can see we’ve made some progress. So now, in the body of the request, I can see now I’m getting the records and again for each record. So this is an array here, I’m getting values such as the count, the total, the minimum, et cetera. So now we are moving on, right? So we are able to now get the string body of our request. So let me again cancel this spark job and let’s go on to our next objective.

So I’m going to copy all of this now. So this is used now to display our data properly. And we have quite a lot of code over here. So again, I’m using my import statements. I need to ensure that I replace it with the proper connection string. So again, I’ll copy this onto the clipboard and then let me replace it here. Now, for the Event Hub configuration, I’m using now another method that is available when you want to set something for the Event Hub configuration.

And this is from where do you want to start? From the event stream. So you can use Event position. And there are different values that you can specify. Over here, I’m giving a reference link on to the different options. So I’m saying please start from the start of the stream when it comes to Azure Event Hubs.

Now, next, you have seen our data. So our data was in the form of an array, and then we had records, and then in the records, again, we had an array, and then we had our JSON objects. Now, when it comes to JSON, or when it comes to other forms of data, it might be complex to try and read all of the data. So even here I’ve added some code to try to get the value of each record. Now, by no means is this code perfect, but the entire idea is I want to show you how you can actually start reading your streams of data in a notebook.

So, as a data engineer, let’s say you want to get events from the Event hub, you want to perform some sort of transformation or you want to understand the data. You can spin up a notebook and start working on it using the underlying spark engine. So here I am again reading the stream of data. The format and the options is the same. Now here I am selecting what do I want from the stream. Remember, in our stream, in addition to the body, we add other properties that are also being shown. This is properties that are being set by the Event Hub, such as the partition, ID, etc. But all I want is the body that’s it. Now in the body I am typecasting it onto string. So, we have seen this before. I’m using another function that’s known as get underscore JSON underscore object. So this is an inbuilt function that is available and this allows to actually extract the JSON object from the JSON string. So currently the body of our event is in a JSON string, but I want to get it as a JSON object.

And here I am trying to get only the records. So I want to kind of explore that initial array and only get the records. And here I am typecasting it onto a column known as records. Now next I want to extract all of the JSON objects. And here is where it gets a little bit tricky. So in my records, when it’s coming from Event hubs, right? So I said it’s an array. Now, here you might have different JSON objects. Each JSON object has the count, the minimum, the maximum, et cetera. All of the information is in the JSON objects, but the number of JSON objects is not fixed.

So you can use some intelligent logic to ensure that you go through the entire set of records that is within the records array. Here what I’m doing is I’m only getting the maximum of 30 records. So if there are 30 or less, go ahead and get them as JSON objects. So that’s the entire purpose of having now a separate JSON elements value to actually get all of my JSON objects. And then finally I’m exploring the entire array of the elements that I get here, right? Because I said everything is in an array and then I’m ensuring that. So I said I’m giving a static value of 30 records. I’m assuming there are 30 records, but if there are only 20 records, then I don’t want those which will have actually null records. So I said this particular logic, which I have over here, is just some temporary solution to actually get what are the records that are being fetched.

So this actually depends on the complexity of the data that you are trying to work with. So you can go ahead and add your own logic to ensure that you get all of the records within that records array. And then here I am creating my own schema. So we’ve seen this before. I am then ensuring that I create a new data frame, getting all of the records. And here I am using that data schema. Here the from JSON is again used to pass a column that contains a JSON string. So I’ve used a collection of function that is available and wherever possible, I’ve tried to give explanation on what are the functions that are being used in this particular code. And then I am going ahead and now only selecting all of the records. And then I am displaying the data frame.

So let me go ahead and run this cell so it’s initializing the stream. So here there is some slight complexity in the code and that’s just because of the data that’s actually coming in. Now here you can see I am already getting the data in place. And why is that? So this is because remember, I have used the set starting position as the start of the stream. So if there are all the events in the Event Hub, then give them on to me in my notebook. That’s why I’m already getting all of the events in the Event Hub. And here you can see now I am getting the proper format of the data. So since it’s refreshing, that’s why I’m not able to scroll. But you can see all of the information here in its proper format. So for me to actually get this sort of information did take some time.

So I said this is not a spark based course. This is not a scala based course. This is a course to actually understand data engineering aspects when it comes to using services that are in Azure. Right? So in this chapter, just want to go through that extra step of now passing the data that we are getting from Azure Event Hubs. I’m going to go ahead and cancel this spark job.

18. Lab – Getting data from Azure Data Lake – Setup

Now in the last chapter we had ended with taking or streaming events from an event hub and basically passing the data in such a way that we get now our individual columns of data. Now I want to write this data onto our table that we have our log data table in our dedicated pool in Azure Synapse. But before that, I want to show you how you can use a notebook to connect on to Azure Data Lake Gen Two. The reason for this is when you start streaming data in Azure Synapse, you need to have a staging area in place. So we’ve seen this earlier on. So when you want to have data transferred onto Azure Synapse, there needs to be a staging area before the data can be copied onto the dedicated SQL pool. So for that, our staging area is going to be specified in our notebook.

But that staging area is going to be azure data lake Gen two storage account. So we anyway already have an Azure data lake gen two storage account in place. So we are going to go ahead and make use of that account. Now, I’ve gone on to all resources and something that you will see is you’ll see a lot of resources in the beginning over here. So you can see a virtual machine, a network interface, some disk, et cetera. All of this is now your cluster. So the machines, now this is like your driver node. So the machine which is now running that Spark installation and running all of your Spark jobs is now running on this machine that has been created automatically by data bricks. When you delete your workspace, it will automatically delete all of this infrastructure.

So again, let’s go on to a data lake. Gen Two storage account. That’s data lake 2000. So we need to use our Data Lake 2000 storage account. So before that, I want to show you how you can access and as your Data Lake Gentle storage account from your notebook. Now here we are going to make use of the access keys. So with the storage account we have our access keys and I want to make use of these access keys when it comes to authorization. But then there is an initial step that we need to perform. First, because of security purposes, you can’t just now embed the access key in your notebook. We have to use another service that is known as the Azure Keywall Service to store the access key.

So the Azure Keywall Service is another service that is available in Azure. This helps to store artifacts such as your passwords that’s your secrets, your certificates and your encryption keys. It provides you a platform where you can actually manage the lifecycle of your certificates, your encryption keys and your secrets.

So now the access key will be stored as a secret in the Azure Key vault. Then we are going to create something known as a databricks scoped secret to access the key value. So remember, the entire purpose is from our notebook in Azure databricks. We want to access, let’s say, our log CAC file in our Azure Data Lake Gen Two storage account. But because of security purposes, remember we need to ensure that Azure databricks has the ability to access our Azure Data Lake Gen Two storage account. One way of doing it is via access keys.

So instead of actually embedding the access keys in our notebook directly, we are going to store this as a secret in the Azure Keyword. And then we are going to create something known as a scoped secret in Azure data bricks and then our notebook can actually make use of that secret. So all of this is just being done to have a more secure way of accessing the access keys. Because if the keys gets in the wrong hands, then anybody will get information onto your Azure Data Lake Gen Two storage account. So the first thing I will do is in all resources, let’s create another resource here. I’m going to search for the keyword service. I’ll choose keyword. I’ll hit on create. Here I’ll choose my resource group. I’ll choose the location has north Europe. We need to give a unique keyword name. Here I leave the pricing tier has standard in terms of the number of days to retain a deleted vault is seven days.

I’m just mentioning it has seven days over here. So this is an extra feature that is available with your keywords. So if by mistake you delete the keyword, you can still access the artifacts of the keyword. So you can actually go ahead and restore the key vault I’ll go onto next to the access policy. I’ll leave everything as it is, I’ll go on to networking, leave everything as it is. I’ll go on tags, I’ll go on review and Create and let’s go ahead and hit on Create. This will just take a couple of minutes. Once we have the keyword in place, I’ll go on to the resource.

And here is where you can define your keys. This is your encryption keys. You can define your secrets such as your passwords and your certificates. So here I’ll go on to secrets and let me create a new secret. So here I’ll just give the name of Data Lake 2000 and what is the value of our secret. So if I go on to my Data Lake Gen Two storage account, I’ll take either key one or key two. So I’ll take key one, I’ll place it here and I’ll hit on Create, right? So let’s mark it end onto this chapter and complete this process in the next chapter.

Category: Uncategorized

Comments

* The most recent comment are at the top

Interesting posts

The Growing Demand for IT Certifications in the Fintech Industry

The fintech industry is experiencing an unprecedented boom, driven by the relentless pace of technological innovation and the increasing integration of financial services with digital platforms. As the lines between finance and technology blur, the need for highly skilled professionals who can navigate both worlds is greater than ever. One of the most effective ways… Read More »

CompTIA Security+ vs. CEH: Entry-Level Cybersecurity Certifications Compared

In today’s digital world, cybersecurity is no longer just a technical concern; it’s a critical business priority. With cyber threats evolving rapidly, organizations of all sizes are seeking skilled professionals to protect their digital assets. For those looking to break into the cybersecurity field, earning a certification is a great way to validate your skills… Read More »

The Evolving Role of ITIL: What’s New in ITIL 4 Managing Professional Transition Exam?

If you’ve been in the IT service management (ITSM) world for a while, you’ve probably heard of ITIL – the framework that’s been guiding IT professionals in delivering high-quality services for decades. The Information Technology Infrastructure Library (ITIL) has evolved significantly over the years, and its latest iteration, ITIL 4, marks a substantial shift in… Read More »

SASE and Zero Trust: How New Security Architectures are Shaping Cisco’s CyberOps Certification

As cybersecurity threats become increasingly sophisticated and pervasive, traditional security models are proving inadequate for today’s complex digital environments. To address these challenges, modern security frameworks such as SASE (Secure Access Service Edge) and Zero Trust are revolutionizing how organizations protect their networks and data. Recognizing the shift towards these advanced security architectures, Cisco has… Read More »

CompTIA’s CASP+ (CAS-004) Gets Tougher: What’s New in Advanced Security Practitioner Certification?

The cybersecurity landscape is constantly evolving, and with it, the certifications that validate the expertise of security professionals must adapt to address new challenges and technologies. CompTIA’s CASP+ (CompTIA Advanced Security Practitioner) certification has long been a hallmark of advanced knowledge in cybersecurity, distinguishing those who are capable of designing, implementing, and managing enterprise-level security… Read More »

Azure DevOps Engineer Expert Certification: What’s Changed in the New AZ-400 Exam Blueprint?

The cloud landscape is evolving at a breakneck pace, and with it, the certifications that validate an IT professional’s skills. One such certification is the Microsoft Certified: DevOps Engineer Expert, which is validated through the AZ-400 exam. This exam has undergone significant changes to reflect the latest trends, tools, and methodologies in the DevOps world.… Read More »

Related posts: