DP-203 Data Engineering on Microsoft Azure – Design and Develop Data Processing – Azure Databricks Part 7

  • By
  • June 25, 2023
0 Comment

22. Lab – Azure Data Lake Storage Credential Passthrough

Now in this chapter, I want to go through a scenario wherein you can actually make use of something known as your ad Credential pass-through. So earlier on we had seen that if we wanted to fetch data from a data lake gen two storage account, we had to ensure that we have the access keys defined in a key vault.

But we can also make use of a feature known as Azure Active Directory Credential Pass through wherein the user that is actually working with the notebook can actually be authorized to access the data in the Azure data lake gen two storage account. So this is a much more useful security feature. So here the user who is executing the notebook does not need to go through the process of having the access keys in place based on their credentials, based on their permissions, they will have access on to the data in the data lake gen two storage account.

Now, I go through in detail on how you can actually give access on to your data in your data like Gentle storage account in the section of security in this course. So I go through that. But in this chapter, we are going to see how to make use of that feature, that security feature. And this is your Active Directory Credential pass through feature. So now, in order to have a clean slate to test this feature, I’m going to create a new storage account. So I’m going to choose a storage account. So it’ll be a data lake gen two storage account. So here I’ll choose my resource group. I’ll give a storage account name. I’ll choose North Europe. I’ll make this locally redundant. I’ll go on to Next for advance. I’ll enable the heroku namespace Cornet working, Data protection tags, review and Create and let me go ahead and hit on create. So here I have listed down all the steps that you need to perform.

The first is creating a new data lake storage account. Next, we need to upload a file and give the required permissions. This also includes to ensure that we give something known as the reader role and the storage Blob reader role onto the Azure admin user and also something known as ACL permissions. So once we have the storage account in place, I’ll go ahead onto the resource, I’ll go on to my containers and I’ll create a data container. Now I’m going to upload the log CSV file. So, we’ve already seen this early on. I have this log CSV file in place. Now we need to give permissions on to Azure admin account. See when we run our notebooks. Here I’m running has the Azure admin account. So I need to ensure that I give the right permissions.

Now onto my data in the data lake gen two storage account. Even though I’m the Azure admin, I still need to specifically give these permissions. So the first thing I need to do is to go on to my Data Lake storage account. Here. I have to go on to access control. I have to click on Add and add a role assignment. Here I need to choose the reader role and here I need to search for the Azure admin user ID. And then I’ll click on save. I have to add another role.

So here another role assignment. Here I need to choose the role of Storage Blob data reader. Here again search for my admin account, click on Save. Now, I also need to log into Azure Storage explorer to give something known as access control list permissions. So I mentioned that in the section of security I go through all of these concepts. So at this point in time I have logged into Azure Storage Explorer, has my Azure admin let me go on to New Data Lake.

I’m just waiting for my containers to load up. I’ll go on to my Blob containers, I’ll go on to my data container. I’ll right click. I’ll manage access control. So here I’ll click on Add. Here I’ll search for the user. So I’ll choose my user ID. That’s the first one. I’ll click on Add. Here I’ll choose the permissions of Access read hit on OK. So here, successfully save the permissions. So what I’ve done earlier on, I’ve chosen the option of manage access control list. Now here I’ll choose Propagate Access Control List so that it will propagate the access control onto all of the objects that are there in this container and I’ll hit OK.

So normally this is a more secure way. So you could have users defined in Azure Active Directory and you can give them selective access onto the files in the Azure Data Lake gen Two storage account. So we’ve done this part, we’ve given the required permissions. We’ve also assigned all of these roles. Now, next we need to create a new cluster. Now, one very important note is this is only available with the premium plan of Azure databricks and we are using that trial Premium plan and this only works with Azure Data Lake Gen two storage accounts. So now in Azure data bricks, I’ll go on to the Compute section. Now I’m going to go on to my existing cluster and I’m going to terminate the cluster.

So we have to create a new cluster. So one thing is that we will not be able to have multiple clusters in place because we might not be able to do so based on the number of virtual codes that we can actually create as part of our subscription. So I know that I have a limit on the number of virtual codes that I can use in a region as part of my subscription. So if I try to create another cluster, I might get an error. So we have to create a new cluster. So I’ll go on to my clusters and here I’ll create a cluster. Here I’ll give a cluster name and here I’ll again choose single node. Now I have to go on to the advanced options. And here I need to enable that Credential passthrough for user level access. Now, here I am choosing my root user. So actually this is the long ID for my tech support thousand user.

And then I’ll create the cluster. Let’s wait till we have the cluster in place. Now, once we have the cluster in place, I’ll go on. I’ll create a new notebook here, I’ll choose my cluster as the new cluster and hit on Create. Now, here I’ll take the code to create a data frame. I’ll place it here. So, I need to replace all of this. So, the name of my storage account is New Data Lake. I need to replace it here just to make sure it’s the same. So I have my log CSV file, it is the data container.

Now, let me run this. So here you can see all of the data. So I said the difference here is we have not used any access keys. There are no keys that are signed onto the cluster, no keys that are part of the notebook itself. We are not making use of the secrets that are stored in data bricks. We are now purely basing our authorization on the user that is defined in as your active directory. So now that same user that is running your notebook is also having access on to the data in your notebook data Lake gen two storage account. So, another secure way in which you can access your data from your notebooks.

23. Lab – Running an automated job

Now, in this chapter, I want to go through jobs that are available in Azure Data bricks. So a job is a noninteractive way to run an application in an Azure Databricks cluster. You can run the job immediately or based on a schedule. You can run a notebook or a job file in a job, and the job will run on a cluster. So, as an example, earlier on, we had run this particular notebook that would take the streaming events from Azure Event Hubs onto our table in our dedicated SQL Pool. Now, let’s say you want to run this as a job. Now, the first thing I need to do is to move this notebook. So I’ll click on move here. I’ll choose a shared location, and let me hit on Select Here. I’ll confirm the move. So currently the notebook is in the detached state. Now, in another tab, let me go on to Jobs here.

Let me create a new job. Now, here, the first thing I need to do is to select my notebook. So I’ll go on to shared. I’ll choose my app notebook. I’ll hit on confirm. Now here we need to choose the cluster on which to run our job. Now, you have two options. You can run it on your existing cluster, or you can create a new job cluster. So Job Cluster is specific for running the jobs. Since we don’t want to reach any sort of limits on the number of virtual cores that we can assign on to our clusters, I’ll choose my existing App Cluster. Now, if I just quickly go on to another tab, if I go on to my clusters. So we have been working with a couple of clusters in this particular section.

Now, at any point in time, you can go on to a running cluster, and you can terminate the cluster to basically stop it, and then you can start the cluster again. This is a cluster we created earlier on. This actually had the library installed for Azure event hub. So if I go back onto clusters, if I go on to my terminate cluster, you can again start this cluster at any point in time. So what Azure Databricks does is that it actually retains the configuration of your cluster after it has been terminated for a period of 30 days, so that you can start your cluster with the same configuration at any point in time.

If you want to retain the configuration of this cluster for a longer duration of time, you have to actually choose this icon to pin it on to as your data breaks. So now you can see you don’t even have the option to delete this particular cluster. So let me unpin this, because you should have the option to delete the cluster at any point in time. So just a quick note when it comes to the clusters going back onto our Jobs page. So we have everything in place. Let me give a job name. And here in the schedule type, you can run it based on a schedule or you can manually trigger the job. So we will manually trigger this particular job. So let me hit on create. Once we have the job in place, I’ll go back onto jobs. Now let me go ahead and start this particular job.

So it started the job, you can go on to the job. So here I’m actually getting an error. So I can see there’s an internal error if I view the details. So here it’s saying that the notebook is not found. So that means we made a mistake in our job configuration. So I can go on to job A. I can go on to configuration. And here let me select the proper notebook. So app notebook in the shared location. Let me hit on confirm. So this is fine. I’ll click on save, let me go back onto jobs and let me now run this job again, I’ll go back onto job A. And now we can see it is in the running state.

We can see the duration. Over here we can click on view details. So it has submitted the command on to the cluster for execution. So now we can see it is initializing the stream. So now we can see it is running the stream. So let me now try to see if I have any data in my log table so I can see the data in place. So now this is actually running has a job on a general all purpose cluster. But what you can do is that in a large organization, if you want to run the jobs, you can actually run them on separate job clusters. So just for now, I’ll go back on to the job and let me cancel this running job. Right, so in this chapter I just want to go to the job aspect which is available when it comes to your data bricks. You.

Comments
* The most recent comment are at the top

Interesting posts

The Growing Demand for IT Certifications in the Fintech Industry

The fintech industry is experiencing an unprecedented boom, driven by the relentless pace of technological innovation and the increasing integration of financial services with digital platforms. As the lines between finance and technology blur, the need for highly skilled professionals who can navigate both worlds is greater than ever. One of the most effective ways… Read More »

CompTIA Security+ vs. CEH: Entry-Level Cybersecurity Certifications Compared

In today’s digital world, cybersecurity is no longer just a technical concern; it’s a critical business priority. With cyber threats evolving rapidly, organizations of all sizes are seeking skilled professionals to protect their digital assets. For those looking to break into the cybersecurity field, earning a certification is a great way to validate your skills… Read More »

The Evolving Role of ITIL: What’s New in ITIL 4 Managing Professional Transition Exam?

If you’ve been in the IT service management (ITSM) world for a while, you’ve probably heard of ITIL – the framework that’s been guiding IT professionals in delivering high-quality services for decades. The Information Technology Infrastructure Library (ITIL) has evolved significantly over the years, and its latest iteration, ITIL 4, marks a substantial shift in… Read More »

SASE and Zero Trust: How New Security Architectures are Shaping Cisco’s CyberOps Certification

As cybersecurity threats become increasingly sophisticated and pervasive, traditional security models are proving inadequate for today’s complex digital environments. To address these challenges, modern security frameworks such as SASE (Secure Access Service Edge) and Zero Trust are revolutionizing how organizations protect their networks and data. Recognizing the shift towards these advanced security architectures, Cisco has… Read More »

CompTIA’s CASP+ (CAS-004) Gets Tougher: What’s New in Advanced Security Practitioner Certification?

The cybersecurity landscape is constantly evolving, and with it, the certifications that validate the expertise of security professionals must adapt to address new challenges and technologies. CompTIA’s CASP+ (CompTIA Advanced Security Practitioner) certification has long been a hallmark of advanced knowledge in cybersecurity, distinguishing those who are capable of designing, implementing, and managing enterprise-level security… Read More »

Azure DevOps Engineer Expert Certification: What’s Changed in the New AZ-400 Exam Blueprint?

The cloud landscape is evolving at a breakneck pace, and with it, the certifications that validate an IT professional’s skills. One such certification is the Microsoft Certified: DevOps Engineer Expert, which is validated through the AZ-400 exam. This exam has undergone significant changes to reflect the latest trends, tools, and methodologies in the DevOps world.… Read More »

img