DP-203 Data Engineering on Microsoft Azure – Design and Develop Data Processing – Azure Databricks

  • By
  • June 22, 2023
0 Comment

1. What is Azure Databricks

Now, before we actually go into Azure data bricks, let me go ahead and first explain the need for Data Bricks itself. So, Databricks is a company that was actually founded by the original creators of Apache Spark. So Data Bricks itself, the service actually makes use of Apache Spark to go ahead and provide a unified antics platform. So let’s go ahead and understand the use case of using Data bricks. So let’s say that you want to go ahead and make use of Apache Spark for your underlying processing needs.

So the first thing that you would need to do is you would need to go ahead and provision machines. You then go ahead and install the Spark engine on these underlying machines and the required libraries. And then you could actually go ahead and use Apache Spark for your data processing needs. Now, in such a scenario, so over here, you are responsible for provisioning the underlying machines. You are responsible for installing the required Spark engine and the required libraries.

Over here, you also have the responsibility of maintaining the underlying infrastructure itself. So if you need to go ahead and scale the underlying machines in order to cater to the data processing needs, this is something that you need to take care of. But with Databricks itself over here, data Bricks can allow you to create this entire environment with just a few clicks.

So over here, Data Bricks can actually go ahead and first of all create the underlying compute infrastructure for you. In addition to that, it will also go ahead and work with the underlying storage layer. So in addition to having the servers in place, it also provides an abstraction layer that allows Spark to go ahead and interact with the underlying storage service. It will also go ahead and install Spark for you and also other libraries and frameworks to go ahead and add other capabilities to Spark as well. So for example, you could also go ahead and include the use of machine learning libraries.

So all of this can be done by Data Bricks itself. It also goes ahead and provides a workspace for you. So in this workspace, you can actually go ahead and create notebooks. Users can then go ahead and collaborate on these notebooks. And you can also go ahead and create visualizations on the notebook itself. Now when it comes to Data bricks. So you can go ahead and launch data bricks either in AWS, that’s Amazon Web Services or Azure. And that’s where we come on to Azure databricks. So Azure databricks is nothing but a completely managed databricks environment for you.

So over here, it’ll actually go ahead and make use of the underlying compute infrastructure and the Virtual Network service that is already available in Azure. So Azure Data Bricks is nothing but an implementation of data bricks on Azure itself. Over here, you can also make use of Azure security aspects such as integration with Azure Active Directory and Rolebased access control. Right? So in this chapter, I just want to kind of give an introduction onto Azure databricks.

2. Clusters in Azure Databricks

Hi and welcome back. Now in the previous chapter, I gave an introduction onto Azure databricks. Now in this chapter, I just want to go through some important concepts before we go into labs, into looking at Azure data bricks, just so that you have an idea on what we are going to do in the labs itself. So again, in data bricks allows you to go ahead and create the underlying infrastructure which will have the underlying machines in place. And those machines will have Spock installed along with the underlying libraries that will allow you to go ahead and perform your data analytics. So now in this case, when it comes to Azure databricks over here, with the help of the Azure Data Bricks service, you can actually go ahead and create clusters in something known as a workspace. So this cluster of machines will actually go ahead and have the Spark engine and other components installed. Now, when it comes to the cluster itself, there are two types of nodes that get created.

So first you have the worker nodes. So these are the nodes that actually process the underlying task. So let’s say you want to go ahead and send a particular command on to the underlying Spark engine. That command will actually be sent onto the worker notes. The worker nodes will have the responsibility of performing the underlying tasks. And then you have the driver node. The driver node actually has the responsibility of distributing the tasks which we send on to the Spark cluster onto the worker nodes, right? So this is one of the key concepts in Azure databricks. We can actually go ahead and create a cluster of nodes.

Now in Azure databricks, when it comes to the clusters, there are two types of clusters in place. So we can actually go ahead and create something known as an interactive cluster, or you can go ahead and create something known as a job cluster. Now, with the help of the interactive cluster here, you can actually go ahead and analyze your data with the help of interactive notebooks over here.

Also, multiple users can go ahead and use a cluster and then collaborate on the notebooks that get created. So this is an interactive way of analyzing your data. Whereas let’s say you just want a job to run on the cluster, you don’t want any sort of interaction from a user, then you could actually go ahead and run that job on a job cluster. So when the job needs to run, then as your databricks will automatically go ahead and start the cluster, it will go ahead and run the job. And when the job is complete, the cluster will be terminated. So this is a cost efficient way of running jobs on a cluster. Now, again, when it comes to an interactive cluster, so there are two types of interactive clusters. So you have a standard cluster and you have a high concurrency cluster.

Now, the standard cluster is recommended if you are a single user working in Azure databricks. Now, over here, there is actually no fault isolation. So over here, yes, you can have multiple users that are running workloads on the cluster itself. But over here, in the standard cluster, there is no fault isolation. That means if a fault happens on a workload that has been executed by one user, it might impact the workloads running by other users on the same cluster. Over here, also the resources of the cluster might get allocated onto a single workload. So in this case, what happens is that if all of the resources are just working on a single workload, and if you have other users who are trying to execute their workloads on the cluster, they might not run efficiently because the resources are not being allocated onto those workloads.

Now, when it comes to a cluster, when it comes to running your notebooks, when it comes to a standard cluster, so it has support for the underlying languages, the programming languages of Python, Rscala and SQL, then you have the high concurrency clusters. So this is recommended for multiple users. So if you have multiple data engineering users who need to go ahead and make use of a cluster in Azure databricks, then you can go ahead and make use of a high concurrency cluster.

Here, you have aspects such as fault isolation. You are also the resources of the cluster are effectively shared across different user workloads. Now, over here, this has support for Python, R and SQL. So there is no support for scala. As of yet, in the high concurrency cluster, your odds are something known as table access control. Here you can go ahead and grant and revoke access onto data from either Python or SQL. Right, so in this chapter, just want to go through some important aspects when it comes to clusters in Azure databreak. Six.

3. Lab – Creating a workspace

So now, in this chapter, let’s go ahead with the working of Azure data bricks. So the first thing that we need to do is to create something known as an Azure Databricks workspace. So let’s do that. In all resources, I’ll hit on Create. So here, I will search for as your data bricks, I’ll choose that. I’ll hit on create. Here, I’ll choose my resource group. Here, I need to give a workspace name. I have to choose my region. So here I’ll choose North Europe.

Now, here, in terms of the pricing tier, there are different pricing tiers in place. I’m going to choose the trial, which is giving us the premium features along with 14 days free DB use. Now, I’ll explain this concept when it comes to this particular pricing tier. So I’ll do that at a later point in time. This is when creating the cluster in the workspace. For now, I’ll choose this pricing tier. I’ll go on to networking. I’ll leave everything hazardous. I’ll go on to advance. I’ll go on tags. I’ll go on to review and Create. And let’s hit on create.

So this is now going to launch our databricks workspace. Let’s come back once we have the workspace in place. Once we have the workspace in place, I’ll go ahead on to the resource. Here, we need to scroll down, and we need to now launch our workspace. So now when the workspace is where you’ll actually do all of your work, you’ll create clusters, you create notebooks. You can create spark databases and tables. So you will do all of your data engineering work here.

In this particular workspace, in as your data bricks here, you can see you have the ability to create a new notebook, create a table, create a cluster, create something known as a job. Here, in the menu options, you can again see that you can create a notebook. You can create a table, you can create a cluster. You can create a job. Here you can see an overview of your workspace. Here you can see something known as repos. You can look at your data. You can look at the compute options and at your jobs. Right? So in this chapter, I just want to start with creating databricks workspace.

Comments
* The most recent comment are at the top

Interesting posts

The Growing Demand for IT Certifications in the Fintech Industry

The fintech industry is experiencing an unprecedented boom, driven by the relentless pace of technological innovation and the increasing integration of financial services with digital platforms. As the lines between finance and technology blur, the need for highly skilled professionals who can navigate both worlds is greater than ever. One of the most effective ways… Read More »

CompTIA Security+ vs. CEH: Entry-Level Cybersecurity Certifications Compared

In today’s digital world, cybersecurity is no longer just a technical concern; it’s a critical business priority. With cyber threats evolving rapidly, organizations of all sizes are seeking skilled professionals to protect their digital assets. For those looking to break into the cybersecurity field, earning a certification is a great way to validate your skills… Read More »

The Evolving Role of ITIL: What’s New in ITIL 4 Managing Professional Transition Exam?

If you’ve been in the IT service management (ITSM) world for a while, you’ve probably heard of ITIL – the framework that’s been guiding IT professionals in delivering high-quality services for decades. The Information Technology Infrastructure Library (ITIL) has evolved significantly over the years, and its latest iteration, ITIL 4, marks a substantial shift in… Read More »

SASE and Zero Trust: How New Security Architectures are Shaping Cisco’s CyberOps Certification

As cybersecurity threats become increasingly sophisticated and pervasive, traditional security models are proving inadequate for today’s complex digital environments. To address these challenges, modern security frameworks such as SASE (Secure Access Service Edge) and Zero Trust are revolutionizing how organizations protect their networks and data. Recognizing the shift towards these advanced security architectures, Cisco has… Read More »

CompTIA’s CASP+ (CAS-004) Gets Tougher: What’s New in Advanced Security Practitioner Certification?

The cybersecurity landscape is constantly evolving, and with it, the certifications that validate the expertise of security professionals must adapt to address new challenges and technologies. CompTIA’s CASP+ (CompTIA Advanced Security Practitioner) certification has long been a hallmark of advanced knowledge in cybersecurity, distinguishing those who are capable of designing, implementing, and managing enterprise-level security… Read More »

Azure DevOps Engineer Expert Certification: What’s Changed in the New AZ-400 Exam Blueprint?

The cloud landscape is evolving at a breakneck pace, and with it, the certifications that validate an IT professional’s skills. One such certification is the Microsoft Certified: DevOps Engineer Expert, which is validated through the AZ-400 exam. This exam has undergone significant changes to reflect the latest trends, tools, and methodologies in the DevOps world.… Read More »

img