DP-203 Data Engineering on Microsoft Azure – Design and Develop Data Processing – Azure Databricks Part 9

  • By
  • June 26, 2023
0 Comment

29. Delta Lake Introduction

Hi and welcome back. Now in this chapter I just want to go through Delta Lake when it comes to Azure databricks and we’ll see a couple of examples on Delta Lake in data bricks itself. So with the help of a Delta Lake, you get some more features when it comes to tables that are stored in Azure databricks. So some of the features that you get are so asset transactions. So here you can always ensure that you get consistent data. So when someone is actually, let’s say, performing an update on records that are being stored in a table, you can always be ensured that readers never see inconsistent data. So here you have now the feature of also having transactions on the data in your underlying tables.

Apart from this, you also have now the ability to handle all the metadata for your data itself. In addition to this, a table can be used both for your batch jobs and also for your streaming jobs as well. You also have schema enforcement to ensure that no bad records are inserted into your table. You also have this concept of time travel. So when it comes to your data, you have data versioning that helps you in terms of performing rollbacks. We’ll actually see an example on this in a later on chapter and then finally you can also perform up search and deletes on your data. So here I just want to give a quick introduction on Delta Lake. In the subsequent chapters we just see some examples on implementing Delta Lake.

30. Lab – Creating a Delta Table

Now in this chapter, I’ll show you how to create a delta lake table. So here I have the code in place. So here I want to first of all take information from a jsonbased file that I have in my Azure Data Lake Gentu storage account. So this is my Data Lake gentoo storage account. You’ve seen this earlier. I have my raw directory and I have a JSON based file. So again, this JSON based file has metrics that are coming in from our database via diagnostic settings. Again, I will ensure to keep this JSON file has a resource onto this chapter. You can upload it onto the raw directory. And if you’ve been following along, we already have the databricks scoped secret in place to access our Data Lake Gen two storage account. Now, next, if you want to now create a table. So now we are trying to create a table in Azure databricks. Here we are using the Save has table option to give the name of the table.

Now here, in terms of the format, we are saying go ahead and make this a delta table. So let me take this and let me execute this in a cell in a notebook that is attached onto my cluster. So this is complete. We now have a table in place. So let me execute now another cell. So now we can issue SQL commands against this table. So remember, this is the same diagnostic based information we had seen earlier on. So these are all metrics that were stored in that JSON based file.

Now here if you want to actually partition your table, so again, this is something that you can also do. So let’s say that you have your queries that are using the where clause where you are filtering on data, and there you are filtering on data based on the metric name. So if you want to get faster performance for your queries, you can actually partition your data on the metric name. So let’s say that your query is looking at information where in the metric name is equal to CPU underscore percent. Then, because it has now been partitioned, azure Tailor Bricks only has to go into the partition where the metric name is equal to CPU underscore percent. So in this way you can actually make your queries much more efficient. So in order to create a table with partitions, it’s very easy. Use the partition by clause. And here you tell what is a column based on which you want to create the partition.

Here I am giving another table name of partition metrics. So let me take this. Here I’ll go on to a new cell. Let me run this. So while this is running, and then you can issue commands. So here I am selecting the metric name and account from partition metrics and grouping by the metric name. So once this is done, let me run this. Let me ensure that it is a SQL based command. So we have all of this information in place. So in this chapter, so far, we have only been looking at how to create a delta table. In the next couple of chapters, we’ll see some advantages of using the delta table. Now, before I actually wrap this chapter off, I just want to give you a note. So if I go on to the compute section so we have our clusters in place. Now, if I create a cluster here, you can see that based on our runtime. So the databricks runtime Ax automatically uses delta lake as the default table format. So when creating a table, you don’t necessarily have to tell it to be a delta table. This is the default table format that will be used. But from the purpose of the exam, the perspective of the exam in terms of the command, because these commands are important, you should know the commands. That’s why I ensure to let you know on what are the commands to actually create a delta table.

31. Lab – Streaming data into the table

Now in this chapter, I’ll show you how to create a delta lake table. So here I have the code in place. So here I want to first of all take information from a jsonbased file that I have in my Azure Data Lake Gentu storage account. So this is my Data Lake gentoo storage account. You’ve seen this earlier. I have my raw directory and I have a JSON based file. So again, this JSON based file has metrics that are coming in from our database via diagnostic settings. Again, I will ensure to keep this JSON file has a resource onto this chapter. You can upload it onto the raw directory. And if you’ve been following along, we already have the databricks scoped secret in place to access our Data Lake Gen two storage account. Now, next, if you want to now create a table. So now we are trying to create a table in Azure databricks.

Here we are using the Save has table option to give the name of the table. Now here, in terms of the format, we are saying go ahead and make this a delta table. So let me take this and let me execute this in a cell in a notebook that is attached onto my cluster. So this is complete. We now have a table in place. So let me execute now another cell.

So now we can issue SQL commands against this table. So remember, this is the same diagnostic based information we had seen earlier on. So these are all metrics that were stored in that JSON based file. Now here if you want to actually partition your table, so again, this is something that you can also do. So let’s say that you have your queries that are using the where clause where you are filtering on data, and there you are filtering on data based on the metric name. So if you want to get faster performance for your queries, you can actually partition your data on the metric name. So let’s say that your query is looking at information wherein the metric name is equal to CPU underscore percent. Then, because it has now been partitioned, azure Tailor Bricks only has to go into the partition where the metric name is equal to CPU underscore percent.

So in this way, you can actually make your queries much more efficient. So in order to create a table with partitions, it’s very easy. Use the partition by clause. And here you tell what is a column based on which you want to create the partition. Here I am giving another table name of partition metrics. So let me take this. Here I’ll go on to a new cell. Let me run this. So while this is running, and then you can issue commands. So here I am selecting the metric name and the count from partition metrics and grouping by the metric name. So once this is done, let me run this. Let me ensure that it is a SQL based command. So we have all of this information in place. So in this chapter, so far, we have only been looking at how to create a delta table. In the next couple of chapters, we’ll see some advantages of using the delta table. Now, before I actually wrap this chapter off, I just want to give you a note. So if I go on to the compute section so we have our clusters in place.

Now, if I create a cluster here, you can see that based on our runtime. So the databricks runtime Ax automatically uses delta lake as the default table format. So when creating a table, you don’t necessarily have to tell it to be a delta table. This is the default table format that will be used. But from the purpose of the exam, the perspective of the exam in terms of the command, because these commands are important, you should know the commands. That’s why I ensure to let you know on what are the commands to actually create a delta table.

32. Lab – Time Travel

So in the prior chapter we had seen how we could stream data onto a delta lake table. Now, in this chapter I just want to give a quick overview on the time travel function that is available for your delta lake tables. So here let me issue the sequel statement of describing the history of the new metric table. So here when any change is actually made on to the data in the table, when it comes to delta lake tables, there are different versions that are being made about that table because it is a delta lake table. Here you can see that for each version what is the operation that is being performed.

So anytime there is a streaming update on the table, here you can see the operation and here you can see what are the operation parameters. And if you want to select data from the table as per a particular version, that is something that you can do as well. So for example, here if I select Star from metrics has of let’s say first of all version one, so I can see that I have no results because there was probably no data in that table. Let me go on to version two. And now you can see you have some data in place, so you can actually look at your data at different versions at different points in time. So this is the concept of the time travel that is also available for your delta lake tables.

33. Quick note on the deciding between Azure Synapse and Azure Databricks

So in this chapter, I just want to again go through some quick points when it comes to the comparison of maybe using the Spark pool in Azure Synapse and when it comes to using the Spark engine that is available in Azure databricks. So with Azure Synapse, you do have advantage of having everything in one place. So you can host your data warehouse with the help of ensuring that you create a dedicated SQL pool. You can also create external tables that actually points onto, let’s say, data in an Azure Storage account. You can also bring your storage accounts much more closer in Azure Synapse with the help of linking Azure Storage accounts, in this case Azure Data Lake Storage Gen Two accounts. Then you also have the integrated section wherein you can actually develop pipelines. So you can actually make use of these pipelines to copy data from a source on to a destination.

So you have everything that is in one place in Azure Synapse. Whereas in Azure databricks we have seen that we have a lot of functionality that is available and this is based on the underlying Spark engine. Also when it comes onto Azure databricks, it’s not only for data science, it can also be used for machine learning. So a lot of the frameworks which is available for machine learning is also part of the databrick service. So this is one complete solution if you are looking at data engineering, data science, machine learning. And as I mentioned before, because the people who have made Spark have also made Azure data bricks, whatever changes they make onto the Spark engine will be available always in Azure data bricks. So in this chapter, again want to go through a few points when it comes to both of these services to help you decide on which service best suits your needs.

34. What resources are we taking forward

So, again, a quick note on what I’m taking forward. So I need to discuss some monitoring aspects when it comes on to as your data bricks. And we’ll be covering this in the monitoring section. So at this point in time anyway, you can actually go ahead and delete your cluster if it’s no longer required. And then when we go on to the monitoring section, you can go ahead and recreate the cluster. But we will visit the monitoring part when it comes on to the Azure databricks service.

Comments
* The most recent comment are at the top

Interesting posts

The Growing Demand for IT Certifications in the Fintech Industry

The fintech industry is experiencing an unprecedented boom, driven by the relentless pace of technological innovation and the increasing integration of financial services with digital platforms. As the lines between finance and technology blur, the need for highly skilled professionals who can navigate both worlds is greater than ever. One of the most effective ways… Read More »

CompTIA Security+ vs. CEH: Entry-Level Cybersecurity Certifications Compared

In today’s digital world, cybersecurity is no longer just a technical concern; it’s a critical business priority. With cyber threats evolving rapidly, organizations of all sizes are seeking skilled professionals to protect their digital assets. For those looking to break into the cybersecurity field, earning a certification is a great way to validate your skills… Read More »

The Evolving Role of ITIL: What’s New in ITIL 4 Managing Professional Transition Exam?

If you’ve been in the IT service management (ITSM) world for a while, you’ve probably heard of ITIL – the framework that’s been guiding IT professionals in delivering high-quality services for decades. The Information Technology Infrastructure Library (ITIL) has evolved significantly over the years, and its latest iteration, ITIL 4, marks a substantial shift in… Read More »

SASE and Zero Trust: How New Security Architectures are Shaping Cisco’s CyberOps Certification

As cybersecurity threats become increasingly sophisticated and pervasive, traditional security models are proving inadequate for today’s complex digital environments. To address these challenges, modern security frameworks such as SASE (Secure Access Service Edge) and Zero Trust are revolutionizing how organizations protect their networks and data. Recognizing the shift towards these advanced security architectures, Cisco has… Read More »

CompTIA’s CASP+ (CAS-004) Gets Tougher: What’s New in Advanced Security Practitioner Certification?

The cybersecurity landscape is constantly evolving, and with it, the certifications that validate the expertise of security professionals must adapt to address new challenges and technologies. CompTIA’s CASP+ (CompTIA Advanced Security Practitioner) certification has long been a hallmark of advanced knowledge in cybersecurity, distinguishing those who are capable of designing, implementing, and managing enterprise-level security… Read More »

Azure DevOps Engineer Expert Certification: What’s Changed in the New AZ-400 Exam Blueprint?

The cloud landscape is evolving at a breakneck pace, and with it, the certifications that validate an IT professional’s skills. One such certification is the Microsoft Certified: DevOps Engineer Expert, which is validated through the AZ-400 exam. This exam has undergone significant changes to reflect the latest trends, tools, and methodologies in the DevOps world.… Read More »

img