DP-203 Data Engineering on Microsoft Azure – Monitor and optimize data storage and data processing

  • By
  • June 30, 2023
0 Comment

1. Best practices for structing files in your data lake

Now, in this chapter, I want to give some best practices when it comes to structuring your files. And this is when it comes to building your data lake. So normally when you design a data lake, you might go ahead and create something known as multiple zones. These zones, for example, can map on to different containers that you have in an Azure data lake gen two storage account. For example, you might have one container which is defined as the raw zone. So here the container would take all of the files that are being ingested into the Azure data lake gen two storage account.

So this contains the files in its original format, whether it be Avro, whether it be Parquet, whether it be JSON, etc. Then you might have another zone or another container where basic filtering has been carried out on the raw data, the data in the raw zone. So, for example, you could remove columns that are not required, let’s say in a parquet based file or in a JSON based file. And then finally, you might have another folder which represents your curated zone. This is the data in which you want to perform the analytics. So this is the data that you might, let’s say, transfer onto your data warehouse. Now, apart from that, the hierarchy used for the storage of files is also important. See, normally when it comes to ingesting data, you will be ingesting data at a rapid pace.

You might be having data that might be coming in every minute onto your Azure data Lake gen two storage account. That’s why it’s very important to use a proper hierarchy when it comes to the storage of your files. Here just giving an example. So you might have your raw zone, but this might be in a particular department. So you might have a department first have a raw zone for the department. You might have a folder which denotes the data source, because data can come from a variety of sources. Then you have the year, the month, the day.

You could even have the hour and minute as well. And then you have the file itself. Next, look at using compressed file formats such as park, because here less time is spent on the data transfer. And when it comes to loading data in Azure Synapse, you can take the advantage of the data warehousing aspect that can be used for decompression. And you can also use multiple source files because remember that when it comes to the compute nodes, when it comes to the MVP architecture of Azure Synapse, when it comes to your dedicated SQL pool, you can actually split your source files into different parts. And here are the multiple compute nodes. Each of them can process one file at a time, right? So in this chapter, just want to kind of go through some of the best practices when it comes to structuring your files.

2. Azure Storage accounts – Query acceleration

Now, in this chapter, I briefly want to go through something known as the Azure Data Lake Storage query acceleration. So here I don’t have any sort of lab on this. I just want to let you know about this particular feature. So, this feature is used when you have applications such as your net based applications that actually access your files in your Azure Data Lake Gen Two storage account. So when you’re using a. Net program, and let’s say you’re using SQL to work with the data that is present in your Azure Data Lake Gen Two storage account, then in order to get faster results when it comes to filtering of row, predicates and column projections, you can actually make use of this feature known as Query Acceleration Request.

Currently, this only supports CSV and JSON based files. Now, when it comes to exam, what’s very important to understand is what is the purpose of the query acceleration feature and how do you make this work? So let me scroll down onto the next steps wherein we filter data by using this acceleration feature. Here, if I scroll down so in order to enable this query acceleration feature, you have to ensure that you register a provider using PowerShell.

So here, the name of the provider is Microsoft Storage. So this page actually has details on how you can actually use this query acceleration feature, since most of this is done from a development language. That’s why I do want to let students actually go into details on how to use this feature. From an exam perspective, it is just important to understand how do you enable this feature. That’s why I just won’t have a quick video based on this. Now, I’ll ensure that these links are placed as an external resource on to this chapter so that you can actually view through these documentation pages.

3. View on Azure Monitor

Hi and welcome back. Now in this chapter, I just want to give an overview when it comes to the monitoring service that is available in Azure. So Azure has their own built monitoring service. So for example, if you look at the resources we have created so far, so if I look at, let’s say, database now in the overview itself, if you scroll down, you can get aspects such as the compute utilization, what is the D two percentage max that is being used. So all of these are actually coming in from a separate service known as the Azure Monitoring service. Now here I can actually search for the monitor service and I can go on to it here. If you go on to the metrics section here, you can actually plot the metrics for a particular resource. So for example, let’s say I want to plot the metrics for my dedicated SQL pool. So I can choose new pool over here and let me hit on apply, let me just hide this. Then I can select the metric. So based on what do I want to see my metrics?

So here, let’s say I want to look at the data warehousing units that have been used over a period of time. So I can see at one point I use a maximum of 19 data warehousing units. This also gives you a good idea on whether you should increase or maybe even decrease the number of data warehousing units that you are using for your dedicated SQL pool. In my example, I’m using the most lowest tier when it comes to the dedicated SQL pool. So there’s no way for me to decrease. The only way is to increase. But maybe in your organization you might be having dedicated SQL pools that have a higher number of data warehousing units that have might be assigned if you want to reduce the cost.

What you can do is you can look at the utilization of your pool over a period of time and then based on that, you can actually decide whether you want to decrease the number of data warehousing units that have been assigned onto your dedicated SQL pool. You can also create alerts. Say that you want your It administrative team to be alerted when there is the threshold being reached for, let’s say, the data warehousing units for your dedicated SQL pool. So you can create a new alert rule from here, or you could also go on to alerts and create alerts from here as well. So if you create a new alert rule here, you choose the scope which has already been defined. So here the scope and the condition is whenever the maximum data warehousing units is greater than something, we have to define this condition.

So I can click on this. And here, if I scroll down, so I can say that whenever the maximum number of data warehousing units goes beyond 60% over, let’s say, a period of five minutes, this will be my condition. I can then hit on done. Then I can scroll down and I can create something known as an action group. So in an action group, you can actually define what action to take. So here you can actually give a name for the action group. Then if you go next on to notifications, you can choose a notification type. So if you want to email someone, you can specify what should be the email address and then you can hit on okay, you can just give a name for the notification.

Then you can go ahead and select an action. So you could use an action to do something. So these are automation tools that are also available as part of Azure. So there are some automation based tools that are available in Azure that you can actually make use of. And then you can go on tags, you can go on to review and create and you can create the action group. This group will be created that can be reused across multiple alerts. So once you have now your action group in place, you have your condition in place, you have your scope in place. So when it comes to the cost, please note that there is a small cost when it comes to defining this alert rule.

And here you can scroll down, you can give a name for the alert and then you can create an alert rule. Here I’ll say not to automatically resolve the alerts. I’ll create the alert rule. And now whenever the data warehousing units goes beyond that particular threshold, an alert will be generated and the email notification will be sent. If you go on to activity log. If I just hide this, this will actually give you all of the activities, the administrative based activities, all of your control plane activities that occur as part of your Azure account. So for example, if you create create a storage account that will come over here.

If you delete a storage account, that activity will come over here. So all of the activities are recorded in the activity lock section in Azure Monitor. And then apart from that, you also have a lot of other features that are available in the Azure monitoring service.

Comments
* The most recent comment are at the top

Interesting posts

The Growing Demand for IT Certifications in the Fintech Industry

The fintech industry is experiencing an unprecedented boom, driven by the relentless pace of technological innovation and the increasing integration of financial services with digital platforms. As the lines between finance and technology blur, the need for highly skilled professionals who can navigate both worlds is greater than ever. One of the most effective ways… Read More »

CompTIA Security+ vs. CEH: Entry-Level Cybersecurity Certifications Compared

In today’s digital world, cybersecurity is no longer just a technical concern; it’s a critical business priority. With cyber threats evolving rapidly, organizations of all sizes are seeking skilled professionals to protect their digital assets. For those looking to break into the cybersecurity field, earning a certification is a great way to validate your skills… Read More »

The Evolving Role of ITIL: What’s New in ITIL 4 Managing Professional Transition Exam?

If you’ve been in the IT service management (ITSM) world for a while, you’ve probably heard of ITIL – the framework that’s been guiding IT professionals in delivering high-quality services for decades. The Information Technology Infrastructure Library (ITIL) has evolved significantly over the years, and its latest iteration, ITIL 4, marks a substantial shift in… Read More »

SASE and Zero Trust: How New Security Architectures are Shaping Cisco’s CyberOps Certification

As cybersecurity threats become increasingly sophisticated and pervasive, traditional security models are proving inadequate for today’s complex digital environments. To address these challenges, modern security frameworks such as SASE (Secure Access Service Edge) and Zero Trust are revolutionizing how organizations protect their networks and data. Recognizing the shift towards these advanced security architectures, Cisco has… Read More »

CompTIA’s CASP+ (CAS-004) Gets Tougher: What’s New in Advanced Security Practitioner Certification?

The cybersecurity landscape is constantly evolving, and with it, the certifications that validate the expertise of security professionals must adapt to address new challenges and technologies. CompTIA’s CASP+ (CompTIA Advanced Security Practitioner) certification has long been a hallmark of advanced knowledge in cybersecurity, distinguishing those who are capable of designing, implementing, and managing enterprise-level security… Read More »

Azure DevOps Engineer Expert Certification: What’s Changed in the New AZ-400 Exam Blueprint?

The cloud landscape is evolving at a breakneck pace, and with it, the certifications that validate an IT professional’s skills. One such certification is the Microsoft Certified: DevOps Engineer Expert, which is validated through the AZ-400 exam. This exam has undergone significant changes to reflect the latest trends, tools, and methodologies in the DevOps world.… Read More »

img