DP-203 Data Engineering on Microsoft Azure – Monitor and optimize data storage and data processing Part 9

  • By
  • July 4, 2023
0 Comment

24. Azure Stream Analytics – Diagnostics setting

And welcome back. Now in this chapter I just want to go through the diagnostic settings that is available for your Azure Stream antics job. So even earlier on we had seen that we could actually direct the logs the pipeline runs when it came on to Azure Data factory onto a log antics workspace and all of the logs would be collected in the log section and they would come has different tables under log management. Now, the same thing you can also do with your Azure Stream antics job. So here if I go on to my Stream antics job here, if I scroll down and if I go on to diagnostic settings here, I have already enabled a diagnostic setting. Here I am actually directing the logs onto a different log antics workspace.

Now please know that you can direct the logs on to the same log antics workspace. But here I have just gone ahead and directed it onto a different log antics workspace if I click on Edit setting. So here I have chosen the log of execution. So everything when it comes to executing the stream antics job is sent on to the log antics workspace. So now if I actually go on to that log antics workspace, so it’s DB workspace, I’ll go on to the log section, I’ll just hide this, I’ll expand this. So when it comes to the diagnostic setting, it will be written onto the table as your diagnostics.

Now Azure diagnostics is a table that can store information, the log information about various other resources as well. So not only Azure stream antics So here if I go on to Azure diagnostics, let me just hide this and let me run the statement. So here we can see all of the information. Now if I expand one record. So here we are seeing the resource provider which is Microsoft Stream antics. Remember, as I mentioned before, you could have different other resources, different based on other services that can also send their logs onto the log antics workspace.

So for example, your Azure SQL database, your Azure web apps, all of them can actually send their logs onto the log antics workspace. So if you only want to look at the Steam antics logs from your point of view, then you have to ensure that in your query against the Azure diagnostic table, you query based on the resource provider.

Now next, when it comes to the stream antics, there is also one more property here that’s properties underscore S. Now here, this actually gives properties about that particular event that occurred in Azure Stream and X. Now here I have an event wherein it could not decrease the input event data. Here we have the properties of, let’s say the data error type, the error category and the error code. Now this property is in JSON and also this property, the properties will not be same for every type of event in Azure Stream antics. So for example, if I just close this so for example, if I go on to another event and if I scroll down. So now this was the operation of starting the stream at the top and if I go on to the properties here, I can see I have some different properties in place.

So depending upon the type of event that is actually occurring in the Azure Stream antics job, that information will actually be in properties underscore S. So here I just want to show an example of a query. Here I am saying please only return the rows where the resource provider is Microsoft or stream antics. And next, if I only want to get those rows wherein it is resulting in an input deserialization error, then I can get the error code, which is part of properties underscore S. So this is not a direct column in Azure diagnostics.

This is part of an event, part of a row in the log. Antics workspace in this particular table. So here I’m using the pass underscore JSON function or method to basically get all of the elements in this particular JSON object and with the help of the project statement. So here if you look at all of the records, I can see the column of the time generated, the resource ID category, resource group, et cetera. If I only want the columns of the time generated and the message, I can actually use the project statement here.

So if I run this now, so I will see only those rows which have been resulted from the input desalination error. And here you can see I only have two columns. One is a time generate and the other is the message property. So in this chapter, just want to go through the diagnostic setting which is available for your as your stream addicts job.

25. Azure Databricks – Monitoring

Now, in this chapter, I just want to go through the monitoring aspects that are available in Azure data bricks. So firstly, if you go on to compute section, if you go on to a running cluster, if you go on to the event log here, you will see all the events when it comes to the cluster itself. So, for example, when the cluster was terminated, when it started running, seeing the health of the driver node and even your executor nodes as well, if you go on to the driver logs. So here you will see all of the logs. So in terms of the jobs, if you want to see whether a job has started, you can actually look at the Spark driver logs here, because all of our jobs is going to run on the driver.

Because remember that in our cluster we only have one node in place and that node is also working as the driver node. If you go on to the Spark UI, the Spark UI is this feature is also available as part of your Spark pools in Azure synapse. So here you will see all of the information about your jobs. So you have jobs and then you have stages within the jobs, then you have your storage. So you have various aspects that you can see. So here I can see that I have my delta tables and if I scroll on to the right, I can see the size in memory. If I go on to structured streaming here I can see all of the streaming jobs that have taken place on my cluster.

Now, apart from that, let’s also look at how you can see the different stages when you execute your jobs in Spark. So if I go on to my workspace, let me go on to any notebook that I have, let me create a new cell. And here let me execute a very simple set of commands. So here I’m just getting information from one of my Jsonbased files and I’m displaying the data frame. So here let me run the cell. So here the first thing that you can see is it has run two Spark jobs in order to run these particular set of commands.

So now if I expand these two Spark jobs, here you can see the different stages of the job. So let me view job number 60. Here you have something known as the Dag Visualization. Dag stands for a directed Acyclic graph. So this gives you the different stages or the different events that have taken place when it came to executing this job. If you click on this you will see details when it comes to the Dag Visualization. So here you can see it is converting it onto a SQL based data frame. And here you can see that it’s mapping the partitions as it should. If I click on view for job number 61, here we have a very simple stage. If I click on this. So it’s a very simple stage wherein it is scanning the JSON based file.

If I take another command here, let’s say I’m writing onto a partition metrics table. So I’m creating a delta table. Let me run the cell. So you can see there are a lot of Spark jobs that are running here. If you again go on to each of the jobs, you can see there is a whole lot that’s actually going on. You can also see an exchange happening here. So if I go on to view for job number 63 here, you can see that data is being transferred from one stage on to another. So there’s a lot that actually goes in the background when it comes to how the jobs are getting executed. Here you can see there are some shuffle reads and writes because now we are making use of partitions.

So if data, let’s say, needs to be made available for a query, if the table has been partitioned, there could be data shuffling that can happen across the partitions to satisfy your query if your query is only targeting one partition. So remember, all our partitions are based on the metric name here. So if you have, let’s say, select star from the table where the metric name is equal to just one metric name, it will only go on to that one partition that is having all of that metric name information. But if you have a query that is scanning across multiple partitions, then definitely you will get shuffled reads in place because the Spark engine needs to read data across multiple partitions.

So, again, not going into much detail about the Dag visualization. This is to give you an idea when it comes to what happens in the background. If you actually want to see all of the partitions in your table, you want to run the cell. So here you can see all of the partitions. So there are eleven partitions in this particular table. So all of the data by the metric name is equal to memory, percent will be in partition number one, so on and so forth. Right? So in this chapter, I just want to go through some of the important points when it comes to the monitoring aspects that is available for Spark in Azure databricks.

Comments
* The most recent comment are at the top

Interesting posts

The Growing Demand for IT Certifications in the Fintech Industry

The fintech industry is experiencing an unprecedented boom, driven by the relentless pace of technological innovation and the increasing integration of financial services with digital platforms. As the lines between finance and technology blur, the need for highly skilled professionals who can navigate both worlds is greater than ever. One of the most effective ways… Read More »

CompTIA Security+ vs. CEH: Entry-Level Cybersecurity Certifications Compared

In today’s digital world, cybersecurity is no longer just a technical concern; it’s a critical business priority. With cyber threats evolving rapidly, organizations of all sizes are seeking skilled professionals to protect their digital assets. For those looking to break into the cybersecurity field, earning a certification is a great way to validate your skills… Read More »

The Evolving Role of ITIL: What’s New in ITIL 4 Managing Professional Transition Exam?

If you’ve been in the IT service management (ITSM) world for a while, you’ve probably heard of ITIL – the framework that’s been guiding IT professionals in delivering high-quality services for decades. The Information Technology Infrastructure Library (ITIL) has evolved significantly over the years, and its latest iteration, ITIL 4, marks a substantial shift in… Read More »

SASE and Zero Trust: How New Security Architectures are Shaping Cisco’s CyberOps Certification

As cybersecurity threats become increasingly sophisticated and pervasive, traditional security models are proving inadequate for today’s complex digital environments. To address these challenges, modern security frameworks such as SASE (Secure Access Service Edge) and Zero Trust are revolutionizing how organizations protect their networks and data. Recognizing the shift towards these advanced security architectures, Cisco has… Read More »

CompTIA’s CASP+ (CAS-004) Gets Tougher: What’s New in Advanced Security Practitioner Certification?

The cybersecurity landscape is constantly evolving, and with it, the certifications that validate the expertise of security professionals must adapt to address new challenges and technologies. CompTIA’s CASP+ (CompTIA Advanced Security Practitioner) certification has long been a hallmark of advanced knowledge in cybersecurity, distinguishing those who are capable of designing, implementing, and managing enterprise-level security… Read More »

Azure DevOps Engineer Expert Certification: What’s Changed in the New AZ-400 Exam Blueprint?

The cloud landscape is evolving at a breakneck pace, and with it, the certifications that validate an IT professional’s skills. One such certification is the Microsoft Certified: DevOps Engineer Expert, which is validated through the AZ-400 exam. This exam has undergone significant changes to reflect the latest trends, tools, and methodologies in the DevOps world.… Read More »

img