DP-203 Data Engineering on Microsoft Azure – Design and Develop Data Processing – Azure Databricks Part 3
7. Lab – Reading a CSV file
Now, in this chapter, we’ll see how to process our log CSV file. So, one of the most important aspects when it comes on to any service that we have seen. So, from the exam perspective, you should be able to understand how to go through CSV files, how to work with your Jsonbased files, and how to work with Park ABS files. That’s why I’ve covered all of these different types of files in each particular section. So we have to do the same over here as well. So, firstly, I’ll close all the cells that I have open. Now here what I can do. I can go on to the menu option that is available here, and I can click on Upload Data. So now I can upload my log CSV file onto Azure databricks.
So, Azure Databricks actually has this underlying data bricks file system in place. So if you want to work with files locally, you can do so you can actually upload your files and you can work with them. Yes, as your data bricks can also connect onto your Data Lake Gentle storage accounts, onto your Azure Normal storage accounts, and you can also create mount points onto those storage accounts. But we’ll see that a little bit later on.
So here, I’ll just click on this just so that I can browse for my file. So my log CSV file I’ll go onto next. And here it’s giving the way that you can now access this file. So, if you’re working in Python Spark, if you’re working in R, if you’re working in Scala, you can just copy this particular statement. So let me copy this. I’ll click on Done, I’ll remove everything in the server and let me place it here. So, we have our databricks file system, and here we have our log CSV file.
So there are some folders in between. We’ve already seen the statement before, wherein we can load a CSV file. The format we are mentioning is CSV, and we are using the Spark context to read our file. So here we can also show the contents of the data frame. So let me run this. We can also do a display of our data frame. Here we can see again that our column names is coming as a row in our data frame. So we can change this. I can take the option. So let me take this. It’s the same URL. So I’ll copy these two statements to create a new data frame. Yeah, I’m mentioning that the header is true. That means the first row is having our column names. Let me run this. And we can now see our data frame being properly displayed. So in this chapter, I want to explain to you the concept on how you can read your CC files. The same thing, but at the same time. We’ve also been introduced onto the databricks file system.
8. Databricks File System
So just want to quickly cover some aspects when it comes on to the data bricks file system so your workspace gets a databricks file system. This is abstraction layer on top of the scalable object storage so only the covers you are getting object storage which is scalable in nature. If you want to interact with that object storage you have the databricks file system. So here you can store your objects using directories and the normal file semantics. These files also persist if the cluster is terminated. So if you terminate your cluster and if you recreate the cluster you can actually have access onto those files. The default storage location is called dBFS root.
Now there are some predefined route locations so we have the file store, this is used for the imported data files, the generic plots, the uploaded libraries, you have the databricks data sets, these are used for some sample public data sets you have user hype warehouse this is the data and the metadata for non external hype tables. So here if I go on to one of my files so there are some magic commands in place to actually look at the database file system so just go on to the cell itself. So this is the magic command and LS is to list all of the contents. So I’ll just run the cell so here you can see the path on to the database file system and what is the name and if you want to create a new directory so here we can create a new directory and then we can again list the contents. So here’s we can see our new directory in place. So just want to give you some more ideas when it comes on to the databricks file system.
9. Lab – The SQL Data Frame
Now in our continuation, working with data frames, let’s again see some commands when it comes to the SQL API that is available on top of your data frames, on top of your RDDs. So again, I’m reading my file here. If you only want to select some columns, you only want to see some columns in that particular particular data frame. So here, let me run this. So here we can see our output in place. So remember that earlier on in our previous chapter, we had created a data frame DF two. So we are reusing that same data frame. So here I am only selecting some of the columns. Now we can also create a data frame which will actually infer the schema. So, if I look at my data frame so, let me do one thing, let me print the schema of the data frame.
So here in terms of the schema, we can see the ID is a string and the time is also a string. But we want the spark to actually infer the schema based on the underlying data. So let me copy this, let me run the cell. Now we can see that the ID is coming up as an integer and the time is coming up has a timestamp if you only want to show the rows based on a particular filter. So this is like also having the where condition in place. We can also use the display command as I’ve shown here. So we can see it here where only these status is equal to succeeded. And then finally, if you want to use the group by statement, that’s something that you can do as well. So here it’s grouped by the status. So again, there are different commands that are available to work with your data frames.
10. Visualizations
In this chapter. I just want to have a quick note when it comes to the visualization that is available by default in the notebooks. So here I am displaying my data frame. So the entire data frame is coming in a tabular format. If I scroll down, I have the different visualizations available here if I click on the bar chart. So by default it’s stacking it up against the different IDs. And here I have the resource group and the resource type. You can expand the plot here by dragging this if you go on to the plot options. So by default it is plotting it against the ID. The keys are the resource group and the resource type. Let’s say you want to stack it against the operation name. You can drag the operation name onto the keys and here you can see all of the operation names. It will go ahead and display it again. So here we have account based on the different operation names. So this is the default visualization that you actually get in the notebooks in as your data bricks.
11. Lab – Few functions on dates
In this chapter, I just want to go through a few functions when it comes to working with dates. So if I go back onto our data frame, if I display it back in the tableau format here, I should be able to see the timestamps. So here we do have a column based on the time. So let me take this first set of statements. So what am I doing? Here I am selecting the column of time. Here. I’m using the year function to display the year aspect of the time. The same goes with the month and the same goes with the day of year. So these are the default functions that are available. So to ensure that I can use these date based functions, I’m using the import statement here.
And then I am selecting all of those different columns. So let’s run this. So here I can see the year, the month and the day of year. If you want to give more meaningful names, you can actually use the alias. We’ve seen this early on to give meaningful names on to the columns in the data frame. Let me run this. Yeah, it’s giving now the different column names. And finally, if you want to convert the date onto a particular format, you can use the two date function. Let me run this. So here you can see all of the different dates in place. So in this chapter, just want to go through some important functions when it comes to working with dates.
Interesting posts
The Growing Demand for IT Certifications in the Fintech Industry
The fintech industry is experiencing an unprecedented boom, driven by the relentless pace of technological innovation and the increasing integration of financial services with digital platforms. As the lines between finance and technology blur, the need for highly skilled professionals who can navigate both worlds is greater than ever. One of the most effective ways… Read More »
CompTIA Security+ vs. CEH: Entry-Level Cybersecurity Certifications Compared
In today’s digital world, cybersecurity is no longer just a technical concern; it’s a critical business priority. With cyber threats evolving rapidly, organizations of all sizes are seeking skilled professionals to protect their digital assets. For those looking to break into the cybersecurity field, earning a certification is a great way to validate your skills… Read More »
The Evolving Role of ITIL: What’s New in ITIL 4 Managing Professional Transition Exam?
If you’ve been in the IT service management (ITSM) world for a while, you’ve probably heard of ITIL – the framework that’s been guiding IT professionals in delivering high-quality services for decades. The Information Technology Infrastructure Library (ITIL) has evolved significantly over the years, and its latest iteration, ITIL 4, marks a substantial shift in… Read More »
SASE and Zero Trust: How New Security Architectures are Shaping Cisco’s CyberOps Certification
As cybersecurity threats become increasingly sophisticated and pervasive, traditional security models are proving inadequate for today’s complex digital environments. To address these challenges, modern security frameworks such as SASE (Secure Access Service Edge) and Zero Trust are revolutionizing how organizations protect their networks and data. Recognizing the shift towards these advanced security architectures, Cisco has… Read More »
CompTIA’s CASP+ (CAS-004) Gets Tougher: What’s New in Advanced Security Practitioner Certification?
The cybersecurity landscape is constantly evolving, and with it, the certifications that validate the expertise of security professionals must adapt to address new challenges and technologies. CompTIA’s CASP+ (CompTIA Advanced Security Practitioner) certification has long been a hallmark of advanced knowledge in cybersecurity, distinguishing those who are capable of designing, implementing, and managing enterprise-level security… Read More »
Azure DevOps Engineer Expert Certification: What’s Changed in the New AZ-400 Exam Blueprint?
The cloud landscape is evolving at a breakneck pace, and with it, the certifications that validate an IT professional’s skills. One such certification is the Microsoft Certified: DevOps Engineer Expert, which is validated through the AZ-400 exam. This exam has undergone significant changes to reflect the latest trends, tools, and methodologies in the DevOps world.… Read More »