Amazon AWS Certified Machine Learning Specialty – Exploratory Data Analysis Part 3

  • By
  • January 25, 2023
0 Comment

20. Lab: Preparing Data for TF-IDF with Spark and EMR, Part 2

In this exercise, we’re going to illustrate wrangling your data, manipulating it, preparing it for use in training. That’s a lot of what this section is about. And while we’re at it, we’ll illustrate the use of Elastic MapReduce and running Apache Spark on an EMR cluster using Zeppelin. So let’s dive in and get started. First thing you need to do is sign in to your AWS management console and let’s search for Elastic MapReduce or EMR for short. And let’s go ahead and create a cluster. I’m using the word cluster loosely because we’re just going to spin up one machine.

We’re not doing anything terribly complicated here. First we need to give it a name. What we’re going to be doing on this cluster is something called TFIDF. That’s the term frequency index. Inverse document frequency algorithm. And we’re going to build a little search engine for Wikipedia using that. So let’s call it Wikipedia, call it everyone. Obviously, logging is fine. We are just going to launch this in cluster mode, meaning that we’re going to want to log into it and use it interactively.

The other option for EMR is Step Execution, which allows you to execute a predefined series of steps and then shut down the cluster when you’re done. We’re going to select Spark for the applications because we’re going to be running Spark on it. Very popular choice for wrangling data, manipulating it and cleaning it, and analyzing your data. It’s made for handling big data sets at scale. So with Apache Spark, you can run your data processing jobs across an entire cluster in parallel and prepare your data before you actually feed it into your machine learning algorithms.

We’ll keep things cheap. Let’s just use like an M four large. That should be more than enough for what we’re doing. And we only need one instance. Again, it’s not really a cluster with one instance, but I know a lot of you want to save money. So to keep this as cheap as possible, we’ll just use that one instance. You could use more than one if you wanted to, but what we’re going to be doing isn’t that taxing such that you would actually need a cluster to do it. So it’s okay. Easy to key pair.

You will have to either use an existing one that you have or create a new one. If you need to learn how to create a new one, you can click on this handy link here and it will walk you through the process. Go ahead and pause and go through that. If you don’t have an existing EC, two key pair already do, make sure that you save the resulting PEM file someplace safe, because once you’ve actually created that, there’s no way to get it back later on.

But I already have one made, so I’m just going to select it and the default permissions are okay. You can actually provide custom permissions if you want to, to the cluster by using IAM. But we don’t need to do that for us. We’ll just stick with the default roles here that will give us all the permissions that we need. Hit create cluster. And now we just wait for it to spin up so you can see that it’s out there provisioning that one master node that we asked for and through the magic of video editing, we’ll come back when that’s actually up and running. Okay, our cluster is now in running status and now our challenge is to actually connect to that cluster or our master node on that cluster and start using it to play around with Spark.

Now this cluster comes with something called Zeppelin that allows us to interact with Spark using a notebook format in Python. Very ubiquitous form of interacting with things in the machine learning world is using Python notebooks. But first we need to figure out how to connect to this instance here. So we have this EC two instance out there that’s part of our EMR cluster. And obviously we have crazy amounts of security in place because we don’t want just anybody logging into it. And half the battle here is just figuring out how to get into it ourselves. So let’s see how we can do that. So we need to enable a web connection to this cluster somehow so we can actually run Zeppelin. So let’s try clicking on this handy link here and it tries to be helpful, but unfortunately there’s more that you need to do than what it tells you in these instructions.

And some of the instructions aren’t really accurate, so I’m going to have to watch walk you through some of this. First of all, you’re not going to be able to connect to this at all unless we open up a port to the actual machine. So let’s close this for now and go down to security groups from Master and go ahead and click on that security group for our master node that we want to connect to. If we click on the master node here and go to Inbound, we need to open up port 22 for our own IP address. Click on the Inbound tab and hit Edit. I already did this earlier, but you’ll have to do this yourself and click on Add Rule and say SSH and change custom to my IP and hit Save. That will open up port 22 for SSH to your own IP address so you can actually connect to it.

Once we have that done, we can close out of the security group window here and go back to the instructions here for enabling a web connection. The first thing we need to do is log in to our master node and actually open up a SSH tunnel in the process. If you don’t already have the Putty terminal installed, here’s a link to go get it and install it on Windows. And if you’re on Macro Linux, you can follow these instructions instead. I’m on Windows, so I’ll walk you through these instructions. First of all, let’s start up Putty. Make sure you copy and paste this hostname field. We’ll need that first, and I will just paste that into the hostname. We’re connecting via SSH. Now we need to go to Connection SSH and open up the Auth page underneath the SSH connection tab there and navigate to the private key file that we made for our key pair for this EC Two host when we created the cluster. For me, that’s the big data PPK file.

You’ll have your own, and you may need to use a tool in Putty to actually convert the PEM file that EC Two gave you into a PPK file that Putty can actually use if you need to do that. If you go to Putty in your Start menu, you should find it there. It’s the Putty gen application, and you can use that to convert a PEM file to a PPK file. But I’ve already done that, so let’s go ahead and we’ve got that set up. There is one more step, though. We need to go to SSH tunnels and actually set up a tunnel that we can use to communicate through Http. So in the source port, it says to type 81 57, and we want to select Dynamic and Auto. We’ll leave the destination empty and hit Add, and then we’ll click Open, which should hopefully connect us to our instance.

We can say, yes, we do trust this host and we’re in. So not only are we logged into our master node at this point, we also have a tunnel set up so that Zeppelin can actually work. But we need one more step. We actually need to have a proxy management tool installed in our browser to use that tunnel that we just set up. Now, if you follow the instructions here that Amazon gives you, it tells you to go to this link to go download Foxy Proxy, and that’s a plug in that you can use for Chrome or Firefox to actually manage this. However, if you follow that link, you’re not going to find the free version of it. They’re only going to tell you about the paid plans that they have available. But there is a free version out there. You just need to know how to find it. If you go to Chrome, for example, go to More Tools and then Extensions, you can search for extensions there from their extension store, if you will. So if you look up Foxy Proxy, what you’re looking for is Foxy Proxy basic. Well, that’s hard to say. Foxy proxy basic. So go ahead and install that and make sure it’s activated.

I’ve already done that step, and once it is installed, we can go back to the instructions here. And you want to copy all of this stuff and paste it into a text file. And we’re going to name that text file FoxyProxy Settings XML. So go to WordPad or not Pad or whatever your favorite editor is. Paste that text in there and we’ll save it as whatever name they said. FoxyProxy.

We’ll just copy and paste it. FoxyProxy settings. XML. I’ll put that on my Documents folder so I know where it is. All right, now we need to configure Foxy proxy to use it. So the instructions at this point are accurate. We’ll just go to the Foxy proxy extension icon up here, go to Options, and we can say Import export, choose that file that we just created and go ahead and import that. I already did that step, so I’m going to actually cancel out of that, but go ahead and do that yourself.

Once you’ve done that, we can actually use Foxy proxy to enable this proxy connection that we need. So we can close out of these instructions at this point. And now I’m going to set Foxy proxy to actually be active and say, use proxy EMR Socks proxy for all URLs. This might mess up other pages, so make sure to disable Foxy proxy when you’re done with this. So you can actually use the Internet normally again. But for now, let’s actually activate that. And once I do, these connection links should become active. Let’s go ahead and hit the refresh button. There we have it. So now we can actually click on Zeppelin and get into it. All right, we’re in zeppelin. Very cool. So that’s actually the hard part, guys. Let’s go ahead and import our notebook so we can actually play around with it and do some data cleaning, shall we? So we’ll click on Import Note, and we are going to upload this from the course materials. So say select JSON file, and from the course materials you should have a TF IDF JSON. Go ahead and select that.

21. Lab: Preparing Data for TF-IDF with Spark and EMR, Part 3

And we’ll click on it here in the list. And here we have an actual notebook that can run Apache Spark on our EMR cluster. Very cool. Alright, so let’s walk through this. Now keep in mind that you are never going to be asked to write or look at or understand code as part of the machine learning certification exam. So do not get hung up on all the Python syntax and the, you know, details of the Python language itself here. It’s not important if you want to learn Python, that will be a whole other course. All you need to understand is what you can do here, what the capabilities of Spark are and some of the things that you might want to be doing with it.

So let’s start in this notebook format. We first need to specify what kind of language every block is in. So we start off with a percent Spark pi Spark in every block here. And the way notebooks work is when you’re inside one of these blocks, you can just click inside of it and hit this little run icon to actually run it, or hit Shift Enter as a shortcut. So we have this code in here, it says Raw data equals Spark, re options equals blah. All this means is that we’re going to retrieve our original raw unprocessed data set from S Three under this S Three URL.

So it’s in a Sundog Spark bucket that I’ve created for you, and it’s called Subset Small TSV. What this is, is an actual subset of real Wikipedia data that I’ve exported to a tab separated value document. That’s why the separator is being indicated as a tab character there. And then we’ll just show the first few rows of it to see what we’re up against, so I can hit Shift Enter and it will actually execute that code and retrieve that from S Three. For me. The first block you run in a notebook usually takes a little bit of extra time because it has to load a bunch of libraries and load up Spark itself. So give it a few moments here. All right, and there we have it. So you can see the raw data that we have here. And I’m just kind of walk you through what you would actually do in the real world here if you were actually trying to do something interesting with this data. First step is to just see what you’re up against, take a look at the raw data that you’re given and see what we need to do with it if it needs any sort of further manipulation. So we can see here. The first problem is that we don’t have any meaningful column names here. It’s. Just call them C zero, C one, C two and C Three.

Because it doesn’t know what these columns actually represent. That wasn’t part of our data set. So it’s up to us to figure out what this stuff means hopefully you could look that up somewhere. But pretty obvious from looking at it that we have some sort of article ID here, the article title, the date associated with that title’s creation, and then the actual text of the article itself. So let’s start by actually giving these columns some meaningful names. The way we do that in Spark is to just say we’re going to take that original data frame and convert it to another data frame with these explicit column names. And then we’ll turn around and show that. In Python with Spark, you organize data into tables that are called data frames. And these are basically tables that can be distributed across an entire cluster and processed in parallel. Let’s go ahead and run that.

And we can see that we now have a new data frame called Articles that should be the same thing, but with meaningful column names. And if we were actually running on a cluster, it could actually distribute the work of this across the entire cluster in parallel. All right, that works. So now we have an ID, title time and document column that we can refer to in our future code. Now we would have to clean our data somehow. And this is something that sometimes you need to iterate on. So a lot of times people talk about the lifecycle of machine learning. There’s usually this sort of iterative cycle where you go through where you try something out, try it an algorithm, you figure out what’s wrong about it, what can be improved, go back tweak things, try it again, just keep iterating until it works the way that you want to. And I’ve already done some of that for you here. I happen to know that I will run into a crash later on if any of these documents are null.

So knowing that TFIDF algorithm cannot handle a null document, let’s start by checking for null documents and see how many there actually are. So we’ll say articles filter, and we’re going to filter for articles that have a null document and count them all up. That’s all that code is doing. And it turns out there is in fact one article that has a null document in it and that will mess up my algorithm later on. So we need to deal with that. We need to get rid of that null. Now, the idea of, like, how do I deal with missing data is kind of a big topic, right? So there are lots of ways of dealing with missing data. One, the simplest approach is to just drop that missing data. And in this case, where there’s only one article that’s affected by it, that’s a completely reasonable thing to do. And how would you even impute that data? There’s no way I can fabricate that. I guess the best thing you might be able to do would be to reuse the title as the actual document text. In that case, that might be better than nothing.

But given that it’s only one article out of a huge corpus, we can just get away with dropping it. The idea of actually trying to fabricate fake data is called imputation, and there’s lots of ways of doing that that we talked about, but in this case, it’s not worth it. So we’re just going to drop it. That’s what this next block does. We’re going to filter again for the set of null documents, and we will actually filter those out by saying we’re going to filter the article’s data frame for things that are not null, put the resulting data frame in. A new data frame called Cleaned Articles and then count up how many of the resulting articles are null, just to make sure that we actually clean them all out. Let’s go ahead and run that. And again, it could distribute that across an entire cluster if you had a truly massive data set. We can see that we now have zero null documents.

So our data has now been cleaned up, at least to the extent that we needed to actually do something useful. All right, but by no means are we done preparing our data. The TFIDF algorithm wants numbers. It doesn’t want to deal with words, it just wants to deal with tokens, right? So the next thing we need to do is actually tokenize our data. We need to split up all of those documents into individual words, where each word is treated individually. And then we need to hash each of those words to a numeric value. Because our algorithm wants numbers, not words. It’s way more efficient to deal with a number than a string. It just takes less memory, it’s faster all sorts of reasons to do that.

So we need to actually first start off by implementing a tokenizer here to split up every document into its individual words, basically an array of words. And then we’re going to tokenize all of those words into hash values, just mapping those words into numbers using a hashing function. So that’s all this code is doing here. Let’s go ahead and shift Enter to run that. This is a beefier operation, obviously, but it will come back pretty quickly. And again, this could run on a whole cluster if it was a massive data set and make short work of it. When we’re done, we will call show on the resulting featureized data data frame. And you can see that we now have two new columns that weren’t there before.

Our tokenizer added a new column called Words. That is the array of a list of words, basically, that make up each document. So, for example, autism is a brain. Whatever that document said is now a vector of words, or an array of words rather that contains the words. Autism is a brain. You’ll notice that it also normalized things so that everything is lowercase, so that capitalization no longer matters for the purpose of the algorithm. And then the hashing TF, the hashing transform turned around and converted those words into numbers by hashing them into numeric values. So you can see here that’s what that new raw features column is. Basically, it produces what’s called a sparse vector. Here where it starts off with how many individual words it can represent, followed by a list of those numeric values that each word actually falls into. So in order to save space, we don’t actually have a giant array of 262,144 values on every single article. It’s just a list of the actual words that actually exist in that document.

And this is called a sparse vector. Not terribly important for the purpose of the exam. I’m just trying to give you a feel of how this all works. Now we’re actually going to start doing some TFIDF stuff here by actually using the IDF package from Spark’s ML library. So this is using the machine learning capabilities that are built into Spark itself. So tying this all together into machine learning, we’ve prepared our data. Now we’re actually going to do something with it using a real machine learning algorithm. So the IDF function here is going to add yet another column. We’re going to call this one Features. That represents the inverse document frequencies of all of these different terms in every single document. And just to recap how TFIDF works, basically it tries to measure the relevance of a term in a given document by taking the term frequency, which is how often documents that contain that term appear across all of the documents in the entire corpus. And dividing that by the document frequency, which is how often that term appears within this document itself, gives you sort of a measure of how unique that term is across the entire document set and how prevalent that term is within a given document. And together, those two pieces of information can tell you how relevant a given search term is to a given document. So we’ll go ahead and run that. Basically, it’s just rescaling things to give us IDF scores. That finished, and now we can take a look at the results.

Our current data frame is now called rescale data. And we can see we now have that new features column that’s going to be the values that we actually want to work with here. All right, so we have actually produced the data that we need to produce a search engine, if you will, on this Wikipedia data. So let’s try it out as an experiment. We know that the term Gettysburg appears here because we have an article about Abraham Lincoln. If you’re not familiar with United States history, abraham Lincoln was a famous president that we had, and he gave a famous speech called the Gettysburg Address. So let’s search for Gettysburg. Now, again, our first problem is that we’re dealing with numbers, not words here. So the first thing we need to do is to figure out what is the hash value that corresponds to the term Gettysburg. This is kind of tough to do right now in spark three, there’s going to be a much easier way of getting that. But for now, using spark two, this is how you have to go about doing that.

Basically, we create a little fake data frame that just contains the term Gettysburg and nothing else, and we call the hashing Transform on that data frame and then collect that data frame back and extract that hash value from the resulting sparse vector. That’s all this code is doing. Again, don’t want to get into the details of how the Python Code works because that is not important for the exam. I don’t want to clutter your brain with it. There is some sort of fancy python stuff going on here, but you don’t need to understand it. All right, so basically, we have figured out that the term Gettysburg maps to the hash Identifier at 205,433. All right, given that we can build up a new column in our data frame here that basically contains the relevancy score for the term Gettysburg on every single document, and we’ll just call that score. So all we’re going to do here, again, the Python Code itself isn’t very important, but we’re going through every single record of our database here, if you will, every single row of our data frame, and we’re going to call this custom function on it that extracts the TFIDF score for the term Gettysburg if it exists. If it doesn’t exist, it will just give us back zero.

So let’s go ahead and run that and see what happens. And there we have it. So you can see the results here seem to make sense. We have a relevancy score of zero for pretty much everything. But look at this. For the Abraham Lincoln article, we did get a relevancy score of 33. 12 and some change. So that’s telling us that the term Gettysburg has some relevance to the Abraham Lincoln article, which is what we would expect to see. This is promising. All right, so now we can just sort that entire data frame by that score for the relevance of Gettysburg and we should have our search results, right? So let’s go ahead and run that. And we’ll use a little trick here. Show truncate equals 100. Just lets us expand the width of those columns so we can read more of them, see what they’re about. And we can see here that we got three hits out of the subset that we processed. Abraham Lincoln is in fact the most relevant result with a score of 33 because he was the guy who delivered the Gettysburg Address.

Also, Avner Doubleday, who was also an army officer from the same time, I assume he had some connection to Gettysburg. And the American Civil War came up as well because Gettysburg was associated with Abraham Lincoln and the Civil War. So these results make entirely good sense. And there you have it. So I’ve kind of walked you through spinning up an EMR cluster, running Spark on it in a Zeppelin notebook, and cleaning and preparing our data to actually use for a more complicated machine learning algorithm. In this case, we’re using the TF IDF algorithm that’s part of Spark’s machine learning library to basically compute relevance for search terms on a corpus of documents. And we’ve basically made a little do it yourself search engine for Wikipedia here as a result. Now, we’re not done yet. Remember that we’re being charged by the hour for this cluster. So we need to remember to shut it down or you’re going to have a nasty surprise on your AWS bill at the end of the month.

We’re done with this. So we can actually close out this notebook and back to our cluster. We’re going to hit the terminate button and make sure that we terminate this cluster so we no longer incur future charges on it. Yes, I’m sure. Terminate it. And now things are cleaned up and we should no longer be billed for this instance also, so you can actually use the Internet. Again, don’t forget to turn off Foxy Proxy. So go up to your Foxy Proxy extension there and select Disable Foxy Proxy to make sure that your browser will be back to normal as well. All right, there it is, guys. Congratulations. You have used EMR to clean and prepare some data and actually run some machine learning on it as well, using Apache Spark.

Comments
* The most recent comment are at the top

Interesting posts

The Growing Demand for IT Certifications in the Fintech Industry

The fintech industry is experiencing an unprecedented boom, driven by the relentless pace of technological innovation and the increasing integration of financial services with digital platforms. As the lines between finance and technology blur, the need for highly skilled professionals who can navigate both worlds is greater than ever. One of the most effective ways… Read More »

CompTIA Security+ vs. CEH: Entry-Level Cybersecurity Certifications Compared

In today’s digital world, cybersecurity is no longer just a technical concern; it’s a critical business priority. With cyber threats evolving rapidly, organizations of all sizes are seeking skilled professionals to protect their digital assets. For those looking to break into the cybersecurity field, earning a certification is a great way to validate your skills… Read More »

The Evolving Role of ITIL: What’s New in ITIL 4 Managing Professional Transition Exam?

If you’ve been in the IT service management (ITSM) world for a while, you’ve probably heard of ITIL – the framework that’s been guiding IT professionals in delivering high-quality services for decades. The Information Technology Infrastructure Library (ITIL) has evolved significantly over the years, and its latest iteration, ITIL 4, marks a substantial shift in… Read More »

SASE and Zero Trust: How New Security Architectures are Shaping Cisco’s CyberOps Certification

As cybersecurity threats become increasingly sophisticated and pervasive, traditional security models are proving inadequate for today’s complex digital environments. To address these challenges, modern security frameworks such as SASE (Secure Access Service Edge) and Zero Trust are revolutionizing how organizations protect their networks and data. Recognizing the shift towards these advanced security architectures, Cisco has… Read More »

CompTIA’s CASP+ (CAS-004) Gets Tougher: What’s New in Advanced Security Practitioner Certification?

The cybersecurity landscape is constantly evolving, and with it, the certifications that validate the expertise of security professionals must adapt to address new challenges and technologies. CompTIA’s CASP+ (CompTIA Advanced Security Practitioner) certification has long been a hallmark of advanced knowledge in cybersecurity, distinguishing those who are capable of designing, implementing, and managing enterprise-level security… Read More »

Azure DevOps Engineer Expert Certification: What’s Changed in the New AZ-400 Exam Blueprint?

The cloud landscape is evolving at a breakneck pace, and with it, the certifications that validate an IT professional’s skills. One such certification is the Microsoft Certified: DevOps Engineer Expert, which is validated through the AZ-400 exam. This exam has undergone significant changes to reflect the latest trends, tools, and methodologies in the DevOps world.… Read More »

img