Amazon AWS Certified Machine Learning Specialty – Exploratory Data Analysis Part 2

  • By
  • January 25, 2023
0 Comment

18. Amazon SageMaker Ground Truth and Label Generation

Let’s finally talk about Sage Makers ground Truth service. This is a relatively new product from Amazon. And what is it? Well, it’s basically a way of using humans to label your data. And I kind of struggled with where to put this in the course. In the end, I ended up putting in feature engineering because fundamentally it comes down to dealing with missing data. And in this case, it’s really more about missing labels and missing features more often than not.

But you could actually apply this to missing features as well. Basically, the idea is that if you have missing data that can be inferred easily by a human, ground Truth lets you farm these tests out to actual human beings to fill in that missing data for you. And the most common example of this is in the world of image classification. So, for example, if I’m training a new image classification model, in order to train that, someone’s going to have to go through and actually tag all these training images with what’s actually in them, right? It’s not going to be an easy thing for a computer to do necessarily.

So, for example, in this picture here, we have someone classifying this picture as being of a basketball or a soccer game. Someone’s got to go through and do all that work. What kind of a bird is this? Where are the birds in this image? Things like that. So Ground Truth will manage a fleet of human beings who will go through and label your data for training purposes. So if you have a huge data set like a bunch of pictures and you need labels on them, sometimes human beings are the best way to actually get those labels. Ground Truth is what manages that process for you. But it’s more than just managing the humans that label your data.

The thing that sets Ground Truth apart is that it creates its own model as those labels are coming in. So this model is learning as it receives more and more label data from the humans, and over time, it’s only going to send out labels to the humans that it’s not sure about. So as time goes on, as it gets more and more labels from the human labelers, it will create a model of its own, and only the ambiguous cases will be sent out to human labelers going forward. So this model builds up over time, gets better and better over time, and only the things that the model is not sure about actually get farmed out to human beings. This ends up reducing the cost of your labeling jobs by up to 70%, which is pretty substantial. Who are these people? Well, you have different choices of who Ground Truth can use. One is just using the Amazon mechanical Turk workforce. This is a huge workforce of people around the world who will label your data and do a bunch of other, you know, simple tasks for you.

For a very small amount of money. You can also choose to farm it out to your own internal team. So if you’re dealing with very sensitive data, that might make sense for you. And there are professional labeling companies out there as well who do nothing but have a fleet of human beings who label training data for a living. So if you want someone who’s a little bit more specialized in this sort of thing, you can spend even more money and use one of them. Now, there are other techniques than just using humans to actually generate training labels that you don’t have. One other would be to use AWS Recognition service, which we’ll talk about more later on in the course. Basically, it’s an AWS service for image recognition. So if you do need to classify images, that could be either for creating labels or maybe even feature data. Maybe you just have a feature in your data set for what is in this image recognition has a pretrained model for most common types of objects that you can just use. So if you are just using classifications of general objects in the world, recognition might be able to do that for you. If you’re dealing with textual information as opposed to images, Comprehend might be useful.

Basically, that is an AWS service for doing text analysis and topic modeling. So you could automatically generate topics or sentiments for a given document that way. So that might be a way of both creating labels and maybe even additional training features, right? So this is another example of feature engineering where you might generate additional features in your model by using something like Comprehend. So if I have a document, maybe I can use Comprehend to generate new features that include the topic of that document or the sentiment that that document represents. And that might be useful information for training a machine learning model. So that’s why this is in this section about feature engineering. You can actually apply these techniques of using existing services or even human beings to generate more information that your model can use. Really, any pretrained model or unsupervised learning technique can be helpful in generating new training labels or even new training features in your data set.

19. Lab: Preparing Data for TF-IDF with Spark and EMR, Part 1

So in this lab we’re going to practice preparing data at scale on Apache Spark on a cluster, and we’re going to do it in the context of preparing data for a TFIDF algorithm. And you might not know what TFIDF is, so let’s start by covering that. TFIDF stands for term frequency and inverse document frequency. Basically it’s used in search algorithms for the most part. It’s a good way to figure out how relevant a given term is is for a given document. So if you can compute the TF IDF score for every term in a corpus of documents, you can use that to figure out which documents are most relevant to that search for term. It sounds fancy, but it’s actually a lot simpler than it sounds. Term Frequency, or TF, just measures how often a word occurs within a document. So that’s it. Term frequency is just like what it sounds. A word that occurs very frequently within a given document is probably pretty important to that document’s meaning. For example, a document that contains the word or phrase. Machine learning a lot is probably about machine learning.

Document frequency is just how often that word occurs across the entire set of documents. So for example, maybe you’re looking across all of Wikipedia or every web page that’s being indexed by a search engine. The document frequency would tell you how common that word is across all documents. And that tends to tell you about common words like a and the and words that are common in every document but that aren’t really relevant to a specific document. They’re just common words. So by taking these two things together, the term frequency and the document frequency, you can figure out just how important a given term is to a given document. So you can measure that relevancy by just taking the term frequency and dividing it by the document frequency. Now, it turns out that division is the same thing as multiplying by the inverse mathematically, and that’s why we call this TFIDF. It’s the same thing mathematically as term frequency times Inverse Document Frequency. We just take how often the word appears in the document and divide that by how often it appears just everywhere. That will give you a measure of just how special that word is for that given document. Now, in practice, there are some nuances to how we actually use this. For example, we use the actual log of the Inverse Document Frequency instead of the raw value.

That’s because word frequencies in reality tend to be distributed exponentially. So by taking a log, we end up with a slightly better weighting of the words given their overall popularity. And there are some general limitations to TFIDF. One is that we just treat everything as a bag of words. Every document is just a collection of words and we don’t really pay attention to the relationships between the words so much. Obviously that’s not always the case and it turns out that parsing out those words is a good part of the work, as we’ll soon see. Also, what do you do about things like synonyms and tenses of words and abbreviations and misspellings? We’ll see in our example here that will at least deal with capitalization by making everything lowercase, but there are more complicated things you could do to try to make this work better. Also, doing this at scale becomes challenging. Mapping all these terms and documents together into numbers that tie together is kind of hard to do, and that’s where Apache Spark is going to come in to help. We talked a little bit about the bag of words approach. One way around that limitation is to look at not just unigrams or individual words in a document, but also things like biograms or more generally, Ngrams.

So look at words that occur together. For example, if I had a document that just contained the phrase I Love Certification exams, I could just look at the unigrams and consider the terms to be the individual words I Love certification and exams, but I could also look at biograms and consider those terms as well. That would be every grouping of two words that appears together. So, for example, the byrams in this sentence are I love, love certification and certification exams. That is, every unique combination of two words that appear in that document. I could go further to trigrams, and those would be I Love Certification and Love Certification exams. I bring this up because one of the sample questions that AWS provides you for the certification exam gets into the guts of all this. So here’s an example of what a TF IDF matrix might look like that’s considering both unigrams and biograms. Let’s imagine that we have two documents in our entire corpus here. One document only contains the sentence I Love Certification exams, and the other document just contains the sentence I Love Puppies.

This is what that matrix might look like when we’re actually computing TFIDF for every individual term across every individual document. The documents appear on the rows of the matrix, and each individual term appears on the columns. So since we’re looking at both unigrams and biograms, we start by taking each individual unique word that appears in this corpus as individual terms. So that consists of the words I love certification, exams and puppies. Since the words I and Love appear in both documents, those aren’t counted twice, those are deduplicated there. We only consider those once in the matrix. So a term can only appear once in a given column, and then we’re looking at diagrams as well. And the unique biograms across this entire document set are I love love certification. Love puppies and Certification exams. Again, the biogram I love is shared among these two documents, so they will appear more than once. And the job of TFIDF would be to flesh out this matrix and actually compute the TFIDF values for each document for each individual term.

And again, those terms can be unigrams or by grams. In this example, in this particular example, the dimensions of this matrix are two by nine. We have two rows corresponding to two documents, and nine columns corresponding to nine unique terms, which consists of both the unigrams and diagrams that are uniquely represented within this document set. So how do you use TFIDF? Well, you could make a very simple search algorithm using nothing but it.

So you could just go through and compute TFIDF ahead of time for every term in a corpus of data. And then for a given search term, sort the documents by their TFIDF score for that term, then just display the results and you’re done. So let’s actually try that out. We’re actually going to try to apply TFIDF on a subset of the Wikipedia data set. We’re going to build our own little search engine using TFIDF on Wikipedia. And to make it interesting, we’re going to do it on AWS using Elastic MapReduce and Apache Spark. And our primary goal here is to show you how to pre process that data ahead of time because, well, that’s what this section is really about.

Comments
* The most recent comment are at the top

Interesting posts

The Growing Demand for IT Certifications in the Fintech Industry

The fintech industry is experiencing an unprecedented boom, driven by the relentless pace of technological innovation and the increasing integration of financial services with digital platforms. As the lines between finance and technology blur, the need for highly skilled professionals who can navigate both worlds is greater than ever. One of the most effective ways… Read More »

CompTIA Security+ vs. CEH: Entry-Level Cybersecurity Certifications Compared

In today’s digital world, cybersecurity is no longer just a technical concern; it’s a critical business priority. With cyber threats evolving rapidly, organizations of all sizes are seeking skilled professionals to protect their digital assets. For those looking to break into the cybersecurity field, earning a certification is a great way to validate your skills… Read More »

The Evolving Role of ITIL: What’s New in ITIL 4 Managing Professional Transition Exam?

If you’ve been in the IT service management (ITSM) world for a while, you’ve probably heard of ITIL – the framework that’s been guiding IT professionals in delivering high-quality services for decades. The Information Technology Infrastructure Library (ITIL) has evolved significantly over the years, and its latest iteration, ITIL 4, marks a substantial shift in… Read More »

SASE and Zero Trust: How New Security Architectures are Shaping Cisco’s CyberOps Certification

As cybersecurity threats become increasingly sophisticated and pervasive, traditional security models are proving inadequate for today’s complex digital environments. To address these challenges, modern security frameworks such as SASE (Secure Access Service Edge) and Zero Trust are revolutionizing how organizations protect their networks and data. Recognizing the shift towards these advanced security architectures, Cisco has… Read More »

CompTIA’s CASP+ (CAS-004) Gets Tougher: What’s New in Advanced Security Practitioner Certification?

The cybersecurity landscape is constantly evolving, and with it, the certifications that validate the expertise of security professionals must adapt to address new challenges and technologies. CompTIA’s CASP+ (CompTIA Advanced Security Practitioner) certification has long been a hallmark of advanced knowledge in cybersecurity, distinguishing those who are capable of designing, implementing, and managing enterprise-level security… Read More »

Azure DevOps Engineer Expert Certification: What’s Changed in the New AZ-400 Exam Blueprint?

The cloud landscape is evolving at a breakneck pace, and with it, the certifications that validate an IT professional’s skills. One such certification is the Microsoft Certified: DevOps Engineer Expert, which is validated through the AZ-400 exam. This exam has undergone significant changes to reflect the latest trends, tools, and methodologies in the DevOps world.… Read More »

img