Amazon AWS Certified Machine Learning Specialty – Modeling Part 9

  • By
  • January 25, 2023
0 Comment

23. Random Cut Forest in SageMaker

Next we’ll cover random cut forest, and that’s Amazon’s algorithm for anomaly detection. And it works in an unsupervised setting. So basically it’s looking at a series of data and trying to find things that are anomalous in that series. What things are sticking out is maybe being a little bit weird, and it can look for things like breaks and periodicity. Things are just unclassifiable. And for every data point, it will assign an anomaly score to that. Amazon developed this algorithm themselves. They wrote a paper on it and they’re like, really proud of it, it seems, because it keeps creeping into all their different systems. So it’s a pretty safe bet you’ll see something about this on the exam because Amazon seems to be very proud of it and using it all over the place.

So what does it expect? Either CSV format or record IO proto buff. You can use file or Pipe mode in either case. Optionally, you could have a test channel if you just want to compute accuracy or precision or recall or F one on labeled data. Again, this is an unsupervised algorithm, so there is no real training. But you can provide a test channel and try to measure its accuracy against something where you know what the anomalies are.

How does it work under the hood? Well, they have a whole video on this available on AWS, but all you need to know is that basically it’s creating a forest of trees where every tree, like a decision tree, is a partition of the training data. And what it does is it looks at the expected change in the complexity of that tree as a result of adding a new point into it. So if you add a new data point into this decision tree and it causes a whole bunch of branches to form off, it says, well, this might be anomalous. Right? So it’s kind of a cool idea.

They’re basically using the properties of a decision tree and saying, okay, the fact that a decision tree needs to make a bunch of new branches to accommodate some new data probably means there’s something weird about that data point. That’s the basic idea. The data is sampled randomly. That’s the random cut part of random cut forest, and then we train it. So random cut forest is showing up, like I said, all over the place. It also shows up in Kinesis Analytics, where it’s available for doing anomaly detection on streams of data.

So, yeah, it can actually work on streaming data as well. It’s not limited to working on batches of data, but under the hood, it’s based on looking at a series, maybe a time series of data, and flagging things that are anomalous as it goes. The important hyper parameters are the number of trees, and if you have more trees, it reduces the noise or the number of samples per tree. The guidance is that you should choose that such that one over number of samples per tree approximates the ratio of anomalous to normal data. So if you have a rough idea of how much of your data is anomalous ahead of time, you can tune numb samples per tree to do a better job of identifying those anomalies. It does not take advantage of GPUs. It’s really a fairly simple algorithm under the hood. So you want to use an M four, c four, or C five for training, and a C five XL is recommended for inference. So that’s random cut forest. Just remember that random cut forces for anomaly detection, and you should be okay.

24. Neural Topic Model in SageMaker

Next we’ll dive into the world of topic modeling. That’s where we’re trying to figure out what’s a document about basically at a high level. And these are unsupervised methods, so you can just throw a bunch of documents into these things and get topics out the other side. Now keep in mind these are just artificial topics that the algorithm has come up with. They’re not going to be human readable things that necessarily make sense anyway. Sage Maker gives you a couple of ways of doing this and one of them is Neural old topic model. And again, we’re just trying to organize documents into topics and we can use that for classification or for even summarizing documents based on the topics that we think they represent. So it’s more than just TFIDF where we’re trying to search on specific terms. It’s actually grouping things together into higher levels of what those terms might represent.

So for example, the terms bike and car and train and mileage and speed might all be associated with a single topic that might represent transportation at a more general level. Again, the model isn’t going to know to call that topic transportation. It’s just going to group these documents together that talk about bikes and cars and trains and mileages and speed together into one topic. It’s going to be up to you to interpret that. It is an unsupervised algorithm, which again is why it doesn’t really know what to call these things because we’re not training it on known topic names. But it works. The underlying algorithm is called neural variational inference. If you’re interested for training, again, it’s unsupervised. So training is kind of not really the right word to use here.

But you can do a training pass on it and just pass in a validation or a test channel if you want to actually measure its performance on a set of known data where you know what the topics are.

You can pass in record, IO, protobuff or CSV data to actually analyze a given set of documents. And those words first must be tokenized into integers. So you don’t just pass in raw text first. You have to actually break up and convert those documents into tokens for each word and also pass in a vocabulary file that maps those words to the numbers that are representing them in the documents. So for every document you have a count of every word in the vocabulary. Often a CSV file, it’s separate. The auxiliary channel is used for that vocabulary data and you can use file or pipe mode. Either is fine. Obviously pipe always faster. Where are some quirks about it? Well, you define how many topics you want at the end. So basically the main hyper parameter you have is how many topics do I want to generate? So that will sort of control how high of a level it’s trying to organize things into. So it will give you as many topics as you say again, those topics aren’t going to map to specific human readable words, they’re just going to be higher level concepts that are learned in an unsupervised manner.

Those topics are going to be what we call a latent representation based on the top ranking words in your documents. So yeah, it’s just trying to find these underlying relationships that are hard to uncover otherwise. And again, this is only one of two topic modeling algorithms that Sage Maker offers. We’ll talk about the next one next, so you should try them both, see which one works best. Being a neural topic model, batch size and learning rate are obviously some important hyper parameters. And also like we said, the number of topics that you want is really the main thing that you want to tune with this thing that will control just how high or low level your topics end up being. It can actually use GPU or CPU. We recommend a GPU for training because it is a neural network, but CPU cheaper and the CPU will probably be adequate for inference.

25. Latent Dirichlet Allocation (LDA) in SageMaker

Let’s dive into Sage Maker’s other topic modeling algorithm, LDA, which stands for a latent Derek Lee allocation. I think it’s pronounced Dirk Lay. There’s some debate over how to say that. Actually, some people say it’s dear Schlutt. I think it’s Derek Lay. Sounds French to me. Anyway, it’s another topic modeling algorithm that’s not based on deep learning and again, it’s unsupervised. Just like neural topic model. It’s just not using a neural network under the hood, but otherwise it works the same way way. So from a usage standpoint, it’s the same sort of a thing. The topics themselves are unlabeled. They’re just groupings of documents based on what words these documents share in common. And you can actually use this for things other than words if you want to.

You could even cluster customers together based on their purchases or do harmonic Analyst analysis in music if you want to, using the same algorithms. So even though it’s most commonly used for topic modeling and documents, it’s actually more general purpose. You can use it for other things too that are just based on unsupervised clustering, based on things that each object has in common. So it takes in a training channel for actually analyzing the data. And optionally, you can pass in a test channel if you just want to measure accuracy. Again, it’s unsupervised, so it’s not really going to be testing anything. The input can either be record IO, protobuff or CSV.

And again, we need to tokenize that data first. Every document will just have accounts for every word in the vocabulary for that document. So we’re just going to pass in a list of tokens integers that represent each word and how often that word occurs in each individual document, not the documents themselves. Same deal with Neural topic model.

And if you want to use Pipe mode, that’s only supported with Record IO format for LDA, again, it’s unsupervised and it will generate however many topics you specify, just like with Neural Topic Model. And that numb underscore topics parameter is going to be the main dial that you want to adjust here. Optionally, you can pass in that test channel if you just want to score the results. The score comes back in terms of per word log likelihood is the metric that we use for measuring how well LDA works. So functionally it’s the same thing as Neural Topic Model, but under the hood it’s a very different approach. It’s CPU based, not GPU based. So as a result it can be maybe a little bit cheaper and more efficient since you don’t need expensive GPU instances to train it.

Again, number of topics is going to be the main hyper parameter that you care about that will control how coarse or fine tuned these topics will be that it generates. And again, these are not human readable topics we’re talking about, they’re just groupings of documents. So we’re going to end up with however many numb topics groupings that you specify, and it’s kind of up to you to figure out what they really mean. Another important hyper parameter that’s worth tuning with LDA is alpha zero under the hood. That’s the initial guess for what they call the concentration parameter. So smaller values will generate sparser topic mixtures, whereas larger values produce uniform mixtures.

But you don’t need that level of detail, probably. Just remember that the number of topics is the main thing to tune with topic modeling. And in the case of LDA, we don’t have all the deep learning stuff to tune, we just have this alpha zero parameter. Again, CPU only for training. And furthermore, it has to be a single instance so can’t parallelize the training of this. Just a single CPU node is all you need.

Comments
* The most recent comment are at the top

Interesting posts

The Growing Demand for IT Certifications in the Fintech Industry

The fintech industry is experiencing an unprecedented boom, driven by the relentless pace of technological innovation and the increasing integration of financial services with digital platforms. As the lines between finance and technology blur, the need for highly skilled professionals who can navigate both worlds is greater than ever. One of the most effective ways… Read More »

CompTIA Security+ vs. CEH: Entry-Level Cybersecurity Certifications Compared

In today’s digital world, cybersecurity is no longer just a technical concern; it’s a critical business priority. With cyber threats evolving rapidly, organizations of all sizes are seeking skilled professionals to protect their digital assets. For those looking to break into the cybersecurity field, earning a certification is a great way to validate your skills… Read More »

The Evolving Role of ITIL: What’s New in ITIL 4 Managing Professional Transition Exam?

If you’ve been in the IT service management (ITSM) world for a while, you’ve probably heard of ITIL – the framework that’s been guiding IT professionals in delivering high-quality services for decades. The Information Technology Infrastructure Library (ITIL) has evolved significantly over the years, and its latest iteration, ITIL 4, marks a substantial shift in… Read More »

SASE and Zero Trust: How New Security Architectures are Shaping Cisco’s CyberOps Certification

As cybersecurity threats become increasingly sophisticated and pervasive, traditional security models are proving inadequate for today’s complex digital environments. To address these challenges, modern security frameworks such as SASE (Secure Access Service Edge) and Zero Trust are revolutionizing how organizations protect their networks and data. Recognizing the shift towards these advanced security architectures, Cisco has… Read More »

CompTIA’s CASP+ (CAS-004) Gets Tougher: What’s New in Advanced Security Practitioner Certification?

The cybersecurity landscape is constantly evolving, and with it, the certifications that validate the expertise of security professionals must adapt to address new challenges and technologies. CompTIA’s CASP+ (CompTIA Advanced Security Practitioner) certification has long been a hallmark of advanced knowledge in cybersecurity, distinguishing those who are capable of designing, implementing, and managing enterprise-level security… Read More »

Azure DevOps Engineer Expert Certification: What’s Changed in the New AZ-400 Exam Blueprint?

The cloud landscape is evolving at a breakneck pace, and with it, the certifications that validate an IT professional’s skills. One such certification is the Microsoft Certified: DevOps Engineer Expert, which is validated through the AZ-400 exam. This exam has undergone significant changes to reflect the latest trends, tools, and methodologies in the DevOps world.… Read More »

img