Amazon AWS Certified Machine Learning Specialty – Modeling Part 10
26. K-Nearest-Neighbors (KNN) in SageMaker
Next up is KNN, which is probably the world’s simplest machine learning algorithm out there. K nearest neighbors, but Sage Maker does take it to the next level. If you’re not familiar with KNN and how it works, it’s really simple. Basically you plot your data points in some sort of feature space and for a given point you just take the k closest points to that sample and return the label that occurs most frequently. So you’re just looking at the k items that are most similar based on some distance metric from your point to the other points in your training data and return the most frequent label from those similar data points. It turns out you can actually use kanon for regression as well. So it can do more than just classification, it’s just a very simple twist on it. Instead of just returning the most common classification label from the nearest neighbors, you return the average value from whatever feature you’re trying to predict. So yeah, no big deal.
KNN very simple classification or regression algorithm. All it’s doing is looking at the k nearest neighbors, quite literally of your data point in some feature space. So you can give it a train channel that contains your data. The test channel can be used if you want to measure accuracy or MSC accuracy metrics for training. It can take record IO protobuff or CSV. For CSV, the first column would be the label data followed by all the features, and you can use file or pipe mode in either case. So like I said, Sage Maker kind of takes you to the next level first. It actually samples the data, assuming that you have too much to deal with.
KNN does have some scaling problems otherwise, and furthermore, it does dimensionality reduction on that data as well. So if you do have a lot of features, it will try to boil that down first to avoid the curse of dimensionality. That way it becomes easier to compute those nearest neighbors if you’re in a smaller dimensional space. This does come at the cost of noise and accuracy, of course. So you want to be a little bit careful on how you do that. Under the hood, it has a couple of options for that sign or FGLT are the options there. Once that’s done, it builds up an index for looking up the neighbors quickly. At runtime it then serializes the model and then you just query the model for a given value of k.
How many neighbors do I want to look at before I get back the average value or the classification that I’m looking for? Obviously the most important hyper parameter in KN is k itself. How many neighbors do I look at? Tuning that is kind of a science in its own. You kind of have to experiment and see where you start to get diminishing returns on higher values of k. Also, sample size is an important thing to tune here as well. Otherwise, though, it’s straight up KNN. If you’ve ever done machine learning before, you know what it’s about, you can train it on either CPU or GPU instances, which I kind of find it kind of interesting. You can actually take advantage of the GPU for KN, so either an M five two x large or a P two x large are recommended for training for inference. CPU or GPU, whatever you want. CPU will probably give you lower latency, but GPU will give you higher throughput. If you’re trying to run KNN on large batches of data all at once.
27. K-Means Clustering in SageMaker
Up next, k means another very common machine learning algorithm out there, and Sage Maker provides it as well. And again, they take it to the next level. So what’s K means all about? Well, it’s an unsupervised clustering technique. It’s important to remember that KNN typically it’s important to remember that KNN was technically a supervised technique where we’re learning from the labels of the training data there and getting the nearest neighbors from those labels right. K means, however, is unsupervised. So basically, it’s just going to take a look at your data and try to divide it up into ways that make sense in the feature space. The way it works is it divides your data into k groups. That’s where the k and K means comes from, where each member of a group is as similar as possible to each other. So it’s trying to find where these clusters are centered in your feature space and how to divide those clusters up. So it’s up to you to define what similar means, how to actually measure how similar two points are to each other.
Often that’s just a Euclidean distance in the space of normalized features, but you can make it whatever you want, really. So what Sage Maker brings to the table is web scale k means clustering. Doing this at large scale can be challenging because we have to look at every single data point in there and figure it out. So Sage Maker brings to the table a way of doing this at very large scale. The training input. You need a training channel, obviously, to actually analyze your data. But since it is unsupervised, the test channel is optional. That’s only going to be used for measuring your results against some test data set that you might have. Now, when you do training and testing in Sage Maker, you can specify how it’s interoperating with s three. So for training, you want to be using the sharded by S three key flag when you’re actually doing the training. And while you’re testing, you would want to use fully replicated. So this is just saying I’m actually going to copy all of my data from S three to every single training node or not. In the case of training. In this case, you don’t want to copy everything over. If you want to scale better, you want to shard that data by the s three key.
So if you’re going to be using multiple machines for training, that allows it to be a lot more efficient. The format itself can be either record IO, proto buff, or CSV. And you can use file and pipe mode in either case. So how does Kmeans work under the hood? Well, we start off just like we would with regular Kmeans. We map every observation to n dimensional space, where N is the number of features. So you can picture this multi dimensional space where every dimension is a feature. And we’re basically measuring how similar observations are based on their distance in that dimensional space. The job of k means is just to optimize the center of where those k clusters are, and we actually can specify in Sage Maker extra cluster centers, which will further improve the accuracy. So the actual number of clusters that we want to end up with is little k here. The actual number of clusters that we end up working with big k is k times x, where k is the number of clusters we want to end up with. X is the extra cluster centers term and we’re actually going to be working with big k here, which is a larger set of clusters initially that we reduce down to the number that we want when we’re done. So that’s one little twist that Sage Maker puts on things.
It’s actually using more clusters at the beginning of the process and trying to consolidate that down to the number of clusters you want over time. Another extension that it has in the Australian algorithm is the process of determining the initial cluster centers. So standard Kmeans would just have you pick those initial cluster centers at random, but that can result in your clusters being too close together potentially. So to avoid that problem, it’s using something called k means plus. And all k means plus plus does is try to make those initial cluster centers as far apart as possible. So that makes the training work a little bit better. So we iterate over that training data, trying to calculate better and better cluster centers at each pass. At the same time, we’re trying to reduce the clusters from big k, which takes those extra cluster centers into account down to little k, which is the actual number that we want to end up with for the number of clusters. If we’re using k means plus, it’s specifically using something called Lloyd’s method to do that reduction important hyper parameters.
Obviously k choosing the right value of k is tricky. Often people use what’s called the elbow method. So you plot the within cluster sum of squares as a function of k and just try to find the point where larger values of k aren’t really producing any benefit and go with that. You want to optimize for how tight your clusters are. That’s what within cluster sum of squares is measuring. Also you want to look at the minibatch size extra center factor that’s that x parameter and the unit method which will be either random or Kmeans plus as ways of tuning your Kmeans model. In Sage Maker. It can actually use CPU or GPU instances, but they recommend CPU for Kmeans. If you are going to be using GPUs, only one GPU per instance is supported. So if you’re going to use a GPU, just use a P, whatever x large something bigger is not going to do any good, but CPU is probably the way to go with Kmeans.
28. Principal Component Analysis (PCA) in SageMaker
We just have a few more algorithms to get through here, guys, so stay with me. Next up is PCA, which stands for Principal Component Analysis. PCA is a dimensionality reduction technique, so its job is to take higher dimensional data, data that contains a lot of different features or attributes into a lower dimensional space that’s easier to work with. So again, we’re trying to avoid the curse of dimensionality, if you will. PCA is a way of doing that, a way of distilling your features down into to a smaller number of features that might not represent a specific thing, but they represent something that’s important to your model. Basically, we are projecting that higher dimensional data into a lower dimensional space while minimizing the loss of information in the process.
That’s what PCA does. So you can imagine, for example, a data set that has a bunch of different features, and we project that down to a two dimensional feature space, and we can just end up with a 2d plot that maps those features down into the two most important dimensions that we care about for preserving the information in that data set. Those reduced dimensions are called components. So that’s why we’re talking about principal components. It’s trying to find the principal dimensions within your higher dimensional space that really matter.
And those dimensions may or may not align with existing features. It’s probably going to be somewhere within that space, but more or less orthogonal if it’s going to be preserving information. Basically the first component that it gives you back will have the largest possible variability. So we’re trying to capture the variance in your data in as few dimensions as possible. And the first component, the first dimension that we get back does the best job of capturing that variability in your data. The second component would have the next largest possible variability and so on and so forth. PCA is also cool because it’s unsupervised.
You can just throw it at some data set, you don’t need to train it and it will distill that down into its fundamental components. What does it expect? To either record IO protobuff or CSV data and you can use file or pipe mode on either pretty much arbitrary data. So the way that it works under the hood is creating what’s called a covariance matrix. And then it uses an algorithm called Singular Value Decomposition or SVD to distill that down. It has two different modes of operation in Sage Maker, one is regular mode, which is generally used for sparse data, or if you have moderate number of observations and features. But if you have a large number of observation and features, randomized mode might be a better choice that uses an approximation algorithm instead and scales a little bit better.
The main hyper parameters for PCA are the algorithm mode like we just talked about, and also something called subtract mean, which has the effect of unbiased the data upfront, which can be obviously useful. For instance types, it can use either GPUs or CPUs. They’re not very specific. In the Sage Maker Developer Guide about how to decide between the two, it says it depends on the specifics of the input data. So you probably want to experiment with GPU and CPU instances for actually running PCA if you care about performance a lot.
But again, the main thing to remember with PCA is that it’s a dimensionality reduction technique. It can take data that has a large number of features and attributes and boil that down into a smaller number of features and attributes. Again, those features and attributes aren’t going to have names per se, but they will represent dimensions in the data that actually capture the variability of the data that you have effectively. So PCA all about dimensionality reduction. Bye.
29. Factorization Machines in SageMaker
The next stop in our zoo of algorithms is Factorization Machines. So what are factorization machines all about? Well, Factorization machines specialize in classification or regression with sparse data. What do we mean by sparse data? So an example that’s often used for factorization Machines is a recommender system. Why is a recommender system dealing with sparse data? Well, with a recommender system, we’re trying to predict what pages or products are, something that a user might like. But the problem is that you might have a huge number of pages or a huge number of products, right? And an individual user does not interact with the vast majority of them. So we can’t really have a bunch of data for training purposes that says this user liked this product but didn’t like this product, and they like this product and they didn’t like this product. In reality, they’re only going to have information for a very small number of products out of the entire product catalog. So that’s why this is what we call sparse data. We only know a few things about each individual user and it’s up to us to try to figure out what they might think about all these other things that they didn’t interact with to try to fill in that sparsity. Click prediction is another example of that, where an individual user or session probably will not interact with the vast majority of pages on the Internet.
However, we have some information about a few pages. It’s up to us to figure out which of these huge number of other pages might they click on next. So this is what we mean by sparse data. So this is a supervised method, and again, it can do classification or regression. For example, you might be saying, I want to predict whether this person likes this product or not. Or you might want to actually do a regression of a specific rating value that they might assign to that item. It is limited to pairwise interactions. So, as we’ll see, factorization Machines are talking about the factorization of a matrix, a two dimensional matrix. So we need at least two dimensions here, where one dimension might be users and another dimension might be items. For example, again, getting back to the case of recommending items to users, the training data expects must be record IO protobuff format in float 32 format. Because we’re talking about sparse data.
CSV isn’t really practical in this case. You would just end up with this huge list of commas because the vast majority of the items are not going to have any data associated with them for a given user. So CSV is a non starter for this one. How does it work? So basically, we’re trying to model this thing as a giant matrix. So we could have a matrix that maps users to items and it’s filled up with whether they like these things or not, or what their ratings were. And again, this is a very sparse matrix where the vast majority of those cells are going to be empty because most users have not rated most items. So our job is to try to build up factors of that matrix that we can use to predict what a given rating might be for a given user item pair that we don’t know about.
That’s what factorization machines do. It’s trying to find these factors of the matrix that we can use to multiply together to figure out, well, given this matrix of a user’s items that they liked, what do we think the resulting ratings would be for things they haven’t seen yet? Again, usually this is used in the context of recommender systems. So for the purpose of the exam, the main thing to remember is that if you’re trying to pick an algorithm that’s relevant to recommender systems, factorization Machines is probably a really good choice to look at important hyper parameters. There are initialization methods for bias factors and linear terms, and you can tune the properties of each one of those methods individually. They can be uniform, normal or constant. Again, the details are probably not going to be important. But just for completeness, those are the main knobs and dials for Factorization machines.
It can use CPUs or GPUs. They actually recommend CPUs for factorization machine training. GPUs are really only a benefit if you have dense data. And if you have dense data, not sparse data, you should be asking yourself, why am I looking at factorization machines in the first place? So again, key takeaways here factorization Machines all about dealing with sparse data, doing classification or regression on sparse data. It works on pairs of data, so user item pairs, user page clicks, things like that. And it’s often used in the field of recommender systems.
Interesting posts
The Growing Demand for IT Certifications in the Fintech Industry
The fintech industry is experiencing an unprecedented boom, driven by the relentless pace of technological innovation and the increasing integration of financial services with digital platforms. As the lines between finance and technology blur, the need for highly skilled professionals who can navigate both worlds is greater than ever. One of the most effective ways… Read More »
CompTIA Security+ vs. CEH: Entry-Level Cybersecurity Certifications Compared
In today’s digital world, cybersecurity is no longer just a technical concern; it’s a critical business priority. With cyber threats evolving rapidly, organizations of all sizes are seeking skilled professionals to protect their digital assets. For those looking to break into the cybersecurity field, earning a certification is a great way to validate your skills… Read More »
The Evolving Role of ITIL: What’s New in ITIL 4 Managing Professional Transition Exam?
If you’ve been in the IT service management (ITSM) world for a while, you’ve probably heard of ITIL – the framework that’s been guiding IT professionals in delivering high-quality services for decades. The Information Technology Infrastructure Library (ITIL) has evolved significantly over the years, and its latest iteration, ITIL 4, marks a substantial shift in… Read More »
SASE and Zero Trust: How New Security Architectures are Shaping Cisco’s CyberOps Certification
As cybersecurity threats become increasingly sophisticated and pervasive, traditional security models are proving inadequate for today’s complex digital environments. To address these challenges, modern security frameworks such as SASE (Secure Access Service Edge) and Zero Trust are revolutionizing how organizations protect their networks and data. Recognizing the shift towards these advanced security architectures, Cisco has… Read More »
CompTIA’s CASP+ (CAS-004) Gets Tougher: What’s New in Advanced Security Practitioner Certification?
The cybersecurity landscape is constantly evolving, and with it, the certifications that validate the expertise of security professionals must adapt to address new challenges and technologies. CompTIA’s CASP+ (CompTIA Advanced Security Practitioner) certification has long been a hallmark of advanced knowledge in cybersecurity, distinguishing those who are capable of designing, implementing, and managing enterprise-level security… Read More »
Azure DevOps Engineer Expert Certification: What’s Changed in the New AZ-400 Exam Blueprint?
The cloud landscape is evolving at a breakneck pace, and with it, the certifications that validate an IT professional’s skills. One such certification is the Microsoft Certified: DevOps Engineer Expert, which is validated through the AZ-400 exam. This exam has undergone significant changes to reflect the latest trends, tools, and methodologies in the DevOps world.… Read More »