Amazon AWS Certified Machine Learning Specialty – Modeling Part 6
14. Linear Learner in SageMaker
Let’s start diving into the long list of builtin algorithms that Sage Maker offers. You will actually be expected to know about these things and the things that are special and quirky about each one. So pay attention, guys. Take notes. It’s all important. We’ll start with linear learner, which is a pretty simple concept, right? I mean, linear regression, that’s kind of like the first thing you learn about in machine learning. And all it is, is the idea of fitting a line to some training data set, right? So once I have a line fitted to some data, I can just say, okay, given an x value, I’ll use that line equation to predict its y value. But linear learner does a whole lot more than just that. It is important to remember that it’s linear. I mean, not all data sets are really suitable for a linear function, right? Not all data sets are separable by a line. Sometimes you have to have more of a curve to actually get at what’s going on. So you want to make sure you’re looking at your scatter plots and figuring out if a line really is an appropriate fit for what you’re trying to do.
But if so, it can actually do more than just linear regression. It can handle both regression and classification, which is a little bit counterintuitive. I mean, usually when we think about regression, we’re thinking about numerical problems, not classification problems. But linear learner can actually do both by using what’s called a linear threshold function. So in the classification case, it can do either binary or multiclass classification problems. So linear learner can do pretty much anything as long as a line will fit what you’re trying to do. It’s input format, and we’re going to get into this with every single algorithm that we look at. It would like to have record IO wrap protobuff file format in float 32 data, so that’s going to be the most performant option for you.
But it can also take raw CSV data as well. In that case, it will assume that the first column is the label data, followed by all the feature data, and it supports both file or pipe mode in either case. So what do I mean by that? Well, in file mode in Sage Maker, it will copy all of your training data over as a single file all at once to every training instance in your training fleet, whereas pipe mode will actually pipe it and stream it in from s three as needed. Obviously, pipe mode is going to be more efficient, especially with larger training sets, right? So if you’re having a problem where s three is taking too long to train, like, it’s getting a hard time even getting started. A very simple optimization would be to use pipe mode instead of file mode. Remember that for the exam guys we’ll get into for every algorithm some sort of things that are special about them. So for Linear Learner, some things you have to know are that the training data should be normalized and you can either do that yourself or let the Linear learner algorithm do that for you automatically. But you have to remember to do one or the other. Either normalize your data upfront or remember to tell Linear Learner to actually normalize your data for you. You also need to make sure that your input data is shuffled. That’s going to be very important for getting good results out of it.
Now, Linear Learner is actually doing some pretty complicated stuff under the hood. It’s using SGD Stochastic gradient descent. You can have one of many optimization algorithms to choose from, including variations of SGD like Adam and Ada Grad, and it actually trains multiple models in parallel and chooses the most optimal one during the validation step. So pretty sophisticated stuff. It also offers L One and L Two regularization, which you can tune again, these are just ways to prevent overfitting where L One is ending up doing feature selection, whereas L Two is sort of just weighting your individual features more smoothly. Some important type of parameters when you’re actually tuning your Linear Learner model one is balance multiclass weights and you want to be able to set that to give each class equal importance in the loss function. So important to remember to set that. Also, you can adjust the learning rate and batch size just like we talked about when we talked about tuning neural networks. Very much works the same way. And you can also adjust the L One regularization term as well as L Two. L Two goes by the name of weight decay. In this particular set of hyper parameters. The types of instances you should choose for Linear Learner are either single or multi machine CPU or GPU.
So it does help to have more than one machine, but it does not help to have more than one GPU on one machine. So important to remember that stuff just to drive home that this is not your father’s linear regression here. One of the examples they give you is actually using Linear Learner to classify handwriting recognition using the MNIs data set.
So that’s kind of cool. You can actually use Linear Learner as a classifier and just feed in raw pixel data and have it classify things and it actually works. I mean, that’s crazy, right? But again, if you look at how it’s working under the hood, it’s not that dissimilar from how we train neural networks. So even though it’s not a neural network, it works in some ways in a similar manner. So it’s kind of cool. So that’s Linear Learner in a nutshell and we’re just going to keep plowing through each of these algorithms and talk about the important points that might show up on the exam.
15. XGBoost in SageMaker
Next, let’s talk about XGBoost, and it’s built in algorithm for a Sage Maker. XGBoost is really hot these days. It stands for extreme gradient boosting. We touched on it a bit when we talked about boosting earlier. It is a boosted group of decision trees. That means that we have sort of a series of decision trees where we keep making new trees that are made to correct the errors of the previous trees and they sort of build on each other to make a better and better model. It’s using gradient to center as it goes. That seems to be sort of the key to a lot of these algorithms, right? And it uses that to minimize the loss as those new trees are added in. XGBoost has been winning a lot of kaggle competitions lately, so it’s really one of the hottest algorithms out there right now.
And it’s also pretty fast. So you’re not going to pay a huge computational cost to it either. And although you tend to think of decision trees in the context of classification problems, turns out you can also use it for regression, for predicting numerical values, and it uses something called regression trees for that case. XGBoost is kind of weird in the context of Sage Maker because it wasn’t an algorithm made for Sage Maker. They are just using straight up open source XGBoost. So as such, it’s not really built to take proto buff record IO format. It just takes CSV files or lib SVM input because that’s what open source XGBoost takes in. So it’s just XGBoost. There’s nothing really special about it in the context of Sage Maker. Because of that, your models just get serialized or deserialized using pickle with Python.
And you can also just use it as a framework within your Sage Maker notebook using Sage Maker XGBoost. So you don’t even have to deploy it to training hosts or you don’t even have to use a docker image with it. You can just use it within your notebook if you want to. Of course, it’s just going to be running on your notebook instance in that case. But you can also use it as a built in Sage Maker algorithm and refer to the XGBoost docker image in ECR and deploy that to a fleet of training hosts to do larger scale training jobs. The thing with XGBoost, though, is that it has a lot of hyper parameters and tuning them is really the main battle and getting good results out of it. So here are just a few of them. I mean, there’s just a ton of them.
There’s the subsample hyper parameter. You can use that to prevent overfitting. So if you’re asked how do I prevent overfitting an XGBoost adjusting the subsample parameter is a good way of doing that. So is ada. That corresponds to the step size shrinkage and that is also used for preventing overfitting. Other parameters include gamma, alpha and lambda. Gamma corresponds to the minimum loss reduction to create a partition for a new branch of your tree. Alpha and lambda correspond to the l one and l two regularization terms, and in both cases larger values will yield a more conservative model. XGBoost only takes advantage of CPUs. It is not a GPU based algorithm, so don’t bother using a p two or a p three instance for this. It actually turns out that it’s a memory bound algorithm, not compute bound most of the time. So your best choice is going to be an m four instance type when you’re actually deploying SG boost for training.
16. Seq2Seq in SageMaker
Next up, we’ll talk about Sage Maker’s seek to seek algorithm that stands for sequence to sequence. And that might ring a bell from when we talked about RNNs. This is an example of where you just take in a sequence of tokens and you output another sequence of tokens. And like we said before, a common use of that is machine translation. You can think of a sentence as just a sequence of tokens, a sequence of words, right? And your output would be a sequence of translation, isolated words based on that same phrase. It’s also used for things like text summarization.
So maybe I can take in a sequence of tokens that corresponds to words in a document and output a sequence that corresponds to words of a summarization of that document speech to text. Another example of that, I might have tokenized audio waveforms, and I want to output tokenized words in text instead. Under the hood, it can implement this with either RNNs or CNNs with attention, which is sort of an alternative method. So even though we talk about sequential data as being a fit for RNNs, they can also be a fit for CNNs as well. It expects record IO protobuff as its input, where the tokens are integers. And that’s a little bit unusual because most of the algorithms in Sage Maker want floating point data.
But if you think about it, the sequences are going to be tokens, right? So we’re going to have some sort of a sequence of things, and those might be integer indices into vocabulary files for words or something else. And it really should be record IO protobot format because that is really custom built for being a good input into a neural network like this. So you need to provide training input in the form of tokenized text files. You can’t just pass in a text file full of word documents or whatever, right? You need to actually build a vocabulary file that maps every word to a number, because computers like numbers, not words. So you’re going to have to provide it with both that vocabulary file and the tokenized text files or whatever else you’re tokenizing as the input.
There is sample code provided in Sage Maker that shows you how to actually do that and convert everything to the record IO proteb format that it expects as input. It ends up looking a lot like that TFIDF lab that we did earlier, actually. But at the end of the day, you provide training data, a validation data set, and a vocabulary file that it can use to actually map those tokenized files to actual words or whatever it is you’re trying to encode. Now, as you can imagine, building a machine learning model that can translate between two different languages is a pretty tall order.
And yeah, it can take days to train these things, even using the power of Sage Maker and huge fleets of training hosts. Fortunately, there are pretrained models available. So if you look at the example notebook they provide, they’ll show you how to actually access pretrained models for translating from one language to another.
The details aren’t important for the exam, but you just need to know that you can do that. And there are public training data sets out there as well that you can use for specific translation tasks. You don’t need to go out there and build your own dictionary of every word in English and how that corresponds to every word in German, for example. There are people who have done that work for you already. So, again, very common use of seek to seek is in machine translation. Usually that’s the context you hear it talked about. In the hyper parameters for seek to seek are the ones you would mostly expect for a neural network batch size, the optimizer type, adam, Sgt or RMS prop the learning rate.
How many layers do I have? The thing that’s a little bit different about it, though, are the optimizers. So what are you optimizing on? You can just optimize on accuracy. So you could provide a validation data set that says, here’s what I think the correct translation of this sentence would be, and it will measure well, did I get that right or not? But in the world of machine translation, the results tend to be a little bit more nuanced, right?
So you’re going to use different metrics more likely. For example, Blue Score can actually compare your translation against multiple reference translations, so it can have a little bit more of a wiggle room there. As to what’s considered correct, there’s also a metric called perplexity that’s used in the space. That’s a cross entropy metric.
But for the purpose of the exam, just remember that Blue Score and perplexity are well suited for measuring machine translation problems. Obviously, as a deep learning algorithm, it takes advantage of GPUs, and you should just use GPU instance sites for this. As we said, it’s a very heavy duty algorithm, so you’re probably going to want to throw a P three node at it.
Unfortunately, you can only use a single machine for training, so it cannot actually be paralyzed across multiple machines, which is kind of a bummer given how intensive this training is. However, the good news is it can use multiple GPUs on the same machine at least. So if you’re doing a very large training job with seek to seek, use a very beefy host P three with lots of GPUs.
Interesting posts
The Growing Demand for IT Certifications in the Fintech Industry
The fintech industry is experiencing an unprecedented boom, driven by the relentless pace of technological innovation and the increasing integration of financial services with digital platforms. As the lines between finance and technology blur, the need for highly skilled professionals who can navigate both worlds is greater than ever. One of the most effective ways… Read More »
CompTIA Security+ vs. CEH: Entry-Level Cybersecurity Certifications Compared
In today’s digital world, cybersecurity is no longer just a technical concern; it’s a critical business priority. With cyber threats evolving rapidly, organizations of all sizes are seeking skilled professionals to protect their digital assets. For those looking to break into the cybersecurity field, earning a certification is a great way to validate your skills… Read More »
The Evolving Role of ITIL: What’s New in ITIL 4 Managing Professional Transition Exam?
If you’ve been in the IT service management (ITSM) world for a while, you’ve probably heard of ITIL – the framework that’s been guiding IT professionals in delivering high-quality services for decades. The Information Technology Infrastructure Library (ITIL) has evolved significantly over the years, and its latest iteration, ITIL 4, marks a substantial shift in… Read More »
SASE and Zero Trust: How New Security Architectures are Shaping Cisco’s CyberOps Certification
As cybersecurity threats become increasingly sophisticated and pervasive, traditional security models are proving inadequate for today’s complex digital environments. To address these challenges, modern security frameworks such as SASE (Secure Access Service Edge) and Zero Trust are revolutionizing how organizations protect their networks and data. Recognizing the shift towards these advanced security architectures, Cisco has… Read More »
CompTIA’s CASP+ (CAS-004) Gets Tougher: What’s New in Advanced Security Practitioner Certification?
The cybersecurity landscape is constantly evolving, and with it, the certifications that validate the expertise of security professionals must adapt to address new challenges and technologies. CompTIA’s CASP+ (CompTIA Advanced Security Practitioner) certification has long been a hallmark of advanced knowledge in cybersecurity, distinguishing those who are capable of designing, implementing, and managing enterprise-level security… Read More »
Azure DevOps Engineer Expert Certification: What’s Changed in the New AZ-400 Exam Blueprint?
The cloud landscape is evolving at a breakneck pace, and with it, the certifications that validate an IT professional’s skills. One such certification is the Microsoft Certified: DevOps Engineer Expert, which is validated through the AZ-400 exam. This exam has undergone significant changes to reflect the latest trends, tools, and methodologies in the DevOps world.… Read More »