Amazon AWS Certified Machine Learning Specialty – Modeling Part 4

  • By
  • January 25, 2023
0 Comment

8. Grief with Gradients: The Vanishing Gradient problem

Something that kind of surprised me on the exam was how much depth they expect you to have on various edge cases around training a neural network. A couple of them involve gradients and some numerical issues with them. So let’s dive into that. Maybe you haven’t heard of the vanishing gradient problem, but let’s talk about it, shall we? So what happens here is that as the slope of that learning curve that we’re looking at, remember, we’re looking at the loss function on the y axis and, you know, some set of weights on the x axis that we’re trying out during gradient descent. So as the slope of this curve approaches zero, and that’s going to happen at the bottom of those curves, right? That’s where the slope becomes flat and horizontal. Basically, you can think of the slope again as the first derivative of this curve, right, mathematically, if you remember your calculus. But as that slope ends up approaching zero, you start working with very small numbers and that can start causing problems numerically.

It can slow down your training, and it can even get into the world of like, precision errors with your CPU, right? So as that slope becomes smaller and smaller, as you approach the bottom of that curve where you want to be, it can actually prove to be a very challenging place for a computer to operate. And this especially becomes a problem with deeper neural networks and also with RNNs, because those vanishing gradients end up propagating to deeper layers. So the vanishing gradient, think about what that’s saying, right? It’s saying that I have a vanishing gradient. The gradient is approaching zero. A gradient approaching zero means that the slope at that point is approaching zero or the first derivative is approaching zero. Basically, it just means we’re reaching the bottom of one of those curves, which might be a local minima, but it could be the correct answer as well.

So that’s an issue, it turns out. And there’s also the opposite problem as well, exploding gradients, where things are the opposite, they’re getting more vertical, like at the beginning of this graph here. That can also be a numerical issue as well. So what do we do about it? Well, there are a few different things you can do to address the vanishing gradient problem. One is by using what’s called a multilevel hierarchy. The idea is to split your neural network up into several layers that are trained individually. So instead of training your entire deep neural network all at once, every single layer together, you train some levels and then you train some other levels and then you train some other levels, all in this hierarchical approach where each subnetwork is trained individually. That can help out with limiting how far those vanishing gradients can propagate through your neural network. Also, there are specific architectures out there that are designed to address this problem. One is long short term memory or LSTM. You might recall that that’s a specific kind of recurrent neural network that we talked about earlier. And also some of the residual networks like ResNet, which is very popular in the world of object recognition. That’s a specific convolutional neural network architecture that’s very popular these days. And it too is designed to address the vanishing gradient problem. Also, just choosing a better activation function can work out well.

The activation function is basically how an individual neuron decides whether or not its weight should trigger a propagation of that signal or not. And you can have different functions that do that. One is called Raylou, and it has the property of being sort of a 45 degree angle when it’s positive. And that actually avoids some numerical issues from gradients and when you’re taking the derivative of that function. So just using Raylou can be a good choice for avoiding the vanishing gradient problem as well. So remember this stuff. If you do have the vanishing gradient problem, these are ways to fix it and these are the sorts of things people usually just learn through experience. But giving you a leg up here, remember these. Raylou is a solution to vanishing gradient, as is multilayer hierarchies and using specific architectures such as LSTM and ResNet.

Also on the topic of gradients, there’s something called gradient checking, and if you haven’t heard of that, it’s just a debugging technique. So as you’re actually developing a neural network framework, it might be a good idea to numerically check the derivatives that are computed during training and just make sure that they are what they should be. So if you’re actually validating the underlying code of how your neural network works, gradient checking is a good diagnostic tool to make sure that those gradients, those first derivatives of that learning curve are exactly what you expect them to be. Now, you’re probably not going to be doing gradient checking yourself in industry because this is usually happening at a lower level than the code and frameworks that you’re working with. But that’s what gradient checking is in case the exam throws it out there as something that maybe you should know what it means. All right. So that’s the world of grief with gradients and specific issues that the gradients during training can cause.

9. L1 and L2 Regularization

Next, let’s talk about L One and L Two regularization. This is more broad than just deep learning. This is something that applies to the entire field of machine learning. But the exam will expect you to know the difference between the two and when you might apply one over the other. Again, this is sort of a more advanced thing, but that’s what this exam is all about. It’s seeing how advanced you are in machine learning. So the way this works is that it adds what we call a regular regularization term as your weights are being learned in your neural network or machine learning model. L One is just the sum of those weights and L Two is the sum of the square of those weights. So, graphically, on the right here, you can see that L One ends up being that sort of diamond shape, whereas L Two, being the square, ends up being more round, or while it is round, it’s a circle. So you can also apply the same idea to loss functions as well, and not just for regularization during training. So what’s the practical difference between these two things? Well, L One, as a sum of weights in practice, it ends up performing feature selection mathematically. It can cause entire features to go to zero. So that regularization term can end up going to zero for a lot of terms and actually choose some features that are more important than others. It is, however, computationally inefficient and results in sparse output because it’s removing information at the end of the day. L Two, however, the sum of the square of the weights, this causes all features to remain considered. It just weights them all differently. So nothing goes to zero. You just get very small or very large weights for different features based on how things shake out. It’s more efficient computationally and results in denser output because it’s not discarding anything. So l two sounds pretty cool, right? I want to keep my information, don’t I? But there are cases where you’d want L One. So again, we talked about the curse of dimensionality when we started talking about feature engineering and feature selection. With l one. Regularization is one way of doing that automatically. And in an extreme example, out of 100 different features you have, maybe only ten would actually end up with nonzero coefficients with L One regularization. So the resulting sparsity that you end up with can make up for the computational inefficiency of L One regularization itself.

So even though it’s a little bit more intensive to compute L One regularization, you end up with a much smaller set of features at the end of the day, which can speed up the training of your machine learning model considerably, right? So at the end of the day, it’s probably a win as far as total training time goes. So if you think that you’re in a world where some of your features might not matter and you Actually want to reduce that down to A smaller Subset Of features, l One Is Probably Going To Work Out Well For You. However, if you think all your features are important, then go with L two regularization, because that’s not going to do feature selection. It’s not going to wipe out entire features by causing that regularization term to go all the way down to zero. It will just weight them lower. So that’s the difference. L one does feature selection. L two, who keeps them all around but just weights them all differently? And well, that’s the main difference, guys. That’s all there is. L one and l two rationalization.

10. The Confusion Matrix

Something you’ll see a lot on the exam is the concept of confusion matrices. So let’s dive into what those are all about. What’s the confusion matrix for? Well, the thing is, sometimes accuracy doesn’t tell the whole story. And a confusion matrix can help you understand the more nuanced results of your model. For example, a test for a rare disease could be 99. 9% accurate by just guessing no all the time by saying you don’t have it. A model that does that would look on paper to have very high accuracy, but in reality it’s worse than useless, right? So you need to understand with a case like this, how important a true positive or true negative is, how important a false positive or a false negative is to what you’re trying to accomplish and to be able to measure how good your model is at each one of those cases. And a confusion matrix is just a way to illustrate those nuances in the accuracy of your model. One might look like this. This is the general format of it.

So imagine that we have a binary situation where we’re just predicting yes or no. Like I have this disease, or I don’t have this disease, or I test positive for this drug, or I don’t test positive for this drug. This image has a cat in it, or this image does not have a cat in it. This is the format of what it would look like, right? So you see that on the rows we have predicted values, and in the columns we have actual values. So go through it. If we predicted something is true, and it really is, then that’s a true positive. If we predicted yes, but it’s actually no, actually negative, that would be a false positive. If we predicted no, but it’s actually yes and that’s a false negative.

And if we predicted no and it’s actually no, that’s a true negative. It gets a little confusing, but if you think through it, this all makes sense, right? So in an actual confusion matrix, these cells would contain actual numbers of how often your model actually did that on its testing data set. So keep in mind too, that you have to pay attention to the labels. There’s no real convention to how this is ordered. Sometimes you’ll see predictions up here and actual values over here.

Don’t just jump in. Assuming that a given confusion matrix is of a certain format, pay attention to how it’s labeled and make sure you understand what it’s telling you before you draw conclusions from it. Something else worth noting here is that you typically want to have most of your values here and here, right? So the diagonal here of your confusion matrix is where most of your results should be.

This is where accuracy lives, right? So this is where I have a true positive. This is where I have a true negative. You want those to be nice big numbers and false negatives and false positives to be comparably low numbers, hopefully, right? So an accurate model would have high numbers along this diagonal value here. Let’s plug in some actual numbers to see what that might look like. So say I have a machine learning model that’s trying to figure out if an image contains a picture of a cat or not. If we predicted that it had a cat, and it really did have a cat. That happened 50 times in my test set. But sometimes I predicted it was a cat. But it wasn’t a cat. It was a dog. Or a fish or something that happened five times if I predicted that wasn’t a cat, but it really was a cat that happened and ten times to this example. And if I said it was not a cat and it really was not a cat, that happened 100 times in this case. So that’s just how you interpret a confusion matrix. And we’ll talk about how to make metrics off of this data that are more useful for analysis shortly. Sometimes you’ll see confusion matrices in a different format, where we actually add things up on each row and column as well. So that’s something you might see once in a while.

All that is, is adding up how many actual nodes we have, how many actual yeses we have, how many predicted nodes we have, and how many predicted yeses we have in total. So just so you have seen that format before, that’s what that looks like. The inner part of it, though, is just the same confusion matrix that we looked at before. And again, remember, things can be flipped as far as where the predicted values and the actual values are. So make sure you pay attention to the labels on these things. And what can I say? Confusion matrices can be confusing. There’s no real standards surrounding them, unfortunately. So just make sure you pay attention to what they’re telling you and make sure you understand them before you answer any questions on the exam about them. Sometimes you’ll see them in this sort of a format too. So maybe we have a multiclassification model here too. Imagine that we have a handwriting recognition system that’s trying to identify somebody writing the values at zero through nine. So a more complicated confusion matrix might look like this, where instead of just yes, no answer is we actually have multiple classifications, but it works the same way. So here we have predicted labels on this axis and true labels on this axis.

So we’re saying that if I predicted something was a five and it really was a five, well, that shade of blue corresponds to some number here. So two things that are different in this example. First of all, we have more than yes, no options here, we have multiple classifications, so our confusion matrix is larger. Let’s dive into another example there, just to drive that home. So sometimes I predicted it was a one, but it was really an eight that has sort of a lighter blue there. Maybe that happened 20 or so times in this example. And we’re also using what’s called a heat map. So instead of just displaying numbers in these individual cells, we’re mapping those numbers to columns where the darkness of that color corresponds to how high of a number it is. You would expect to see sort of a dark line going down the diagonal here, representing a good accuracy on True Positives and True Negatives and some sparser lighter colors outside here, ideally. But that color will mount to an actual value, and it just makes it easy to visualize how your confusion matrix is laid out. All right. Makes sense, guys. That’s what a confusion matrix is all about. It can be a little bit confusing, but just stare at examples a little bit and it should make sense to you.

Comments
* The most recent comment are at the top

Interesting posts

The Growing Demand for IT Certifications in the Fintech Industry

The fintech industry is experiencing an unprecedented boom, driven by the relentless pace of technological innovation and the increasing integration of financial services with digital platforms. As the lines between finance and technology blur, the need for highly skilled professionals who can navigate both worlds is greater than ever. One of the most effective ways… Read More »

CompTIA Security+ vs. CEH: Entry-Level Cybersecurity Certifications Compared

In today’s digital world, cybersecurity is no longer just a technical concern; it’s a critical business priority. With cyber threats evolving rapidly, organizations of all sizes are seeking skilled professionals to protect their digital assets. For those looking to break into the cybersecurity field, earning a certification is a great way to validate your skills… Read More »

The Evolving Role of ITIL: What’s New in ITIL 4 Managing Professional Transition Exam?

If you’ve been in the IT service management (ITSM) world for a while, you’ve probably heard of ITIL – the framework that’s been guiding IT professionals in delivering high-quality services for decades. The Information Technology Infrastructure Library (ITIL) has evolved significantly over the years, and its latest iteration, ITIL 4, marks a substantial shift in… Read More »

SASE and Zero Trust: How New Security Architectures are Shaping Cisco’s CyberOps Certification

As cybersecurity threats become increasingly sophisticated and pervasive, traditional security models are proving inadequate for today’s complex digital environments. To address these challenges, modern security frameworks such as SASE (Secure Access Service Edge) and Zero Trust are revolutionizing how organizations protect their networks and data. Recognizing the shift towards these advanced security architectures, Cisco has… Read More »

CompTIA’s CASP+ (CAS-004) Gets Tougher: What’s New in Advanced Security Practitioner Certification?

The cybersecurity landscape is constantly evolving, and with it, the certifications that validate the expertise of security professionals must adapt to address new challenges and technologies. CompTIA’s CASP+ (CompTIA Advanced Security Practitioner) certification has long been a hallmark of advanced knowledge in cybersecurity, distinguishing those who are capable of designing, implementing, and managing enterprise-level security… Read More »

Azure DevOps Engineer Expert Certification: What’s Changed in the New AZ-400 Exam Blueprint?

The cloud landscape is evolving at a breakneck pace, and with it, the certifications that validate an IT professional’s skills. One such certification is the Microsoft Certified: DevOps Engineer Expert, which is validated through the AZ-400 exam. This exam has undergone significant changes to reflect the latest trends, tools, and methodologies in the DevOps world.… Read More »

img