Amazon AWS Certified Machine Learning Specialty – Modeling Part 3
5. Deep Learning on EC2 and EMR
So we’ve talked a lot about the world of deep learning in a more general sense that’s not really specific to AWS. And again, that’s okay. You’ll find that most of the machine learning exam is not specific to AWS, but let’s tie it back in a little bit, shall we? So how do I do this stuff? Actually on AWS using EC Two or Elastic MapReduce. Well, Elastic MapReduce supports Apache MX net out of the box and different GPU instance types. So it’s useful to know what sorts of instance types are relevant here for deep learning. Like we said, deep learning is a really good fit for GPUs, so even though these are not cheap, if you can use a GPU instance, it’s going to accelerate your deep learning quite a bit. The main types here are the P Three, which consists of eight Tesla V 100 GPUs, the P Two, which consists of GPUs, and the G Three, which is four M 60 GPUs.
These are all Nvidia chips. Usually we’re talking about P Twos or P threes in practice here. So P two S are going to be cheaper than P threes, but depends on how big your job is and how quickly you need it to run and how much budget you have. Frankly, those P threes are pretty pricey. There are also deep learning amis available, so you can just deploy an EC Two instance that has TensorFlow or MXNet or whatever framework you want preinstalled. And if you deploy that to a P Two or P Three instance, you’ll have a nice little machine waiting to go to run your TensorFlow application really, really quickly. And as we’ll see later in the course, two Sage Maker is also an alternative for doing your deep learning. But we’re not there yet. We’re getting there, though.
6. Tuning Neural Networks
Let’s talk a bit about tuning your neural networks. And again, this is getting into the part of the exam where they’re trying to weed out people who haven’t actually done this in the real world. This is not stuff that is typically taught, but I’m going to try to convey it as best as I can. This is very important stuff for the exam guys. So let’s talk about learning rate. First of all, what do we mean by learning rate? Well, you need to understand how these neural networks are trained. They’re using a technique called to gradient descent, or something similar to gradient descent. There’s various different flavors of it out there. The basic idea is that we start at some random point of weights in our neural network and we just sample different solutions, different sets of weights, trying to minimize some cost function that we define over several epochs. So those are the keywords there.
We have many epochs iterations over which we train. At each epoch, we try a different set of weights on our neural network, trying to minimize some cost function, which might be the overall accuracy of how well it makes predictions on our validation set. So we need to have some sort of rhyme and reason as to how we make those samples of different solutions, different weights, if you will. If we were to boil this down into sort of a two dimensional graph, maybe it would look something like this, where we’re just sampling different points here along a curve of solutions and we’re trying to find the one that minimizes the cost function. So that’s the y axis here. So what we’re trying to find is the lowest point on this graph and we’re trying to get there by sampling it at different points and learning from each previous sample. That’s what gradient descent is all about. So the learning rate is all about how far apart those samples are. So you see here we might have started up here and our learning rate said, okay, I’m going to try another point here and try again here, so on and so forth until I finally find the lowest point along this curve and call that my best solution.
So not too hard to understand the effective learning rate on your training, right? If you have too high of a learning rate, you might overshoot that solution entirely. So imagine my learning rate was huge and I went straight from here to here. I might miss that bottom point there entirely if my learning rate were too high. But you can see that if my learning rate is too small, I’m going to be sampling a whole lot of different points here and it’s going to take a lot of epochs, a lot of steps to actually find that optimal solution. So too high of a learning rate might mean that I overshoot the correct solution entirely, but too small of a learning rate will mean that my training might take longer than it needs to. Now learning rate is an example of what we call hyper parameters.
It’s one of the knobs and dials that you use while training your deep learning model that can affect its end result. And oftentimes these hyper parameters can have just as much influence on the quality of your model as the topology of the model, the feature engineering you’ve done and everything else. So it’s just another piece of the puzzle here that you need to arrive at experimentally. In addition to learning rate, another important hyper parameter is the batch size. And this is how many training samples are used within each epoch. Now hammer this into your heads guys, because it’s kind of counterintuitive. You would think that a large batch size would be a good thing, right? The more data the better. But no, that’s not how it ends up working. It turns out that if you have a small batch size, it has a better ability to work its way out of what we call local minima. So in this example here, you can see that we have a minimum here, sort of a dip in the graph here where we have a pretty good, nice low loss function value here.
What we’re trying to optimize for is pretty good here. But there’s a risk during graded descent that we get stuck in that local minimum when in fact the better solution is over here somewhere. So we want to make sure that during the process of gradient descent we have some ability to wiggle our way out of this thing and find that better solution. It turns out that smaller batch sizes can do that more effectively than larger ones. So a small batch size can wiggle its way out of these local minima, but a large batch size might end up getting stuck in there, like basically weighing it down, if you will.
So batch sizes that are too large can end up getting stuck in the wrong solution. And what’s even weirder is that because you will usually randomly shuffle your data at the beginning of each training epoch, this can end up manifesting itself as getting very inconsistent results from run to run. So if my batch size is just a little bit too big, maybe sometimes I’ll get stuck in this minimum and sometimes I won’t. And I’ll see that in the end results of seeing that from run to run. Sometimes I’ll get that answer and sometimes I’ll get that answer, right? So hammer this into your head guys. It’s very important for the exam. Smaller batch sizes tend to not get stuck in local minima, but large batch sizes can converge on the wrong solution at random. A large learning rate can end up overshooting the correct solution, but small learning rates can increase the training time. So remember this, write it down. Important stuff. And again, it’s an example of things that most people just learned the hard way through experience, but I’m trying to teach it to you up front here.
7. Regularization Techniques for Neural Networks (Dropout, Early Stopping)
Let’s dive into regularization techniques in the world of neural networks. Again, something that’s not usually taught a little bit. But this is going to be something that the test touches on to try to separate out the people who have actually done this from the people who just read about it. So I do what I can to teach you about it. What is regularization anyway? Well, basically regularization is any technique that is intended to prevent overfitting. What is overfitting? Well, if you have a model that’s good at making predictions on the data it was trained on, but it doesn’t do so well on new data that it hasn’t seen before, then we say that that model is overfitted. That means that it’s learned patterns in your training data that don’t really exist in the general sense in the real world. So if you see a high accuracy on your training data set, but a lower accuracy on your test set or your evaluation data set, that’s nature’s way of telling you that you might be overfitting. Let’s take a step back here. This is probably the first time I’ve used the word evaluation data set. Again, if you’re new to this world, in the world of deep learning, typically we talk about three different data sets.
So we have the training data set. This is the actual training data fed into your neural network from the bottom up. And that’s what we actually train the network on, right? And then as we’re training each epoch, we can evaluate the results of that network against an evaluation data set. So basically, that’s a set of the training set that’s set aside to evaluate the results and the accuracy of your model as it’s being trained. And then we can also have a testing data set that lives outside of all of that. So once we have a fully trained model, we can then use our testing data set to evaluate the complete finished model, if you will. So again, if you’re seeing your training accuracy being a lot more than the accuracy measured against your evaluation data or your testing data at the end, that probably means you’re overfitting to the training data. This graph on the right here makes it a little bit more easy to understand. So imagine I’m trying to build a model that just separates out things that are blue from things that are red here.
So if you eyeball this data, your brain can pretty much figure out that there’s probably this curve that kind of separates where the bluish stuff is and where the reddish stuff is, right? But in the real world, data is messy. There’s a little bit of noise there too. So if a model were overfitting, it might actually learn that green curve there that’s actually snaking in and out of all the data to try to fit that training data exactly. But you know that’s just noise, right? I mean, just looking at it, your brain knows that that’s not correct, but your neural network doesn’t really have that intuition built into it. So we need regularization techniques to sort of prevent that from happening, to prevent a neural network or any machine learning model from curving and undulating and sort of making these higher frequency pass out of the way to overfit its data to its model. All right, that’s what overfitting. It’s a good way to generalize it. The socalled correct answer.
The correct model would be that black line, but an overfitted model would be more like the green line. And this is actually something that really happens in neural networks. If you have a really deep neural network with lots of weights and connections and neurons that are built into it, it can totally pick up on complex patterns like that. So you do have to be careful with it. So that’s where the world of regularization techniques come in. Let’s go into some. So a very simple thing might be you might just have too complex of a model. Maybe you have too many layers or too many neurons. So you could have a deep neural network that’s too deep or maybe too wide or maybe both, right? So by actually simplifying your model down, that restricts its ability to learn those more complicated patterns that might be overfitting. So a very simple model that’s just a simple curve like that that could probably be achieved through a regression, maybe you’re better off with a simpler model. And the simplest regularization technique is simply to use fewer neurons or use fewer layers. That is a totally valid thing to do. Sometimes you need to experiment with that.
So if you find that your model is overfitting, probably the simplest thing is just to use a simpler model, try fewer layers, try fewer neurons in each layer, and see what kind of impact that has. If you can still have the same accuracy in your test data set but not overfit to your training data set, then why use more neurons than you need? Another technique is called dropout, and this is kind of an interesting one. So the idea with a dropout layer is that it actually removes some of the neurons in your network at each epoch as it’s training. And this has the effect of basically forcing your model to learn and spread out its learning amongst the different neurons and layers within your network. So by dropping out specific neurons that are chosen at random in each training step, we’re basically forcing the learning to spread itself out more. And this has the effect of preventing any individual neuron from overfitting to a specific data point, right? So it’s a little bit counterintuitive that actually removing neurons from your neural network can make it actually trained better. But that’s what happens. That prevents overfitting. So that’s what dropout is all about. Again, a very effective regularization technique. We see this a lot in, say, CNNs, for example. It’s pretty standard to have a pretty aggressive dropout layer, like maybe even 50% being held out for each training pass.
So that’s all drop out, is it’s just removing some neurons at random at each training step to force your model to spread its learning out a little bit better. And that has a regularization effect that prevents overfitting. Another very simple solution is called early stopping. So let’s take a look at this print out as we’re actually training a real neural network. So you can see that if you look at the accuracy on the validation set, that’s the right hand column there. We’re going from 95% to 97% and things are getting better. And then all of a sudden we get up to like around 98% and things start to get weird. It starts to oscillate, right? So we can say just by looking at this, that after around epoch five, we’re not doing any more benefit by training further. In fact, we might be doing more harm than good because at this point, we’re probably starting to overfit. And indeed, if you look at the training set accuracy, that’s that first column of accuracy, the second column of numbers that you see in this display, the accuracy on the training set continues to increase as we train more and more epochs. But the accuracy on the validation set pretty much stopped getting better at around epoch five. So this is pretty clearly starting to overfit beyond the fifth epoch.
All right, all early stopping is a way of automatically detecting that and it’s an algorithm that will just say, okay, the validation accuracy has leveled out. My training accuracy is still increasing. We should probably just stop now. So early stopping just means, okay, I know you wanted ten EP box, but I can see here that after five, things are just getting worse as far as overfitting goes. So we’re going to stop at five. Guys, we’re done here. That’s it. That’s all early stopping is about. It’s just making sure that you’re not training your neural network further than you should and that prevents overfitting. Very simple solution there. We’ll talk about more regularization techniques later in the course, but those are two that are specific to neural networks and that’s what I want to focus on right now.
Interesting posts
The Growing Demand for IT Certifications in the Fintech Industry
The fintech industry is experiencing an unprecedented boom, driven by the relentless pace of technological innovation and the increasing integration of financial services with digital platforms. As the lines between finance and technology blur, the need for highly skilled professionals who can navigate both worlds is greater than ever. One of the most effective ways… Read More »
CompTIA Security+ vs. CEH: Entry-Level Cybersecurity Certifications Compared
In today’s digital world, cybersecurity is no longer just a technical concern; it’s a critical business priority. With cyber threats evolving rapidly, organizations of all sizes are seeking skilled professionals to protect their digital assets. For those looking to break into the cybersecurity field, earning a certification is a great way to validate your skills… Read More »
The Evolving Role of ITIL: What’s New in ITIL 4 Managing Professional Transition Exam?
If you’ve been in the IT service management (ITSM) world for a while, you’ve probably heard of ITIL – the framework that’s been guiding IT professionals in delivering high-quality services for decades. The Information Technology Infrastructure Library (ITIL) has evolved significantly over the years, and its latest iteration, ITIL 4, marks a substantial shift in… Read More »
SASE and Zero Trust: How New Security Architectures are Shaping Cisco’s CyberOps Certification
As cybersecurity threats become increasingly sophisticated and pervasive, traditional security models are proving inadequate for today’s complex digital environments. To address these challenges, modern security frameworks such as SASE (Secure Access Service Edge) and Zero Trust are revolutionizing how organizations protect their networks and data. Recognizing the shift towards these advanced security architectures, Cisco has… Read More »
CompTIA’s CASP+ (CAS-004) Gets Tougher: What’s New in Advanced Security Practitioner Certification?
The cybersecurity landscape is constantly evolving, and with it, the certifications that validate the expertise of security professionals must adapt to address new challenges and technologies. CompTIA’s CASP+ (CompTIA Advanced Security Practitioner) certification has long been a hallmark of advanced knowledge in cybersecurity, distinguishing those who are capable of designing, implementing, and managing enterprise-level security… Read More »
Azure DevOps Engineer Expert Certification: What’s Changed in the New AZ-400 Exam Blueprint?
The cloud landscape is evolving at a breakneck pace, and with it, the certifications that validate an IT professional’s skills. One such certification is the Microsoft Certified: DevOps Engineer Expert, which is validated through the AZ-400 exam. This exam has undergone significant changes to reflect the latest trends, tools, and methodologies in the DevOps world.… Read More »