Amazon AWS Certified Machine Learning Specialty – ML Implementation and Operations
1. Section Intro: Machine Learning Implementation and Operations
Our last domain to cover is machine learning, implementation and operations. It’s one thing to build a machine learning model and train it offline, but how do you deploy it into production? Not only do your models need to scale and perform reliably, they need to be secure as well. We’ll cover the operational aspects of Amazon Sage Maker here and how it interoperates with containers that host your models. We’ll talk about accelerating your machine learning systems using elastic inference and pushing your models to the edge and to devices using Sage Maker Neo. We’ll also dive into the intersection of Sage
Maker and AWS security and how Sage Maker interacts with IAM Kms and private VPCs. And we’ll talk about choosing appropriate EC, two instance types for Sage Maker and how to perform AB tests in a production environment to try out new models on realworld data at scale. For our Handson lab activity, we’ll take the same convolutional neural network that we built in the previous section and integrate it into Sage Maker for training, deployment and model tuning. All in the cloud. We’re in the home stretch here, guys, so let’s get through this last domain and you’ll be just about ready for the practice exam.
2. SageMaker’s Inner Details and Production Variants
Let’s dive into the final domain of the exam machine learning, implementation and operations. And we’ll start off by talking about more depth on how sage maker interacts with docker containers. This is a big part of this domain. This is actually going to be kind of a short domain in this course because we’ve covered a lot of this as we’ve gone. So it’s kind of hard to talk about all the different models that come with Sage Maker without talking at least a little bit about how to deploy them and choose the right instances for them and so forth. So in reality, you already know a lot of the implementation and operations domain of the exam. We’re just going to sort of fill in the blanks of stuff that we haven’t covered already, which is why this section is rather short. But that’s good news. You’re almost done, guys. Anyway, let’s talk a little bit more about docker and how that interacts with Sage maker. So we talked about this at a high level earlier on. Basically, every machine learning model and deployment within Sage maker needs to be hosted inside of a docker container that is registered with ECR. Now, that can include any number of things.
We could have a prebuilt deep learning model that’s sitting in a docker container. We could have a prebuilt ScikitLearn or a spark ML model that’s sitting in a docker container. We could have our own prebuilt code built on top of TensorFlow or MXNet or Chainer or PyTorch. These are all supported frameworks that you can just build a docker container from and just have it work within Sage Maker pretty much automatically. So as we’ll see in an example later on, you can actually write a little snippet of TensorFlow code and very easily package that up into a model with Sage Maker. One thing that’s worth mentioning here, though, is that TensorFlow does not get distributed across multiple machines automatically. So if you do need to distribute that training across multiple machines that might have multiple GPUs, there’s a couple of ways to do that. One is using a framework called Horovod or something called parameter servers. So these are little bits of trivia and important words that you might want to remember. Hint, hint.
Anyway, Horovod is a way of actually distributing TensorFlow across a fleet. And finally, you can also develop your own training or inference code from scratch. Or you could take a prebuilt image from one of the prebuilt algorithms for Sage maker and extend that and build upon it and make your own docker image based on top of that. So this gives you a really flexible model because all sage maker expects is a docker image that conforms to some specifications about where its model code is and how it’s called. And it can be anything, right? So any script you can dream up written in any runtime or any language that you want, as long as it’s in a docker container and things are in the right place, it can work with Sage Maker. These containers are all isolated and contain all their dependencies and resources within them. So you can really use any sort of technology you want within Sage Maker. As long as it’s wrapped up in a docker container. It’s really cool stuff. Take another pretty picture.
We looked at something similar to this earlier, but it’s worth reviewing again. So again, a docker container is created from a docker image, and an image is built from a docker file. So the docker file says, how do I put this thing together? That resulting image that includes all of your code and resources and whatnot gets saved into a repository. And in the example of using Sage Maker, that should be Amazon’s Elastic Container Registry or Amazon ECR. So again, let’s review this diagram over here. So we have docker images that I’ve prepared ahead of time in Amazon ECR. These can include both inferences and training models. It will pluck out that training model from ECR and use that docker container for training, pulling in training data from S Three. Those training jobs are all run after deploying those docker images and allowing it to access that S Three training data, the resulting trained models are then stored into S Three again, and that S Three model artifact is then made accessible to the inference code in the model deployment stage.
So again, we have a docker container that contains our inference code that can consume that stored model artifacts to actually generate inferences in real time. That model is deployed to a fleet of servers, and there are endpoints that will actually expose that for runtime usage from your outside application or what have you. One little point of trivia is that there is a library available for making your container compatible with Sage Maker. You can just run the Pip, install Sage Maker dash containers as part of your docker file to expose those capabilities to your docker file. The actual structure of a training container for your training model looks like this. And this is worth remembering, guys. So basically everything lives under Aoptml directory within your container.
Your input should contain your configuration, your hyper parameters and resource configuration files, as well as any data which might include channels where that input data is coming from. The actual code is worth calling out. There’s a code subdirectory that should contain the script files that actually run your training code itself. So the Python script or whatever it is that actually does your training should be deployed to opt ML code. Okay? Remember that. And finally, the output goes to an output directory. Any failure error messages are expected to go into there. The model directory is used for deployment, as we’ll see here. So the deployment container should contain an opt ML model directory, and that’s where the files associated with your deployment, your inference code, should live. Pretty simple stuff, at a higher level, the entire docker image would look like this. So basically, you have a working directory that contains nginx. com predictor PY, also a serve directory that contains the deployment stuff. The train subdirectory contains all the training image code stuff, and there’s also a WSGI PY file. Okay, let’s go into more details about what these all are. So nginx.com, that’s a configuration file for the NGI NX front end. So basically, we’re going to be running a web server at deployment time, and that’s how we configure that web server. Predictor PY. That’s the program that implements a flask web server.
That’s the program that actually implements a flask web server for making those predictions at runtime, you’re going to need to customize that code to actually perform predictions for your application in whatever way it works. The serve directory that program in there will be started when the container is started. For hosting, that file just launches the g unicorn server, which runs multiple instances of a flask application that is defined in your predictor PY script. The train directory contains the program that’s invoked when you run the container for training. So to implement your own training algorithm, you would modify the program that lives in there.
Again, the structure of that training directory is something we covered earlier when we talked about the structure of your training image. Finally, WSGI PY is just a small wrapper that’s used to invoke your flask application for serving results. Now, you can have separate training and inference images if you want to, or you can combine them together into this structure.
Either way works. So to put it all together, your docker file might look something like this. So we start off with from TensorFlow, TensorFlow, whatever it says, that’s just going to download the TensorFlow library used to run the python script. So basically we’re saying this depends on TensorFlow. We’re going to use the TensorFlow stuff to actually run this code. Pip install sage maker containers like we talked about before. That actually includes the common functionality necessary to create a container that’s compatible with amazon sage maker. And the copy command here is copying the script to do our training into the spot where it’s expected to be. Like we talked about earlier, that training code needs to reside within opt ML code directory.
So that’s just making sure that our training script is copied into the spot where sage maker expects it to be. And finally, we can define environment variables, and one of them would be env sage maker underscore program that allows you to define the specific script that actually runs the training. So train PY defines that as the name of the entry point script that is located in the optimal code folder for that container. That’s the only environmental variable that you must specify when you’re using your own container from scratch. But there are many other environment variables you might want to set as well. So in addition to Sage Maker program, some other ones worth looking at are training Module and service module. That’s basically where you’re going to be loading up the TensorFlow or MXNet or whatever it is modules. There’s also SM model dur. That’s where the model checkpoints are saved and pushed into S three. And then we have SM channels which can be SM channel, train test or validation. Those are basically where your train test and validation channels are coming from and what it expects to see there. And HPS stands for hyper parameters.
So that’s how you can define the different hyper parameters that are exposed by your algorithm, which in turn can be used by automatic model tuning in Sage Maker. And there are many more too, but you won’t need that much detail for the exam. A little bit of a code snippet of what it might look like to actually use your own image. It’s very straightforward. You would CD into the docker file. This could all be done like within a notebook, even call out to docker, tell it to build that docker file to whatever name you want to call it, and then the code to actually invoke it would be just from Sage Maker estimator import estimator that would invoke an estimator with an image name of whatever you specified. So when I built that docker image above, you see there that I called it foo. So here I’m referring to image name foo to say I want to go load up the docker image with this name foo and use that as an estimator within Sage Maker. I can then call fit on that estimator to actually do the training and we’re done. So it’s just that simple.
There’s a lot more detail in the Sage Maker developer guide if you want to dig into it more, but you probably don’t need a whole lot more for the exam. Also, I want to talk about production variants. This is an important concept, so you can actually test out multiple models in Sage Maker on live traffic using production variants. So there’s a concept where you can have different versions of your model. Let’s say you want to try them out on real traffic and see how they do.
Not all models can really be evaluated offline effectively. Recommender systems come to mind, for example, where accuracy on people’s past behavior isn’t always a good indication of their performance on future unseen behavior. So if you want to roll out a new model and test it out without taking a whole lot of risk, you could use a production variant to actually separate those two models and run them in parallel for a while to measure how they perform against each other. So there’s something called variant weights. This tells Sage Maker how to distribute traffic among your different production variants.
So, for example, I might have a new iteration of my model and I could roll it out initially at a 10% variant weight, that would mean that I would be sending 90% of my traffic to the existing model and 10% to the new model. And I can just start ramping that up, like increase that 10% over time. Once I gain more and more confident in this performance, and eventually, once I’m sure that this new variant is actually doing better than the older one, I could ramp it up to 100% and then discard the older production variant. So this allows you to do a B tests and to validate performance in a real world setting. Again, offline validation isn’t always used, and it’s always risky to launch new code in general, right? So this gives you a controlled way of rolling out a new model. And if it doesn’t work out well, some unforeseen problem comes out. You can very quickly roll back to the other previous model just by changing those variant weights at Runtime.
Interesting posts
The Growing Demand for IT Certifications in the Fintech Industry
The fintech industry is experiencing an unprecedented boom, driven by the relentless pace of technological innovation and the increasing integration of financial services with digital platforms. As the lines between finance and technology blur, the need for highly skilled professionals who can navigate both worlds is greater than ever. One of the most effective ways… Read More »
CompTIA Security+ vs. CEH: Entry-Level Cybersecurity Certifications Compared
In today’s digital world, cybersecurity is no longer just a technical concern; it’s a critical business priority. With cyber threats evolving rapidly, organizations of all sizes are seeking skilled professionals to protect their digital assets. For those looking to break into the cybersecurity field, earning a certification is a great way to validate your skills… Read More »
The Evolving Role of ITIL: What’s New in ITIL 4 Managing Professional Transition Exam?
If you’ve been in the IT service management (ITSM) world for a while, you’ve probably heard of ITIL – the framework that’s been guiding IT professionals in delivering high-quality services for decades. The Information Technology Infrastructure Library (ITIL) has evolved significantly over the years, and its latest iteration, ITIL 4, marks a substantial shift in… Read More »
SASE and Zero Trust: How New Security Architectures are Shaping Cisco’s CyberOps Certification
As cybersecurity threats become increasingly sophisticated and pervasive, traditional security models are proving inadequate for today’s complex digital environments. To address these challenges, modern security frameworks such as SASE (Secure Access Service Edge) and Zero Trust are revolutionizing how organizations protect their networks and data. Recognizing the shift towards these advanced security architectures, Cisco has… Read More »
CompTIA’s CASP+ (CAS-004) Gets Tougher: What’s New in Advanced Security Practitioner Certification?
The cybersecurity landscape is constantly evolving, and with it, the certifications that validate the expertise of security professionals must adapt to address new challenges and technologies. CompTIA’s CASP+ (CompTIA Advanced Security Practitioner) certification has long been a hallmark of advanced knowledge in cybersecurity, distinguishing those who are capable of designing, implementing, and managing enterprise-level security… Read More »
Azure DevOps Engineer Expert Certification: What’s Changed in the New AZ-400 Exam Blueprint?
The cloud landscape is evolving at a breakneck pace, and with it, the certifications that validate an IT professional’s skills. One such certification is the Microsoft Certified: DevOps Engineer Expert, which is validated through the AZ-400 exam. This exam has undergone significant changes to reflect the latest trends, tools, and methodologies in the DevOps world.… Read More »