Amazon AWS Certified Machine Learning Specialty – Modeling Part 7
17. DeepAR in SageMaker
Up next in our sage maker BC area is Deep AR. Deep AR is used for forecasting one dimensional time series data. So it’s kind of the classic use case of an RNN of a recurrent neural network. So we’re looking at a sequence of data points over time and trying to predict what that sequence will do in the future. For example, looking at stock prices and trying to predict future stock prices. What’s a little bit different about it over just a straight up RNN is that you can actually train Deep AR over several related time series at once. So it’s not limited to training over a single time series. If you have many time series that are somehow interdependent, it can actually learn from those relationships between those time series to create a better model for predicting any individual time series.
It’s able to find frequencies and seasonality, so it’s going to be better than just a straight up linear regression sort of an approach. It can actually learn the nuances of your data over time and feed that forward into the future for you. For training input, it has a wide variety that it can take. It can take JSON lines format and that can be gzipped or in parquet format for even better performance. Basically every record of that input data has to have a starting timestamp followed by the time series values. And optionally it can also include dynamic features such as was a promotion applied to the product in the time series or the product purchases.
It can also include categorical features as well. So if you have that sort of additional data, it can also learn from that as well. A little example of what the input format might look like is below here. Basically you have start containing the starting timestamp and then Target, which is a list of time series values for that time series. And then optionally you can have categorical features and dynamic features as well. So how is it used? What’s quirky about it? Well, you always include the entire time series for training, testing and inference. So even though you might only be interested in a certain window of it, you want to give it all the data you can so it can learn better. Always use the entire data set as a test set, just removing the last time sets for training.
So training and testing RNNs and time series data is a little bit weird. That way. You want to evaluate on the data set you withheld but still have the original data before that to actually use for evaluation. You don’t want to use very large values for prediction length, so don’t try to go beyond say, 400 data points that you’re predicting in the future. Things get a little bit wiggly after that. And again, if you can train on many times series that are related together and not just on one, that will unleash the full power of Deep AR for you. Some of the hyper parameters in Deep AR are the ones that you’d expect with a neural network of any sort. The number of epochs for training, the batch size, the learning rate, numb cells is just how many neurons were using in it. We also have the context length, though, and this is basically the number of time points the model sees before making a prediction, and that can actually be smaller than your seasonality period.
But remember that the model is going to lag by one year anyhow, so even if you’re doing a shorter context length that you’re using, it’s still going to be able to go back in time and pick up seasonalities on the scale of up to one year. That’s why it’s important to always give it the entire data set for training and not just the window that you care about. That way it can pick up on those seasonal trends based on more historical data. Deeper can actually use either a CPU or a GPU, which is kind of interesting. I mean, as an RNN, obviously it will benefit from a GPU, but given that, it’s pretty simple. As far as Deep learning goes, you can start off with a CPU if you wanted to see how it goes. So a C four two x large or a C four four x large might end up being cheaper than going with a P three or a P two. So if you can get away with you might want to try that out first. Obviously, for inference, it’s pretty quick. You can get away with a CPU instance just for making inferences off of a pretrained model, but you may need larger instances while you’re doing hyper parameter tuning jobs. One good thing, though, is that during training you can use single or a multiple machines, so it’s easy to scale this out if you need to.
18. BlazingText in SageMaker
Next, let’s cover the blazing text. Algorithm and you would think with a name like Blazing Text, it would be this incredibly general purpose, huge natural language processing thing. But in reality, it’s fairly constrained in what it can do. So don’t let the name fool you. What is it for? Well, it can be used for a couple of different things. One is text classification. So it can predict labels for a given sentence if you train the system with existing sentences and the labels that you have associated with them. So this is a Supervised learning system where you train it with a group of sentences and labels that you’ve associated with those sentences and it can take that information and predict labels for new sentences that you pass in in the future. This is useful for things like web search or information retrieval. But remember, it’s intended only for use with sentences, not entire documents. That’s an important distinction to remember on the exam.
The other thing it can do is what we call a Word to VEC. This is a word embedding layer that it creates. So it creates a vector representation of words where semantically similar words will be vectors that are close to each other. In the space of this embedding layer, basically it’s finding words that are similar to each other. So you might ask yourself, well, how is that helpful? By itself it often isn’t, but for other natural language processing algorithms, it can be a useful component. So you can use that embedding layer that gives you similar words when you’re doing things like machine translation or sentiment analysis and other things. But remember, Word to VEC only works on words. It does not work on entire sentences or entire documents. It’s just giving you this embedding layer that puts similar words close to each other. That’s all it does.
So Blazing Text two things supervised Text Classification or Word to VEC, which is a word embedding layer for finding words that are similar to each other. Now, the training input that it expects is very specific. So for Supervised Mode, when you’re doing the text classification, basically you’re going to feed in one sentence per line and the first Socalled word in that sentence will be the string underscore label underscore followed by the actual label for that sentence. It also accepts what we call Augmented manifest text format, if you wish. And for Word to VEC mode, it just wants a text file with one training sentence per line. So an example of what the input for the Supervised mode might look like is below here you see that we have underscore label underscore four. That means I’m going to associate label number four with the following sentence Linux rate for prime time, blah.
And notice that we’ve tokenized this so that every word is separated by a space with other words. And that also goes for punctuation. So you’ll notice that there’s spaces around the commas and the periods as well there too, we’ve also lowercased everything to make sure that we don’t have to deal with case sensitivity in the algorithms either. So this is an example of what that actual input would look like. Remember that these input formats are actually important and could show up on the exam. The augmented text format is also shown here, which is a little bit more of a structured way of putting that together. We have a source and a label field where we have again the tokenized and preprocessed sentence and the actual label number as well.
There are a few different modes of operation for Word to VEC. One is called Seabow, which is continuous bag of words and envision if you will, a bag of words. It’s exactly like that. So the idea there that you’re trying to picture is that these words aren’t really structured in a way where the order of the words matters. It’s just a bag of disconnected words that it’s learning relationships between. So the actual order of the words is being thrown out during the training process. It’s just the words themselves that matter. However, it also has two other modes skip gram and batch skip gram which are different. Those can be actually distributed over many CPU nodes. In the case of batch skip gram, which is kind of good skipgram, we’re talking about ngrams there.
So the order actually does matter in that case. Important hyper parameters. Again, this differs based on which mode you’re using. For Word to VEC, where we’re doing the embedding stuff, the mode is obviously very important are using continuous bag of words skip gram or bat skip gram. Also learning rate, window size, vector dim and negative samples all are important parameters for tuning the performance of that. For the text classification mode, the usual suspects for neural networks come into play here the number of epochs, the learning rate and also word grams and vector dimension for actually seeing how many words are going to look at together. For the continuous bag of words and skip gram modes of blazing text, they recommend a P three node and you can use a single CPU or single GPU instance as well, but this will take advantage of GPUs.
However it can only use a single machine. However, batch skip gram can use multiple CPU instances instead. So if you do need to scale horizontally, batch skip ground would be the way to go, but you’re going to be using CPUs instead of GPUs for that one. For text classification mode, they recommend a C five node if you’re using less than 2GB of training data and if you need something bigger, you could go up to a P two x large or a P three two x large, so it can go with either CPU or GPU on that one.
19. Object2Vec in SageMaker
Next we’ll cover object to VEC. And if you remember when we talked about blazing text, we talked about Word to VEC and that was restricted to finding relationships between individual words in a sentence. So Objective is a similar idea. It’s an embedding layer that shows you things that are similar to each other, but it’s more general purpose. It works on arbitrary objects. So whereas Word to VEC would be operating on an individual word level with Objective VEC, you could actually operate on entire documents if you wanted to, or entirely other kinds of things if you want. So again, it’s just an embedding layer. It’s creating this low dimensional, dense embedding of higher dimensional objects. So it’s basically taking all the features and attributes of your objects and boiling that down to a lower dimensional representation where the vectors that represent each object represent how similar they are to each other. So you can use this for a lot of things, computing the nearest neighbors of objects for clustering purposes. You can visualize those clusters if you want to. You can use it for things like genre prediction. So things that are close together in this object embedding space or possibly in the same genre for like dealing with movies or music or something like that. The usefulness of things like similar items or similar users also comes into play with recommendations and recommender systems.
When you’re recommending stuff to people, obviously you want to show things that similar users also liked or maybe show items that are similar to the items that you liked. A couple of ways of doing that. But either way for similar items or similar users, items and users could both be examples of objects that objective that can find relationships between. And this is an unsupervised algorithm, so you don’t need to train it. It can just automatically figure out what those similarities are based on the inherent data within its features, the input you have to tokenize the data into integers first of all. So you can’t just throw in an image or something there and expect it to work. You need to give it a list of integers basically that represent something else.
And what’s weird about Objective VEC is that it consists of pairs of tokens or sequences of tokens. So you’re dealing with things like a pair of sentences or labels and sequences, for example, genres to a sequence of words that represent a description, customer pairs or product pairs or user to item pairs. So we’re trying to find these relationships between things based on these paired attributes, if you will. And an example of what that actual format might look like is below, if you want to take a look at that.
So the way that it works is that we basically have two paths going in parallel for each component of that pair. And remember, we can have different pairs of different things there. So we have two input channels here, each input channel has its own encoder and its own input path. Those two input channels feed into a comparator which generates the ultimate label. So for each input path you can choose your own encoder. And there’s a couple of choices. Well, there’s a few choices here. One is average pooled embeddings, one is using a CNN and also a bi directional LSTM. So what works best for your data will vary. Obviously the comparator itself is actually followed by a feed forward neural network. So at the end of the day, you have a fairly complex system here based on deep learning and it’s pretty impressive what it can do. You only need this high level understanding in terms of the exam. The details are not going to be that important if you need to tune this thing.
The usual suspects for deep learning crop up again here. Drop out early, stopping the number of epochs, the learning rate, the batch size, how many layers you have, what activation function you’re using, what optimizer you’re using, what your weight to take decay is. I mean it’s just straight up deep learning optimization here. There’s also the encoder one network and encoder two network and that’s where you actually choose the encoder type for each input channel, which again can be a CNN, an LSTM or a pooled embedding layer. As for what instance types you should choose for objective VEC, it can only train on a single machine, although that machine can have multiple GPUs. So you can either go with a beefy CPU instance like an m five two x large, or you can use a p two xlarge.
And if you do need to step things up, you might want to go all the way up to an m five four x large or even an m 512 x large. For inference they recommend using a p two node, p two two x large and specific. And they also note that you want to use the inference preferred mode environment variable when you’re setting up the image for that inference. So this environment variable allows you to optimize for encoder embeddings, which is what the object effect does as opposed to classification or regression. So might be worth remembering that inference preferred mode is kind of an important environment variable when you’re trying to tell your inference nodes what kind of work your neural network is doing. Is it doing embeddings? Is it doing classification or regression?
Interesting posts
The Growing Demand for IT Certifications in the Fintech Industry
The fintech industry is experiencing an unprecedented boom, driven by the relentless pace of technological innovation and the increasing integration of financial services with digital platforms. As the lines between finance and technology blur, the need for highly skilled professionals who can navigate both worlds is greater than ever. One of the most effective ways… Read More »
CompTIA Security+ vs. CEH: Entry-Level Cybersecurity Certifications Compared
In today’s digital world, cybersecurity is no longer just a technical concern; it’s a critical business priority. With cyber threats evolving rapidly, organizations of all sizes are seeking skilled professionals to protect their digital assets. For those looking to break into the cybersecurity field, earning a certification is a great way to validate your skills… Read More »
The Evolving Role of ITIL: What’s New in ITIL 4 Managing Professional Transition Exam?
If you’ve been in the IT service management (ITSM) world for a while, you’ve probably heard of ITIL – the framework that’s been guiding IT professionals in delivering high-quality services for decades. The Information Technology Infrastructure Library (ITIL) has evolved significantly over the years, and its latest iteration, ITIL 4, marks a substantial shift in… Read More »
SASE and Zero Trust: How New Security Architectures are Shaping Cisco’s CyberOps Certification
As cybersecurity threats become increasingly sophisticated and pervasive, traditional security models are proving inadequate for today’s complex digital environments. To address these challenges, modern security frameworks such as SASE (Secure Access Service Edge) and Zero Trust are revolutionizing how organizations protect their networks and data. Recognizing the shift towards these advanced security architectures, Cisco has… Read More »
CompTIA’s CASP+ (CAS-004) Gets Tougher: What’s New in Advanced Security Practitioner Certification?
The cybersecurity landscape is constantly evolving, and with it, the certifications that validate the expertise of security professionals must adapt to address new challenges and technologies. CompTIA’s CASP+ (CompTIA Advanced Security Practitioner) certification has long been a hallmark of advanced knowledge in cybersecurity, distinguishing those who are capable of designing, implementing, and managing enterprise-level security… Read More »
Azure DevOps Engineer Expert Certification: What’s Changed in the New AZ-400 Exam Blueprint?
The cloud landscape is evolving at a breakneck pace, and with it, the certifications that validate an IT professional’s skills. One such certification is the Microsoft Certified: DevOps Engineer Expert, which is validated through the AZ-400 exam. This exam has undergone significant changes to reflect the latest trends, tools, and methodologies in the DevOps world.… Read More »