Amazon AWS Certified Machine Learning Specialty – Exploratory Data Analysis
17. Binning, Transforming, Encoding, Scaling, and Shuffling
Let’s quickly go through some other techniques you might use in the process of feature engineering. One is called binning. The idea here is just to take your numerical data and transform it into categorical data by binning these values together based on ranges of values. So as an example, maybe I have the ages of people in my data set. I might put everyone in their 20s into one bucket, everyone in their 30s into another bucket, and so on and so forth.
That would be an example of binning where I’m just putting everyone in a given range into a certain category. So instead of saying that I’m going to train based on the fact that you’re 22 and three months old, I’m just going to bucket you into the bin of 20 year olds, right? So I’ve changed that number of 22 point whatever it is, into a category of 20 somethings. So that’s all bidding is. Why would you want to do that? Well, there’s a few reasons. One is that sometimes you have some uncertainty in your measurements.
So maybe your measurements aren’t exactly precise and you’re not actually adding any information by saying this person is 22. 37 years old versus 22. 38 years old. Maybe some people remembered the wrong birthday or something, or you asked them on different days and you got different values as a result. So binning is a way of covering up imprecision in your measurements.
That’s one reason. Another reason might be that you just really want to use a model that works on categorical data instead of numerical data. That’s kind of a questionable thing to be doing because you’re basically throwing some information away by binning, right? So if you’re doing that, you should think hard about why you’re doing that. The only really legitimate reason to do this is if there is uncertainty or errors in your actual underlying measurements that you’re trying to get rid of. There’s also something called quantile binning that you should understand. The nice thing about quantile binning is that it categorizes your data by their place in the data distribution.
So it ensures that every one of your bins has an equal number of samples within them. So with quantile binning, I make sure that I have my data distributed in such a way that I have the same number of samples in each resulting bin. Sometimes that’s a useful thing to do. So remember, quantile binning will have even sizes in each bin. Another thing we might do is transforming our data, applying some sort of a function to our features to make it better suited for our algorithms.
So for example, if you have featured data that has an exponential trend within it, that might benefit from doing a logarithmic transform on it to make that data look more linear, that might help out your model and actually finding real trends in it. Sometimes models have difficulty with nonlinear data coming into it. A real world example is YouTube they published a paper on how their recommendations work, which is great reading, by the way. There’s a reference to that in the slide here.
They have a whole section on feature engineering there that you might find useful. And one thing they do is for any numeric feature x that they have, for example, how long has it been since you watched a video? They also feed in the square of that and the square root of it. And the idea there is that they can learn super and sub linear functions in the underlying data that way. So they’re not just throwing in raw values, they’re also throwing in the square and the square root. Just to be careful and see if there actually are nonlinear trends there that they should be picking up on. They found that that actually improved their results. So that’s an example of transforming data. It’s not necessarily replacing data with a transformation. Sometimes you’re actually creating a new feature from transforming an existing one. That’s what’s going on here. So they’re feeding in both the original feature, x and x squared and the square root of x. You can see in this graph here why you might want to do that.
So if I’m starting off with a function of x here on the green line, you can see that by taking the Ln, the logarithm of that, I end up with a linear relationship instead, which might be easier for miles to pick up on. I could also raise that to a higher power, which would actually make things worse in this case, but sometimes more data is better. Again, we’re talking about the cursed dimensionality, so there is a limit to that. But that’s what feature engineering is all about, trying to find that balance between having just enough information and too much information. Another very common thing you’ll do while preparing your data is encoding. And you see this a lot in the world of deep learning.
So a lot of times your model will require a very specific kind of input, and you have to transform your data and encode it into the format that your model requires. A very common example is called one hot encoding. Okay? So make sure you understand how this works. The idea is that I create a bucket for every category that I have, and basically I have a one that represents that that category exists, and a zero that represents that it’s not that category.
Let’s look at this picture as an example. Let’s say that I’m building a deep learning model that tries to do handwriting recognition on people, drawing the numbers zero through nine. This is a very common example that we’ll look at more later. So to One hot encode this information, I know that this thing represents the number eight. And to represent that in a one hot encoded manner, basically I have ten different buckets for every possible digit that that might represent 012-34-5678 or nine now, I usually start counting at zero here.
So you can see here that in the 9th slot there, there’s a one that represents the number eight, and every other slot there has a zero representing that. It is not that category. That’s all one hot encoding is. So again, if I had a one in that first slot that would represent the number zero. If I had a one in the second slot that would represent the number one, and so on and so forth.
We do this because in deep learning, neurons generally are either on or off or activated or they’re not activated. So I can’t just feed in the number eight or the number one into an input neuron and expect it to work. That’s not how these things operate. Instead, I need to have this one hot encoding scheme where every single training value, that label is actually going to be fed into ten different input neurons where only one of them represents the actual category I have. So stare at that picture a little bit, make sure you understand it. If you’re not familiar with onehot encoding, that is probably something you’ll see on the exam. We can also talk about scaling and normalizing your data.
Again, pretty much every model requires this as well. A lot of models prefer their feature data to be normally distributed around zero, and this is also true of most deep learning and neural networks. And at a minimum, most models will require that your feature data is at least scaled to comparable values. There are models out there that don’t care so much, such as decision trees, but most of them will be sensitive to the scale of your input data. Otherwise, if you have features that have larger magnitudes, they’ll end up having more weight on your model than they should. Going back to the example of people if I’m trying to train a system based on their income, which might be some very large number, like 50,000, and also their age, which is a relatively small number like 30 or 40. If I weren’t normalizing that data down to comparable ranges before training on it, that income would have a much higher impact on the model than their ages. And that’s going to result in a model that doesn’t do a very good job.
Now, it’s very easy to do this, especially with ScikitLearn in Python. It has a pre processor module that helps you out with this sort of a thing. It has something called Min Max Scalar that will do it for you very easily. The only thing is you have to remember to scale your results back up if what you’re predicting is not just categories and actual numeric data.
So sometimes if you’re predicting something, you have to make sure to reapply that scaling and reverse to actually get a meaningful result out of your model at the end of the day. Finally, we will talk about shuffling a lot of algorithms benefit from shuffling your training data. Otherwise, sometimes there’s sort of a residual signal in your training data resulting from the order in which that data was collected. So you want to make sure you’re eliminating any byproducts of how the data was actually collected by shuffling it and just randomizing the order that’s fed into your model. So often that makes a difference in the quality as well. There are a lot of stories I’ve seen where someone got a really bad result out of their machine learning model, but by just shuffling the input, things got a lot better. So don’t forget to do that as well. And that’s the world of feature engineering in a nutshell.
Interesting posts
The Growing Demand for IT Certifications in the Fintech Industry
The fintech industry is experiencing an unprecedented boom, driven by the relentless pace of technological innovation and the increasing integration of financial services with digital platforms. As the lines between finance and technology blur, the need for highly skilled professionals who can navigate both worlds is greater than ever. One of the most effective ways… Read More »
CompTIA Security+ vs. CEH: Entry-Level Cybersecurity Certifications Compared
In today’s digital world, cybersecurity is no longer just a technical concern; it’s a critical business priority. With cyber threats evolving rapidly, organizations of all sizes are seeking skilled professionals to protect their digital assets. For those looking to break into the cybersecurity field, earning a certification is a great way to validate your skills… Read More »
The Evolving Role of ITIL: What’s New in ITIL 4 Managing Professional Transition Exam?
If you’ve been in the IT service management (ITSM) world for a while, you’ve probably heard of ITIL – the framework that’s been guiding IT professionals in delivering high-quality services for decades. The Information Technology Infrastructure Library (ITIL) has evolved significantly over the years, and its latest iteration, ITIL 4, marks a substantial shift in… Read More »
SASE and Zero Trust: How New Security Architectures are Shaping Cisco’s CyberOps Certification
As cybersecurity threats become increasingly sophisticated and pervasive, traditional security models are proving inadequate for today’s complex digital environments. To address these challenges, modern security frameworks such as SASE (Secure Access Service Edge) and Zero Trust are revolutionizing how organizations protect their networks and data. Recognizing the shift towards these advanced security architectures, Cisco has… Read More »
CompTIA’s CASP+ (CAS-004) Gets Tougher: What’s New in Advanced Security Practitioner Certification?
The cybersecurity landscape is constantly evolving, and with it, the certifications that validate the expertise of security professionals must adapt to address new challenges and technologies. CompTIA’s CASP+ (CompTIA Advanced Security Practitioner) certification has long been a hallmark of advanced knowledge in cybersecurity, distinguishing those who are capable of designing, implementing, and managing enterprise-level security… Read More »
Azure DevOps Engineer Expert Certification: What’s Changed in the New AZ-400 Exam Blueprint?
The cloud landscape is evolving at a breakneck pace, and with it, the certifications that validate an IT professional’s skills. One such certification is the Microsoft Certified: DevOps Engineer Expert, which is validated through the AZ-400 exam. This exam has undergone significant changes to reflect the latest trends, tools, and methodologies in the DevOps world.… Read More »