ISACA COBIT 5 – Measure (BOK V) Part 4
7. Sampling (BOK V.B.3)
Hey, welcome to this topic of sampling. Before we talk about sampling, let’s understand population. What is population? Population is the entire collection which we want to study. And this entire collection depends on what you want to study. So for example, if you want to study the height of grade ten students in a school ABC school, then that’s the population. So all the students in school ABC in grade ten is your population. But suppose you want to study the students in grade ten across the country, then your population is all the students in all the schools of the country who are studying in grade ten. So that’s your population. So population is something which you want to study. Now, depending on population size, you might or you might not be able to look at each and every item in that population.
So for example, if your population was the entire country, grade ten students, it might not be possible for you to measure the height of every student in the country in grade ten. That will be a big task, will cost you a lot of money. Whereas on the other hand, if you just want to study the height of students in grade ten of a particular school, you can do the height of all the students. So you can study the full population in that case. So it depends whether you need to study the whole population or you need to take a sample out of that. And here, let me put another term which is census. So when you are studying the whole population, let’s say all the students in grade ten of a specific school, if that was your population, and if you are studying or if you are measuring the height of all the students, then what you are doing is sensors as against sampling.
Sampling is something when you take a limited number out of the full population. So here in this graphical form, this is the population. The big circle is the population. This is the entire population which you want to study. And since the population is a big one, huge one, you cannot look at each and every item in that. If you are measuring the height, you cannot measure the height of every student. If you are looking at a chocolate manufacturing company and this chocolate company is producing, let’s say, a million chocolates per day, you cannot study every chocolate, you cannot do the taste of that, you cannot take a weight measurement of each chocolate. What you will be doing there is you will be taking a sample out of that and based on that sample, you will decide whether your population, your population in this case is that million chocolates being produced per day, whether your population is conforming or not. So what you do is you take a sample, and sample here is represented here in this by a dark circle in the middle of population.
When you take sample, one thing you need to understand is your sample should represent the population, your sample should be the representative selection, representative selection from the population and how you can ensure that, you can ensure that by making sure that the samples which you are picking are randomly selected. So if you randomly select, let’s say, 100 chocolates from those 1 million chocolates being produced, then you can say that your sample is a representative sample. So once you take a sample and this becomes our sample on the right, you can study this sample. For example, in the case of students in the class, if you wanted to measure the height of these, you measure the height of those students, or if you wanted to check the weight of chocolate being produced, you picked out 100 chocolates.
So let’s say this was 1 million chocolates, 1 million chocolates being produced per day and this sample was, let’s say 100 pieces and this is your 100 pieces sample here. Once you take measurement of those 100 pieces which are the representation of the big population, if you do any calculation from that, you find out the average, you find out the standard deviation or the range, those things are known as statistic. So statistic is something which you calculate from your sample. So the average weight of chocolates or the average height of students which were the sample selection will be statistic. Statistic is the characteristic of a sample, as against that the value which is related to the whole population will be parameter. So for example, if we have taken 100 chocolates and we found out the average weight of that, that average weight would be statistic.
But if we have those 1 million chocolates and we find out the average weight of that which you know is difficult, which you know you would never do that that number, that average, that standard deviation for the whole population is the parameter. So this is something which you need to understand here, the difference between statistic and parameter. Statistic is related to sample and parameter is related to the whole population. Parameter is something most of the times which you don’t know exactly what it is, you don’t know what exactly is the average weight of those 1 million chocolate. You can just assume that based on that 100 chocolate, the statistic which you calculated based on that, you can assume that your parameter is also going to be the same. So whatever average weight you calculated based on that 100 chocolate, you will assume that the average weight of 1 million chocolate also will be the same. So parameter most of the time is an unknown thing. Now, coming to how do we denote a sample or a population?
For example, the number of items in the sample are represented by small N which you can see here. So, number of members in sample is represented by small n, but the numbers for the whole population is represented by a capital N. And when you calculate the mean of that, the mean of sample is represented by x bar. So x bar whenever you see that would mean that this is the mean for the sample, but the mean for the whole population is denoted by letter mu. So this is known as mu m u mu and it is shown something like this. And in case of standard division for the sample, it is shown as small s. So small s is the standard division for the sample and for the whole population. The standard division will be sigma. And as I already told you, the mu and sigma the mean and the standard division for the whole population many times is not possible to calculate. You estimate that based on the sample. And another important thing in sampling is whenever you are taking sample, make sure that those samples are free from bias. So if these are free from bias, then that represents the population. So for a sample to represent the population, that has to be free from any bias. So after talking about sampling and population, why do we need sampling? I don’t think there is any justification or any explanation need to be provided here. You very well know that you cannot check the whole population many times. So for example, in the case of chocolate factory, you cannot check the weight of each and every chocolate, 1 million chocolate being produced per day.
So you need to take sample and based on the sample you need to make judgment, you need to make estimate about the population and that’s why you need sampling because the sampling will help you in reducing the cost and the time involved in the study of the entire population. You cannot study the entire population. So you take sample and based on that sample you make the estimate of the population. Now, on the next slide we will look at the various ways to take sample. What are the approaches to take samples? Let’s study that on the next few slides. For sampling we have two main approaches.
One is probability sampling and another is non probability sampling. What are these two? The probability sampling is a sampling approach in which every item in the population has equal chance of being selected. So if you are selecting 100 chocolates from a 1 million chocolate, then every chocolate in that 1 million should have equal chance of being selected. Being said that you cannot select 100 chocolates being produced on a day because if you select that, then not every chocolate has equal chance of being selected. So your 100 chocolates need to be randomly selected so that every chocolate has an equal chance of being selected. So that is probability sampling. Non probability sampling is something just opposite to that. That means it’s not a probability sampling. That means that not every item has an equal chance of being selected or where the probability of selection cannot be accurately determined in this case of non probability sampling since not every item has an equal chance of being selected. So your sample also might not represent the whole population, your sample might be biased.
But then if that is the thing, then why do we have non probability sampling? If that much problem is there with the non probability sampling? Why we use non probability sampling at some places is because sometimes it might be easy to do. Easy to do, and there might be less to lose. In both of these probability and non probability sampling there are a number of approaches. So let us look at those approaches one by one on next few slides. So on next few slides we will have four types of probability sampling and three types of non probability sampling. Let’s study that on the next slide. So here on this slide, the top four are probability sampling. And if you remember, the probability sampling is something where each item has equal chance of being selected. So in probability we have a simple random sampling, systematic random sampling, specified random sampling and cluster sampling. So these are the four approaches for probability sampling.
And for nonprobability sampling we could use accidental or convenience sampling, we could use judgmental sampling, or we could use quota sampling. Let’s study each of these starting with simple random sampling. Simple random sampling is a type of sampling where every item in the population has equal chance of being selected. And how we do that, there are a number of ways. One way could be randomly picking. You might have seen a person drawing a lottery, putting all the slips in a hat and blindly picking one or two names from that hat, where all the names have been put on a slip of paper. Every item has an equal chance of being selected. That is one approach. Second approach could be assigning each item a number and then randomly selecting numbers from the random table. So in Microsoft Axel you can create a random number table and based on that random number table, you can select the items which have been assigned that number.
This is the most common type of probability sampling where you randomly select an item where every item has an equal chance of being selected. So after talking about random sampling, next one is systematic random sampling. This one is also random, but this is systematic. Let’s take an example of the same chocolate manufacturing company. You have million chocolates being produced per day and you want to draw a sample of 100 chocolates out of that 1 million. So the simplest thing could be you start picking chocolate after every 10,000 chocolates being produced. So you pick item number one, then you pick item number 10,001, then 20,001 and 30,001, etc, etc. And that way out of a million you would be able to pick 99 or 100 samples.
And this is systematic random sampling. Here you pick one item at a regular interval third type of probability sampling is stratified random sampling. In stratified random sampling you make subgroups in the population and you select people from subgroup. So for example, if you want to study a town, some characteristic of people in that town, you might want to decide that instead of randomly picking ten people from the town, you will pick five males and five females rather than picking ten members or the ten people. Because if you randomly select ten people you might end up with seven male, three females or six males, four females or some other combination. But if you need to make sure that the men and women are equally represented in the sample, you might want to decide that you will be selecting five male and five female. So that way you will have equal representation of both in the sample. So for that, what you actually did was you subgroup the whole population. So instead of the population of the town, you divide it into two subgroups, males and females.
So from each subgroup you selected five people. This you might want to do when your population is heterogeneous, when you don’t have a homogeneous configuration, homogeneous group of people or the homogeneous population there, you might want to divide that population which is not homogeneous or which is heterogeneous into different groups. And from those groups or subgroups you select number of people or number of items. So this is specified random sampling where you divide the population into subgroups and from that subgroup you select members or you select those items from that. The next one is cluster sampling. In cluster sampling instead of randomly picking from anything in the population, you pick as a cluster. As a cluster. For example, if you have to survey a store, instead of picking one random member all across the ear, you just might want to pick a particular day that on that particular day all the people who are going to that store you will study them. So that way what you have done is you have clustered your selection.
By clustering your selection you can save time and money. Clustering could be by time, as I said in this case, example of store or clustering could be by geography. So rather than selecting the whole country population, you just might want to pick a city or a town and do your sampling from that town only with the assumption that people in this town represent the whole country as well. So that’s cluster sampling after talking about four types of probability sampling, now let’s look at three types of nonprobability sampling here. The first one here is convenience sampling. This is also known as accidental sampling. So here in this case, the researcher is looking for convenience, convenience to pick a sample which is easily available.
So if you have room full of boxes and you enter the room, the first few things which you can easily handle, those are the ones you pick and take as a sample rather than going deeper into the room and looking for different types of samples, random samples. So instead of random sampling, you just pick a few things which are easily accessible to you. Another case, as shown on this slide, is a researcher goes to Mall and just selects five people who he meets first. This is convenience sampling. Those five people might not represent the whole population, but just because of convenience, just because of saving some time and money, you might end up doing this approach also that just pick something which is convenient to you. Next type of non probability sampling is judgmental sampling. Judgmental sampling is based on what researcher thinks.
So for example, if you are doing an audit and you know that there might be some issue in calibration process, you know beforehand, then what you might want to do is you might want to pick more samples for calibration. And if you know there is a problem with the pressure gauges, you might even want to go and pick the samples which are the calibration certificate for pressure gauges only because you have some background information and you want to pick sample based on your own judgment, your own background knowledge. So this knowledge could be from your past or this knowledge could be from something which you feel is important.
This is judgmental sampling, the judgment of the person who is selecting the sample. And the last type of sampling approach is quota sampling. Here you just fix a quota and you don’t look for any approach. You just need to fulfill that quota. For example, you just need to look at 2% of calibration records. Those might be random, might not be random. Whatever way you select, you just need to select that particular quota. So that’s a quota sampling. So this completes our discussion on different types of sampling approaches. We talked about probability sampling where we talked four types in probability and we talked about three types of non probability sampling approach.
8. Data Collection Plans and Methods (BOK V.B.4)
In this topic of data collection earlier, we talked about data types, what type of data is nominal ordinal interval ratio. We talked about qualitative quantitative data. We talked about continuous and discrete type of data. We talked about sampling approaches, how do we select sample? Now let’s talk about data collection plan. Because in measurement phase of demec, you want to collect data. You need to make a plan for collecting that data. You need to decide beforehand what sort of a data you are collecting, why you are collecting that, all these things you need to sort out before you start your measurement. So that’s the reason you created data collection plan. In data collection plan, the first thing you need to understand is what is your goal and what is your objective, why you are collecting this data, what you expect from that.
So, if your project, for example, is related to the diameter variation of a cable which you are producing, then you want to check the diameter of pieces being produced. So that’s your goal to know how your process is performing. This is an example that you want to study how your process of cable production is doing. And for that there was a variation in the diameter and you want to study the diameter of that. So that’s your goal. You need to be clear about this, that why do you want to collect data. And once you know that why you want to collect data, then you need to have operational definition of that. What is operational definition of data? This will tell exactly what needs to be measured.
So, for example, if an operator is picking a nut, picking a bolt, tightening that this is a part of assembly and you want to collect data related to the time it takes for the assembly, then you need to be clear that the time which you are looking is from the time operator picks the first item to the time operator puts this assembled item in the tray. You need to be that much specific when you say the time for assembly, you just cannot say that time for assembling a nut and bolt is this. You need to be clear that where does that time starts, where does the time ends, et cetera, et cetera. Because that is the operational definition of this specific data, how much, where all these things you need to specify before you start collecting data.
You need to understand what is the data type, whether the data type is qualitative or quantitative, whether the data is continuous or discrete, whether the data is nominal ordinal interval or ratio, whether this data will be calculated manually or will be calculated automatically. Because the automatic collection of data generally could be more reliable if you have thought out it properly. Another thing you need to look at is whether you want to collect data now or you are looking at the past data. Because if you are looking at the past data, there is a good chance of someone manipulating that data because someone could selectively pick data from past. So you need to be clear whether you are looking at the historical data or you are looking to collect data from the future production or the future items and make sure that whatever thing you do, your data is reliable. Your data is reliable, means that data, that sample should represent the population.
That is something which you need to make sure when you are collecting data. So, what I have done on the next slide is I haven’t made a data collection plan. So you look at a sample data collection plan, this plan could be different in different organization, could be different from project to project, but this will give you a good idea of what a data collection plan could look like. So here I have a data collection plan. Example, one example of data collection plan. So, first item here is measurement, what measurement is and as I was talking on the previous slide, this is the time for assembly.
So, time for assembly and the operational definition of this is the time from picking up the first piece to placing the assembly item in tray. You need to be very specific when you are telling the operational definition, because operational definition should be such that no one could misinterpret that. And then how this is measured, this is measured using a stopwatch what type of data this is, this is a continuous data and ratio how you are doing sample. The sampling was decided that every 10th piece will be sampled, every 10th assembly will be sampled.
The operator will be the one to make this measurement. And the data recording form here I have just randomly put a form number which this company might have. So the operator will be putting this value of the time taken for assembly in the assembly record number, this and this. And then there is a provision to put some comment also. So this is one example of how your data collection plan could look like.
9. Data Collection Plans and Methods – Data Coding (BOK V.B.4)
Hey, welcome back. On these next few slides, let’s talk about data coding and what is data coding is, you change the data to make things easier for you. So instead of, let’s say you have a data something like 00:55, three, 0. 5570. 559, et cetera, et cetera. If this was your data to be recorded by operator every time, there was a good chance that this operator will make some mistake in recording this Data. So instead of this, if we ask operator to forget about this 1st 00:55 and just record the third one, because every item in this has 00:55, so we can ask operator that forget about zero point 55, just note down three, seven, nine, et cetera, et cetera. This makes things much simpler, much easier for the operator. Now there are less chances of making mistake.
So what we have done here is we have done the data coding, instead of writing 00:53, the operator can just write three. What is the impact of this data coding on the mean and the standard deviation calculation? How does this work? Let’s understand on next two or three slides. So one way we can do data coding is adding or subtracting a number. In that adding or subtracting A number means instead of writing 10 03, 10 05, 10 09, et cetera, et cetera, we can just write three, five, nine. So what will be the impact that instead of this was the Main data 1003 1005 that was the main Data, instead of that, we recorded three five nine, what will be the impact of that if I calculate the mean of The Original data or the mean of the coded data. So this is my Coded Data, the previous one was The Original Data, this was My Original Data so what is the impact when we calculate mean and standard Division. Let’s understand that. So here in this example, I have a Data.
Which is -95 -97 -98, and -90. So what I have done is to do the Coding to make things easier, I have added plus 100 in each of these so that converted my Data to Five which was 100, -95 gave me five and -97 and 100, that gave me three -98 plus 100, gave me Two, -90 plus 100 gave Me Ten. So this is my coded data. So my coded Data is five, 3210 so I calculated the mean of that the Mean of that was adding everything and dividing it by Four, so five plus three, plus Two plus ten divided by four, this gave me A Coded Mean of Five. So if the coded mean was five, what is the mean of the original data? Or the uncoded data? So that would be the coded Mean.
And here for coding, when I added 100 now I can subtract 100 from The Mean from the coded Mean. So coded mean was five. If I Subtract 100, this gives me -95 so -95 is My Mean of the original data which you can see here -95 -98 -97 -90 the mean of that is -95 so coded data helped me in quickly calculating the mean. And then once I have -100 out of that that gave me the uncoded or the original data mean original data mean. So, this was the case of a mean. What happens in the standard deviation. If you calculate the standard deviation of the coded data or uncoded data, this will remain the same. This doesn’t change. This doesn’t change only in case of adding and subtracting. So for coding if you have done some adding some subtraction that doesn’t change the standard deviation, this will not be the same when we do multiplication or division.
This remains same only when we do addition and subtraction. So if the coding which we did was using addition and subtraction then the standard deviation doesn’t change. You can calculate that you can take a calculator and find out the standard. Division. Of -95 -97 -98 and -90 find out the standard division of that that will be the same and instead of that if you find out the standard deviation of five, 3210, that also will be the same which is 3. 55 nine. Let’s move on to multiplication and division. What will happen if we do coding using multiplication or division? So, earlier we looked at data coding using addition or subtraction. Now, let’s look at using multiplication or division. So here I have my original data, which is 1. 51. 31 . 2 and one point that’s my original data. To make things convenient for operator, suppose I decided to multiply everything by 100.
So if I multiply everything by 100, then my coded data becomes 10 five, which comes from 1. 5 multiplied by 100 becomes 10 five, then 103102 and one, 10. So, this is my coded data. It’s easy to calculate, easy to handle. Then I calculate the mean of this coded data. So, mean of 105, 103, 102, and 110 comes out to be 10 five. Now, from this coded mean. So from this coded mean. Now, how do I find out the mean of the original data? So, since I multiplied the original data with 100 here, I can just reverse that by dividing by 100. So if I divide 10 five, which was the mean of the coded data by 100, this will give me the mean of the Uncoded or the original data mean. So, this is my original data mean. This is just like in addition and subtraction. So if you have added something to do the coding, then you subtract that from the mean to find out the original mean.
Similarly, in multiplication and division, suppose here we multiplied everything by 100. So at the end, if you divide by 100, you will get the mean of the original data. In case of standard deviation, there was no effect when we added or subtracted from the original data to do the coding. However, when we do multiplication or division, we need to do the reverse thing to find out the standard deviation of the original data. So here we multiplied everything by 100. And from this data, which was our coded data, 105103, this is the standard division. We got 3. 55 nine. To find out the standard deviation of the original data here you need to divide that. And this gives us standard division as 0. 3559. So, moving forward, earlier we talked about addition subtraction effect of that on coding, we looked at multiplication and division. Now let’s look at a complex example where we have truncated the repetitive terms.
And what does that mean? Suppose my data was zero point 550, point 55, 30 point 520, point 550. So this was my original data. And in this original data I see that zero point 55 is repeating. So if this number is repeating in all your data, then you might just want to drop that. You might want operator to say that you just record the last digit, which is five, three, two and zero. So this is what operator has done to make things easy for the operator. Truncated zero point 55 from all data. What in fact this operator is doing, if you look at this, operator is multiplying each number by 1000. So suppose if you look at zero point 550, you multiply this by 1000. This gives you 550 and then -550 out of that. So this is what the operator is doing when the operator is truncating zero point 55 from every piece of data. Operator is multiplying everything by 10 and subtracting 550 from that.
Now, from this truncated data, we got a mean of 2. 5, which was because five plus three plus two plus zero divided by four gives us the coded mean of 2. 5. Then how do we find the mean of the original data? So, here we have, so this was the mean coded mean since we subtracted 550, so we add 550 to compensate for that. And since we multiplied by 1000, so here we divide that by 1000. And this gives us the mean as zero point 55 25 for the original data. So this is how you find the mean of original data from the coded data going back to standard division.
As we earlier said, the standard division is not affected by addition or subtraction, it is affected by division or multiplication. So, when we calculated the standard deviation of our coded data, which coded data was 5320 from this, if we calculated the standard deviation s using a calculator, this came out to be 2. 8 116. The effect of addition or subtraction doesn’t matter here. So when we subtracted -550 from that you can just ignore that the only thing which you need to take care of is because you multiplied your data with 1000. So here you need to divide your data with 1000. And this gives you the standard division of the original data as 0. 208 116. So this is the standard division for your uncoded data.
10. Data Collection Plans and Methods – Missing Data (BOK V.B.4)
Another thing which we need to learn in this topic is data cleaning. When you have a missing data in statistics, imputation is the process of replacing missing data with the substitute value. So one thing we, which you need to understand here is the term imputation. Imputation is the process of replacing the missing data. Let’s understand this with some simple example. This in itself could be a very complex subject. But for our level of understanding, let’s keep it simple. Suppose you went to a class and you noted down the height and weight of students.
So you made a table which was height here and weight here. And you wanted to establish a correlation between height and weight. So you note down the height. So one height was 150, weight of the student was 60. You find out the weight was 62, this fellow was 62, so on. But suppose in one of the data, which was let’s say 150, another student and you didn’t have a weight of that, somehow you missed to note down the weight of this student.
Student number four, what do you do with that? So, if we just look at a common sense, you have two options here. One option would be to drop this data. You can just drop this from the table and look at rest. All other option could be for this data you put a weight, approximate weight, approximate weight, depending on all the students who are of the height of 150 CM. You look at the average weight of that and you put an average weight here. So one of these two things make a common sense.
So how you do that? What you do that will depend on the situation. First thing you need to understand is when you have a missing data, you need to understand that the missing data can introduce bias. And for that you need to understand whether the missing data was randomly missing or there was a reason for missing that. That’s something which will decide whether you need to delete this data or you need to do something about that. So if this was a random information which was missing, okay, it’s not a big deal if you drop this data out of 60 students, you would have 59 students in your table.
But sometimes when you see there is a reason for missing data, suppose you make a phone call and you ask about people, how many members are in their family, what’s the income? You might see that people with the higher income might want to hide their income. So that is the case. So if there is a specific reason for that, then the whole data table which you are making could be biased. So you note down the members in the family what’s the income.
So low income people will tell the income, but very high income people will not tell the income. So that will distort the fact in your survey. So that’s something which you need to be aware of that whenever you see missing data, make sure that missing data doesn’t introduce some sort of a bias in the final result. This in itself is a complex subject. We are not going into that. But we are just limiting our discussion to this common sense, which is either delete the data or replace with the average value and make sure that you don’t have a bias because of this missing information.
Interesting posts
The Growing Demand for IT Certifications in the Fintech Industry
The fintech industry is experiencing an unprecedented boom, driven by the relentless pace of technological innovation and the increasing integration of financial services with digital platforms. As the lines between finance and technology blur, the need for highly skilled professionals who can navigate both worlds is greater than ever. One of the most effective ways… Read More »
CompTIA Security+ vs. CEH: Entry-Level Cybersecurity Certifications Compared
In today’s digital world, cybersecurity is no longer just a technical concern; it’s a critical business priority. With cyber threats evolving rapidly, organizations of all sizes are seeking skilled professionals to protect their digital assets. For those looking to break into the cybersecurity field, earning a certification is a great way to validate your skills… Read More »
The Evolving Role of ITIL: What’s New in ITIL 4 Managing Professional Transition Exam?
If you’ve been in the IT service management (ITSM) world for a while, you’ve probably heard of ITIL – the framework that’s been guiding IT professionals in delivering high-quality services for decades. The Information Technology Infrastructure Library (ITIL) has evolved significantly over the years, and its latest iteration, ITIL 4, marks a substantial shift in… Read More »
SASE and Zero Trust: How New Security Architectures are Shaping Cisco’s CyberOps Certification
As cybersecurity threats become increasingly sophisticated and pervasive, traditional security models are proving inadequate for today’s complex digital environments. To address these challenges, modern security frameworks such as SASE (Secure Access Service Edge) and Zero Trust are revolutionizing how organizations protect their networks and data. Recognizing the shift towards these advanced security architectures, Cisco has… Read More »
CompTIA’s CASP+ (CAS-004) Gets Tougher: What’s New in Advanced Security Practitioner Certification?
The cybersecurity landscape is constantly evolving, and with it, the certifications that validate the expertise of security professionals must adapt to address new challenges and technologies. CompTIA’s CASP+ (CompTIA Advanced Security Practitioner) certification has long been a hallmark of advanced knowledge in cybersecurity, distinguishing those who are capable of designing, implementing, and managing enterprise-level security… Read More »
Azure DevOps Engineer Expert Certification: What’s Changed in the New AZ-400 Exam Blueprint?
The cloud landscape is evolving at a breakneck pace, and with it, the certifications that validate an IT professional’s skills. One such certification is the Microsoft Certified: DevOps Engineer Expert, which is validated through the AZ-400 exam. This exam has undergone significant changes to reflect the latest trends, tools, and methodologies in the DevOps world.… Read More »