SAP-C02 Amazon AWS Certified Solutions Architect Professional – New Domain 5 – Continuous Improvement for Existing Solutions Part 12

  • By
  • August 31, 2023
0 Comment

56. AWS VPC Flow Logs

Hi everyone and welcome back to the Knowledge Portal video series. So today we are going to talk about a very important topic called as flow logs. So basically what flow logs allows us to do is it allows us to see on what type of traffic is coming to our particular interface. So let me give you a very simple example. So this slide we have already looked into the earlier your video lectures where this is a security group and it is allowing the port 20 to access on this particular IP. So this is a genuine user. When he tries to do a SSH on this server, he’ll be allowed. However, there can be a lot of hackers as well who also will try to do a SSH. So as there is no security group to allow this particular IP, the security group over here, it will block or deny this access to this particular user.

Now, as a security engineer, we should be knowing on what kind of packets are getting blocked at the security group level so that we can have a better understanding on from where the malicious traffic is coming from. And one of the amazing features that AWS provides is it allows us to see exactly what is coming over here, what is getting accepted and what is getting blocked. So this we already looked, I just wanted to add the slide that the security group is always associated at the network interface level. So basically what the flow log does is flow log works at the network interface level and it basically allows us to check on what type of traffic is coming and also if the traffic targets are getting accepted or rejected by the security group. So let’s go to our favorite AWS console and here we have the EC to instance, running on a public subnet with a public IP.

So if we just click on this particular interface, you see that there is a flow lock section over here. So what this flow lock allows us to do is it allows us to monitor the traffic of this particular interface. Now one important thing to remember over here is that as there are lot of interfaces that you see over here, AWS allows us to enable flow logs at the global level. So if let’s say you have 100 servers, one thing is you can go into each interface and you can enable the flow log or you can directly go to your VPC and you can enable the flow log over here. So in my case, I already have a flow log enabled. Let me just delete this particular flow log. So this I used it for testing.

So what I’ll do is I’ll create a new flow log and the first thing which will be required over here is you need to set up the permission. So I’ll click here and basically you need to create a new im role. So Amazon has already filled the document policy. So what this basically does is it allows the VPC to create a log group and put the events inside a particular log group.

So if I just click on allow over here, okay, let me go back to my VPC. Let me just try again. So I’ll create flow logs, rule and destination group name. Let me type KP labsay flow logs. Okay, I’ll create a flow log. And now you see the status is active. It is also showing the Cloud Watch log group as Kplabflowlogs. So let’s do one thing. Let’s go back to the EC to instance. So we have to generate some kind of a logging. So I’ll go to the EC two instance, I’ll copy the public IP and let’s just verify the security group. So, security group is only allowing on port 80. So let’s do one thing. Let’s try to generate some traffic which we know that it will be blocked. For example, ICMP traffic. If I just paste ICMP, you see it will not reach because security group is not allowing. Let’s generate one more packet, say telnet on, say port 22.

So we know that the 22 is not allowed. So it will not work. Let’s do one thing. Let’s to try to do telnet on 3306, which will again be not allowed. So, these are all the traffic which will be rejected at the security group level. Now, all of these entries will be present in the new log group of the flow logs that we just created. So in order to just verify, just open up the Cloud Watch. Let’s go to logs. And generally, if you will see there is no log entry which is created. So generally what happens is that the first time you create this log group in Cloud Watch, it will take around four to five minutes to populate the data. So if you will just see over here, you see this particular log group is already created. But here the data is not yet populated. Let me just try to open this. Let’s see if it works now.

Okay, there is some error, so ideally comes up in a minute or two. So let’s try and wait for a few more seconds. One thing that is really good here is that once you start to capture the flow logs, you’ll see an amazing chemistry between your servers and the hackers. You’ll actually see the insights on what actually is happening, or what actually hackers are trying to do very interesting things. Let me try to refresh this page. Okay, so it might take some time. So let me do one thing. Let me pause this video for a while, and in a minute or two, let me check if this is up and running. Okay, so it has been around five minutes for the log group to be created. So it is created.

Now, let me open this particular log group. And this is the interface of the public easy to instance and you see these all are the flow logs. Now let’s just tune it to the last five minutes. It has already been five minutes since we had already paused this video. But one very interesting thing that you will see that there are a lot of unknown IPS which are trying to connect to my public instances. Very interesting. So let’s do one thing. Let’s take this IP address. Let me do IP trace, I say IP trace. So as I said, there is a very interesting chemistry between hackers and the EC to instance which you’ll be able to see in VPC flow logs and you see it is from Hong Kong, I believe it’s China.

So generally you find tremendous amount of packets which gets rejected many from many countries which includes China. China is at the top of the list anyways, so coming back to our main topic, let’s select one or over here. So ideally many enterprises they just block lot of Chinese subnets because there are a tremendous amount of traffic the malicious traffic which comes from China. So if you will notice, let’s see the one one six because this is my IP and if you will remember, what we did was we tried to do a TenneT on port 3306 which is basically the MySQL port. And here you see that the VPC flow log is saying that someone from this particular IP tried to connect to port 3306 and that was rejected. That means the security group had blocked this particular packet. Now, it is very important for you even in the exam you have to understand what exactly each and every field within the log means. So let’s do one thing, let’s go back to the presentation and understand each and every field from the log file. So what I’ve done is I have copied the sample VPC flow log and let’s understand each and every field over here. So the first field is the version which is a VPC flow log version which is two. The second field is the account name or account ID which is here. The third field is the interface ID.

The fourth field that you see over here, it is the source IP address. So this is the IP address from which the packet is coming from. The next field over here is the destination private IP address of the EC two instance. Just remember this will always be a private IP address of the EC two instance. Next is the source port and this is the destination port followed by six. Six is basically the protocol number. So six is basically denoted for PCP. The next is the amount of packets transferred so the amount of packets transferred followed by the number of bytes transferred. And the next two fields over here are the seconds which is start time and the end time in Unix seconds. And the second last field is basically the action which can be either accept or reject.

In our case it is x reject and last field is OK which is basically the log status. It means that this particular entry is stored in the VPC flow log. So two important thing to remember as far as exam is concerned you need to be very thorough with what each and every field here means. That is one thing, second thing, remember that flow law can be enabled at the individual interface level, it can be enabled at the subnet level and it can be enabled at the VPC level. So let me show you the interface level.

So if I go back to the EC to instance so there is an 80 edge zero interface. Let me click here since we have already enabled it at the VPC level, what will happen is the VPC will automatically enable it at all the interfaces which are connected. So this is one interface and you see the flow log is already active. So basically you can enable the flow log at this interface level, you can enable it at the subnet level and you can enable it at the VPC level. So these are the two things that are very important in real life as well as in Exam point of view.

57. AWS Elastic Map Reduce

Hey everyone and welcome back to the Knowledge Full video series. Now, in today’s lecture we will be speaking about Elastic MapReduce. Now, this is a quite important service, specifically when it comes to the big data world. So before we understand more about what exactly EMR is all about, let’s spend some time in understanding what big data is all about. So big data is an aspect which is basically beyond a generic storage capacity and generic processing power. So what do I mean by this? So by the term itself we can know that big data is some data which is going to be huge. And when it comes to huge amount of data, using a traditional log monitoring or log processing tools are not quite good enough. So for example, Elk, Elasticsearch Logstash, cabana so when you put terabytes of data and you try to process it, they will not work as good. So there was a need to develop a new technology which would specifically solve the big data use case. So what do I mean when I talk about huge amount of data?

So specifically when you talk about sensors or social networks or even online shopping websites. So these are some of the examples of big data. So one of the examples of big data is the Twitter feed. So if you want to count how many tweets are coming every second or every minute, so it will be huge. So typically those data will be in terabytes. So the technology which can process that terabytes of data is basically completely different. And if you want to look into how exactly it would work, let’s take this simple example where you have a large click stream logging data which are basically in terabytes. Now, how would you process it is you split the entire log file into multiple small pieces. So if you will see on the left hand side we have a large click stream logging data. Now you split that data into multiple small chunks and then you process these small chunks together individually and then you aggregate the results and then you get a specific data.

So let’s assume from all the Twitter feeds you want to get a feed of a specific user and you only have a huge log file which is typically in terabytes. So you split that log files into terabytes, you create individual servers and each server is responsible for processing a part of that log file. And at the end all the servers will aggregate their result and will come to know on what an individual user did. And this is how basically the big data solutions really work. So data is broken down into smaller portions and each portion is handled by individual set of nodes and results are then aggregated. So I hope you understood what and how the technology related to big data is built. So basically this is one of the ways in which EMR or an Elastic Map reduce really works. So with this, let’s talk about EMR. EMR or elastic. Map reduce is a hosted version of Apache Hadoop Clustering software. So there is a really nice software called as Apache Hadoop. Let me just show you.

So if you do apache hadoop So this is one of the big data softwares which is commonly used everywhere. And there are certain components of the Hadoop software which includes the Map Reduce. It also includes the HDFS file system. So let’s go back to our presentation and we can understand more about it. So, there are certain primary components of EMR. So, since EMR is a hosted version of Apache Hadoop, lot of things, in fact most of things will remain similar as far as the concepts are concerned. So there are three primary components that we have to remember of EMR cluster. And these primary components are master node, core node and task node. Now, basically when I talk about EMR is a cluster.

So Cluster is basically a collection of EC two instance which has the Hadoop software installed. Now, so in this diagram you see an Amazon EMR cluster. Thanks to the AWS documentation, I was able to get the diagram instead of drawing the entire one. So, you have a master node and then on the left hand side you have a core node and on the right hand side you have the task node. Now, core node, if you will see it, has the HDFS File system. So this is where the data can be stored. So this is the HDFS file system. The task nodes are basically specifically for completing or processing a task.

So when you look over here, when you divide a large click, stream data into smaller chunks. Now, there needs to be some kind of a software or some servers which will process or which will analyze these small chunks of data. And these servers are called as the task nodes. Now, one important thing to remember is that even core nodes can perform tasks and that feature is optional. So core nodes can do tasks as well and they are assigned with the HDFS File system where the data can be stored. So these are the three components that we need to remember. Now, before we go ahead and understand more, let’s do one thing. Let’s go to the console and select the EMR service. Now, when you click on create cluster, let me just show you. So here you have to give the cluster name. Now, if you go a bit down, you can select the version of EMR.

So there are a lot of versions which are available depending upon your requirement. You can select a version, then comes the hardware configuration. Now, if you’ll see the number of instances are basically user defined. So if I put five, master node is compulsory. Just remember that and core nodes can be changed. So if I move instances from three to five, you see what happened is master node is one and the core node has increased. So there has to be a master node which must be present. Now let’s go to the advanced configuration so that we can understand things in a much more better way. So if you’ll see over here I have one master node, I have one code node and I have one task node and each of these nodes are responsible for a certain operations.

So we already discussed master node is basically if you see over here, master node is responsible for coordinating and distributing of data and it will also look into the status whether all the other nodes are healthy or not. So master node is responsible to check if the core nodes or if the task nodes are healthy are running properly or not. So that is the responsibility of master node. So whenever you create an EMR cluster, a master node is something which should always be present. Second is the core node we already discussed Cornode can be used as a task node to perform certain processing on log file and cornered is where the HDFS file system also resides. And last is the task node. Task node is only and only responsible for certain kind of task processing.

Now, each of these nodes can be either on demand or spot depending upon what you need. Now in exam there might be certain questions related to these purchasing options. Now definitely master node, you should never put it as spot because if someone puts a higher bit then your master node will be deleted. So this is something that you should never put it as spot. Core node again, it should ideally be on demand because this is where your HDFS file system is present. So if you’re using storage then core node should never be in spot. Task node is something that you can put it in spot. And let’s assume if I put five task nodes and I can select spot instances over here so even if it gets deleted, it is not a major issue for my cluster.

So this is something that you need to remember. So let’s go back to the PowerPoint presentation. We already discussed that master node is responsible for coordination and distribution of data. So whenever certain data arises, the master nodes will distribute the data to the core node as well as to the task instance node. So that is a responsibility of master node. It will also keep track of status and the overall health of a cluster. Now, core node contains both data nodes and task tracker Damon. So what this basically means is that they can store data based on the HDFS and they can also run task on the log that is being given to the EMR cluster. On the other hand, task nodes only run the task tracker Demon and it can only perform tasks. Now certain times there can be a use case that will be given to you in the exam where cost is one of the factors.

So in the cases where cost is a factor, you can select task nodes as spot instances instead of on demand. So certain kind of a scenario will be given to you where you have to select on whether task node should be on demand or spot, master node should be on demand or spot. So now you know you cannot really have a master node as spot because that is too risky. So sample EMR task if we look into where you have to calculate a number of repetitive words in a specific text document. Now, there are two ways in which you can put data within an EMR cluster.

One of the easiest and one of the most ideal way is through an S Three bucket. So what you can do is you can put your data in an S Three bucket whatever log files that you have, put those log files in a S Three bucket and fetch the data from that S Three bucket. And once the processing is done within the EMR cluster, you can put the output, the analyzed output again in a destination S Three bucket.

Now, this approach is much more flexible because once the processing is done, you can delete the entire EMR cluster. Otherwise, if you are storing the results in the SDFs as well, then basically you cannot really delete the EMR cluster because the data is still present over here. So fetching the data from the SD bucket and storing the results in the SD bucket is one of the ideal solutions that should be implemented in.

Comments
* The most recent comment are at the top

Interesting posts

The Growing Demand for IT Certifications in the Fintech Industry

The fintech industry is experiencing an unprecedented boom, driven by the relentless pace of technological innovation and the increasing integration of financial services with digital platforms. As the lines between finance and technology blur, the need for highly skilled professionals who can navigate both worlds is greater than ever. One of the most effective ways… Read More »

CompTIA Security+ vs. CEH: Entry-Level Cybersecurity Certifications Compared

In today’s digital world, cybersecurity is no longer just a technical concern; it’s a critical business priority. With cyber threats evolving rapidly, organizations of all sizes are seeking skilled professionals to protect their digital assets. For those looking to break into the cybersecurity field, earning a certification is a great way to validate your skills… Read More »

The Evolving Role of ITIL: What’s New in ITIL 4 Managing Professional Transition Exam?

If you’ve been in the IT service management (ITSM) world for a while, you’ve probably heard of ITIL – the framework that’s been guiding IT professionals in delivering high-quality services for decades. The Information Technology Infrastructure Library (ITIL) has evolved significantly over the years, and its latest iteration, ITIL 4, marks a substantial shift in… Read More »

SASE and Zero Trust: How New Security Architectures are Shaping Cisco’s CyberOps Certification

As cybersecurity threats become increasingly sophisticated and pervasive, traditional security models are proving inadequate for today’s complex digital environments. To address these challenges, modern security frameworks such as SASE (Secure Access Service Edge) and Zero Trust are revolutionizing how organizations protect their networks and data. Recognizing the shift towards these advanced security architectures, Cisco has… Read More »

CompTIA’s CASP+ (CAS-004) Gets Tougher: What’s New in Advanced Security Practitioner Certification?

The cybersecurity landscape is constantly evolving, and with it, the certifications that validate the expertise of security professionals must adapt to address new challenges and technologies. CompTIA’s CASP+ (CompTIA Advanced Security Practitioner) certification has long been a hallmark of advanced knowledge in cybersecurity, distinguishing those who are capable of designing, implementing, and managing enterprise-level security… Read More »

Azure DevOps Engineer Expert Certification: What’s Changed in the New AZ-400 Exam Blueprint?

The cloud landscape is evolving at a breakneck pace, and with it, the certifications that validate an IT professional’s skills. One such certification is the Microsoft Certified: DevOps Engineer Expert, which is validated through the AZ-400 exam. This exam has undergone significant changes to reflect the latest trends, tools, and methodologies in the DevOps world.… Read More »

img