Google Professional Data Engineer – Appendix: Hadoop Ecosystem part 6

Streams Intro Let’s now introduce Apache Flink and as always, here is a question which I’d like you to ponder on as we discuss this technology. How, if at all, can Mapreduce be used to maintain a running summary of the real time data sent in from sensors? These sensors send in temperature readings every five minutes. You would like to calculate maybe the average of the temperature readings from all these sensors. And that average needs to be updated forever. And to perpetuity, your Mapreduce job should never stop…

Google Professional Data Engineer – Appendix: Hadoop Ecosystem part 5

Spark We’ve now finished discussing hadoop hive and pig in the Google Cloud platform world. Hadoop, Hive, and Pig are all used via Dataproc, which is a managed Hadoop service. It turns out that Pig, Hive, and Spark are services which are available by default on every instance of a Dataproc cluster. So let’s close the loop by now discussing Spark, which is an incredible popular technology these days. As usual, I have a question that I’d like you to think about as we discuss Spark. And that question is…

Google Professional Data Engineer – Appendix: Hadoop Ecosystem part 4

Windowing Hive We’ve explored partitioning and bucketing as ways to devy up table data into more manageable chunks. We’ve also seen two types of join optimizations the use of the smaller table in memory and the use of maponly joins. Let’s now turn our attention to the third bit of olap functionality offered by Hive, which is windowing functions. Windowing functions can be thought of as syntactic sugar. There is almost nothing that windowing functions can do for you, which traditional queries cannot. But the real value and the importance…

Google Professional Data Engineer – Appendix: Hadoop Ecosystem part 3

Hive vs. RDBMS Picking the right technology for the right use case is really important these days. And so I’d like you to think about this question why would we never use Hive or Bigquery for that matter? For OLTP applications. OLTP stands for online transaction processing. These are where traditional databases tend to dominate. So, as we discuss the differences between Hive and a traditional RDBMS, do keep this question in mind. Also, keep in mind the converse of this question why would we never do the reverse? Why…

Google Professional Data Engineer – Appendix: Hadoop Ecosystem part 2

MapReduce Let’s now move on to the next part, the next building block of Hadoop, and that is Mapreduce, which of course is the parallel programming paradigm, which really is at the heart of it all. While understanding Mapreduce, I’d like you to try and answer this question why is there such a strong need for some kind of SQL interface on top of Mapreduce? Both hive and bigquery are exactly this. They are both SQL interfaces on top of Mapreduce like activities. And the question for you is why…

Google Professional Data Engineer – Appendix: Hadoop Ecosystem part 1

Introducing the Hadoop Ecosystem Hello and welcome to this module on the Hadoop ecosystem. We are going to spend a fair bit of time discussing Hadoop and some of the important components in that ecosystem. And there are two reasons why this is a good use of time. The first reason has to do with the genesis of the Hadoop ecosystem. Recall that Hadoop and Mapreduce and HDFS were actually born out of Google technologies. And so there is a close mapping between the Hadoop ecosystem and the corresponding tools…

Amazon AWS SysOps – Security and Compliance for SysOps part 5

MFA + IAM Credentials Report So, as we all know, IAM can be integrated with MFA. And MFA is multifactor authentication. What would you use? MFA. Well, you use because it adds a level of security. That means that whenever you log in, you’re also prompted with a code and you have to enter that code. And that code must be in your possession. So that just guarantees another level of security. If your password gets compromised, a hacker cannot get access to also your phone or your codes or…

Amazon AWS SysOps – Security and Compliance for SysOps part 4

KMS Overview + Encryption in Place Okay, so now let’s talk about Kms. And Kms is for key management service. So anytime you hear encryption in an AWS service, most likely this will involve Kms. And Kms is an easy way to control access to your data. And the data is going to be encrypted by keys and AWS, Kms will manage this key for us. So, Kms is a store, we have some control over it, but some things we cannot do with this store. And so that’s how…

Amazon AWS SysOps – Security and Compliance for SysOps part 3

Guard Duty So guard duty is a very special service that’s kind of hard to understand because we don’t have to do much. But it is an intelligent threat discovery, basically meant to protect your AWS accounts. That means that it’s going to run some analysis in the background. You don’t have to do anything. It will use the logs that’s available to it and it will just make sure that it’s protecting you against malicious usage. So it will use a machine learning algorithm, animal detection, and third party…

Amazon AWS SysOps – Security and Compliance for SysOps part 2

AWS Inspector Okay, so now let’s talk about AWS Inspector. So this is only for easy two instances and that is very important. The exam will trick you into saying could you use Inspector on RDS? The answer is no, you cannot. The only way you can use Inspector is on EC two instances. So what does Inspector do? Well, it helps you analyze the known vulnerabilities or the unintended network accessibility on your EC two instances only. Why? Because you need to install an Inspector agent and you need…

Latest Posts