Professional Data Engineer: Professional Data Engineer on Google Cloud Platform Certification Video Training Course
The complete solution to prepare for for your exam with Professional Data Engineer: Professional Data Engineer on Google Cloud Platform certification video training course. The Professional Data Engineer: Professional Data Engineer on Google Cloud Platform certification video training course contains a complete set of videos that will provide you with thorough knowledge to understand the key concepts. Top notch prep including Google Professional Data Engineer exam dumps, study guide & practice test questions and answers.
Professional Data Engineer: Professional Data Engineer on Google Cloud Platform Certification Video Training Course Exam Curriculum
You, This Course and Us
-
1. You, This Course and Us02:01
Introduction
-
1. Theory, Practice and Tests10:26
-
2. Lab: Setting Up A GCP Account07:00
-
3. Lab: Using The Cloud Shell06:01
Compute
-
1. Compute Options09:16
-
2. Google Compute Engine (GCE)07:38
-
3. Lab: Creating a VM Instance05:59
-
4. More GCE08:12
-
5. Lab: Editing a VM Instance04:45
-
6. Lab: Creating a VM Instance Using The Command Line04:43
-
7. Lab: Creating And Attaching A Persistent Disk04:00
-
8. Google Container Engine - Kubernetes (GKE)10:33
-
9. More GKE09:54
-
10. Lab: Creating A Kubernetes Cluster And Deploying A Wordpress Container06:55
-
11. App Engine06:48
-
12. Contrasting App Engine, Compute Engine and Container Engine06:03
-
13. Lab: Deploy And Run An App Engine App07:29
Storage
-
1. Storage Options09:48
-
2. Quick Take13:41
-
3. Cloud Storage10:37
-
4. Lab: Working With Cloud Storage Buckets05:25
-
5. Lab: Bucket And Object Permissions03:52
-
6. Lab: Life cycle Management On Buckets03:12
-
7. Lab: Running A Program On a VM Instance And Storing Results on Cloud Storage07:09
-
8. Transfer Service05:07
-
9. Lab: Migrating Data Using The Transfer Service05:32
-
10. Lab: Cloud Storage ACLs and API access with Service Account07:50
-
11. Lab: Cloud Storage Customer-Supplied Encryption Keys and Life-Cycle Management09:28
-
12. Lab: Cloud Storage Versioning, Directory Sync08:42
Cloud SQL, Cloud Spanner ~ OLTP ~ RDBMS
-
1. Cloud SQL07:40
-
2. Lab: Creating A Cloud SQL Instance07:55
-
3. Lab: Running Commands On Cloud SQL Instance06:31
-
4. Lab: Bulk Loading Data Into Cloud SQL Tables09:09
-
5. Cloud Spanner07:25
-
6. More Cloud Spanner09:18
-
7. Lab: Working With Cloud Spanner06:49
BigTable ~ HBase = Columnar Store
-
1. BigTable Intro07:57
-
2. Columnar Store08:12
-
3. Denormalised09:02
-
4. Column Families08:10
-
5. BigTable Performance13:19
-
6. Lab: BigTable demo07:39
Datastore ~ Document Database
-
1. Datastore14:10
-
2. Lab: Datastore demo06:42
BigQuery ~ Hive ~ OLAP
-
1. BigQuery Intro11:03
-
2. BigQuery Advanced09:59
-
3. Lab: Loading CSV Data Into Big Query09:04
-
4. Lab: Running Queries On Big Query05:26
-
5. Lab: Loading JSON Data With Nested Tables07:28
-
6. Lab: Public Datasets In Big Query08:16
-
7. Lab: Using Big Query Via The Command Line07:45
-
8. Lab: Aggregations And Conditionals In Aggregations09:51
-
9. Lab: Subqueries And Joins05:44
-
10. Lab: Regular Expressions In Legacy SQL05:36
-
11. Lab: Using The With Statement For SubQueries10:45
Dataflow ~ Apache Beam
-
1. Data Flow Intro11:04
-
2. Apache Beam03:42
-
3. Lab: Running A Python Data flow Program12:56
-
4. Lab: Running A Java Data flow Program13:42
-
5. Lab: Implementing Word Count In Dataflow Java11:17
-
6. Lab: Executing The Word Count Dataflow04:37
-
7. Lab: Executing MapReduce In Dataflow In Python09:50
-
8. Lab: Executing MapReduce In Dataflow In Java06:08
-
9. Lab: Dataflow With Big Query As Source And Side Inputs15:50
-
10. Lab: Dataflow With Big Query As Source And Side Inputs 206:28
Dataproc ~ Managed Hadoop
-
1. Data Proc08:28
-
2. Lab: Creating And Managing A Dataproc Cluster08:11
-
3. Lab: Creating A Firewall Rule To Access Dataproc08:25
-
4. Lab: Running A PySpark Job On Dataproc07:39
-
5. Lab: Running The PySpark REPL Shell And Pig Scripts On Dataproc08:44
-
6. Lab: Submitting A Spark Jar To Dataproc02:10
-
7. Lab: Working With Dataproc Using The GCloud CLI08:19
Pub/Sub for Streaming
-
1. Pub Sub08:23
-
2. Lab: Working With Pubsub On The Command Line05:35
-
3. Lab: Working With PubSub Using The Web Console04:40
-
4. Lab: Setting Up A Pubsub Publisher Using The Python Library05:52
-
5. Lab: Setting Up A Pubsub Subscriber Using The Python Library04:08
-
6. Lab: Publishing Streaming Data Into Pubsub08:18
-
7. Lab: Reading Streaming Data From PubSub And Writing To BigQuery10:14
-
8. Lab: Executing A Pipeline To Read Streaming Data And Write To BigQuery05:54
-
9. Lab: Pubsub Source BigQuery Sink10:20
Datalab ~ Jupyter
-
1. Data Lab03:00
-
2. Lab: Creating And Working On A Datalab Instance04:01
-
3. Lab: Importing And Exporting Data Using Datalab12:14
-
4. Lab: Using The Charting API In Datalab06:43
TensorFlow and Machine Learning
-
1. Introducing Machine Learning08:04
-
2. Representation Learning10:27
-
3. NN Introduced07:35
-
4. Introducing TF07:16
-
5. Lab: Simple Math Operations08:46
-
6. Computation Graph10:17
-
7. Tensors09:02
-
8. Lab: Tensors05:03
-
9. Linear Regression Intro09:57
-
10. Placeholders and Variables08:44
-
11. Lab: Placeholders06:36
-
12. Lab: Variables07:49
-
13. Lab: Linear Regression with Made-up Data04:52
-
14. Image Processing08:05
-
15. Images As Tensors08:16
-
16. Lab: Reading and Working with Images08:06
-
17. Lab: Image Transformations06:37
-
18. Introducing MNIST04:13
-
19. K-Nearest Neigbors07:42
-
20. One-hot Notation and L1 Distance07:31
-
21. Steps in the K-Nearest-Neighbors Implementation09:32
-
22. Lab: K-Nearest-Neighbors14:14
-
23. Learning Algorithm10:58
-
24. Individual Neuron09:52
-
25. Learning Regression07:51
-
26. Learning XOR10:27
-
27. XOR Trained11:11
Regression in TensorFlow
-
1. Lab: Access Data from Yahoo Finance02:49
-
2. Non TensorFlow Regression05:53
-
3. Lab: Linear Regression - Setting Up a Baseline11:19
-
4. Gradient Descent09:56
-
5. Lab: Linear Regression14:42
-
6. Lab: Multiple Regression in TensorFlow09:15
-
7. Logistic Regression Introduced10:16
-
8. Linear Classification05:25
-
9. Lab: Logistic Regression - Setting Up a Baseline07:33
-
10. Logit08:33
-
11. Softmax11:55
-
12. Argmax12:13
-
13. Lab: Logistic Regression16:56
-
14. Estimators04:10
-
15. Lab: Linear Regression using Estimators07:49
-
16. Lab: Logistic Regression using Estimators04:54
Vision, Translate, NLP and Speech: Trained ML APIs
-
1. Lab: Taxicab Prediction - Setting up the dataset14:38
-
2. Lab: Taxicab Prediction - Training and Running the model11:22
-
3. Lab: The Vision, Translate, NLP and Speech API10:54
-
4. Lab: The Vision API for Label and Landmark Detection07:00
Virtual Machines and Images
-
1. Live Migration10:17
-
2. Machine Types and Billing09:21
-
3. Sustained Use and Committed Use Discounts07:03
-
4. Rightsizing Recommendations02:22
-
5. RAM Disk02:07
-
6. Images07:45
-
7. Startup Scripts And Baked Images07:31
VPCs and Interconnecting Networks
-
1. VPCs And Subnets11:14
-
2. Global VPCs, Regional Subnets11:19
-
3. IP Addresses11:39
-
4. Lab: Working with Static IP Addresses05:46
-
5. Routes07:36
-
6. Firewall Rules15:33
-
7. Lab: Working with Firewalls07:05
-
8. Lab: Working with Auto Mode and Custom Mode Networks19:32
-
9. Lab: Bastion Host07:10
-
10. Cloud VPN07:27
-
11. Lab: Working with Cloud VPN11:11
-
12. Cloud Router10:31
-
13. Lab: Using Cloud Routers for Dynamic Routing14:07
-
14. Dedicated Interconnect Direct and Carrier Peering08:10
-
15. Shared VPCs10:11
-
16. Lab: Shared VPCs06:17
-
17. VPC Network Peering10:10
-
18. Lab: VPC Peering07:17
-
19. Cloud DNS And Legacy Networks05:19
Managed Instance Groups and Load Balancing
-
1. Managed and Unmanaged Instance Groups10:53
-
2. Types of Load Balancing05:46
-
3. Overview of HTTP(S) Load Balancing09:20
-
4. Forwarding Rules Target Proxy and Url Maps08:31
-
5. Backend Service and Backends09:28
-
6. Load Distribution and Firewall Rules04:28
-
7. Lab: HTTP(S) Load Balancing11:21
-
8. Lab: Content Based Load Balancing07:06
-
9. SSL Proxy and TCP Proxy Load Balancing05:06
-
10. Lab: SSL Proxy Load Balancing07:49
-
11. Network Load Balancing05:08
-
12. Internal Load Balancing07:16
-
13. Autoscalers11:52
-
14. Lab: Autoscaling with Managed Instance Groups12:22
Ops and Security
-
1. StackDriver12:08
-
2. StackDriver Logging07:39
-
3. Lab: Stackdriver Resource Monitoring08:12
-
4. Lab: Stackdriver Error Reporting and Debugging05:52
-
5. Cloud Deployment Manager06:05
-
6. Lab: Using Deployment Manager05:10
-
7. Lab: Deployment Manager and Stackdriver08:27
-
8. Cloud Endpoints03:48
-
9. Cloud IAM: User accounts, Service accounts, API Credentials08:53
-
10. Cloud IAM: Roles, Identity-Aware Proxy, Best Practices09:31
-
11. Lab: Cloud IAM11:57
-
12. Data Protection12:02
Appendix: Hadoop Ecosystem
-
1. Introducing the Hadoop Ecosystem01:34
-
2. Hadoop09:43
-
3. HDFS10:55
-
4. MapReduce10:34
-
5. Yarn05:29
-
6. Hive07:19
-
7. Hive vs. RDBMS07:10
-
8. HQL vs. SQL07:36
-
9. OLAP in Hive07:34
-
10. Windowing Hive08:22
-
11. Pig08:04
-
12. More Pig06:38
-
13. Spark08:54
-
14. More Spark11:45
-
15. Streams Intro07:44
-
16. Microbatches05:40
-
17. Window Types05:46
About Professional Data Engineer: Professional Data Engineer on Google Cloud Platform Certification Video Training Course
Professional Data Engineer: Professional Data Engineer on Google Cloud Platform certification video training course by prepaway along with practice test questions and answers, study guide and exam dumps provides the ultimate training package to help you pass.
BigTable ~ HBase = Columnar Store
1. BigTable Intro
Why is it easier to add columns on the fly in Big Table than in Cloud Spanner? This is a question that I'd like you to think about. Let's say you have a complex application and you now want to go ahead and change your database design. You'd like to add a whole bunch of columns to existing tables. That is very difficult to do in cloud spanner. Big Table makes this very simple. Why is that? We've been going on and on about the similarities between HBase and cloud spanner.
So let's now turn our conversation to HBase and its GCP equivalent, Big Table. Recall that the big table is used when we want to carry out fast sequential scanning of data, which is in columnar format. Again, for fast sequential scanning with low latency. Our no-sequel tool of choice on the Google Cloud platform is Vic Table. Vic Table is quite indistinguishable from HBase under the hood. Like HBase, it is a columnar database, which is good for sparse data. We will examine what exactly a columnar database is in a lot more detail in just a moment. Like a cloud spanner, big tables store physical representations of sequential key values in sorted order. And this means that BigTable, like Cloud Spanner, is sensitive to hot spots. If the reads and the rights are not evenly distributed, performance can take a bad hit.
For most intents and purposes, we can assume that Big Table and HBase are synonymous. We'll have a little more to say on the relationship between Big Table and HBase. But in general, like any cloud tool, Big Table has a bunch of advantages. It is a managed version of HBase, and that's a much closer relationship than, say, that between Hive and BigQuery. The underlying representations of data in Hive and BigQuery are quite different. The advantages of Big Table or HBase are exactly the ones that you would expect. These are all associated with the use of a cloud platform's scalability, a low administrative and operational burden, the ability to resize clusters without downtime, and the ability to support many more column families before performance drops. We will get to that when we discuss the columnar data format. Because the connection between Big Table and each way is so strong, it makes sense for us to thoroughly understand the properties of HBase. To begin with, it's a columnar data store, which means that effectively, the representation has just three columns.
Well, actually, four. It supports denormalized storage. This is quite different from RDBMS. It focuses on the "Crud" operations of create, read, update, and delete. That is a more basic set of DDL operations than many RDBMS. And lastly, transaction support and HBase is pretty much nonexistent. The only operations where asset properties are guaranteed are row-level operations. So again, this is worth remembering. HBase is a base. At the row level, acid stands for atomicity, consistency, isolation, and durability. Let's go through these one by one and understand them in some detail. Let's start with the idea of a columnar data store. Let's say that you wish to store data for a notification service on an e-commerce website. Notifications have properties like the ID of the person to whom they were sent and the type of notification. This could be an offer or a sale notification, depending on the content of the notification message. In a traditional relational database, we would store this data in the form of a table with four rows and a bunch of columns.
In our relation, each row corresponds to one tuple, and this is the layout of any traditional IDBMS. The number of elements in each row corresponds to the length of the schema. Here, each row has four elements, and each of those elements has to correspond with and type according to the corresponding schema that's specified. To understand the significance of any one value, any one piece of data in a relational database, we must link it to both its corresponding rowID and its column. For example, where the notification ID was three, the value jill is the person to whom the notification was sent. Now let's check out how this exact same set of data items would be represented in a columnar data store. Effectively, in a columnar data store, there would only be three columns, and these three columns would map to the columns of our relational data as follows:
First up, there is an ID column. This is common between the columnar datastore and the relational database representation. The second column in the columnar data store is a column identifier. The column identifier is going to contain the values that correspond to the columns in the relational database. Effectively, what we've done is encode the columns from the RDBMS as fields in the columnar data store. And now, to complete the representation of any one row of data, we are going to need to add columns corresponding to each of the cell values from the RDBMS tuple. And notice how those cell values are associated with the column types. A couple of points jump out and quickly grab our attention. Every row from the relational database now has multiple rows in the column or store. In fact, it has one row for each column from the RDBMS. The other bit that jumps out is that the columnar data store is clearly not normalized. For instance, notice how the column identifiers for type and content appear repeatedly in the column of format. That is not normalised storage, and in nontraditional RTMs, that would be frowned upon.
To make up for this, however, the columnar data store has a couple of powerful advantages. The first of these has to do with the ease with which it handles sparse data. If you have data that has a lot of null values, you're not going to end up wasting much space. And the other has to do with the dynamic nature of attributes in columns. Notice how, in a columnar data store, we can go ahead and add new columns on the fly without changing the schema of our data store. If you wanted to add a column in an RDBMS, we would have had to carry out an alter table operation, which would have a significant penalty. Let's come back to the question we posed at the start of this video. There are actually two separate answers here. The first relates to why it's difficult to add columns on the fly in cloud spanner. And the second is why it's easy to add columns in BigTable. Let's talk about BigTable. First, this one is easier to understand. Big Table is a column database. So if you decide to add columns to some tables in your data set, all you need to do is insert new rows into your database. You do not need to change the schema in any way.
This is why adding columns dynamically is pretty easy in Big Table. Let's now talk about why it's difficult to add columns to cloud spanner. For one, cloud spherical is a relational database, and so each time you add columns, that's going to change the schema, and there are going to be a whole bunch of database rights. These rights will require transaction support, which will give you terrible performance. So that's one reason. Another more fundamental reason why adding columns is particularly difficult in cloud spanner is because of the nature of the underlying storage. Remember, the cloud spanner uses interleaving. It uses a complex form in which related data items are grouped together in a way that is not all that dissimilar from the table. That gives rise to a whole bunch of practical difficulties. However, when you want to change the schema and add columns.
2. Columnar Store
Generations of computer science students have grown up learning about the importance of normalisation in database design. Why, then, is it that distributed databases often compromise on normalization? What are the drawbacks of normalisation in the distributed world? This is a question that I would like you to ponder as you watch this video. We will come back to the answer at the end of the video. These two advantages of columnar data stores like BigTable and HBase are quite significant, so let's go ahead and make sure we really understand them.
Let's start out by understanding why columnar data stores are so much better at dealing with sparse data. Let's keep going with our discussions of notification data, because that's actually a good example of the kind of data where there are a bunch of missing values. Here, for instance, there might be notification types that have expiration dates.
These are offers that are going to expire at a certain point. Sale and offer notifications have expiration dates, but the other notification types do not. Also, it is entirely possible that order notifications have an "order status" field. This is a field that is specific to order notifications. If we wanted to accommodate all of these different types of notifications within a relational database table, we would effectively keep adding columns.
This would cause our table to get wider and wider, but these columns would only exist for a small subset of the total notification data. And so our data set, our relational database, would be filled up with more and more nulls, more and more empty values. It's also worth keeping in mind that columnar data stores like Big Table or HBase tend to operate on really large data sets on the order of petabytes. This is perhaps several orders of magnitude greater than relational databases. And that's why in relational databases we are OK to ignore the space occupied by missing values, but we are not okay ignoring this when we are dealing with petabytes of data in a columnar data store. So the fact that we have an extremely large dataset and that data set is very sparse, with a lot of empty values in each row, can become a real problem as the data set explodes in size.
This is where columnar stores come in handy because, as we can see, we simply do not have a row corresponding to a null value. For instance, notice that here we have a notification that has an expiry type field. That is because it's a sale or an offer notification, and these are the only notifications that will have rows in our columnar data store. The notifications that lack these attributes will simply not have rows corresponding to this field, and the result is that there is no wastage on empty cells. This example also demonstrates the other great advantage of columnar data stores, which is that they have the ability to add new attributes, such as new columns, dynamically on the fly as rows into our columnar data. If we wanted to add a new field or column to a relational database, we would have to use the Alter table command and then add more columns.
Those columns would have null values for most of the existing data. None of these issues arise when we attempt to add columns to a columnar data store in the form of new attributes. So the dynamic addition of attributes is yet another advantage of this type of data representation. One important aside: here in this conversation about bigtable and also in a preceding conversation about cloud spanner, we have been dealing with a lot of schematic diagrams of how data is laid out.
It should be noted that these are not necessarily accurate descriptions of how data is stored internally, but they are schematically correct, in the sense that you can use them as a good guide to how columnar data stores work or how interleaving in cloud spanner works. But don't get hung up. And don't assume or attempt to reverse engineer the actual physical storage of data in either of these technologies that has anything to do with the basic idea of a columnar data store and its benefits and drawbacks.
Let's now move on and talk about denormalized storage. We've already discussed how storage in columnar data stores does not fit into the traditional definitions of normalization. In traditional RDBMS, an important objective is minimising redundancy, and that's what gave rise to the different normal forms, and in particular to the third normal form, which is what most RDBMS shoot for. Let's understand this with an example. Let's say that we wish to minimise redundancy when we are storing data that has to do with employee details. And this data also includes subordinate and reporting relationships as well as the addresses. All of this is part of our dataset. Let's go ahead and see how we would design a table in the RDBMS world.
We would have one employee details table. This would contain information specific to an employee but separate from any subordinates or addresses, because one employee would have multiple subordinates and addresses. So subordinate and address information would reside in separate tables. For instance, there would be an employee-subordinate table.
This would link to the employee details table based on the ID column. And in a similar manner, we would have an employee address table once again here. This would link back to the employee table based on the ID. And in this way, we would deal seamlessly with the situation where an employee has multiple subordinates and multiple addresses. Now let's focus a little bit on the ID column. This is the column that holds together our data set.
Why did we decide to keep all of the employee details in one table and separate out the subordinate and address data? Well, because if we had had multiple subordinates per employee, we would have had to repeat all of the employee's specific information, such as name, function, grade, and so on. By having a separate employee subordinate table that links entirely on the basis of that one ID column, we only need to repeat the ID.
We do not need to repeat any of the other data items for that employee. And this also means that we are going to refer to an employee no matter what table we're talking about using that ID column. In a sense, this ID column is the key to this data set. And we have made our data granular by splitting it across multiple tables. We have also eliminated redundancy.
But we do have a more complex data model because now the same column, ID, is logically and semantically linked across these three tables. This is normalization, and this is the way traditional RDBMS do things. Let's come back to the question we posed at the start of the video. I think it's a really interesting one.
Normalization in traditional database design was largely driven by the need to save space. and that in turn was driven by the monolithic nature of database servers. You had one very big and powerful machine. A whole bunch of data had to be crabbed into that machine. And so the bottleneck was in the amount of data that you could fit onto that machine in a distributed database. All of a sudden, bandwidth is now the bottleneck.
The number of network accesses that you are going to need to perform and the number of different nodes that you will need to access in order to read data become the really expensive operations. And now, all of a sudden, normalization isn't such a great idea.
Let's say you normalise data and end up storing related data items in distant different nodes; even if you save a few bites, having to access the network three times instead of once will give you Aable performance. That is why, in a distributed worlddisk, seeks are more expensive than storage. As a result, denormalized data forms that group together all of the information that you require are becoming more popular.
3. Denormalised
Here's a question that I'd like you to think about as you go through the contents of this video. This is a true or false question. Is this statement true or false? Big Able supports equijoins, but it does not support constraints or indices. Equijoins are joined where the joint condition involves an equality check.
As we saw in the case of data store inequality, joints have some restrictions in terms of technologies. Is one of those technologies being able to? Is this statement true or false? Normalization and the traditional normal forms have existed as standards in RDBMS and database theory for decades. The basic idea here is to optimise the amount of storage.
As we've already seen, using the normalized forms allows us to save employee-specific details such as the name, grade, and so on just once. We do not have to repeat these for each subordinate or each address that the employee contains. But the reality now is that we are talking about working on a distributed file system, and here storage is actually very cheap because we have a large number of generic machines, each with a lot of attached storage.
What is really costly in a distributed file system is making a lot of disc seeks to servers or to data that resides on different machines. Now, for instance, if you wanted to get all the information about one employee, a normalised storage form would require us to look up three different tables, which might reside in three very different parts of the network, and that could impose a terrible performance penalty in a distributed system.
While this would be perfectly acceptable in a monolithic database server from the 1990s or 2001, columnar data stores eliminate the concept of normalization. They squish all of their data together so that all the data for one entity resides together. The immediate and obvious implication of this is that we have eliminated normalization. In fact, we have eliminated the first normal form with our subordinate data here because for every subordinate for a corresponding employee, we now have an array in the row corresponding to that employee, and arrays are compound data types, which violate the first normal form. We need to do something similar for address information, but their address information is structured, and this requires us to use something even more complex than an array. Here we are going to need to make use of a structure.
Each address is going to consist of a city and a zip code, and then for each employee, we are going to need to include a structure within that structure that will contain information about the cities and zip codes of all of the addresses of that employee. Notice here that everything that has to do with a particular employee is logically grouped together and effectively indexed by the row ID or the employee ID that is equal to one. The great advantage of this representation is that now you can get all the information about a particular employee. We just need to carry out one discrete action. All of these data items will be logically and physically stored close to each other in a distributed system. This can lead to an incredible improvement in performance, particularly if you are smart about how you sort and store your data items. And that's exactly what Big Table and HBase do.
Next up on the agenda, we need to understand the operations that HBase and Big Table support and do not support. Basically, HBase and Big Table only support Crud operations. "Crowd" stands for "create, read, update, and delete." Now, this is a far smaller set of operations than supported by traditional databases or SQL, where, for instance, we are used to complex operations across rows such as joins or group-by operations or sorting operations such as order-by. If you take a minute and stare at these three bits of functionality, which are supported in Sequel and RDBMS but are not supported in HBase, what jumps out? What do these operations have in common? You guessed it, they were involved in some capacity. Some kind of comparison, sorting, or equality or inequality check across different rows in the same data.
HBase is very row-centric; it basically scans data by a row key. And it doesn't really understand operations that compare groups of rows with each other. And that's why HBase and BigTable are both not SQL technologies. They do not support SQL. And the reason for this, of course, ties back to their underlying data representation, where a row is basically their basic unit of viewing the world. It turns out that eBay only allows a very limited set of operations. These are the current operations that we've already discussed. It's okay to create data sets, it's okay to read data, it's okay to update specific data items, and it's okay to delete data items. As we shall see, most of these are indexed by a row key.
That's the only way that HBase knows how to access data. So current operations are supported by HBase. More complex joins, order by, and aggregation operations are not. This is an important point. This is important to keep in mind, particularly on the Google Cloud Platform, where BigQuery, which is the Hive-like version and offers a similar interface on top of cloud storage, does a lot better in terms of performance than Hive. So if you need to choose between BigQuery and Big Table, do remember these limitations of Big Table.
Big Table will throw up its hands, and it will not support any operations involving multiple tables. It does not support indexes on tables other than the row key, which is the ID column, and it does not support constraints. None of these facilities are available to you if you want to use Big Table. And all of these restrictions, which could seem arbitrary, will make sense if you keep in mind that all data needs to be self-contained within one row. That is the basic underlying premise of columnar data stores like Big Table and HBase. Let's now move on to the next property of HBase that is important for us to understand, which is the fact that HBase only supports acid at the row level. Recall that as it stands for atomicity, consistency, isolation, and durability, this is transaction support as provided by a traditional RDBMS. Now, in HBase, updates to a single row are atomic.
So effectively, any operations that you carry out that affect a particular row ID will be all or nothing. Either all of the columns corresponding to that row will be affected, or none will be. However, this is only applicable to a single row. Updates to multiple rows are not atomic, even if your update is to the same column but on multiple rows. Again, this should come as no surprise to us once we've understood the underlying representation of data in HBase. The worldview of a column in a data store is restricted to groups of data with the same row ID. Once you cross the boundary of a row, all bets are off.
So let's quickly summarise all of the differences between a traditional RDBMS and a column that is stored like HBase. Remember that in an RDBMS, data is arranged in rows and columns, but in HBase, data is arranged in columns only. That's where the name "columnar data store" comes from. Traditional relational databases are SQL-compliant. In fact, they are, by definition, SQL databases. HBase is a prominent example of a NoSQL database. Specifically, it is a key-value store. Traditional RDBMS and database design place a high value on normalization. That's because it minimizes redundancy and optimises the amount of space taken up by data. However, columnar data stores like HBasedo do not care about denormalization.
In fact, they intentionally denormalize data in order to make it easier and faster to access related data items in a distributed file system. HBase operates at much larger dataset sizes than traditional RDBMS. That has implications for transaction support and asset compliance. HBase is only going to support asset properties at the row level. Multirow operations are not asset-compliant. Let's come back to the question that we posed. This statement is false. Big Table is about as non-sequel as it gets. It does not support any operations across tables. It does not support joins of any form, constraints, or indices. Again. A big table is pretty hardcore. No sequel. It does not support any operations across tables. Everything is only at the level of raw data and column families.
Prepaway's Professional Data Engineer: Professional Data Engineer on Google Cloud Platform video training course for passing certification exams is the only solution which you need.
Pass Google Professional Data Engineer Exam in First Attempt Guaranteed!
Get 100% Latest Exam Questions, Accurate & Verified Answers As Seen in the Actual Exam!
30 Days Free Updates, Instant Download!
Professional Data Engineer Premium Bundle
- Premium File 319 Questions & Answers. Last update: Dec 16, 2024
- Training Course 201 Video Lectures
- Study Guide 543 Pages
Student Feedback
Comments * The most recent comment are at the top
Can View Online Video Courses
Please fill out your email address below in order to view Online Courses.
Registration is Free and Easy, You Simply need to provide an email address.
- Trusted By 1.2M IT Certification Candidates Every Month
- Hundreds Hours of Videos
- Instant download After Registration
A confirmation link will be sent to this email address to verify your login.
Please Log In to view Online Course
Registration is free and easy - just provide your E-mail address.
Click Here to Register