Google Professional Data Engineer – Appendix: Hadoop Ecosystem part 3
- Hive vs. RDBMS
Picking the right technology for the right use case is really important these days. And so I’d like you to think about this question why would we never use Hive or Bigquery for that matter? For OLTP applications. OLTP stands for online transaction processing. These are where traditional databases tend to dominate. So, as we discuss the differences between Hive and a traditional RDBMS, do keep this question in mind. Also, keep in mind the converse of this question why would we never do the reverse? Why would we never use MySQL or Postgresql or for that matter, Cloud, SQL or Cloud Spanner for an Olap use case? Olap stands for online Analytical processing.
This has to do with business intelligence or all of the advanced analytics querying capabilities that Hive and Bigquery bring with them. Why can’t we simply use SQL or an RDBMS? Here? Let’s try and answer all of these questions by examining the differences between Hive and a traditional RDBMS. Hive and traditional RDBMS differ along many important axes, starting with the size of the data sets that they work best with, their computation models, their approaches, and their tolerance for latency. These are very widely different as well, and also the operations that they support, the capabilities of their query languages, the asset guarantees and transaction support, and lastly, the query language itself.
The differences between Hive, Ql and SQL. In a nutshell, let’s summarize the differences between Hive and Rtpms along all of these axes. Hive works best with very large data sets in the order of Petabytes. These are file sets in a distributed file system, so they can get pretty large. Rdpms tend to work with much smaller data sets on the order of Terabytes or gigabytes. This has to do with big data applications where all of the data just doesn’t fit into the memory of any one machine. This is why Hive and tools like Hive are required for much larger data sets. And indeed, that’s why Hive needs to rely on parallel computations. Mapreduce, after all, is a parallel programming paradigm.
Rdpms, on the other hand, are a decades old technology which rely heavily on serial computation. This is a luxury that they can afford because they deal with smaller data sets and serial computations are that much easier to serialize and they are that much easier to provide right guarantees and transaction guarantees with. We’ll have more on that in a second. The latency in Hive and Bigquery and tools like that tends to be higher, especially when the underlying storage technology is a distributed file system. This is true with Hive, but not true with Bigquery. In contrast, relational databases tend to offer much lower levels of latency, once again a luxury because of their small data sets.
Read operations really are what Hive is optimized for. The basic use case in Hive is that you have a lot of data sitting in HDFS and you would now like to read it and carry out some kind of smart analytics on it. RDBMS, on the other hand, need to support both read and write operations. And this once again is a lot more complicated because once you start writing to data, you also have to make sure that various properties are satisfied. Those are the asset properties. Asset stands for atomicity consistency, isolation and durability.These are the guarantees which relational databases make in order to support transaction processing. Relational databases are unambiguously asset compliant. Hive, on the other hand, is not.
There’s a lot of fine print around asset support and Hive which we will not get into. Relational databases sometimes even offer stronger guarantees than Hive. Recall our conversation about cloud spanner and how it was asset plus plus in at least some axes. For instance, in the fact that the serializing of transactions happened in timestamp order and those timestamps could actually correspond to real time, which you could measure in a watch. All of these differences feed into the last difference, which is the difference between Hive Ql and SQL. We will have a separate conversation about the differences between these two query languages in just a moment, but at a very high level.
You should know that SQL is optimized for the transaction kind of operations which you associate with RDBMS. Hive Ql offers very powerful analytical functions because the whole point is to move the calculation of complex metrics from software that is from your application into Hive, and that is done by supporting a query language, which in some ways is much richer. So support for operations such as Partitioning, Bucketing, Windowing, all of these operations are much more easily accomplished in Hive than in an RDBMS. Okay, let’s now double click on each one of these, starting with the data set size. We’ve already spoken about how traditional RDBMS are a pretty big data concept.
In a large data set, which exists on HDFS, data can easily extend into gigabytes or even Petabytes. In a relational database world, the data is typically going to be big enough to fit in memory on a high performance machine, and therefore it’s going to be in Megabytes or Gigabytes. These data set sizes also play nicely into the use cases of these respective tools. You tend to use something like Hive when you’d like to calculate a trend or some kind of analytics from a really large, broad data set. On the other hand, you tend to use RDBMS to access or update individual records.
If there is a needle in the haystack kind of operation, an RDBMS will work much better than an Olap processing tool like Hive. Moving on, let’s talk about the parallelization inherent in these two tools. As we’ve discussed time and time again, Hive basically runs on top of Hadoop, which is a distributed system with multiple machines, maybe thousands or millions. Relational databases typically are servers. These are extremely powerful single computers, possibly with some failover replicas. Going back to our conversation about monolithic versus distributed computing. Monolithic database servers are maybe the most successful still existing form of monolithic computing.
Relational databases are the last holdout in that sense. Everywhere else, distributed computing is winning. In hive, semistructured data exists, and this is partitioned across the distributed file system HDFS using the block storage format which we had previously discussed. Relational data is anything but semi structured. It has to be very strictly structured into rows and columns with schemas, and those schemas are enforced. All of this typically takes the form of data in able files which exist on a single machine, that is, the database server.
- HQL vs. SQL
Hive and tools like Hive offer horizontal scaling. Most traditional relational databases do not. The one exception was cloud spanner, which we had discussed in some detail. Recall that cloud spanner blurred the lines between a relational database and a big data technology. In a tool such as Hive, we get horizontal scaling because disk space is very cheap and it’s abstracted using the distributed file system. So we can add space simply by adding more machines, simply by adding more data nodes into our HTFs cluster. Most traditional databases do not offer horizontal scaling because they have a monolithic server of salts. And disk space is very expensive and very difficult to attach to a single machine.
Recall, for instance, that the maximum that we could go up to while discussing local SSDs. This is in the Compute Options section of this course. Three terabytes. That’s as far as we could go using persistent disk or even persistent SSD. We could easily go up to 64 terabytes. Horizontal scaling is easy to achieve when one has parallelization and a distributed file system like HDFS. It’s very difficult to achieve with a monolithic supercomputer like server. This also informs the relative approaches to latency. Unfortunately, in a distributed file system, files are all over the place. They are scattered on different machines, they are not even indexed, and so they cannot be accessed quickly.
This is why latency is very poor. It’s very high in tools like Hive. I will point out, however, that Hive differs from GCP’s Bigquery in this respect, because bigquery uses an underlying columnar data format. It does allow indexing and it does allow relatively faster access and much lower latency than Hive. In any case, relational databases have extremely low latency because they rely heavily on indices. Recall the tools which have indices. For example, Data Store, which is a document oriented database, tend to be great for needle in a haystack style random access. Relational databases tend to win hands down versus Hive in use cases like this under the hood, what’s going on in Hive is a MapReduce job.
And so, if you would like to perform an operation to fetch a single row, that is still going to require all of the overhead of mapreduce, the mappers and the reducers, and all of the shuffling and the other steps that go on in the mapreduce programming paradigm. And in this way, fetching a single row using Mapreduce can easily take minutes. Queries, on the other hand, will have their latency measured in micro or milliseconds. Moving on to read and write operations in Hive and RDBMS. This brings us to another crucial, and in fact a new point which we have not yet alluded to in Hive.
Hive is not necessarily the owner of the data, because after all, that data resides in files on a distributed file system like HDFS, and it is entirely possible for any number of technologies to go behind Hive and access those files directly. Hadoop Pig and Spark. All of them do exactly this. This has yet another implication, which is that it’s basically impossible to enforce a schema because, after all, how can Hive make sure that some other application, such as a Hadoop Map, reduced Job, or Spark has not gone in and violated the schema in some way? It’s simply impossible for schemas to be enforced. This is completely different from relational databases, where of course the RDBMS is the sole gatekeeper for the data.
This is why schema enforcement in Hive is called Schema on Read. Schema on read basically implies that the number of columns, their types, and the constraints may be specified at table creation. But Hive only can make a best effort to impose the schema while the data is read. It is entirely possible that Hive will fail to enforce the schema, and it will go ahead and pad the data with nulls and do whatever else is required for the read to succeed. In contrast, relational databases can go with a much stronger form of schema reinforcement, and that is schema on right. They can confirm at the point of writing data into a relational database whether the schema and all of the associated constraints and so on are being satisfied or not.
And if not, relational databases will reject the right operation. This fact, the difference between the data ownership and between schema enforcement in Hive and traditional RDBMS also leads to changes in asset support because data can be dumped into Hive tables, and the tables here are in air quotes, because after all, these are HDFS files. Because data can be dumped into Hive from any source at all, there really is no realistic way for Hive to be asset compliant. On the other hand, relational databases can choke off access and make sure that only data which satisfies constraints and which meets the asset properties gets into the database. And all of these differences between Hive and Rdpms inform the huge differences between Hive Ql and SQL in Hive Ql.
Because schema on read is the only game in town, there really are no constraints at all. Relational databases, on the other hand, support a wide variety of constraints and enforce all of them. Not null, unique foreign key. All of these constraint types can be satisfied in a relational database. Hive doesn’t really have any meaningful indexing support. We will not discuss the minimal support which it does have in much detail, but indices are clearly extremely important in a relational database. We can also see from this why row level updates or deletes are only a special case, and most update and delete operations in Hive are just not supported. Hive is not really set up to write data.
It’s mostly set up just to read or analyze data. Hive does support a very rich variety of built in functions. It’s almost like a little programming language going on internally. This is a much richer set than relational databases or SQL typically support. This is important in order for analytics processing to be carried out using a tool like Hive. These rich built in functions compensate to a certain extent for the fact that Hive only supports Equijoins in most analytics operations. This is fine. Relational databases have no restrictions on joint conditions at all. You might remember that Data Store, which is a document oriented database in the Google Cloud platform, also has similar restrictions on non Equidjoins and sub queries.
And indeed, Hive only supports specific types of subqueries. Unlike relational databases, where it’s pretty much possible to have any valid SQL statement run as a sub query. This conversation about the differences between Hive and traditional relational databases has hopefully made it explicitly and painfully clear why Hive or Bigquery would not be a good fit for transaction processing. There just is not support, not even for the asset properties. And at the same time, we can also see why we would choose something like Hive or Bigquery for Olap. There’s just a lot more horsepower. We can consume big data sets and perform commerce complex analytical processing on them. None of this would be possible using a traditional relational database.
- OLAP in Hive
Now that we’ve got a decent sense of the differences between Hive and a traditional relational database management system, let’s explore some of the olap capabilities in Hive. As usual, while we are doing so, I’d like you to meditate upon this question. The question is, why are specific features like windowing and partitioning so important in an olap big data technology like Hive? Clearly, traditional relational database management systems don’t emphasize either windowing or partitioning or other advanced querying functionality all that much. With that big picture consideration in mind, let’s plunge right into a conversation of partitioning and bucketing. In Hive, partitioning and bucketing both have the same basic idea. They are both ways of splitting our data.
They are both ways of creating more manageable chunks that are easier to work with. Now, in Hive, each of these is a feature that adds to performance optimizations. And each of these is something that you, as the end user of Hive, can control. It is possible in your application that data might be logically decomposable into some units. Take, for instance, geographical data. If you are dealing with data for states in the US. It’s pretty natural for you to partition based on the state. So if you have a customer information database and all of your customers are in the US. In Hive, you can tell Hive that you would like to partition your data on the state column. The result will be that each of these partitions will be stored in a different directory.
The obvious advantage and implication of this is that anytime you specify a partition filter, a specific query can then run only against the directory which contains the relevant data. This can be a big potential saving in a distributed system. Hive need not go around hunting on HDFS for all of the directories for all of the irrelevant states. One important bit worth keeping in mind is that clearly the partitions are unlikely to be of the same size. For instance, it is highly likely that the number of customers in California will far exceed the number of customers in, say, Oregon. And so this is something that partitioning comes with. Each of these partitions will not be of the same size.
And if that poses a problem for you, if you are not happy with that feature, you ought to consider going with bucketing instead. Bucketing is an alternative way of splitting data, but this has a higher likelihood of creating splits which are equal in size. Here, effectively, the buckets are created on the hash of some column value. The result of using a column value hash is usually to evenly distribute data across the different buckets. You could go with any column value and address a name a timestamp pretty much anything we had seen while discussing the design of HDFS. That file blocks were chosen to all be 128 megabytes. And that’s because having splits of the same size can actually lead to some pretty important performance benefits.
And this is why bucketing might be a good idea. Bucketing and partitioning are not mutually exclusive. The reason is that bucketing is implemented using separate files. So inside a directory which corresponds to a partition, there is nothing preventing you from having different buckets, one for each split of hash value in your columns. Hive has powerful functionality related to sampling data. This is something which comes in handy when data sets get really large. Both sampling and join operations in Hive become a lot more efficient because of the presence of buckets.As you can imagine, equidjoins on bucket values will greatly reduce the search space because Hive can very quickly zero in on the bucket based on the hash value. Recall again that Hive only supports equidjoins and you have a hint as to why. Bucketing is a way to make equidjoins very high performance, relatively at least. A big part of really effectively using Hive is carrying out various optimizations query optimizations which do not work on flat relational data, rather these deviate data into much more manageable subsets. This is really important while working with big data, and this is why bucketing and partitioning are both so important. In the same logical category are various join optimizations, which also have to do with smartly divvying up data as well as windowing functions. If you think about it, join optimizations and windowing will all be prohibitively expensive if you have a file which is petabytes in size.
Unless you have some smart ways of restricting the data that’s required to be processed in relational databases, where all or most of the data in a table is going to fit in memory, a join optimization might be a nice to have in something like Hive, it is an absolute must have. Remember that join operations, like everything else in Hive, are basically map reduced jobs under the hood. So when you optimize a join, what you’re really doing is reducing the amount of data that’s got to be held in memory. Indeed, this is why Hive only allows equal joins. It does not allow inequality conditions.
Recall that there was a similar condition in Data Store, that is the document oriented no SQL database service available on the Google Cloud platform. You might recall that in Data Store only one inequality condition was allowed, and that also had to do with optimizing the amount of Tuppers which can be held in memory. Let’s say that we wish to join a really large table, a 500 gigabyte table with a really small table which is just five MB. The 500 GB table is going to consist of a large number of 128 MB file blocks. These will be split across multiple machines in the underlying Hadoop cluster. Here, one table is going to be held in memory while the other is read from disk.
Here it kind of makes sense that the smaller table, that’s the five MB table, should always be completely in memory in the mapreduce paradigm. Another way of achieving much better performance is by eliminating the reduce operation and converting an operation into a map. Only if you work a lot with map reduce operations, this will instantly ring a bell. But if not, you might find it a little difficult. So let’s try and understand this with an example. Recall that all map reduce operations have two phases the map and the reduced phase. The output from the map phase is a set of key value pairs. The input into the reduced phase is each reducer gets a list of values, all of which have the same key.
Now, if certain queries could be structured to have no reduced phase, then that would dramatically reduce the amount of time that has to be spent in executing that query. As an example of such a query, consider a really simple SQL query with a filter. We only wish to report all of the rows in a particular table which satisfy a condition. All that the mapper needs to do is to check that that condition is satisfied and only add that row into the output if it is. The trick in executing this however, is for the mapper to use null as its key. That will force all output values from all of the mappers to go into a single reducer. That reducer then doesn’t have to do anything. It just outputs puts all of the input values.