Google Professional Data Engineer – Appendix: Hadoop Ecosystem part 6
- Streams Intro
Let’s now introduce Apache Flink and as always, here is a question which I’d like you to ponder on as we discuss this technology. How, if at all, can Mapreduce be used to maintain a running summary of the real time data sent in from sensors? These sensors send in temperature readings every five minutes. You would like to calculate maybe the average of the temperature readings from all these sensors. And that average needs to be updated forever. And to perpetuity, your Mapreduce job should never stop calculating this average temperature, no matter what. Applications like these are exactly where stream processing tools like Apache Flink or Apache Kafka or Spark Streaming come in handy. This requires us to understand a pretty fundamental distinction between bounded data sets and unbounded data sets.
If one has a bounded data set, one processes them in batches. If one has an unbounded data set, one processes them in streams. We will discuss this distinction between batches and streams in more detail in just a moment. First, let’s be sure that we understand what exactly unbounded data sets are. Unbounded data sets have two defining characteristics. The first is that data gets added to these data sets continuously and on an infinite basis. The other is that unbounded data sets require continuous processing. So the processing needs to continue for as long as the data is received, which again, is unto infinity. This does not lend itself wells to batch processing. With batch processing, there is a finite set of data points.
Resources are used to perform some calculation on them, and then those resources are given up. So let’s understand this difference between batches and streams. Now that we’ve understood the defining characteristics of unbounded data sets. Another key characteristic of batch processing is that the entire pipeline of transformations is rather slow. If one took all of the steps from the point of data being ingested to the point where it’s actually analyzed and some insights result, that is a long process. Stream processing, on the other hand, has real time requirements. The processing must be carried out immediately as soon as the data is received. This is in contrast to batch processing, where a batch job might take hours or even days to return.
In batch processing, the results are typically calculated at some fixed periodicity as batches complete. As jobs complete, those results are emitted. With stream processing, however, there is a different model. Continuous updates flow out of the stream processing job. This is a constant and never ending process. Batch processing jobs typically do not care about the order in which data was received because all of it is going to be aggregated and operated on in a batch in any case. In stream processing, however, order is important. Any arrival of data out of order needs to be tracked and handled as a special case. Another important attribute of batch processing systems is that they have the luxury of relying upon a single state of the world. The data that they are operating on encapsulates that state of the world.
It acts as the source of truth. Stream processing applications do not have this luxury because in a sense, their data set is constantly changing. So they do not have a global state that they can rely on. They only have a history of events that have been received. Let’s now understand some of the attributes of a stream processing system. The defining characteristic of a stream processing system, of course, is the presence of a stream, an unbounded data set in which data items are received in the form of continuous updates. Common examples include log messages, tweets, or climate sensor data. Such data satisfies both of our defining characteristics because there is no end to such a data stream and because the results that we wish to calculate on these require continuous monitoring and updating.
To process this data, we need to work on it, maybe one entity at a time or on the basis of a set of entities collected into a batch. There are a bunch of things that we might want to do with such a data stream. If it’s a log, we might want to filter out error messages and examine those or scrutinize those, maybe for some kind of intrusion detection. If it’s a Twitter stream, maybe what we need to do is search for references to some latest movies or to trending keywords. Or if it’s sensor data, then maybe we need to calculate some kind of weather pattern or a weather warning or maybe some kind of alerts. Let’s say this is an oil field.
As more and more data is collected, and as more and more powerful machine learning models sit on top of these big data collections, the needs for stream processing are growing. Stream processing requires some kind of mechanism which is going to operate on additional entities as they keep coming in. This mechanism must be able to store, display and act on a set of filtered messages. Every one of these use cases is a stream processing use case. We get a stream of data items, we aggregate that stream in some way and then we go ahead and process it. This requires us to clearly understand a couple of concepts. There is the data stream which consists of all of the data items as they come in.
And then there are the operations which we perform on that stream, which are called the stream processing we’ve already discussed. The traditional systems such as files or databases have a source of truth. They have a repository of data items which can be really large but is still going to be finite. And that is the source of truth. They can always go back and reperform a calculation on that same data set if they feel that something has gone wrong. This is not the case with a stream first architecture. In a stream first architecture, there are going to be data items which can come from a bunch of different sources. Some of these sources might be traditional sources such as files or databases, but at least one of them is going to be an unbounded data set. I-E-A stream.
All of these data items from whatever source they are derived will then be aggregated and buffered in some way. And that buffer is known as the message transport or the queue. The message transport or the queue has an extremely important role to play in a stream first architecture because as far as the stream processing goes, that message transport or that buffer or that queue represents the stream. And that really is the source of truth. Message transport systems are acknowledged in their own right. Two common and well known message transport systems are Apache, Kafka and Map Are. Now, once all of those data items have been buffered in the message transport, they will then be passed on to a stream processing system which has the ability to actually do something with them.
Examples of stream processing technologies include Apache, Flink and Spark streaming. Each of these two highlighted bits is an essential characteristic of any stream first architecture both the message transport and the stream processing engine. Both of these are required for an architecture to really be called streamfirst. If we think about an architecture like Hadoop or Mapreduce, it has neither of these elements. And that is why Mapreduce just cannot be used for stream processing. It’s quite simple. It’s a pure batch processing system.
- Microbatches
A stream first architecture has a message transport system and a stream processing system. That message transport system contains within it, at any point in time, a set of data items which constitute the source of truth. They are the stream. They are the world. As far as the stream processing architecture goes, let’s dig a little deeper. Let’s try and understand some of the attributes of that message transport. At heart, the message transport is a buffer. It’s a queue of sorts for the event data. This message transport needs to be persistent. That’s because entities should not be lost even if there is some kind of system failure. But in spite of this, it has to also be pretty high performing. So the message transport should be pretty low latency.
One of the important functions of the message transport is to decouple the sender and the receiver and also to insulate both senders and receivers from crashes in the other. This requires also a decoupling of multiple sources from the processing. This is another important attribute. Message transport technologies include Kafka and Map R. That’s as far as the message transport bit goes. Let’s also talk about the stream processing portion of the stream first architecture. This also needs to be high throughput and low latency. So be careful about retrofitting stream processing onto any batch processing system. The latency is likely to kill the performance of any such retrofitted system. In addition, the stream processing system also needs to be fault tolerant.
If any data items or packets or events are lost, there should be a way to recover them. And as always with fault tolerance, this recovery mechanism should not take up a huge amount of overhead. Another important requirement is that out of order events need to be managed. So if the data items appear out of order, our stream processing system needs to be smart enough to detect that and then also to reorder it so that the events are now processed in the right order. Like any other bit of software, stream processing systems need to be easy to use and maintainable. In addition, this has the implication that they should have the ability to replace streams. This is because debugging a streaming application can be really hard.
So the ability to replace streams is really critical. Examples of common and important stream processing technologies include Spark Streaming, Apache Storm and Apache Flint. Now, if you are particularly sharp eyed, you might notice that Spark was not really a streaming first architecture to begin with. You might recall that it had a cluster manager and a storage manager and it looked pretty batch oriented in its architecture. And the way Spark streaming approximates stream processing is through the use of something known as micro batches. This actually is a fairly common experience adopted by bits of software which were built to deal with batch processing. But where stream processing now needs to be retrofitted on, let’s understand how this would work.
Let’s say that the data items come in in the form of a stream as before and it’s received by a batch processing system which needs to fake the effect of stream processing. The first thing that we can do is to group together these data items on some basis, such as the time at which they were received. Grouping these together into batches allows our batch processing system to deal with them. Now, we can make the batch sizes really small. If the batch sizes are made infinitely small. In the extreme, what we get is stream processing. This trick is known as mimicking streams using micro batches and it actually works really well. It has some important advantages. One important advantage is that it’s relatively easy to achieve something known as exactly One semantics.
Exactly one semantics consists of a guarantee that each item in the stream will be processed exactly once. This is easy to achieve using micro batching because we can replay the batches. Another benefit of using micro batches is that it allows us to pick the exact trade off between latency and through foot, which works for us. Let’s say, for instance, that our particular application requires extremely low latency. It requires each data item to be processed as soon as it comes in on the stream. In a situation like this, we need to go with a really small batch size because we need to move towards the use case of true stream processing. But it’s also possible that our use case might be okay to live with a relatively high level of latency.
So we can increase the batch size and begin to approach true batch processing. We can find the sweet spot in this trade off between latency and throughput based on the batch size, which works best for us. There are a couple of pretty popular stream processing systems. Spark Streaming and Storm Trident both use the micro batch architecture. And so in this way, neither of these really qualifies as a stream first architecture. An example of a stream first architecture is Apache Flink. This truly is stream first because the basic unit of processing there is the individual data item in a stream. And in Flint it is batch processing which is retrofitted onto a stream first architecture. In effect, Flink fakes the idea of batches by grouping together individual data items in a stream.
- Window Types
Whether you mimic real time streams using micro batches or use a stream first. Architecture an important concept in streams is the concept of windowing your streaming entities. A window essentially groups together a bunch of entities in a stream in order to perform aggregations on them. There are three types of windows that we generally work with. These are the most important, though there are some other types of windows that you might find in certain applications as well. The three types that we are concerned with here is the tumbling window, the sliding window, and the session window. These group streaming entities in slightly different ways, and depending on the application or the product that you’re building, you might want to use one or the other.
Consider a stream of data that flows in from some source. The source can be Twitter log messages, sensor data, anything you want to perform aggregations on these streaming entities, you might choose the tumbling window. In a tumbling window, the window size is fixed at a specified time interval. The time intervals are non overlapping, as in one time interval does not overlap with another time interval, which means that every streaming entity is present in exactly one interval. The number of entities within a window size is not fixed. That means each tumbling window, each window interval can have any number of entities within it. If you apply this tumbling window to Twitter messages that originated in the United States, the number of messages per tumbling window will be higher.
During the daytime when people are awake and tweeting. At night, when it’s relatively quiet, there will be fewer messages per tumbling window interval. If you observe how it works, it’s called a tumbling window. For obvious reasons, the window tumbles over the data in a nonoverlapping manner. Let’s say you want to perform the sum aggregation on the entities within a tumbling window. You will get a sum of 14 for the first window. The window will then tumble over, giving you a sum of seven. It can tumble once again give you a sum of 19, and the last time will give you a sum of 21. Notice that every entity is included exactly once within a tumbling window. Contrast this with a sliding window.
The sliding window also has a fixed time interval over which it aggregates the streaming entities. The window size is fixed, but as it moves over the data, there can be overlapping portions. Sliding windows have overlapping time, and this overlap time is called the sliding interval. This represents the amount of time that one window position overlaps with the next window position.Each window is of a fixed time interval, just like the tumbling window. And the number of entities within a window can be different. If the streaming is in spurs, there will be a different number of entities for each fixed interval. Let’s apply the same sum aggregation over these streaming entities, but this time with a sliding window and see how the results differ.
The window interval is the same that we use with the tumbling window. At the beginning, the window includes the first three elements. The sum is 14. This is exactly the same as the four. The window then slides over, but there is an overlapping entity. The entity three is included in this new window position as well to give you the sum of eight. As the window slides over, notice that there are certain entities that are present in more than one window interval and can be counted twice. Sliding windows are very common when we are looking at error spikes in logs. At any instant, we want to view the errors that occurred in the last n minutes, where n can be one to any number. This requires us to count errors on a sliding window.
The third kind of window that we look at today is the session window. This differs from the sliding and the tumbling window in that it does not encompass a fixed time interval. The window size changes based on what a session is. Length of a session is user defined. You typically specify the idle time between user activity and all the user activity between those idle times gets grouped into a session. You can say that a user has to be ID for 30 seconds, and I’ll consider that one session. This is why the window sizes for a session are not fixed and there is no overlapping time. Sessions are defined by the idle time between session activity. The number of entities within a session are also not fixed.
Imagine that you type really fast for any number of minutes and then you pause for 30 seconds or 1 minute. That might be a session. The session gap, which is user defined. A 1 minute or a 32nd interval of idle time determines the size of a session window. In this particular example that you can see on screen, the first two windows had fewer entities. But this third window includes 80 entities. This is because the gap that you see between those two entities marked with a red arrow is not large enough to start a brand new window. Tumbling windows, sliding windows and session windows are some ways in which we group together streaming entities in order to perform aggregations on them.