Operational Hadoop and the Lambda Architecture for Streaming Data

With all the talk about the growing volumes of data today, it’s no surprise that more and more analytics technologies are emerging on the big data scene. After all, it’s difficult to consume huge volumes of data via traditional ways. And when you consider that much of the new data types today consists of a series of measurements or event records, viewing small chunks of the data is unenlightening. Contrast this to data in your ERP or CRM systems, in which sales invoices or customer call records can provide you with meaningful information that you can use to make specific business decisions.

  • Monday, 15th December 2014 Posted 10 years ago in by Phil Alsop

Streaming data, which refers to a continuous inflow of data, is one form of these rapidly growing data sets. Streaming data is typically time sensitive, meaning that the point in time for which the data points relate is critical to analysis. The term “time-series data” is often used to describe data sets that inherently have time as an important component of the data points.

Since streaming data tends to grow in unbounded ways, the technologies required to manage such data must be incrementally and linearly scalable on a distributed cluster of commodity hardware servers. One proven technology is Apache™ Hadoop®, which has its roots in internet companies, but is also applicable to almost any data-driven business. Another relevant technology is the class of databases known as NoSQL. NoSQL databases were built primarily for distributed scaling and a variety of data types. Hadoop starts with batch processing, but when combined with NoSQL, you have an operational Hadoop platform that delivers the insights of large-scale analytics along with real-time data reads and updates.


An Architecture for Streaming Data
The distributed model of an operational Hadoop deployment is not without challenges. The often-cited CAP Theorem describes the tradeoff you must accept in a distributed system. At the risk of oversimplifying the key points in the CAP Theorem, let’s summarize by saying that the complexities of distributed systems force you to choose between availability and consistency. Availability is the ability to always return an answer to a user query, even if that answer is out of date. Consistency is about either returning the correct, up-to-date answer, or a response that the correct answer is not available.


The Lambda Architecture is an implementation that reduces some of the complexities of distributed data systems by treating your data as immutable, or unchanging, and isolating complexity to a small window of your data. Nathan Marz, in his seminal blog post, points out that all data is essentially immutable, and treating data as such simplifies data management. This goes against our accepted view of how data changes over time, but Marz notes that data is always time-specific, so while data may seem to change, what actually happens is a new data point is added in the context of a new period of time. And since much of the complexity of deploying distributed systems is due to treating data as changing, a system that assumes non-changing data is easier to manage.


The Lambda Architecture is a great framework if you want to take advantage of stream-processing methods. Since streaming data is easily recognizable as the writing of new data points, and not the update or deletion of existing data, the tenets of the Lambda Architecture make sense. The Lambda Architecture consists of three main parts: a batch layer, a speed layer, and a serving layer. The batch layer, typically handled by Hadoop, is about creating an immutable master data set on which pre-computed views are built in batch. Since the value of streaming data sets increases with larger volumes, supporting a massive batch layer is important. Another important point is that the immutability offers human fault tolerance because the data is never modified and thus never corrupted. Errors in the batch views can be corrected by updating the batch computation code which is then reapplied to the master data. The speed layer, typically handled by a real-time computation engine such as Apache Storm, and often combined with a NoSQL database, is about real-time processing and querying on a small, recent window of data. This adds a real-time, operational component to your deployment. The serving layer combines the batch and speed layers to provide a complete queryable view of your data. The application of these layers to streaming data appears straightforward. The batch layer handles the historical data, the speed layer can process recent events and deliver immediate insights to you, and the serving layer combines the benefits of both.


Streaming Data at Work
Let’s look at a few use cases of streaming data. One cross-industry example is sensor data that is collected for the purposes of predictive maintenance, a specific implementation of predictive analytics. Predictive maintenance is, as its name suggests, the practice of predicting when you need to repair or replace machine parts. These machine parts have sensors that take frequent measurements during usage in a real operating environment. For example, a set of sensors in an oil field drill might regularly measure the temperature and rotational speed of a drill bit in a drilling rig. Predictive maintenance not only avoids costly downtime as a result of part failure, but also eliminates inefficient scheduled maintenance, which is especially valuable when part inspection is costly and time-consuming.


Predictive analytics can also be applied beyond maintenance scenarios, including product quality assurance analysis. In that use case, you can detect non-obvious flaws in work-in-progress products in an assembly line to improve yield and avoid wasted downstream assembly. Another example is supply chain optimization, where you can optimize deliveries by predicting the fastest routes for a given set of destinations, based on past delivery times along specific routes.


In one example implementation model, historical measurement data makes up the batch component of a Lambda Architecture. The speed layer handles the recent data to provide new machine learning input, as well as the alerting mechanism for events that need immediate action. The computations at both layers typically involve pattern and anomaly detection, to predict when parts will likely fail soon based on past usage data. In the serving layer, recent data is applied to the analytics models to alert users to imminent failure.


Another area where management of streaming data will play a critical role is the Internet of Things (IoT). The “things” can range from computer equipment to home appliances to consumer devices. In a sense, the prior use cases can be viewed as part of the Internet of Things due to the high rate and volume of incoming data. One distinction, though, is that in an Internet of Things scenario, you can derive value from aggregating many different types of data from different types of sources, versus a predictive maintenance environment where the data sets come from the same category of machines. Perhaps the most interesting example for the masses will be the use of consumer devices, and how they can potentially help your life on a daily basis. There are already wearable devices that help track health-related aspects of daily activities, and large-scale analysis of these data sets can potentially help with insights such as recognizing the positive impact of a more healthy and fit lifestyle.
A more general use case is the enterprise data hub. Sometimes referred to as a data lake, an enterprise data hub collects data from numerous sources for storage, processing, and analysis. As the name implies, enterprise data hubs are used by large organizations that want to make better use of all the available data. Data sources such as social media and web clickstream data are combined with other data sources such as sales data and customer call data to give you a more complete picture of your customers and your business. In this use case, capturing streaming data is less about handling high- speed, growing data, and more about the flexibility of creating a variety of pre-computed batch views of the data. The Lambda Architecture best practice of storing a canonical version of data lets you accommodate future, unanticipated queries with simple updates in your batch computations.


There’s so much more than can be said about streaming data, predictive analytics, the Lambda Architecture, and all the other subtopics in this article. Hopefully you have a few ideas on why you’d pursue streaming data deployments, and with the wealth of information on the internet, along with expertise from technology vendors, you can easily investigate these topics further. Also be sure to look for expertise and innovations in this area from big data technology vendors that can help your implementation plan, including from MapR Technologies, a leading distributor of Apache Hadoop.