Powering the enterprise data hub with Hadoop

By Dale Kim, Director of Industry Solutions at MapR.

  • Monday, 22nd September 2014 Posted 10 years ago in by Phil Alsop

Enterprise data hubs are becoming a growing necessity for many businesses today. In many environments, however, the costs of deploying an enterprise data hub using traditional solutions is prohibitive. This is a major reason why Apache Hadoop is being used as the foundation for enterprise data hubs. With its many cost advantages, including the deployment model that leverages commodity hardware, businesses are turning to Hadoop as a core technology for optimizing their existing data architectures.

 

What Is an Enterprise Data Hub?

An enterprise data hub (EDH) is a system that acts as a storage and processing center for data in a variety of formats from a variety of sources. An EDH can typically store, query, update, transform, and deliver huge volumes of data. Organizations deploy EDHs to achieve a variety of business objectives, including customer 360- degree view, legacy system retirement, and especially data warehouse optimization.

 

Businesses are collecting and analyzing more data than ever. They get to the point in which upgrading their data warehouse environment could potentially be cost-prohibitive. As an alternative to upgrading, they discard older data to make room for more recent data. As a result, they lose all the value of long-tail historical data they have collected over the years, which limits the range of analytics that can be performed without the historical data.

 

An EDH allows businesses to better exploit and even expand their big data use cases. Furthermore, an EDH is more than just a place to analyze historical data. True EDHs also provide comprehensive extract/transform/load (ETL) capabilities to create high-value data summaries that can be reloaded into the data warehouse for analysis.

 

How Hadoop Can Be Used as an Enterprise Data Hub

Hadoop is a powerful option when building an EDH. Features that are innate to the Hadoop architecture make it perfect for the task. Its scalability and workload flexibility are two such features. For a Hadoop-based EDH to be successful, it must support the same enterprise-grade capabilities that businesses have come to expect from their existing enterprise data architectures. Those capabilities include:

 

· High performance

· High availability

· Data security

· Data recovery and disaster recovery

 

These are important characteristics that no organization should ignore. Getting the right expertise to help plan out the deployment will ensure a successful EDH initiative. Let’s take a look at what you should plan for:

 

High Performance

With so much data to manage, you want the fastest system possible to process data in the least amount of time. Also, you want to get the most output for your money. To do so requires planning to make sure you have the right mix of resources, and are not inadvertently creating a bottleneck in your system. Maximizing performance requires sufficient investment in memory, CPU, network bandwidth, disk throughput, and software optimizations. If you compromise in one of these areas, you potentially defeat the investment you’ve made in the other areas.

 

That’s why selecting the right software is important for your EDH. If you choose a technology not built for high performance, you don’t get as much out of your hardware as you expected. If you’ve already made a commitment to a specific software system, then conventional wisdom tells you simply to spend more on hardware. But with Hadoop, it’s more viable simply to switch distributions. The standard tools and interfaces of Hadoop ensure easy migration from one vendor’s distribution to another, letting you “upgrade” if necessary to get more performance.

 

High Availability

High availability (HA) refers to a system’s ability to work continuously despite failures to components such as servers, disks, power supplies, etc. One way this is handled is via data replication which eliminates single points of failure in a multi-node cluster. If a node in a cluster is rendered unusable due to a component failure, then the system enables automatic failover in which one or more other nodes handle the replicated data of the down node.

 

This level of reliability is sufficient for some systems, but not all. For example, in an EDH where large processing jobs on huge volumes of data may take hours or even days to run, you shouldn’t allow loss of work. A production-grade EDH must ensure jobs run to completion to avoid losing all the work that had been done up to the point of the failure.

 

Data Security

Without the proper security, an EDH can unintentionally become an access point to a company’s most valuable and sensitive information. This is why enterprise-grade security features are essential for any system used as an EDH.

 

Security in Hadoop entails the following subtopics:

 

· Authentication - identification of users wanting to access the cluster, leveraging technologies like Kerberos

· Authorization - allowing or denying authenticated users access to data, with mechanisms like role-based access control lists

· Auditing - reporting on where data came from and how it’s used

· Encryption - additional protection against unauthorized viewing, especially for data-at-rest and for data transmission between nodes

 

Each of these capabilities must be implemented in Hadoop to ensure a secured EDH. In addition, these should be automated in the system as much as possible to avoid unnecessary overhead in accessing data.

 

Data Recovery and Disaster Recovery

An EDH also requires data recovery and disaster recovery capabilities. Data recovery is about fixing a problem related to data corruption, typically caused by application or user error. Disaster recovery (DR) is about keeping operations running despite the failure of an entire data center.

 

Data recovery is often handled by backups, but alternatively via snapshots. A snapshot is a point-in-time view of data that reflects the exact state of the data at the time the snapshot is taken. The exactness of the state is referred to as consistency, which intuitively is a critical characteristic. If data corruption occurs, users or administrators can retrieve valid data from the snapshot to undo the corruption. Consistent snapshot technology is found in relational databases, storage systems, and virtual machines, and is important for Hadoop as well. However, snapshots are not consistent in all Hadoop distributions, meaning that in some distributions, the snapshot is only an approximation of the state of the data. With consistent snapshots, you will know that you can recover a known state of your data if necessary, or that you can run a reliable audit.

 

DR can be addressed manually with backups, but is more effectively supported with replication or “mirroring.” For proper DR, replicas of the primary data center are created in geographically remote sites. And as with snapshots, DR replicas need to be consistent so that if a disaster occurs, a remote replica can be enabled with a known, valid state of the data. This means that the use of file copying tools does not work as a proper DR solution, since copying multiple files across the network provides no consistency guarantees at the replica site. A true DR system lets you recover from a disaster back to a known state.

 

Why MapR for an EDH

Everything that I discussed above is addressed in the MapR Distribution including Apache Hadoop. MapR is designed for ensuring customer production success. It delivers high performance to get more work out of fewer hardware servers. It protects against downtime, data loss, and work loss. It provides security to ensure there is no unauthorized access to data. And it provides recovery capabilities at file, directory, volume, and cluster levels. Any business looking to pursue a Hadoop-based EDH would be wise to take a close look at MapR.

 

The Enterprise Data Hub Is Here

More and more organizations will turn to EDHs to achieve their business objectives that rely on efficient and powerful data management capabilities. You likely have an environment full of high-end technologies that bring real value to your organization. And as your data volumes grow, the addition of an EDH will make sense for complementing and optimizing your existing data architecture.

 

When pursuing an EDH deployment, Hadoop is a great technology on which to start your investigation. And as you plan out your overall strategy, which should factor in performance, availability, security, and recovery, it’s worth closely examining the various Hadoop distributions to see how well they meet your enterprise deployment requirements.