In-Memory Data Grids and Hadoop: Sorting Your Big Data
The amount of business data that’s generated and moved across networks today is staggering. You’ve likely already heard the quote from Google CEO Eric Schmidt, who in a talk about big data said that more exabytes of data are now generated every two days than were created from the beginning of human civilization to 2003. One of the hottest fields in technology today is planning and developing the systems that can allow humans to not just sort through that tidal wave, but derive actionable insights from it.
For awhile now the de facto leader in making this happen has been Hadoop. As we’ve mentioned in past posts, Hadoop is an open source software framework for the processing of huge amounts of data. While the Hadoop framework offers an effective method for data processing and analysis, a relatively new technology, the in-memory data grid (IMDG), is poised to turn the big data segment on its ear in terms of just what constitutes speed or efficiency.
How it Works
Frameworks like Hadoop are built upon a Google-crafted technology for distributed data processing called MapReduce (M/R). M/R handles immense data sets by breaking them down into constituent units of similar sorts of information and distributing them down to different nodes on a network to process. In turn those nodes break down their chunks into smaller subsets that get sent out to even more nodes on the network. The sum effect is that lead time for data processing is drastically reduced.
By the late 2000s companies like Cloudera-Hadoop Big Data Analytics had increased awareness for both Hadoop and the M/R data processing model for addressing the concurrent rise in silos of big corporate data.
In contrast to what one might assume, an in-memory data grid is not some kind of competitor or alternative system to M/R (though it is an alternative to Hadoop, a different method of using the M/R framework). In fact, one of the most exciting current developments is how data grid technology can supplement a traditional system using M/R. What the data grid provides is, like M/R, a new method for how to distribute parts of a massive data set across many different nodes. What it doesn’t do, however, is propose a whole new method for doing so or even a particularly different philosophy. Instead, what IMDG software does is almost the same distributed data processing model, only greatly increased in scope and tweaked in terms of how the data is stored and transmitted across the network.
Some of the key elements of an in-memory data grid:
- Data stored in server RAM
By storing data in RAM a set of servers can completely bypass the delay created in the process of transferring data to and from individual hard disks across the network. The transfer and devolution of data between different nodes or even servers, a frequent occurrence in a distributed data processing model, is also sped up. This makes response times on requests incredibly quick and can, theoretically, reduce a lead time on analytics results from 60 minutes to 60 seconds.
- Data shared across multiple servers
When running an in-memory data grid system not just nodes on an individual network but entire servers become processing nodes for the data set to be analyzed, the titular “data grid.” Since IMDG is software-based, each server runs its own instance of the application and redundancy occurs as a result of servers sometimes processing the same parts of the data set independently. While it sounds like something that would decrease efficiency, it’s actually far faster than a model where each server would be accessing information through a potential bottleneck like a centralized, shared database.
- Greater data resilience
Because of the data redundancy across the IMDG server cluster, disruptions of service aren’t as damaging as they could be.
As part of the server-side backbone behind an enterprise’s user-facing analytics applications, the business benefits of in-memory data grid software are the same as those of having a system for big data analytics–greater visibility on past and current internal processes, deriving benefits from data that would otherwise sit silo’ed, identifying emergent trends, etc. Obviously with increased speed and reliable for analytics comes increased agility
Additionally, IMDG provides a simplified alternative to Hadoop. In spite of its many strengths, Hadoop’s parallel distribution model has also developed a reputation with as being rather complicated, with a typical deployment requiring weeks of configuration in Java or other languages. Dr. William Bain, founder and CEO of ScaleOut Software, a data analytics company that works with in-memory data grids, pointed out some of the headaches that come with Hadoop as opposed to IMDG in a Datanami blog post.
Several technology companies have already developed software solutions that take advantage of the in-memory data grid approach. Here’s a handful of the leaders:
Coherence is Oracle’s middleware IMDG program. It provides application caching and enterprise-level data management for client networks storing data in a variety of different software languages.
IBM WebSphere eXtreme Scale
Despite a name that looks like it was coined back in the early-to-mid 1990s, WebSphere eXtreme Scale is state of the art. Java and Microsoft .NET programs can integrate seamlessly with the in-memory data managed by WebSphere, allowing for enterprise-wide data sharing across multiple applications.
Currently in its 4.0 iteration, GridGain like Coherence comes in three variants: Compute Grid, Data Grid and Big Data. The highest-end version, Big Data, offers support for integration with Hadoop and advanced abilities like data center replication, which ensures that data in each node is properly mirrored and backed up against all the other data centers in the cluster.
Learn more about big data analytics by downloading comparisons of the industry’s top big data solutions in Business-Software.com’s business intelligence software comparison report.