As a distributed file system for organizing and analyzing big data (think petabytes), Apache Hadoop has a lot of things going for it: affordability, scalability and flexibility to name a few, yet it’s far from perfect. As a matter of fact, as of late many of its shortcomings have been amply profiled by some.
Some developers seem to have heeded the cries, however. Cloudera, the company at the forefront of packaging and providing support for Hadoop as a big data analytics and business intelligence engine, announced in October that it was working on Cloudera Impala, a new engine for interfacing with Hadoop (like the original product, Impala is an open source product. The code can be viewed in its entirety on this GitHub page).
The aptly named engine, whose name evokes images of V8-engine muscle cars and nimble Sub-Saharan African antelope, addresses many of the complaints users had with the original system.
Key among technical issues regarding Hadoop are slowness as a result of the way the system accesses different kinds of data and varying degrees of poor compatibility with many of the bigger commercial-grade business intelligence solutions on the market.
Data handled by Hadoop passes through the system in the Hadoop Distributed File System (HDFS), which runs “under the hood” of the normal filesystem used by the native operating system. HDFS can be used to store and scale the huge amounts of data that a BI program might use to built datasets for its insights. One of the downsides of Hadoop’s default filesystem, however, is that it’s not all that convenient for the sort of real-time queries that typically come out of an office environment, especially at the decision-maker levels (that is to say, the traditional consumers of the end product data visualizations that BI solutions produce).
One of the reasons for this is that HDFS is optimized to be used by MapReduce (M/R), the Google-crafted model of distributed computing that’s at the heart of Hadoop. M/R is adept at distributing the complex number crunching of big data work between the horsepower of individual nodes on a network cluster, but it lacks the sort of “speed-of-thought” query response of computer languages like SQL, which are traditionally used for database management work. Submitting an SQL query to Hadoop results in it being converted into an M/R job, a time-consuming process handled through Apache Hive, the project’s data warehousing infrastructure, which Impala seeks partially to replace.
In contrast to the standard SQL-to-M/R-and-back-again model, Impala cuts out the middleman by implementing a native, distributed SQL engine, allowing for dramatically faster real-time queries of Hadoop-organized data. The result is a tool for making speedy data queries far more compatible with fast-paced enterprise environments and the promises of real-time responsiveness that many BI vendors make a central point of their marketing push. Mike Olson, CEO of Cloudera, has said that users could experience a performance improvement ranging anywhere from three to thirty times what they get through Hive queries.
As of the end of this year Impala is still in open beta status as Cloudera continues to hammer out the kinks. In the meantime, however, several vendors in the BI segment have announced that they are extending support to Cloudera’s new query engine in the hopes of enhancing the services they provide to their clients.
Here are the first four that have announced their support:
MicroStrategy is one of the older companies working in the BI space, having been founded in 1989.
In contrast to many of the younger Big Data start ups who typically get the lion’s share of publicity in coverage of this segment, MicroStrategy is a publicly traded company and sports a history filled with endorsements and awards from the likes of Forbes, Gartner and PC Magazine.
Tableau’s Tableau Server Business Intelligence boasts a list of more than 50,000 BI agency users. The company’s customers include many huge names like Walmart, Goldman Sachs and Bank of America.
In addition to a heavy focus on real-time query responses and visualization options, Tableau also emphasizes mobility with apps for access from Android machines and iPad.
Pentaho Business Intelligence is one of the younger BI vendors looking to stand in contrast to the segment’s traditional, monolithic image, which it describes as “dominated by bureaucratic megavendors offering eye-watteringly expensive heavyweight products built on outdated technology platforms.”
The company recently went public about Impala, stating that while their solutions did enjoy 10X performance of query responses, the overall speed still lags behind the expectations of the average BI user.
Swedish vendor QlikTech includes some rather impressive numbers as part of its sales pitch; 96 percent customer satisfaction, 186 percent return on investment (!), 20 percent decrease in operating costs and 34 percent employee productivity increase.
By integrating Impala into its QlikTech Direct Discovery solution, QlikTech plans to offer a real-time data analysis tool that, true to the Impala specifications, will cut out the middleman process of downloading large datasets out of HDFS into memory. The company is planning to go live with QlikTech Direct Discovery in December.
Ready to find the best BI software solution for your company? Browse product reviews, top blog posts and premium content on our BI resource center page. To compare the leading business intelligence software, download and browse the Top 10 Business Intelligence Software report.