Hadoop has come to be one of the most popular open source frameworks in the big data realm, but it’s not without performance and security problems, many of which can’t be quickly addressed by the open source community. We sat down with Zettaset president and CEO Jim Vogt to find out about how their Orchestrator solution seeks to tackle those problems and plug a gap that, so far, hasn’t been adequately addressed.
So Brian’s vision was that this just wasn’t ready for prime time and wasn’t ready for enterprise, yet it had a very powerful technical function in terms of the MapReduce technology. The way I put it simply is if you’re going to go rack and stack a new Hadoop implementation or Hadoop clusters within the datacenter it needs to adhere to the same requirements as an existing rack of equipment in terms of high availability, security, management, visibility, controls and performance. And those are things that we’re doing in our software without requiring an IT guy to have to go and train five Hadoop guys. They can actually redeploy those resources in other areas of the big data solution without having to address this with people all over the infrastructure–we can do it through our enterprise software. That was the vision.
Well, I think it’s really because it’s very good at what it does in terms of being able to take on unstructured data sources and structure them.
When we looked at our strategy as a company we took a longer-term view. You mentioned some of the early contributors within the community, and although the community is doing very good work around Hadoop, despite their efforts it takes time to get things through the community. And also, it’s not addressing all of the enterprise-ready concerns today, right? That’s why we’re transparent in terms of distributions and we add value on top of the distributions that plug these functional gaps.
We also look at today’s leading distributions–you look at Cloudera, you look at Hortonworks–and we don’t believe in the five, ten years’ timeframe they’re using. These bigger guys are seeing the market develop. They’re seeing the functional value within Hadoop and forming their own distributions and solutions. So maintaining transparency is a key part of our strategy with Hadoop. It lets us focus on security, performance, scalability, compliance and ease of use. These are the functional components that we can couple with the distribution.
Lastly, we’re focused on strategic relationships. Ours are a little bit different from the competition. I mean, we are partnering not only with the big players, but with quite a few of the switch vendors, mainly because those guys are setting up big data infrastructure implementations within the data center and that requires switching technology as well. So we’ve announced we’re working with Mellanox. We’re also working with Brocade. These are infrastructure vendors who are selling into the enterprises and selling into that footprint of the datacenter. And again, our mentality is a little different in terms of enterprise focus versus getting all hung up on the open source and other components of the solution. We complement those components and form a full solution for the end customer.
That’s a great topic to drill into. We developed a technology, but it wasn’t really a file system. It’s more of a translator through Apache HBase to HDFS. We did that because we wanted to improve performance without re-coding the Hadoop components themselves. And we looked at MapR and how they rewrote HDFS and C++. We do that outside of HDFS transparently through this piece of technology we patented that allows us to translate group files and create efficiency by translating that through HBase to HDFS.
Now that’s the first problem we were trying to solve. The added value component is that because we do it that way there are two advantages. One is that this is a great infusion point for increasing performance and also encrypting and securing the data. Secondly, we can support other file systems. So we actually have supported HDFS. We support GPFS. And we’re working on Lustre as well. So it allows us some flexibility and options for the end user. In these enterprises being able to be as adaptive as we can in the existing environment improves not only the distributions themselves but also the file systems.
That’s a good sampling there on the website. I would like to add merchandising to it as well, along with insurance. So anybody trying to shorten their decision processes and to also be able to open other sources of data they could not explore before like Syslog data. There’s a whole list of unstructured sources that would help in those decisions. That’s why we’re horizontal across many different verticals.
Where we actually distinguish through verticals is our strategic partners. When you look at Brocade, for example, they are basically installed in twenty-five hundred different enterprises. Across verticals they have a very good federal presence. So we differentiate by adding value to those infrastructure solutions. And secondly, since we basically stem the resources to address infrastructure component we’re very appealing to a wide variety of large system integrators (SI) who actually have full big data practices they’re deploying for their customers.
That’s a great question. As I mentioned, our business is based on the gap in functionality and the time gap it takes to get something through the open source community. But there’s so many useful components within that working body that it just takes time. We embrace that. Our recent announcement of support for Intel’s distribution for Hadoop is a great example. Intel announced not only their distribution, but also a project called Project Rhino. It’s their attempt to guide the direction of security-related projects in that open source community and contribute that code back to Apache.
What’s interesting is that they’re supporting many of the same directions we’re taking in our enterprise code today. The difference is we can ship our code today whereas those projects will take eighteen to twenty-four months to get to fruition within the community. So we believe in contributing back to the community. We have a big sponsor in doing that with Intel and the Rhino Project, and it guides the community in the right direction in terms of where the code needs to move in terms of security and performance. But that’s going to take some time, and that’s the value-added that we already fill with our enterprise software right now.
We actually spent quite a bit of time with Intel ahead of the announcement on how we install and employ our software, our security and performance road map, things that are good to talk about in terms of what happens in their hardware components and their chipset, right? So we see a real opportunity to become part of their formal solution. To have the endorsement of a big player like Intel broadens our list of access to partners who we can go to market with. It endorses our direction in terms of our road map, our technology and our overall solution. That’s all good, good good.
There’s two options for the customers. They can use analytics, which are more native to Hadoop, or they can use the Hadoop clusters, as in I ship data off there, I run my MapReduce jobs, I bring it back, then I use my existing analytics to actually analyze the data. And, by the way, a good example of that is still Excel, right?
The thing for SMBs though is that there’s a lot of rattle around hosting this infrastructure in the cloud or using Amazon Web Services (AWS) and those types of things. That brings you right back where we’re focused in terms of our development. The first concern of the cloud is security. People talk about performance all the time, and we’re addressing performance as well, but the security aspects in terms of maintaining privacy, separating multitenant data, sensitive data, things like that, all the infrastructure we’re talking about in a live datacenter, which is maybe on-premise versus hosted, but on the same set of requirements. Where we’re headed in terms of performance is once you’ve brought the data into the Hadoop cluster you’re not going to have pure green field environments. So where did this data come from? Being able to tag and tier data based on usage allows us to essentially form more efficient media over time on learning basis based on jobs that are run and frequency of those jobs. That allows us to really put the more frequently used data closer to the processing and higher-speed media like flash or SSD, right? So those types of improvements will help the datacenter, but the hosted datacenter as well.
I think there’s a place for the SMB, and if you look at SMBs many of them are smaller entities who, especially in health care, that have hard compliance issues. They have hard compliance issues, but don’t have the resources to implement that infrastructure. So that’s where hosted comes in.
Well, two things: there’s a lot of noise in this space, and it’s interesting that we’ve gone from focusing on these few distributions last year to many more distributions this year. There’s a lot of noise, and we’re trying to stay above that, rise above the fray, if you will, and continue to add to the value we have at the enterprise software level.
A challenge that I welcome is that a lot of these enterprises who are through evaluation stages now are saying OK, I have 30 nodes running in my little lab experiment, how do I ramp up to a thousand, two thousand-plus distributed throughout the organization? That’s a good inflection point in the market for us because I think their concerns are, one, I don’t want to have a reliance on professional services. Two, I don’t want to have to train a lot of resources. And three, pro services won’t provide the security and performance aspects that I can get in an enterprise software solution.
So those challenges are good because those are the ones I believe our product absolutely supports.
There are two actual projects around next generation MapReduce, and that’s the problem in that the community is dominated by two guys, Cloudera and Hortonworks, who have opposing projects. What we envision as a company, though, is that it would be interesting to parse at a project level, to say hey, I like Hortonworks’ distribution, or I like Intel’s distribution, but can you add Cloudera Impala to it? That’s something we could actually do in the longer term–parse at a project level versus a distribution level.
Secondly, there are people out there who have other high performance, high availability data file systems like PPFS (Portal Parallel File System), like Lustre. So is there a way to mate those file systems with a distribution? We say yes, and we’ve actually done that.