This guest post comes courtesy of Tony Baer's OnStrategies blog. Tony is senior analyst at Ovum.By Tony BaerHadoop remains a difficult platform for most enterprises to master. For now skills are still hard
to come by – both for
data architect or engineer, and especially for
data scientists. It still takes too much skill, tape, and baling wire to get a Hadoop cluster together. Not every enterprise is
Google or
Facebook, with armies of s
oftware engineers that they can throw at a problem. With some exceptions, most enterprises don’t deal with data on the scale of Google or Facebook either – but the bar is rising.
If 2011 was the year that the big IT
data warehouse and
analytic platform brand names discovered Hadoop, 2012 becomes the year where a tooling ecosystem starts emerging to make Hadoop more consumable for the enterprise. Let’s amend that – along with tools, Hadoop must also become a first-class citizen with enterprise IT infrastructure. Hadoop won’t cross over to the enterprise if it has to be treated as some special island. That means meshing with the practices and technology approaches that enterprises are using to manage their
data centers or
cloud deployments. Like
SQL,
data integration,
virtualization, storage strategy, and so on.
Admittedly, much of this cuts against the grain of early Hadoop deployment that stressed
open source and commodity infrastructure. Early adopters did so out of necessity as commercial software ran out of gas for Facebook when its data warehouse daily refreshes were breaking terabyte range, not to mention that the cost of commercial licenses for such scaled out analytic platforms wouldn’t have been trivial. Anyway, Hadoop’s linearity leverages scale out of commodity
blades and direct attached disk as far as the eye can see, enabling such an almost pure noncommercial approach. At the time, Google’s,
Yahoo’s, and Facebook’s issues were considered rather unique – most enterprise don’t run global
search engines – not to mention that their business was built on armies of software engineers.
Something's got to giveAs we’ve previously noted, something’s got to give
on the skills front. Hadoop in the enterprise faces limits – the data problems are getting bigger and more complex for sure, but resources and skills are far more finite. So we envision tools and solutions addressing two areas:
- Products that address “clusterophobia” – organizations that seeks the scalable analytics of Hadoop but lack the appetite to erect infinite data centers out in the fields or hire the necessary skillsets. Obviously, using the cloud is one option – but the questions there revolve around whether corporate policies allow maintenance of data off premises, and also, as data store size grows, whether the cloud is still economical.
- The other side of the coin is consummability – tools that simplify access to and manipulation of the data.
In the run-up to this year’s
Hadoop Summit, a number of tooling announcements addressing clusterophobia and consumption are pouring out.
The workloads are going to get more equitably distributed, and in the long run, we wouldn’t be surprised to see more Hadoop-only appliances.
On the fear of clusters side, players like
Oracle,
EMC Greenplum, and
Teradata Aster are already offering appliances that simplify deployment of Hadoop, typically in conjunction with an Advanced SQL analytic platform. While most vendors position this as a way for Hadoop to “extend’ your data warehouse so you perform exploration in Hadoop, but the serious analytics in SQL, we view appliances as more than transitional strategy. The workloads are going to get more equitably distributed, and in the long run, we wouldn’t be surprised to see more Hadoop-only appliances, sort of like Oracle’s (for the record, they also bundle another
NoSQL database).
Also addressing the same constituency are storage and virtualization – facts of life in the data center. For Hadoop to cross over to the enterprise, it, too, must get virtualization-friendly. Storage is an open question. The need for virtualization becomes even more apparent because (1) the exploratory nature of Hadoop analytics demands the ability to try out queries offline without having to disrupt or physically build a new cluster; and (2) the variable nature of Hadoop processing suggests that workloads are likely to be elastic. So we’ve been waiting for
VMware to make their move. VMware – also part of EMC – has announced a pair of initiatives. First, they are working with the
Apache Hadoop project to make the core pieces (
HDFS and
MapReduce) virtualization-aware, and separately, they are hosting their own open source project (
Serengeti) for virtualizing Hadoop clusters. While Project Serengeti is not VM-specific, there’s little doubt that this will be a VMware project (we’d be shocked if the
Xen folks were to buy in).
Storage follows
Where there’s virtualized servers, storage often closely follows. A few months back, EMC dropped the other shoe, finally unveiling a strategy for leveraging
Isilon with the
Greenplum HD platform, the closest thing in
NAS that replicates the scale-out model storage model popularized with Hadoop. This opens an argument of whether the scales of data in Hadoop make premium products such as Isilon unaffordable. The flip side however is the “open source tax,” where you hire the skills in your IT organization to manage and deploy scale-out storage, or pay consultants to do it for you.
In the spirit of making Hadoop more consummable, we expect a lot of vibes from new players that are simplifying navigation of Hadoop and building SQL bridges.
Datameer is
bringing down the pricing of its uber Hadoop spreadsheet to personal and workgroup levels courtesy of entry level pricing from $299 to $2999. Teradata Aster, which already offers a patented framework that
translates SQL to MapReduce (there are also
others out there) is now taking an early bet on the
incubating Apache HCatalog metadata spec so that you could write SQL statements that go up against Hadoop. It joins approaches such as those from
Hadapt, which hangs SQL tables from HDFS file nodes, and mainstream BI players such as
Jaspersoft, that already provide translators that can grab reports directly from Hadoop.
In the spirit of making Hadoop more consummable, we expect a lot of vibes from new players that are simplifying navigation of Hadoop and building SQL bridges.
This doesn’t take away from the evolution of the Hadoop platform itself.
Cloudera and
Hortonworks are among those releasing new distributions that bundle their own mix of recent and current Apache Hadoop modules. While the Apache project has addressed the NameNode HA issue, it is still early in the game with bringing enterprise-grade manageability to
MapReduce. That’s largely an academic issue as the bulk of enterprises have yet to implement Hadoop. By the time enterprises are ready, many of the core issues should resolve — although there will always be questions about the uptake of peripheral Hadoop projects.
What’s more important – and where the action will be – is in tools that allow enterprises to run and, more importantly, consume Hadoop. A chicken and egg situation, enterprises won’t implement before tools are available and vice versa.
This guest post comes courtesy of Tony Baer's OnStrategies blog. Tony is senior analyst at Ovum.You may also be interested in: