This guest post comes courtesy of Tony Baer's OnStrategies blog. Tony is senior analyst at Ovum. 
By Tony Baer
With 
Strata, 
IBM IOD, and 
Teradata Partners conferences all occurring this week, it’s not surprising that this is a big week for 
Hadoop-related
 announcements. The common thread of announcements is essentially, “We 
know that Hadoop is not known for performance, but we’re getting better 
at it, and we’re going to make it look more like 
SQL.” In essence, 
Hadoop and SQL worlds are converging, and you’re going to be able to 
perform interactive BI analytics on it.
|  | 
| Tony Baer | 
 
 
 
 
The opportunity and challenge of 
Big Data from new platforms such as 
Hadoop is that it opens a new range of analytics. On one hand, Big Data 
analytics have updated and revived programmatic access to data, which 
happened to be the norm prior to the advent of SQL. There are plenty of 
scenarios where taking programmatic approaches are far more efficient, 
such as dealing with 
time series data or 
graph analysis to map many-to-many relationships.
It also leverages in-memory data grids such as 
Oracle Coherence, 
IBM WebSphere eXtreme Scale, 
GigaSpaces
 and others, and, where programmatic development (usually in 
Java) 
proved more efficient for accessing highly changeable data for web 
applications where traditional paths to the database would have been 
I/O-constrained. Conversely Advanced SQL platforms such as 
Greenplum and 
Teradata Aster have provided support for 
MapReduce-like
 programming because, even with structured data, sometimes using a Java 
programmatic framework is a more efficient way to rapidly slice through 
volumes of data.
But when you talk analytics, you can’t simply write off the legions 
of SQL developers that populate enterprise IT shops.
Until now, Hadoop has not until now been for the SQL-minded. The 
initial path was, find someone to do data exploration inside Hadoop, but
 once you’re ready to do repeatable analysis, 
ETL (or ELT) it into a SQL
 
data warehouse. That’s been the pattern with 
Oracle Big Data Appliance
 (use Oracle loader and data integration tools), and most Advanced SQL 
platforms; most data integration tools provide Hadoop connectors that 
spawn their own MapReduce programs to ferry data out of Hadoop. Some 
integration tool providers, like 
Informatica, offer tools to automate parsing of Hadoop data. Teradata Aster and 
Hortonworks have been talking up the potentials of 
HCatalog, in actuality an enhanced version of 
Hive with RESTful interfaces, cost optimizers, and so on, to provide a more SQL friendly view of data residing inside Hadoop.
But when you talk analytics, you can’t simply write off the legions 
of SQL developers that populate enterprise IT shops. And beneath the 
veneer of chaos, there is an implicit order to most so-called 
“unstructured” data that is within the reach programmatic transformation
 approaches that in the long run could likely be automated or packaged 
inside a tool.
At 
Ovum, we have long believed that 
for Big Data to crossover to the mainstream enterprise, that it must 
become a first-class citizen with IT and the 
data center. The early 
pattern of skunk works projects, led by elite, highly specialized teams 
of software engineers from Internet firms to solve Internet-style 
problems (e.g., ad placement, 
search optimization, customer online 
experience, etc.) are not the problems of mainstream enterprises. And 
neither is the model of recruiting high-priced talent to work 
exclusively on Hadoop sustainable for most organizations; such staffing 
models are not sustainable for mainstream enterprises. It means that Big
 Data must be consumable by the mainstream of SQL developers.
Making Hadoop more SQL-like is hardly new 
Hive and 
Pig became Apache Hadoop 
projects because of the need for SQL-like 
metadata management and data 
transformation languages, respectively; 
HBase emerged because of the 
need for a table store to provide a more interactive face – although as a
 very sparse, rudimentary column store, does not provide the efficiency 
of an optimized SQL database (or the extreme performance of some 
columnar variants). Sqoop in turn provides a way to pipeline SQL data 
into Hadoop, a use case that will grow more common as organizations look
 to Hadoop to provide scalable and cheaper storage than commercial SQL. 
While these Hadoop subprojects that did not exactly make Hadoop look 
like SQL, they provided building blocks from which many of this week’s 
announcements leverage.
Progress marches on 
One train of thought is that if Hadoop can look more like a SQL 
database, more operations could be performed inside Hadoop. That’s the 
theme behind Informatica’s long-awaited enhancement of its PowerCenter 
transformation tool to work natively inside Hadoop. Until now, 
PowerCenter could extract data from Hadoop, but the extracts would have 
to be moved to a staging server where the transformation would be 
performed for loading to the familiar SQL data warehouse target. The new
 offering, 
PowerCenter Big Data Edition,
 now supports an ELT pattern that uses the power of MapReduce processes 
inside Hadoop to perform transformations. The significance is that 
PowerCenter users now have a choice: load the transformed data to HBase,
 or continue loading to SQL.
There is growing support for packaging Hadoop inside a common 
hardware appliance with Advanced SQL. EMC Greenplum was the first out of
 gate with 
DCA (Data Computing Appliance) that bundles its own distribution of Apache Hadoop (not to be confused with 
Greenplum MR, a software only product that is accompanied by a 
MapR Hadoop distro).
Teradata Aster has just joined the fray with 
Big Analytics Appliance, bundling the 
Hortonworks Data Platform
 Hadoop; this move was hardly surprising given their growing partnership
 around HCatalog, an enhancement of the SQL-like Hive metadata layer of 
Hadoop that adds features such as a cost optimizer and RESTful 
interfaces that make the metadata accessible without the need to learn 
MapReduce or Java. With HCatalog, data inside Hadoop looks like another 
Aster data table.
Not coincidentally, there is a growing array of analytic tools that 
are designed to execute natively inside Hadoop. For now they are from 
emerging players like 
Datameer (providing a spreadsheet-like metaphor; which just announced an app store-like 
marketplace for developers), 
Karmasphere (providing an application develop tool for Hadoop analytic apps), or a more recent entry, 
Platfora (which caches subsets of Hadoop data in memory with an optimized, high performance fractal index).
Yet, even with Hadoop analytic tooling, there will still be a desire 
to disguise Hadoop as a SQL data store, and not just for data mapping 
purposes.
Yet, even with Hadoop analytic tooling, there will still be a desire 
to disguise Hadoop as a SQL data store, and not just for data mapping 
purposes. 
Hadapt has been promoting a variant where it squeezes SQL tables inside 
HDFS
 file structures – not exactly a no-brainer as it must shoehorn tables 
into a file system with arbitrary data block sizes. Hadapt’s approach 
sounds like the converse of object-relational stores, but in this case, 
it is dealing with a physical rather than a logical impedance mismatch.
Hadapt promotes the ability to query Hadoop directly using SQL. Now, so does 
Cloudera. It has just announced 
Impala,
 a SQL-based alternative to MapReduce for querying the SQL-like Hive 
metadata store, supporting most but not all forms of SQL processing 
(based on 
SQL 92; Impala lacks triggers, which Cloudera deems low 
priority). Both Impala and MapReduce rely on parallel processing, but 
that’s where the similarity ends. MapReduce is a blunt instrument, 
requiring Java or other programming languages; it splits a job into 
multiple, concurrently, pipelined tasks where, at each step along the 
way, reads data, processes it, and writes it back to disk and then 
passes it to the next task.
Conversely, Impala takes a shared nothing, 
MPP approach to processing SQL jobs against Hive; using HDFS, Cloudera 
claims roughly 4x performance against MapReduce; if the data is in 
HBase, Cloudera claims performance multiples up to a factor of 30. For 
now, Impala only supports row-based views, but with columnar (on 
Cloudera’s roadmap), performance could double. Cloudera plans to release
 a 
real-time query (RTQ) offering that, in effect, is a commercially 
supported version of Impala.
By contrast, Teradata Aster and Hortonworks promote a 
SQL MapReduce
 approach that leverages HCatalog, an incubating Apache project that is a
 superset of Hive that Cloudera does not currently include in its 
roadmap. For now, Cloudera claims bragging rights for performance with 
Impala; over time, Teradata Aster will promote the manageability of its 
single appliance, and with the appliance has the opportunity to counter 
with hardware optimization.
The road to SQL/programmatic convergence
Either way – and this is of interest only to purists – any SQL extension
 to Hadoop will be outside the Hadoop project. But again, that’s an 
argument for purists. What’s more important to enterprises is getting 
the right tool for the job – whether it is the flexibility of SQL or raw
 power of programmatic approaches.
SQL convergence is the next major battleground for Hadoop. Cloudera 
is for now shunning HCatalog, an approach backed by Hortonworks and 
partner Teradata Aster. The open question is whether Hortonworks can 
instigate a stampede of third parties to overcome Cloudera’s resistance.
 It appears that beyond Hive, the SQL face of Hadoop will become a 
vendor-differentiated layer.
Part of conversion will involve a mix of cross-training and tooling 
automation. Savvy SQL developers will cross train to pick up some of the
 Java- or Java-like programmatic frameworks that will be emerging. 
Tooling will help lower the bar, reducing the degree of specialized 
skills necessary.
And for programming frameworks, in the long run, 
MapReduce won’t be the only game in town. It will always be useful for 
large-scale jobs requiring brute force, parallel, sequential processing.
 But the emerging 
YARN
 framework, which deconstructs MapReduce to generalize the resource 
management function, will provide the management umbrella for ensuring 
that different frameworks don’t crash into one another by trying to grab
 the same resources. But YARN is not yet ready for primetime – for now 
it only supports the batch job pattern of MapReduce. And that means that
 YARN is not yet ready for Impala or vice versa.
Either way – and this is of interest only to purists – any SQL extension
 to Hadoop will be outside the Hadoop project. But again, that’s an 
argument for purists.
Of course, mainstreaming Hadoop – and Big Data platforms in general –
 is more than just a matter of making it all look like SQL. Big Data 
platforms must be manageable and operable by the people who are already 
in IT; they will need some new skills and grow accustomed to some new 
practices (like exploratory analytics), but the new platforms must also 
look and act familiar enough. Not all announcements this week were about
 SQL; for instance, MapR is throwing a gauntlet to the Apache usual 
suspects by extending its management umbrella beyond the proprietary 
NFS-compatible file system that is its core IP to the MapReduce 
framework and HBase, making a similar promise of high performance.
On 
the horizon, EMC 
Isilon and 
NetApp
 are proposing alternatives promising a more efficient file system but 
at the “cost” of separating the storage from the analytic processing. 
And at some point, the Hadoop vendor community will have to come to 
grips with capacity utilization issues, because in the mainstream 
enterprise world, no CFO will approve the purchase of large clusters or 
grids that get only 10 – 15 percent utilization. Keep an eye on 
VMware’s 
Project Serengeti.
They must be good citizens in data centers that need to maximize 
resource (e.g., virtualization, optimized storage); must comply with 
existing data stewardship policies and practices; and must fully support
 existing enterprise data and platform security practices. These are all
 topics for another day.
This guest post comes courtesy of Tony Baer's OnStrategies blog. Tony is senior analyst at Ovum.  
You may also be interested in: