Wednesday, March 18, 2009

Greenplum aims to eliminate massive data load 'choke points' with Scatter/Gather technology

Greenplum has taken massively parallel processing (MPP) of data to the next level with the introduction this week of its "MPP Scatter/Gather Streaming" (SG Streaming) technology, which manages the flow of data into all nodes of the database, eliminating the traditional bottlenecks with massive data loading.

The San Mateo, Calif. company, which provides large-scale analytics and data warehousing, says SG Streaming has allowed customers to achieve production-loading speeds of over four terabytes per hour with negligible impacts on concurrent database operations. [Disclosure: Greenplum is a sponsor of BriefingsDirect podcasts.]

Under the "parallel everywhere" approach to loading data flows from one or more source systems to every node of the database without any sequential choke points. This differs from traditional “bulk loading” technologies, used by most mainstream database and parallel-processing appliance vendors that push data from a single source, often over a single or small number of parallel channels, and result in fundamental bottlenecks and ever-increasing load times.

The new technology "scatters" data from all source systems across hundreds or thousands of parallel streams that simultaneously flow to all nodes of the database. Performance scales with the number of nodes, and the technology supports both large batch and continuous near-real-time loading patterns with negligible impact on concurrent database operations.

Data can be transformed and processed in-flight, utilizing all nodes of the database in parallel, for extremely high-performance extract-load-transform (ELT) and extract-transform-load-transform (ETLT) loading pipelines. Final 'gathering' and storage of data to disk takes place on all nodes simultaneously, with data automatically partitioned across nodes and optionally compressed.

It was just six months ago that Greenplum publicly unveiled how it wrapped MapReduce approaches into the newest version of its data solution. That advance allowed users to combine SQL queries and MapReduce programs into unified tasks executed in parallel across thousands of cores.

No comments:

Post a Comment