Open Sourced Greenplum Data Warehouse Now on GitHub
Pivotal Software Inc. followed through on its February promise to open source core components of its Big Data platform, placing the Greenplum data warehouse software on GitHub with an Apache 2 license.
"Today, Pivotal unveils the first massively parallel processing (MPP) data warehouse to open source," exec Gavin Sherry said in a blog post today. "Built over the past 10 years, with nearly 2 million lines of code, Pivotal has released the last of its core data solutions, Greenplum Database, under the Apache Software 2.0 license."
"The Greenplum Database is an advanced, fully featured, open source data warehouse," says the site on the popular source code repository. "It provides powerful and rapid analytics on petabyte-scale data volumes. Uniquely geared toward Big Data analytics, Greenplum Database is powered by the world's most advanced cost-based query optimizer delivering high analytical query performance on large data volumes."
Greenplum was originally based on the open source PostgreSQL object-relational database, the project's main Web site explains, with features added on such as MPP architecture, petabyte-scale loading, a query optimizer, polymorphic data storage and execution, and advanced machine learning via Apache MADLib technology.
"Setting the Greenplum Database apart, both architecturally and functionally, from any other open source data processing system such as Apache Hadoop, MySQL or even PostgreSQL, is Greenplum Database usage of MPP to execute complex SQL analytics on very large data sets at speeds multiple times faster than any other solution tested," Sherry said. "Powered by a next-generation query optimization technology that has never been available commercially outside of Pivotal, Greenplum Database comes packed with data management quality features, upgrade and expansion capabilities also not found in open source today."
Early this year Pivotal announced plans to open source Greenplum and other technology underlying its Big Data Suite as part of a move toward openness and interoperability that also saw the formation of the Open Data Platform to "promote Big Data technologies based on open source software from the Apache Hadoop ecosystem and optimize testing among and across the ecosystem's vendors."
"Our detailed plans are still being finalized, but we plan to begin release and incubation of Pivotal GemFire, Pivotal Hawq, and Pivotal Greenplum Database in a quarterly cadence," exec Michael Cucchi said in a February blog post. "We're closing in now on the structure of ownership of GemFire, Greenplum Database, and Hawq code to the most appropriate entity for working with the Big Data community."
That most appropriate entity turned out to be the Apache Software Foundation, overseeing code placed on GitHub, where Greenplum joins the GemFire NoSQL database and Hawq SQL-on-Hadoop project, under an Apache Software 2.0 license. Pivotal claimed that open sourcing the data warehousing technology would improve it, and yesterday reiterated its embrace of contributions to the project, which today sported 23,790 commits, one branch and 14 contributors.
"We believe that the release of such a proven and widely adopted data warehouse to open source will have substantial ripple effects throughout the industry," Sherry said. "Most importantly, with the barrier of entry to large-scale, real-time data analytics lowered, more companies are now equipped to tackle Big Data challenges. As a result, we expect to see greater success in large-scale, Big Data analytics across every industry."
Pivotal said the move indicates a "massive change" in the industry, with commercial vendors seeing open source as the best delivery mechanism and transformative driver of software development.
"This is better for vendors as well, given that so many customers simply will not consider software solutions outside of open source," Sherry said. "With major software vendors like Pivotal responding to market demands to open source all of their cloud and data products inside of 10 months -- that's nearly 10 million lines of code -- commercial vendors worldwide will have to accept that the days of a closed source, vendor-lock in, legacy models no longer suffice in today's modern era of computing."
David Ramel is an editor and writer for Converge360.