News

MapR To Integrate Drill and Spark Big Data Projects

Enterprise Hadoop distribution vendor MapR Technologies Inc. is seeking to integrate the open source Apache Drill and Apache Spark projects used for Big Data analytics in the Hadoop ecosystem.

In addition to its MapR Distribution for Apache Hadoop, the company has been leading development efforts on the Drill project, a low-latency SQL-on-Hadoop query engine for Hadoop and NoSQL that it says provides real-time, self-service exploration of data residing on multiple data sources.

Now MapR is seeking to integrate that technology with Spark, an in-memory data analytics cluster computing framework that it said provides advantages in speed, easier programming and real-time processing. Spark is often described as an upgrade to the MapReduce technology that was a mainstay of early Hadoop systems but was widely criticized for its limitations as the ecosystem evolved.

Spark is an increasingly popular project whose development has primarily been stewarded by Databricks Inc., though Hortonworks Inc. -- a primary competitor of MapR -- recently announced it was committing more developer resources to the project in advance of including it in Hortonworks' own Hadoop-based platform. MapR added Spark to its distribution in April. Databricks last week announced that Spark broke the record for large-scale sorting.

"The MapR initiative to integrate Apache Drill with Apache Spark's high-performance, in-memory data processing will provide a powerful combination," MapR quoted analyst John Webster at Evaluator Group as saying in its announcement yesterday. "MapR support for the complete Spark stack provides Drill users the ability to create advanced data pipelines that leverage Drill's data agility and Spark's batch processing capabilities."

A key feature of Drill is the ability to immediately conduct queries across complex data residing in native formats, even if that data is nested or isn't described by schemas or uses schemas that rapidly evolve. "Because SQL queries can run directly on various file formats, live data can be explored as it is coming in, versus spending weeks preparing and managing schemas and setting up ETL tasks," MapR said. "Additionally, Apache Drill supports ANSI SQL so users can easily leverage their SQL skills and existing investments in business intelligence (BI) tools."

Databricks, meanwhile, praised the MapR initiative to integrate the two technologies, just as it welcomed increased development efforts by Hortonworks. "As the driving force behind Spark, Databricks is pleased to see continued and expanded innovation around Spark to help users derive value from big data faster," said Ion Stoica, CEO of Databricks. "We are looking forward to MapR integrating Drill with Spark to enable enterprises to expand processing options and unlock deeper insights from their data faster."

About the Author

David Ramel is an editor and writer at Converge 360.