News

MapR Adds Some Spark to its Hadoop Distribution

MapR Technologies Inc. yesterday announced it had added the Apache Spark technology stack to its Hadoop distribution, one of the leading tools in the fast-growing arena of Big Data analytics.

Apache Spark, recently updated to a top-level project by the open source Apache Software Foundation, is an in-memory, distributed computing framework that improves Big Data analytics processing with Hadoop. Often used as a superior replacement for the original batch-oriented MapReduce technology, it works with newer technologies found in Hadoop 2 such as the YARN resource manager to boost processing of data in the Hadoop Distributed File System (HDFS) or any other Hadoop data store, such as HBase or Cassandra.

MapR has one of the leading Hadoop distributions, fighting with competitors such as Cloudera and Hortonworks to gain market share, secure funding and land big customers in the burgeoning Big Data market. MapR said the in-memory processing technology of Spark provides speed, easier programming and real-time processing capabilities. It's adding the Spark stack to its Hadoop distribution through a partnership with Databricks, a company founded last fall by the creators of Spark, which was originally developed at the University of California, Berkeley.

"We are now the only Hadoop distribution to support the complete Spark stack, including Spark, Spark Streaming (stream processing), Shark (Hive on Spark), MLLib (machine learning) and GraphX (graph processing)," said MapR executive Tomer Shiran in a blog post yesterday.

MapR said Spark provides two main benefits: application performance and developer productivity.

Developers are more productive, the company said, because Spark requires much less code to be written, as much as 1/5 of what's normally needed. Also, the simple programming abstraction it offers lets developers use multiple languages to design batch, interactive and streaming applications that operate on data collections. Developers can use Java, Scala and Python, with support for R reportedly coming.

"It has become clear that Apache Spark offers a combination of high-performance, in-memory data processing and multiple computation models that is well suited to serving as the basis of next-generation data processing platforms," MapR quoted 451 Research analyst Matt Aslett as saying. "MapR's support for the complete Spark stack, combined with its partnership with Databricks, should give Hadoop users the confidence to start developing applications to take advantage of Spark's performance and flexibility."

MapR said the addition of Spark, which incorporates the five separate Apache open source projects listed earlier by Shiran, brings the total number of such projects featured in its Hadoop distribution to more than 20.

The company said the new addition means its customers can get round-the-clock help for all Spark stack projects. It also said it will work with Databricks to develop a roadmap for further development and increase the cadence of new innovations. MapR claims to be the only Hadoop vendor offering a monthly release cadence for its distribution.

MapR and Databricks are conducting an April 29 webinar where developers can learn more about the benefits of using Spark in the MapR Hadoop distribution.

About the Author

David Ramel is an editor and writer for Converge360.