Spark Continues Big Data Ascension

Any doubts about the skyrocketing rise of Apache Spark technology in the Big Data ecosystem were put to rest at a recent conference, where it almost stole the show.

Big Data vendors large and small were trumpeting Spark-related news at the Strata + Hadoop World conference, led by Databricks Inc. -- the main commercial steward of the open source technology -- which was formed by creators of the original academic project that spawned it.

Spark is a Big Data processing engine, sometimes described as a cluster computing framework, that adds several kinds of enhancements leading to improved analytics, such as better iterative processing, in-memory primitives for speed, interactive queries, support of SQL queries and streaming, interoperability with different components and many more.

"The Spark engine runs in a variety of environments, from cloud services to Hadoop or Mesos clusters," Databricks says on its site. "It is used to perform ETL, interactive queries (SQL), advanced analytics (for example, machine learning) and streaming over large datasets in a wide range of data stores (for example, HDFS, Cassandra, HBase, S3). Spark supports a variety of popular development languages including Java, Python and Scala."

Databricks last week announced a new API to simplify Big Data analytics for some users, along with a deal with Intel Corp. to optimize real-time analytics on Intel architecture.

The Spark ecosystem
[Click on image for larger view.] The Spark Ecosystem (source: Databricks Inc.)

Coming early next month the latest Databricks Spark version 1.3, the DataFrames API is designed to simplify distributed data processing for data scientists who are used to working with single-machine tools.

"The DataFrames API, inspired by data frames in R and Python (Pandas), provides a familiar and efficient interface for data scientists that they can now run on Big Data, making large-scale computing accessible to users with experience in these languages," Databricks said in a news release. "The API builds on the Spark SQL query optimizer to automatically execute code efficiently on a cluster of machines. Finally, through Spark SQL's external data sources API, DataFrames can access a wide array of third-party data sources such as databases and NoSQL stores."

It didn't take long for follow-on support of DataFrames, as Zoomdata, a Big Data analytics and visualization company, said it added Spark DataFrames engine support to its flagship platform in order to enhance high-scale business intelligence (BI).

In the Intel agreement, the companies will meld the Intel and Spark ecosystems. "The necessity for real-time analytics across Intel architecture is a vital piece of the Big Data puzzle to enable the extraction of prompt, actionable insights from large data sets," Databricks said.

Beyond Databricks, the Spark news last week emanated from many other vendors.

Tableau Software, the BI and data visualization specialist, announced a direct connector for Spark SQL, facilitating visualization of analytics results.

Another connector announcement was made by MemSQL, which focuses on real-time databases for transactions and Big Data analytics. "The powerful combination of an in-memory database from MemSQL and Spark's in-memory processing framework gives enterprises immediate access to transactions, analytics, and data exploration—opening up new opportunities for revenue and improved customer experiences," the company said of the open source connector.

Altiscale Inc., a Hadoop-as-a-Service (HaaS) company, announced that Spark is now available on the company's Data Cloud. "The Altiscale Data Cloud was created to provide organizations access to the only infrastructure 'purpose-built' for Hadoop, as well as the operational expertise needed to execute complex Hadoop projects," the company said.

Qubole Inc., calling itself a Big Data-as-a-Service (BDaaS) company, announced the addition of Spark to its Qubole Data Service platform, described as "a self-service platform for Big Data analytics that runs on the three major public clouds: Amazon AWS, Google Compute Engine and Microsoft Azure."

Paxata announced the Spring 2015 release of its Adaptive Data Preparation application and platform that uses Spark extensively. "The Paxata platform was built with a data management layer that persists data inside the Hadoop Distributed File System (HDFS) and a real-time columnar parallelized in-memory pipeline data prep engine powered by Intellifusion," the company said. "The data prep engine wraps Apache Spark v1.2 with additional functionality built to optimize Spark performance and responsiveness."

The flurry of Spark-related news and product releases further cements the project as the darling of the open source Big Data movement, showing a "hockey stick-like" growth in a chart measuring Spark awareness, according to a recent survey. It has been recognized as the most active Apache Software Foundation project and, indeed, most active Big Data open source project of any kind. Editor at Large John K. Waters recently wrote about enterprise development predictions for the new year in which Spark figured heavily in the estimation of Forrester Research Inc. Analyst Mike Gualtieri. Spark and the Internet of Things (IoT), he said, "will see a lot more beef behind their buzz."

"Apache Spark makes Hadoop stronger," Gualtieri continued. "Hadoop was designed for volume. Apache Spark was designed for speed. If you believe that opposites attract, then Hadoop and Spark belong together. They are both cluster computing platforms that share nodes."

About the Author

David Ramel is an editor and writer for Converge360.