Apache Spark 2.1 Improves Structured Streaming -- ADTmag

Apache Spark 2.1 Improves Structured Streaming

By David Ramel
January 11, 2017

Better streaming analytics, a hot topic in Big Data development right now, is the highlight of more than 1,200 improvements and bug fixes in the new Apache Spark 2.1.

Databricks Inc., the commercial steward of the popular open source Spark project, announced the availability of version 2.1 on its platform late last month.

"This release makes measurable strides in the production readiness of Structured Streaming, with added support for event-time watermarks and Apache Kafka 0.10 support," Databricks said. "In addition, the release focuses more on usability, stability, and refinement, resolving over 1,200 tickets, than previous Spark releases."

Streaming analytics, as opposed to the batch-processing model of the original MapReduce component of the Apache Hadoop ecosystem, was one of the main attractions of Spark when it debuted in 2014, along with in-memory processing and other features.

When Spark 2.0 was unveiled last July, Databricks said that version laid the foundation for continuous applications, which provide real-time analytics. That foundation has now been solidified even more.

"Introduced in Spark 2.0, Structured Streaming is a high-level API for building continuous applications," Databricks said. "The main goal is to make it easier to build end-to-end streaming applications, which integrate with storage, serving systems, and batch jobs in a consistent and fault-tolerant way."

New improvements to Structured Streaming in v2.1 include: support for all file-based formats, including JSON, text, Avro, CSV; support for Apache Kafka 0.10, which is often used for ingesting and managing data streams; and event-time watermarks, for identifying events that may be "too late" for the current job.

The new version also addresses the "stringent" visibility and manageability requirements that streaming analytics demands from the underlying systems.

The following figure depicts the concerns usually handled by streaming engines and those needed in continuous applications:

**[Click on image for larger view.]** Streaming and Continuous Applications *(source: Databricks)*

Other highlights of the new release include numerous enhancements to SQL functionality and Spark's core Dataset/DataFrame API, along with better advanced analytics.

The latter improvement stems from many new algorithms that were added to the MLlib machine learning library, the GraphX API for graphs and graph-parallel computation and to SparkR, which provides a package based on the R programming language especially attractive to data scientists running jobs on large datasets from the R shell.

It's Structured Streaming that's the star of the new release, though, and Databricks promised more detailed and hands-on information to come on that technology.

"At Databricks, we religiously believe in dogfooding," the company said. "Using a release candidate version of Spark 2.1, we have ported some of our internal data pipelines, as well as worked with some of our customers to port their production pipelines using Structured Streaming. In coming weeks, we will be publishing a series of blog posts on various aspects of Structured Streaming, as well as our experience with it. Stay tuned for more deep dives."

Interested developers can check out the release notes for more detailed information on the changes brought with Apache Spark 2.1

More on Apache Spark

About the Author

David Ramel is an editor and writer at Converge 360.

Featured

AppTrends

Email Address*Country*

Please type the letters/numbers you see above.

Upcoming Training Events

0 AM

Visual Studio Live! @ Microsoft HQ
July 27-31, 2026

Visual Studio Live! @ San Diego
September 14-18, 2026

The AI Pivot
September 25, 2026

Live! 360 6-Week Training & Certification Course: Mastering the Microsoft AI Framework: Building Enterprise-Ready AI Agents with Microsoft Foundry
October 6–November 10, 2026

VSLive! 6-Week Training & Certification Course: Blazor Developer Accelerator: Hands-On Skills for Real-World .NET Teams
October 7 – November 11, 2026

Live! 360 Orlando
November 15-20, 2026

Artificial Intelligence Live! Orlando
November 15-20, 2026

AI Enterprise Architecture Live! Orlando
November 15-20, 2026

Cybersecurity & Ransomware Live! Orlando
November 15-20, 2026

Data Platform Live! Orlando
November 15-20, 2026

Visual Studio Live! Orlando
November 15-20, 2026

Live! 360 2-Day Hands-On Seminar: AI-Powered .NET Development with Claude & Claude Code
December 8-9, 2026

VSLive! 4-Day Hands-On Training Seminar: Immersive .NET Full Stack Training with CoPilot: 4-Day Hands-On Experience
December 15-18, 2026

Visual Studio Live! Las Vegas
March 22-26, 2027

Visual Studio Live! @ Microsoft HQ
August 2-6, 2027

Free White Papers

More Tech Library