Apache Spark 2.1 Improves Structured Streaming -- ADTmag

Apache Spark 2.1 Improves Structured Streaming

By David Ramel
January 11, 2017

Better streaming analytics, a hot topic in Big Data development right now, is the highlight of more than 1,200 improvements and bug fixes in the new Apache Spark 2.1.

Databricks Inc., the commercial steward of the popular open source Spark project, announced the availability of version 2.1 on its platform late last month.

"This release makes measurable strides in the production readiness of Structured Streaming, with added support for event-time watermarks and Apache Kafka 0.10 support," Databricks said. "In addition, the release focuses more on usability, stability, and refinement, resolving over 1,200 tickets, than previous Spark releases."

Streaming analytics, as opposed to the batch-processing model of the original MapReduce component of the Apache Hadoop ecosystem, was one of the main attractions of Spark when it debuted in 2014, along with in-memory processing and other features.

When Spark 2.0 was unveiled last July, Databricks said that version laid the foundation for continuous applications, which provide real-time analytics. That foundation has now been solidified even more.

"Introduced in Spark 2.0, Structured Streaming is a high-level API for building continuous applications," Databricks said. "The main goal is to make it easier to build end-to-end streaming applications, which integrate with storage, serving systems, and batch jobs in a consistent and fault-tolerant way."

New improvements to Structured Streaming in v2.1 include: support for all file-based formats, including JSON, text, Avro, CSV; support for Apache Kafka 0.10, which is often used for ingesting and managing data streams; and event-time watermarks, for identifying events that may be "too late" for the current job.

The new version also addresses the "stringent" visibility and manageability requirements that streaming analytics demands from the underlying systems.

The following figure depicts the concerns usually handled by streaming engines and those needed in continuous applications:

**[Click on image for larger view.]** Streaming and Continuous Applications *(source: Databricks)*

Other highlights of the new release include numerous enhancements to SQL functionality and Spark's core Dataset/DataFrame API, along with better advanced analytics.

The latter improvement stems from many new algorithms that were added to the MLlib machine learning library, the GraphX API for graphs and graph-parallel computation and to SparkR, which provides a package based on the R programming language especially attractive to data scientists running jobs on large datasets from the R shell.

It's Structured Streaming that's the star of the new release, though, and Databricks promised more detailed and hands-on information to come on that technology.

"At Databricks, we religiously believe in dogfooding," the company said. "Using a release candidate version of Spark 2.1, we have ported some of our internal data pipelines, as well as worked with some of our customers to port their production pipelines using Structured Streaming. In coming weeks, we will be publishing a series of blog posts on various aspects of Structured Streaming, as well as our experience with it. Stay tuned for more deep dives."

Interested developers can check out the release notes for more detailed information on the changes brought with Apache Spark 2.1

More on Apache Spark

About the Author

David Ramel is an editor and writer at Converge 360.

Featured

AppTrends

Email Address*Country*

Please type the letters/numbers you see above.

Upcoming Training Events

0 AM

VSLive! 4-Day Hands-On Training Seminar: Hands-on with Blazor
May 5-8, 2025

Cybersecurity & Ransomware Live! VirtCon 2025
May 13-15, 2025

VSLive! 4-Hour In-Depth Workshop: Deep Dive into ASP.NET Core Razor Pages
May 29, 2025

VSLive! 3-Day Hands-On Training Seminar: Master Modern JavaScript: Unlock the Full Potential of Your Code
June 2-4, 2025

VSLive! 2-Day Hands-On Training Seminar: Asynchronous and Parallel Programming in C#
June 24-25, 2025

4-Hour Hands-on Workshop: MCP Demystified
June 30, 2025

VSLive! 4-Day Hands-On Training Seminar: Immersive .NET Full Stack Training: 4-Day Hands-On Experience
July 15-18, 2025

VSLive! 4-Hour In-Depth Workshop: Immersive .NET Full Stack Training: C# Interfaces: Effective Usage while Avoiding Pitfalls
July 29, 2025

Visual Studio Live! @ Microsoft HQ
August 4-8, 2025

4-Hour VSLive! Workshop: Testability in .NET
August 27, 2025

Visual Studio Live! San Diego
September 8-12, 2025

Live! 360 2-Day Hands-On Seminar: Swimming in the Lakes of Microsoft Fabric and AI – A Hands-on Experience
September 18-19, 2025

VSLive! 2-Day Hands-On Training Seminar: Hands-On with .NET Web Development in 2025
October 7-8, 2025

Live! 360 Orlando
November 16-21, 2025

Artificial Intelligence Live! Orlando
November 16-21, 2025

Cloud & Containers Live! Orlando
November 16-21, 2025

Cybersecurity & Ransomware Live! Orlando
November 16-21, 2025

Data Platform Live! Orlando
November 16-21, 2025

Visual Studio Live! Orlando
November 16-21, 2025

VSLive! 4-Day Hands-On Training Seminar: Immersive .NET Full Stack Training: 4-Day Hands-On Experience
December 16-19, 2025

Visual Studio Live! Las Vegas
March 16-20, 2026

Free White Papers

More Tech Library