News

Apache Storm 2.0: Re-Architected in Pure Java

The Apache Storm community has announced a major milestone release of its eponymous open source, distributed, real-time computation system. Apache Storm 2.0 comes with a number of fixes and enhancements, but the most striking change in this release is that it has been re-architected in pure Java.

Large parts of the core functionality in previous releases were implemented in Clojure, a dynamic, general-purpose programming language that provides easy access to Java frameworks. The change to pure Java for the system's core functionality was about improving performance, the community says. Switching to Java has made Storm's internal APIs more maintainable and extensible.

"While Storm's Clojure implementation served it well for many years," the Project Management Committee (PMC) wrote in a blog post, "it was often cited as a barrier for entry to new contributors. Storm's codebase is now more accessible to developers who don't want to learn Clojure in order to contribute."

Storm is designed to make it easy to process unbounded streams of data, as the community puts it on the web site, "doing for real-time processing what Hadoop did for batch processing." (The language can be confusing around this stuff, but in his O'Reilly publication, "Streaming 101: The World Beyond Batch," author Tyler Akidau offers two useful definitions: He refers to unbounded data as infinite streaming data sets, and bounded data as finite batch data sets.

Storm is used in real-time analytics, online machine learning, continuous computation, distributed RPC, and ETL, among other use cases.

The new core introduced in Storm 2.0.0 features a leaner threading model, a much fast messaging subsystem, and a lightweight back-pressure model. It's designed, the PMC says, to push boundaries on throughput, latency, and energy consumption while maintaining backward compatibility.

"The design was motivated by the observation that existing hardware remains capable of much more than what the best streaming engines can deliver," the committee wrote. "Storm 2.0 is the first streaming engine to break the 1 microsecond latency barrier."

This release also comes with a new typed API for expressing streaming computations more easily using functional style operations. It builds on Storm's core spouts and bolt APIs and automatically fuses multiple operations to optimize the pipeline. There's also an enhancement to the system's Windowing API, which can save/restore the window state to the configured state backend, so that larger continuous windows can be supported. Also, the window boundaries can now be accessed via the APIs.

The list of updates in this release also includes the removal of storm-kafka, which is the most significant change to Storm's Kafka integration since release 1.x. The module was deprecated a while back, due to Kafka's deprecation of the underlying client library, the PMC says. Users will have to move to the storm-kafka-client module, which uses Kafka's ´kafka-clients´ library for integration."

Also, with this release, the 1.0.x version line will no longer be maintained. The PMC strongly encourages 1.0.x users to upgrade to a more recent release. And Java 7 support has been dropped in this release. Storm 2.0 requires Java 8.

The full list of changes in this release can be found here. Apache Storm 2.0 is available now for download.

About the Author

John K. Waters is the editor in chief of a number of Converge360.com sites, with a focus on high-end development, AI and future tech. He's been writing about cutting-edge technologies and culture of Silicon Valley for more than two decades, and he's written more than a dozen books. He also co-scripted the documentary film Silicon Valley: A 100 Year Renaissance, which aired on PBS.  He can be reached at [email protected].