Apache Storm 2.0: Re-Architected in Pure Java
- By John K. Waters
The Apache Storm community has announced a major milestone release of its eponymous open source, distributed, real-time computation system. Apache Storm 2.0 comes with a number of fixes and enhancements, but the most striking change in this release is that it has been re-architected in pure Java.
Large parts of the core functionality in previous releases were implemented in Clojure, a dynamic, general-purpose programming language that provides easy access to Java frameworks. The change to pure Java for the system's core functionality was about improving performance, the community says. Switching to Java has made Storm's internal APIs more maintainable and extensible.
"While Storm's Clojure implementation served it well for many years," the Project Management Committee (PMC) wrote in a blog post, "it was often cited as a barrier for entry to new contributors. Storm's codebase is now more accessible to developers who don't want to learn Clojure in order to contribute."
Storm is designed to make it easy to process unbounded streams of data, as the community puts it on the web site, "doing for real-time processing what Hadoop did for batch processing." (The language can be confusing around this stuff, but in his O'Reilly publication, "Streaming 101: The World Beyond Batch," author Tyler Akidau offers two useful definitions: He refers to unbounded data as infinite streaming data sets, and bounded data as finite batch data sets.
Storm is used in real-time analytics, online machine learning, continuous computation, distributed RPC, and ETL, among other use cases.
The new core introduced in Storm 2.0.0 features a leaner threading model, a much fast messaging subsystem, and a lightweight back-pressure model. It's designed, the PMC says, to push boundaries on throughput, latency, and energy consumption while maintaining backward compatibility.
"The design was motivated by the observation that existing hardware remains capable of much more than what the best streaming engines can deliver," the committee wrote. "Storm 2.0 is the first streaming engine to break the 1 microsecond latency barrier."
This release also comes with a new typed API for expressing streaming computations more easily using functional style operations. It builds on Storm's core spouts and bolt APIs and automatically fuses multiple operations to optimize the pipeline. There's also an enhancement to the system's Windowing API, which can save/restore the window state to the configured state backend, so that larger continuous windows can be supported. Also, the window boundaries can now be accessed via the APIs.
The list of updates in this release also includes the removal of storm-kafka, which is the most significant change to Storm's Kafka integration since release 1.x. The module was deprecated a while back, due to Kafka's deprecation of the underlying client library, the PMC says. Users will have to move to the storm-kafka-client module, which uses Kafka's ´kafka-clients´ library for integration."
Also, with this release, the 1.0.x version line will no longer be maintained. The PMC strongly encourages 1.0.x users to upgrade to a more recent release. And Java 7 support has been dropped in this release. Storm 2.0 requires Java 8.
The full list of changes in this release can be found here. Apache Storm 2.0 is available now for download.
John has been covering the high-tech beat from Silicon Valley and the San Francisco Bay Area for nearly two decades. He serves as Editor-at-Large for Application Development Trends (www.ADTMag.com) and contributes regularly to Redmond Magazine, The Technology Horizons in Education Journal, and Campus Technology. He is the author of more than a dozen books, including The Everything Guide to Social Media; The Everything Computer Book; Blobitecture: Waveform Architecture and Digital Design; John Chambers and the Cisco Way; and Diablo: The Official Strategy Guide.