Gartner Big Data Analyst: Still Waiting for YARN to Move Hadoop Forward

Gartner analyst Merv Adrian was upbeat about the progress of Big Data and Hadoop during his recent keynote at Hadoop Summit North America, but later expressed some frustration and noted that "we aren't quite there yet."

Adrian on Wednesday wrote a recap of the summit in which he pointed out the increase in attendance and sponsors--along with increased queries to him and observations of fellow analysts--all pointed to "steady growth" of Hadoop.

"But it's not all sweetness and light," Adrian wrote Wednesday. "There are issues." Some of those issues include stagnant corporate investment in Big Data, continuing confusion about Hadoop itself, and the failure of the technology dubbed "YARN" to quickly move things forward. "Much is expected--and we seem to be doomed to wait a while longer," he said of the latter problem.

Adrian highlighted that issue by quoting a summary of YARN by Jeff Kelly, posted "after the summit":

"MapReduce is great for batch processing large volumes of distributed data, but it’s less than ideal for real-time data processing, graph processing and other non-batch methods. YARN is the open source community’s effort to overcome this limitation and transform Hadoop from a One Trick Pony to a truly comprehensive Big Data management and analytics platform."

But the problem, Adrian pointed out, is the slow progress of YARN, as evidenced by the fact that the above summary was posted last August after the 2012 summit, and things haven't moved forward much.

Adrian noted that there has been adoption of YARN in the industry, but the Apache Software Foundation, the open source steward of Hadoop, still lists the technology being in the "alpha" stage--before beta--and the MapR site still doesn't even mention YARN.

As for corporate investment stagnation, Adrian pointed out in his keynote that 31 percent of respondents to a 2013 Gartner survey indicated they have no plans to invest in Big Data, and that's the same percentage as reported in a 2012 survey.

Confusion about exactly what Hadoop is continues, Adrian said, as reflected by the number of people still asking him that question. The confusion is liable to continue, he said, because of the growing number of substitutes for core Hadoop components. "As YARN comes to market, other engines will be swappable for MapReduce," the analyst said. "Graph engines and 'closer to real-time' processing are next on the horizon, as Storm is getting great traction and several Summit presenters of real world case studies alluded to their use of it. Yahoo! has open sourced its Storm-YARN code, which it runs internally, so expect more productionization ahead. So the answer to 'what is Hadoop, exactly?' will become even more complicated."

During his keynote, Adrian supplied Gartner's answer to that question: "Apache Hadoop is a set of standard open source software projects that provide a framework for using massive amounts of data across a distributed network."

Despite that handy definition, Adrian said that the confusion about Hadoop might slow adoption, a problem compounded by the increase in component substitutes. "YARN will broaden the set of possible use cases, and raise many questions," he said. "Let’s hope it's ready to start answering them soon."

About the Author

David Ramel is an editor and writer for Converge360.