WatersWorks

Blog archive

Apache Hadoop Community Promotes YARN -- But Don't Call it MapReduce 2

The Hadoop community recently promoted YARN -- the next-gen Hadoop data processing framework -- to the status of "sub-project" of the Apache Hadoop Top Level Project. The promotion puts YARN on the same level as Hadoop Common, the Hadoop Distributed File System, and MapReduce. It had been part of the MapReduce project; the promotion means it'll now get the spotlight and developer attention its proponents believe it deserves.

"We now have consent from the community to separate YARN from MapReduce," says Arun C. Murthy. "Which is as it should be. YARN is not another generation of MapReduce, and I really don't like the 'MapReduce 2.0' label. This is a different paradigm. This is much more general and much more interesting."

Murthy ought to know: he's has been a full-time contributor to the Hadoop MapReduce project since it got off the ground at Yahoo in early 2006. Back then, he and fellow Yahoo software engineer Own O'Malley set a world data-sorting record (http://sortbenchmark.org/) using Map-Reduce: a terabyte in 60 seconds. Today, Murthy is a member of the Apache Hadoop Project Management Committee and a co-founder of Hortonworks, one of the chief providers of commercial support and services for Hadoop.

And he's been working on YARN full-time for about two and a half years.

"We knew that we were going to have to take Hadoop beyond MapReduce," Murthy says. "The programming model—the MapReduce algorithm—was limited. It can't support the very wide variety of use-cases we're now seeing for Hadoop. YARN turns Hadoop into a generic resource-management-and-distributed-application framework that lets you implement multiple customized apps. I expect to see MPI, graph-processing, simple services, all co-existing with MapReduce applications in a Hadoop YARN cluster. You can even run MapReduce now as an application for YARN."

Hadoop, of course, is the open-source framework for running applications on large data clusters built on commodity hardware (let's just say it: Big Data). I sometimes forget that Hadoop is actually a combination of two technologies: Google's MapReduce and HDFS. MapReduce is a programming model for processing the large data sets that supports parallel computations on so-called unreliable clusters. HDFS is the storage component designed to scale to petabytes and run on top of the file systems of the underlying operating systems.

What Murthy and others are hoping to do is redefine Hadoop from "HDFS-plus-MapReduce" to "HDFS-plus-YARN."

"The users can now look at Hadoop as a much more general-purpose system," Murthy says. "And from a developer perspective, we've opened up Hadoop itself to the point where now anyone can implement their own applications without having to worry about the nitty-gritty details of how you manage resources in a cluster and what you do for fault tolerance. [Promoting it] will also help us get more users and more developers to build an ecosystem around YARN. I guarantee you that next year at this time, we will be looking at four or five ways of doing real-time processing on Hadoop."

And I had to ask: What does YARN stand for?

"We were sitting around at lunch one day, trying to come up with the most inane names for our product," Murthy confessed to me. "The result was 'Yet Another Resource Negotiator—YARN.' I know: it's a really bad name."

But really promising technology.

Hortonworks is in the process of publishing a still-unfolding series of blogs by Murthy and Hortonworks' product marketing director Jim Walker on the subject of YARN and its implications for Hadoop. And there's a new collaboration mailing list ([email protected]) for those who want to get involved in the project.

Posted by John K. Waters on August 15, 2012