Yahoo Releases Hadoop Source Code

p>Yahoo has released to developers the source code for its internal distribution of Hadoop, the Java-based open-source framework for data-intensive distributed computing.

The release is noteworthy because Hadoop currently runs on tens of thousands of servers inside Yahoo, making it the company's largest distribution in the world. Shelton Shugar, Yahoo's senior VP of engineering for cloud computing and data infrastructure made the announcement this week at the company's second annual Hadoop Summit in Santa Clara, Calif.

The code for the Yahoo Distribution of Hadoop is available now on the company Developer Network Web. This download is based on the alpha version of the 0.20 release of the Apache Hadoop code. It's also available on source code hosting and collaborative development site GitHub.

The Hadoop Framework is an open-source distributed computing platform designed to support parallel computations over large data sets on so-called unreliable computer clusters. It's based on Google's MapReduce, a programming model for processing and generating large data sets, which divides an application into multiple units of work, each of which can be executed on any node in a server cluster. Hadoop supports the HDFS distributed file system, which designed to scale to petabytes of storage and to run on top of the file systems of the underlying OS.

Simply put, Hadoop is a system that can analyze a large amount of data in a small amount of time, Shugar explained. "There's still work to be done," he added. "There's lots of development ongoing for more programming languages, better debugging and scheduling tools, and there's talk about making it easier to deploy newer versions. But from the standpoint of having a core system that works well and is reliable, it's pretty solid -- solid enough to use in production."

This release of the Yahoo Distribution of Hadoop includes a scheduler that allows multiple users to share a cluster through separate queues, as well as code patches that Yahoo has added to the framework to improve the stability and performance of its own server clusters. Many of these patches have already been contributed back to the Apache project, but they might not be available in the current Apache release (0.18.3), the company says.

"The Yahoo distribution of Hadoop is running on the largest compute clusters in the universe," Eric Bladeschwieler, Yahoo's vice president of Hadoop software development, told summit attendees. "Now we're putting it out there on the Web."

Speaking to reporters after the keynotes, Shugar explained that this release is meant to encourage developers to create applications that exploit Hadoop's ability to deal with the massive amounts of distributed data, with which virtually all enterprises struggle today. "I think Hadoop has become the de facto platform for scalable data processing," he said. "And it's ready for folks to run businesses on."

Rod Smith,  IBM vice president for emerging Internet technologies, who spoke at the summit told attendees that Big Blue is seeing a new kind of demand among its customers for self-service BI. "We see a kind of emerging content application that is more long-running that will be built on Hadoop," Smith said. "We think that Hadoop is going to set the tone for this new class of longer-running types of app. We think tools like this will be helpful in collecting and extracting content and letting users run operations on it over and over again. But so far this is still cookie dough -- it's not baked yet."

IBM is currently developing proof-of-concept tools for these kinds of big-data-set, long-running apps, which Smith said could run on more common configurations of business computers with a few dozen nodes. Called M2, that project uses Hadoop as a back-end engine for a suite of browser-based analytical and visualization tools, he said.

Although Yahoo is making the source code for its Hadoop distro available to developers, it is not currently providing any support for that code. Christopher Yeh, head of the Yahoo developer network, said that the company provides support in other ways, through videos, classes, posted notes from the community on the site, blogs, and other general Hadoop resources on the developer network site.



About the Author

John K. Waters is a freelance author and journalist based in Silicon Valley. His latest book is The Everything Guide to Social Media. Follow John on Twitter, read his blog on ADTmag.com, check out his author page on Amazon, or e-mail him at john@watersworks.com.


Reader Comments:

Add Your Comment Now:

Your Name:(optional)
Your Email:(optional)
Your Location:(optional)
Comment:
Please type the letters/numbers you see above