Hadoop Summit 2012 Highlights -- ADTmag

WatersWorks

By John K. Waters

Hadoop Summit 2012 Highlights

The fifth annual Hadoop Summit brought an estimated 2,100 attendees to the Convention Center in downtown San Jose, Calif., last week. The two-day, big-data event was hosted by Yahoo, Hadoop's first large-scale user, and Hortonworks, a leading commercial support-and-services provider.

Among the announcements coming out of this year's summit were updates from the three leading commercial Hadoop distributors. Hortonworks unveiled the first general release of its Apache Hadoop software distro, Hortonworks Data Platform (HDP) 1.0, a day before the start of the show. The company bills the open source data management platform as "the next generation enterprise data architecture." Built on Apache Hadoop 1.0, this release includes a bundle of new provisioning, management, and monitoring capabilities built into the core platform. It also comes with an integration of the Talend Open Studio for Big Data tool.

Cloudera got a big jump on the competition by announcing a new release a week earlier, but the company showed on its new CDH4 and Cloudera Manager 4, which are part of Cloudera Enterprise 4.0, at the show. Version 4 of CDH, the company's open source Hadoop platform (on which Enterprise 4.0 is built), expands the number of computational processes executable under Hadoop and introduces a new feature designed to software programs to be embedded within the data itself. Dubbed "coprocessors," these programs are executed when certain pre-defined conditions are met.

MapR Technologies showed off version 2.0 of its Hadoop distro, the first to support multi-tenancy. The new version also comes with advanced monitoring management tools, isolation capabilities, and added security. MapR is offering this release in a basic edition (M3), and an advanced edition (M5). The MapR Hadoop Distribution M3 supports HBase, Pig, Hive, Mahout, Cascading, Sqoop and Flume. The M5 edition adds high availability features and additional security tools, including: JobTracker HA, Distributed NameNode HA, Snapshots and Mirroring.

Also, VMware launched a new open source project codenamed "Serengeti" at the show. The Web site describes the project's goal "to enable the rapid deployment of an Apache Hadoop cluster... on a virtual platform." VMware says the project aims to produce a virtualization-aware Hadoop configuration and management tool. VMware is partnering with Cloudera, Hortonworks, MapR and big data analysis company Greenplum on this project.

Apache Hadoop is an increasingly popular, Java-based, open-source framework for data-intensive distributed computing. They system is designed to analyze a large amount of data in a small amount of time. At its core, it is a combination of Google's MapReduce and the Hadoop Distributed File System (HDFS). MapReduce is a programming model for processing and generating large data sets. It supports parallel computations over large data sets on unreliable computer clusters. HDFS is designed to scale to petabytes of storage and to run on top of the file systems of the underlying OS.

Attendance at this year's Hadoop Summit set a record. The first event, held in 2008, drew an estimated 500 attendees. The Summit's sponsorship roster underscores the growing importance of the data analysis platform. Cisco, Facebook, IBM, Microsoft and VMware were among the heavy hitters adding their support to the event; there were 49 event sponsors total.

Speaking at the conference, Facebook engineer Andrew Ryan talked with attendees about his company's record-setting reliance on the HDFS clusters to store more than 100 petabytes of data. During his talk, Ryan explained how Facebook has worked around Hadoop's key weakness: its reliance on a single name server (Namenode) to send and receive all filesystem data via a pool of Datanodes. If a Datanode goes down there's little impact on the cluster, but if Namenode goes down, no clients can read or write to the HDFS. The fix: AvatarNode, a piece of software designed to provide a backup Namenode. Ryan laid out the details from his talk in a blog post.

Posted by John K. Waters on June 18, 2012