Apache Announces Hadoop Upgrade, Elevates Spark Project -- ADTmag

Apache Announces Hadoop Upgrade, Elevates Spark Project

By David Ramel
February 28, 2014

The open source Apache Software Foundation this week voted to release a Hadoop upgrade that allows for in-memory caching of data and working with data from different storage classes. It also elevated Spark, the Big Data analytics project, to top-level status.

The Big Data industry has been flooded with recent in-memory analytics product announcements from numerous vendors. Now, in-memory caching of Hadoop Distributed File System (HDFS) data in the new Hadoop 2.3.0 release will help developers boost performance of the baseline open source Hadoop distribution.

The problem addressed, according to release notes, is that "HDFS currently has no support for managing or exposing in-memory caches at datanodes. This makes it harder for higher level application frameworks like Hive, Pig and Impala to effectively use cluster memory, because they cannot explicitly cache important datasets or place their tasks for memory locality."

That issue has been fixed, explained Arun Murthy, founder of Hortonworks Inc., a major Hadoop distributor. "It is now possible to use memory available in the Hadoop cluster to centrally cache and administer data sets in-memory in the datanode’s address space," Murthy said. "Applications such as MapReduce, Hive, Pig [and so on] can now request for memory to be cached ... and then read it directly off the datanode’s address space for extremely efficient scans by avoiding disk all together."

Cloudera Inc., another major Hadoop distributor, said the in-memory caching was developed by two of its engineers. By letting developers target certain files and directories for caching, the feature "enables memory-speed reads in HDFS," Cloudera said. "Preliminary benchmarks show that optimized applications can achieve read throughput on the order of gigabytes per second."

The other major improvement, Heterogeneous Storage Hierarchy, means developers can work with different kinds of storage in HDFS. "We now can take advantage of different storage types on the same Hadoop clusters," Murthy said. "Hence, we can now make better cost/benefit tradeoffs with different storage media such as commodity disks, enterprise-grade disks, [solid-state drives], memory [and so on]."

Other improvements in the new Hadoop release include hundreds of bug fixes and new features such as "simplified distribution of MapReduce binaries via the YARN Distributed Cache," noted Cloudera.

In other news, Apache yesterday announced that Spark has been elevated from its previous incubator status to a top-level project. That means "the project's community and products have been well-governed under the ASF's meritocratic process and principles," Apache said.

Spark is a distributed computing framework that allows for advanced analytics in Hadoop. "Spark is well suited for machine learning, interactive queries, and stream processing, and can read from HDFS, HBase, Cassandra, as well as any Hadoop data source," Apache said.

About the Author

David Ramel is an editor and writer at Converge 360.

Featured

AppTrends

Email Address*Country*

Please type the letters/numbers you see above.

Upcoming Training Events

0 AM

Live! 360 2-Day Hands-On Seminar: From Traction to Production: Building Generative AI Applications with Azure AI Studio
March 25-26, 2025

VSLive! 4-Day Hands-On Training Seminar: Hands-on with Blazor
May 5-8, 2025

Cybersecurity & Ransomware Live! VirtCon 2025
May 13-15, 2025

VSLive! 3-Day Hands-On Training Seminar: Master Modern JavaScript: Unlock the Full Potential of Your Code
June 2-4, 2025

VSLive! 2-Day Hands-On Training Seminar: Asynchronous and Parallel Programming in C#
June 24-25, 2025

VSLive! 4-Day Hands-On Training Seminar: Immersive .NET Full Stack Training: 4-Day Hands-On Experience
July 15-18, 2025

Visual Studio Live! @ Microsoft HQ
August 4-8, 2025

Visual Studio Live! San Diego
September 8-12, 2025

Live! 360 2-Day Hands-On Seminar: Swimming in the Lakes of Microsoft Fabric and AI – A Hands-on Experience
September 18-19, 2025

Live! 360 Orlando
November 16-21, 2025

Artificial Intelligence Live! Orlando
November 16-21, 2025

Cloud & Containers Live! Orlando
November 16-21, 2025

Cybersecurity & Ransomware Live! Orlando
November 16-21, 2025

Data Platform Live! Orlando
November 16-21, 2025

Visual Studio Live! Orlando
November 16-21, 2025

VSLive! 4-Day Hands-On Training Seminar: Immersive .NET Full Stack Training: 4-Day Hands-On Experience
December 16-19, 2025

Free White Papers

More Tech Library