Apache Announces Hadoop Upgrade, Elevates Spark Project -- ADTmag

Apache Announces Hadoop Upgrade, Elevates Spark Project

By David Ramel
February 28, 2014

The open source Apache Software Foundation this week voted to release a Hadoop upgrade that allows for in-memory caching of data and working with data from different storage classes. It also elevated Spark, the Big Data analytics project, to top-level status.

The Big Data industry has been flooded with recent in-memory analytics product announcements from numerous vendors. Now, in-memory caching of Hadoop Distributed File System (HDFS) data in the new Hadoop 2.3.0 release will help developers boost performance of the baseline open source Hadoop distribution.

The problem addressed, according to release notes, is that "HDFS currently has no support for managing or exposing in-memory caches at datanodes. This makes it harder for higher level application frameworks like Hive, Pig and Impala to effectively use cluster memory, because they cannot explicitly cache important datasets or place their tasks for memory locality."

That issue has been fixed, explained Arun Murthy, founder of Hortonworks Inc., a major Hadoop distributor. "It is now possible to use memory available in the Hadoop cluster to centrally cache and administer data sets in-memory in the datanode’s address space," Murthy said. "Applications such as MapReduce, Hive, Pig [and so on] can now request for memory to be cached ... and then read it directly off the datanode’s address space for extremely efficient scans by avoiding disk all together."

Cloudera Inc., another major Hadoop distributor, said the in-memory caching was developed by two of its engineers. By letting developers target certain files and directories for caching, the feature "enables memory-speed reads in HDFS," Cloudera said. "Preliminary benchmarks show that optimized applications can achieve read throughput on the order of gigabytes per second."

The other major improvement, Heterogeneous Storage Hierarchy, means developers can work with different kinds of storage in HDFS. "We now can take advantage of different storage types on the same Hadoop clusters," Murthy said. "Hence, we can now make better cost/benefit tradeoffs with different storage media such as commodity disks, enterprise-grade disks, [solid-state drives], memory [and so on]."

Other improvements in the new Hadoop release include hundreds of bug fixes and new features such as "simplified distribution of MapReduce binaries via the YARN Distributed Cache," noted Cloudera.

In other news, Apache yesterday announced that Spark has been elevated from its previous incubator status to a top-level project. That means "the project's community and products have been well-governed under the ASF's meritocratic process and principles," Apache said.

Spark is a distributed computing framework that allows for advanced analytics in Hadoop. "Spark is well suited for machine learning, interactive queries, and stream processing, and can read from HDFS, HBase, Cassandra, as well as any Hadoop data source," Apache said.

About the Author

David Ramel is an editor and writer at Converge 360.

Featured

AppTrends

Email Address*Country*

Please type the letters/numbers you see above.

Upcoming Training Events

0 AM

Live! 360 2-Day Hands-On Seminar: Copilot Studio, Microsoft Agent Framework and Foundry: Building Multi-Agent AI Systems
June 8-9, 2026

Live! 360 2-Day Hands-On Seminar: AI-Powered .NET Development with Claude & Claude Code
July 9-10, 2026

VSLive! 4-Day Hands-On Training Seminar: Immersive .NET Full Stack Training with CoPilot: 4-Day Hands-On Experience
July 14-17, 2026

Visual Studio Live! @ Microsoft HQ
July 27-31, 2026

Visual Studio Live! @ San Diego
September 14-18, 2026

The AI Pivot
September 25, 2026

Live! 360 6-Week Training & Certification Course: Mastering the Microsoft AI Framework: Building Enterprise-Ready AI Agents with Microsoft Foundry
October 6–November 10, 2026

VSLive! 6-Week Training & Certification Course: Blazor Developer Accelerator: Hands-On Skills for Real-World .NET Teams
October 7 – November 11, 2026

Live! 360 Orlando
November 15-20, 2026

Artificial Intelligence Live! Orlando
November 15-20, 2026

AI Enterprise Architecture Live! Orlando
November 15-20, 2026

Cybersecurity & Ransomware Live! Orlando
November 15-20, 2026

Data Platform Live! Orlando
November 15-20, 2026

Visual Studio Live! Orlando
November 15-20, 2026

VSLive! 4-Day Hands-On Training Seminar: Immersive .NET Full Stack Training with CoPilot: 4-Day Hands-On Experience
December 15-18, 2026

Free White Papers

More Tech Library