In-Depth

Big Data Getting In-Memory Performance Boost: A Tech Round-Up

A flurry of new announcements from Cloudera, Red Hat, SGI and others highlights growing trend in real-time analytics of massive amounts of data.

As Big Data evolves into a more accessible, multi-dimensional mainstream technology, its tools are getting a big performance boost by incorporating in-memory computing.

Big Data and in-memory computing have been around for years but are relatively new trends when compared to more traditional data-related technologies such as relational database management systems (RDBMS). In fact, SQL Server, the granddaddy RDBMS, is only getting built-in in-memory capabilities with the upcoming 2014 release. The importance of increasing computing performance by bypassing disk I/O and taking advantage of falling RAM costs is not lost on Microsoft, which touts those built-in in-memory capabilities first and foremost in its SQL Server 2014 marketing.

And in the relatively upstart and rapidly growing Big Data ecosystem, a plethora of recent announcements showcase how vendors are utilizing in-memory technology to help developers increase performance in their analytical apps. But just how much of a performance increase can developers expect? Way back in December 2011, an Aberdeen Group study on the state of Big Data proclaimed "organizations that had adopted in-memory computing were not only able to analyze larger amounts of data in less time than their competitors--they were literally orders of magnitude faster."

Now, more than two years later, Big Data companies are scrambling to help these organizations and their development teams stay competitive.

Cloudera Inc. just three days ago announced commercial support for Apache Spark, an open source project that promises 100x performance gains over Hadoop MapReduce in memory and 10x on disk. Spark promises this advantage by incorporating "an advanced DAG execution engine that supports cyclic data flow and in-memory computing."

On Monday, Cloudera said, "With Spark, Cloudera users can now perform rapid, resilient processing of in-memory datasets stored in Hadoop, as well as general data processing." Cloudera, a major player in commercial distributions of the open source Hadoop technology almost synonymous with Big Data, made the announcement in conjunction with the unveiling of its new "enterprise data hub."

A day later, H2O announced it was partnering with Cloudera to offer "the world's fastest in-memory predictive analytics" to work with that data hub and enable those 100x performance gains in Big Data analysis. "The key to H2O's interactive performance is its fast in-memory parallel processing--never before have state-of-art algorithms been distributed at these speeds," the company said. The company also announced that its products were available on the Intel Corp. and MapR Technologies Inc. Hadoop distributions and that it had joined the Hortonworks Partner Program to work with that company's Hadoop platform. H2O said the YARN subproject of Hadoop allowed its products to run in Hadoop. YARN, sometimes referred to as "Yet Another Resource Negotiator," is an improvement on the much-maligned, batch-oriented MapReduce technology and came with the major revision of Hadoop, to version 2.0, last October.

Prime Dimensions LLC last week announced "a new offering in high-performance discovery analytics using a Big Data platform based on Hadoop 2.0, NoSQL databases and in-memory technology." Prime Dimensions describes "discovery analytics" as the combination of data discovery and analytics with visualizations. "By enabling extremely fast discovery of patterns and relationships in large datasets via massive in-memory graph analytics and multithreaded processing, clients will be able to mediate the complexity of our digital, data-driven world by revealing new insights in real-time," said company founder Michael Joseph.

Last month, ScaleOut Software Inc., which makes in-memory data grids, said its ScaleOut hServer, an in-memory execution engine for working with Hadoop MapReduce, is available in the AWS Marketplace. The company also announced a Windows version of the server "to perform Java-based, in-memory Hadoop MapReduce on Windows systems." ScaleOut Software claims its server eliminates disk I/O latencies and other problems to provide 20x faster execution times over the Apache Hadoop distribution.

Red Hat Inc. last week announced JBoss Data Grid 6.2, the latest version of its own in-memory data grid for use in Big Data analytics. "Many IT environments were designed before Big Data challenges became common," the company said. "JBoss Data Grid helps overcome these challenges without requiring enterprises to invest in new data infrastructure by providing a complementary layer to the database."

Another announcement last week came from MicroStrategy Inc. about an "in-memory analytics service designed to deliver extremely high performance for complex analytical applications that have the largest data sets and highest user concurrency," called Parallel Relational In-Memory Engine (PRIME). The company said PRIME, an option for users of its "MicroStrategy Cloud" analytics platform, uses "a state-of-the art visualization and dashboarding engine with an innovative massively parallel in-memory data store" to more quickly distribute Big Data insights to large numbers of users.

Besides software packages and integrated platforms, another way vendors are making Big Data analytics more accessible is through specialized hardware/software "appliances." Hardware vendor Silicon Graphics International Corp. last month announced it will develop "an in-memory appliance based on the SAP HANA platform" from SAP AG. The SAP HANA platform, released in 2010, features an in-memory, real-time, row-and-column store RDBMS.

Also getting in on the row-and-column store scene is MemSQL. The "NewSQL" vendor, which claims to produce "the world's fastest in-memory database," just today announced the upcoming version 3 of its Big Data platform, combining its in-memory row store "with a new highly compressed column store."

Another company in the RDBMS realm, VoltDB, last week announced version 4.0 of its "high-speed operational database with in-memory analytics." Company president and CEO Bruce Reading said "VoltDB 4.0 makes possible high-performance, exceptionally fast business applications designed to realize the promise of Big Data for many industries."

The fact that all of the above items were announced within the last month--most within the last week--speaks to the rapid pace of the integration of Big Data and in-memory computing and the importance of organizations remaining competitive by harnessing those huge performance gains mentioned in that 2011 Aberdeen report.

Developers creating faster Big Data analytics applications doesn't necessarily equate to better Big Data insights by management, however. Case in point is yet another headline last week--based on a study using the aforementioned SAP HANA platform--which read: "Denver Broncos to defeat Seattle Seahawks in Super Bowl: big data analysis."

Final score:

Denver Broncos: 8

Seattle Seahawks: 43

Ouch.