BI-on-Hadoop Benchmark Reveals Analytical Engine 'Sweet Spots' -- ADTmag

BI-on-Hadoop Benchmark Reveals Analytical Engine 'Sweet Spots'

By David Ramel
October 24, 2016

BI-on-Hadoop specialist AtScale Inc.'s recent analytical engine benchmark study concludes that organizations will probably need to use multiple such engines for a successful implementation able to handle varied workloads.

The company researched four analytics engines in the open source world -- Hive, Impala, Presto and Spark SQL -- in order to publish its new reference performance study: The Business Intelligence for Hadoop Benchmark Q4 2016.

The good news is the company found continuing innovation across all products, resulting in 2x to 4x performance gains in just the six months since the first edition of the study was conducted. The not-so-good news is that none is likely to provide one sole solution for all the varied Big Data analytical workloads. Depending upon the type of query and other factors, "enterprises will find that each engine has its own 'sweet spot,' " AtScale said.

"There is no single 'best engine,'" the study concluded. "Presto, Hive, Impala and Spark SQL were all able to effectively complete a range of queries on over 6 billion rows of data. The 'winning' engine for each of our benchmark queries was dependent on the query characteristics (join size, selectivity, group-bys)."

For example, one "sweet spot" for Presto and Hive is handling a large number of concurrent dashboard queries. "Production enterprise BI user-bases may be on the order of 100s or 1,000s of users," AtScale said. "As such, support for concurrent query workloads is critical. Our benchmarks showed that Presto and Impala performed best -- that is, showed the least query degradation -- as concurrent query workload increased. Presto, new to this edition of the benchmark, showed the best results in our user concurrency testing."

**[Click on image for larger view.]** Benchmark Query Results for Large Tables *(source: AtScale)*

For querying large data sets, the study said: "Consistent with the findings in the First Edition of the BI on Hadoop Benchmarks, Spark SQL and Impala tend to be faster than Hive. For many queries, the performance difference between Impala and Spark SQL is relatively small. Presto 0.152, a newcomer to the Second Edition of the BI on Hadoop Benchmarks, shows a performance profile that is very similar to that of Hive 2.1."

On the smaller side of things, Impala and Spark SQL perform better, the company said. "Impala and Spark SQL continue to shine for small data queries (queries against the AtScale Adaptive Cache). New in this edition, the latest release of Hive LLAP (Live Long and Process) shows suitable 'small data' query response times. Presto also shows promise on small, interactive queries."

The Presto analysis was added to the AtScale project after the first edition was published in February.

AtScale said it conducted the study to reveal the strengths and weaknesses of the industry’s most popular analytical engines, focusing on three criteria: how they perform when processing large amounts -- billions or trillions of rows -- of data; how fast they are on smaller, interactive queries; and how stable they are for many users.

After discovering that all four engines perform well for general use cases and workloads, AtScale concluded that "A successful BI-on-Hadoop architecture will likely require more than one SQL on Hadoop engine. Each engine has its strengths: Presto's and Impala's concurrency scaling support for quick metric queries, Spark SQL's handling of large joins, Hive's and Impala's consistency across multiple query types. Enterprises might consider leveraging different engines for different query patterns."

About the Author

David Ramel is an editor and writer at Converge 360.

Featured

AppTrends

Email Address*Country*

Please type the letters/numbers you see above.

Upcoming Training Events

0 AM

VSLive! 3-Day Hands-On Training Seminar: Master Modern JavaScript: Unlock the Full Potential of Your Code
June 2-4, 2025

VSLive! 2-Day Hands-On Training Seminar: Asynchronous and Parallel Programming in C#
June 24-25, 2025

VSLive! 4-Day Hands-On Training Seminar: Immersive .NET Full Stack Training: 4-Day Hands-On Experience
July 15-18, 2025

Securing IT in the AI Era
July 23, 2025

VSLive! 4-Hour In-Depth Workshop: Immersive .NET Full Stack Training: C# Interfaces: Effective Usage while Avoiding Pitfalls
July 29, 2025

Visual Studio Live! @ Microsoft HQ
August 4-8, 2025

4-Hour VSLive! Workshop: Testability in .NET
August 27, 2025

Visual Studio Live! San Diego
September 8-12, 2025

Live! 360 2-Day Hands-On Seminar: Swimming in the Lakes of Microsoft Fabric and AI – A Hands-on Experience
September 18-19, 2025

VSLive! 2-Day Hands-On Training Seminar: Hands-On with .NET Web Development in 2025
October 7-8, 2025

Live! 360 Orlando
November 16-21, 2025

Artificial Intelligence Live! Orlando
November 16-21, 2025

Cloud & Containers Live! Orlando
November 16-21, 2025

Cybersecurity & Ransomware Live! Orlando
November 16-21, 2025

Data Platform Live! Orlando
November 16-21, 2025

Visual Studio Live! Orlando
November 16-21, 2025

VSLive! 4-Day Hands-On Training Seminar: Immersive .NET Full Stack Training: 4-Day Hands-On Experience
December 16-19, 2025

Visual Studio Live! Las Vegas
March 16-20, 2026

Free White Papers

More Tech Library