eBay Open Sources Big Data Tool
It's a familiar story: Web giant needs analytical capabilities not found in existing products, uses tremendous resources to build homegrown solution and open sources the project for all to use and improve upon.
That's how the whole Big Data thing got started, and the latest iteration of the story features e-commerce giant eBay Inc. putting faster query speeds, SQL functionality and Multidimensional Online Analytical Processing (MOLAP) into its distributed analytics engine now open sourced as the project Kylin.
"Designed to accelerate analytics on Hadoop and allow the use of SQL-compatible tools, Kylin provides a SQL interface and MOLAP on Hadoop to support extremely large datasets," the eBay tech blog announced last week.
eBay said the technologies behind the Kylin approach aren't new, but it has taken Hadoop's distributed computing model -- including the Hadoop Distributed File System (HDFS) -- and used it to improve on, for example, calculating and storing values from a large query for further use later.
"When data becomes bigger, the pre-calculation processing becomes impossible -- even with powerful hardware," eBay said. "However, with the benefit of Hadoop's distributed computing power, calculation jobs can leverage hundreds of thousands of nodes. This allows Kylin to perform these calculations in parallel and merge the final result, thereby significantly reducing the processing time."
Speeding up queries was important to internal eBay staffers, who wanted less latency on their analytics projects but also wanted to continue using familiar tools such as Microsoft Excel and those from Tableau, known for data visualization.
With no existing tools available, eBay developers created Kylin to meet the company's internal goals:
- Sub-second query latency on billions of rows.
- ANSI-standard SQL availability for those using SQL-compatible tools.
- Full OLAP capability to offer advanced functionality.
- Support for high cardinality and very large dimensions.
- High concurrency for thousands of users.
- Distributed and scale-out architecture for analysis in the terabyte to petabyte size range.
The resulting Kylin OLAP engine can run fast queries on data structures with more than 10 billion rows, the company said, and perform interactive queries of Hadoop data faster than the Apache Hive project, a data warehouse that uses a SQL-like query language called HiveQL.
Kylin works by reading data from Hive, running MapReduce for pre-calculations, storing cube data in HBase and using Zookeeper to coordinate jobs. HBase, part of the Hadoop ecosystem, is an open source, non-relational, distributed database.
Components in the Kylin platform include: Metadata Manager; Job Engine; Storage Engine; REST Server; ODBC Driver; and Query Engine. The Metadata Manager is the key component, eBay said, as it manages the crucial cube metadata and all other metadata, supporting all the other components.
eBay has been refining the technology with the help of pilot customers and sponsors such as key employees from Hadoop distribution vendor Hortonworks Inc.
"Our largest use case is the analysis of more than 12 billion source records generating more than 14 TB cubes," eBay said. "Its 90 percent query latency is less than five seconds. Now, our use cases target analysts and business users, who can access analytics and get results through the Tableau dashboard very easily -- no more Hive query, shell command, and so on."
As evidenced on a developer Google Group, work continues on the project, with recent updates to a Docker container feature, Web front end, RESTful API and more.
With the open sourcing of Kylin, eBay said it's looking for developer contributions, specifically in the areas of a shell client; RCP server, job scheduler and tools. The "how to contribute" wish list on GitHub includes specific projects such as merging multiple HBase tables in the metadata component and implementing multi-column distinct counts in the query engine, among many others.
One major improvement on tap, eBay said, is support for hybrid OLAP.
"MOLAP is great to serve queries on historical data, but as more and more data needs to be processed in real time, there is a growing requirement to combine real-time/near-real-time and historical results for business decisions," eBay said. "Many in-memory technologies already work on Relational OLAP (ROLAP) to offer such capability. Kylin's next generation will be a Hybrid OLAP (HOLAP) to combine MOLAP and ROLAP together and offer a single entry point for front-end queries."
eBay also said it will propose Kylin as an Apache Software Foundation (ASF) incubator project, the initial step for ASF adoption.
David Ramel is the editor of Visual Studio Magazine.