New Projects Fill in Big Data Gaps

As some original Apache Hadoop projects mature and graduate to commercial stewardship for further refinement, the open source Apache Software Foundation (ASF) and private companies are continually incubating and launching new open source projects to fill in the gaps in Big Data analytics.

The ASF yesterday announced a new version of a Big Data warehousing solution providing increased "SQL-on-Hadoop" functionality: Apache Tajo.

"Apache Tajo is used for low-latency and scalable ad hoc queries, online aggregation and extract-transform-load process (ETL) on large data sets stored on the Hadoop Distributed File System (HDFS) and other data sources," the ASF said. "By supporting SQL standards and leveraging advanced database techniques, Tajo allows direct control of distributed execution and data flow across a variety of query evaluation strategies and optimization opportunities."

The SQL-on-Hadoop movement is a primary driver in the maturing Big Data ecosystem, as it expands from its NoSQL roots to be more inclusive and accessible.

Some of the key new features coming in Tajo v0.10.0 listed by the ASF include:

  • Oracle and PostgreSQL catalog store support.
  • Direct JSON file support.
  • HBase storage integration (allowing users to directly access HBase tables through Tajo).
  • Improved JDBC driver for easier use of JDBC applications.
  • Improved Amazon S3 support.

Early this year, the ASF announced Apache Flink had graduated to a top-level project, providing a system for "expressive, declarative, and efficient batch and streaming data processing and analysis."

"Apache Flink is an open source distributed data analysis engine for batch and streaming data," the ASF said. "It offers programming APIs in Java and Scala, as well as specialized APIs for graph processing, with more libraries in the making."

Also updated to a top-level project in January was Apache Falcon, "an open Source Big Data processing and management solution for Apache Hadoop in use at Hortonworks, InMobi and Talend, among others." The project addresses data motion, data pipeline coordination, lifecycle management and data discovery, the ASF said.

The ASF also recently updated other key Big Data components, including Apache HBase, which last month graduated to v1.0, featuring a comprehensive API reorganization.

The private sector has also been releasing Big Data project updates. Last month, MapR Technologies Inc. -- one of the "big three" commercial vendors of Hadoop-based distributions -- teamed up with Mesoshpere Inc. for a new Big Data framework called Myriad.

"Today, Mesosphere and MapR are proud to announce project Myriad, an open source framework for running YARN on Mesos that integrates the two major powerhouses in the datacenter -- Mesos and Hadoop -- and makes them fully compatible technologies," Mesosphere said.

Mesosphere -- which provides the Mesosphere Datacenter Operating System (DCOS) for managing large-scale datacenter and cloud resources -- said the project started out under the direction of the Apache Mesos project, but was to be submitted to the Apache incubator program to hopefully attain independent status.

"Apache Mesos abstracts CPU, memory, storage and other compute resources away from machines (physical or virtual), enabling fault-tolerant and elastic distributed systems to easily be built and run effectively," the project's Web site states.

MapR further expounded on the Myriad project. "Based on an open source and collaborative development effort between MapR, Mesosphere and eBay, Myriad is an open source project built on the vision of consolidating Big Data with other workloads in the datacenter into a single pool of resources for greater utilization and operational efficiency," the company said.

With last month's announcement of an Open Data Platform (ODP), another organization will be helping to fill in the gaps to provide missing Big Data analytics functionality. "The ODP will promote Big Data technologies based on open source software from the Apache Hadoop ecosystem and optimize testing among and across the ecosystem's vendors," cofounder Pivotal Software Inc. said in a news release. "These efforts will accelerate the ability of enterprises to build or implement data-driven applications."

About the Author

David Ramel is an editor and writer for Converge360.