Big Data Product Watch, 03/03/2016: GraphFrames, DataFlow, Data Funnel, More -- ADTmag

Big Data Product Watch, 03/03/2016: GraphFrames, DataFlow, Data Funnel, More

By David Ramel
March 3, 2016

Here's a roundup of this week's Big Data news featuring: an updated platform and new cadence cycle from Hortonworks; GraphFrames, a graph processing library for Apache Spark, from Databricks; the open sourcing of LinkedIn's WhereHows project that provides a repository for metadata; and DMX Data Funnel from Syncsort, for data ingestion.

Hortonworks Inc. on Tuesday announced a new distribution strategy for its Hortonworks Data Platform (HDP) that provides two different release cadences for its core Apache Hadoop components and for its extended services.
Core services include Hadoop Distributed File System (HDFS), MapReduce and YARN, which, along with Apache Zookeeper, will be updated once per year in alignment with the ODPi consortium, "the open ecosystem of Big Data."

Extended services include Apache projects Spark, Hive, HBase, Ambari and others, all of which run on top of the core components. These will be grouped together logically and updated on a continuing basis as dictated by community project teams.

"As part of this rapid distribution model, Hortonworks today announced the general availability of Apache Spark 1.6, Apache Ambari 2.2 and SmartSense 1.2 in HDP 2.4, which is available immediately," the company said.

Hortonworks also introduced Hortonworks DataFlow 1.2 (HDF). "HDF is a data-in-motion platform for real-time streaming of data and is a cornerstone technology for the Internet of Anything to ingest data from any source to any destination," the company said. "HDF 1.2 now integrates streaming analytics engines Apache Kafka and Apache Storm for delivering actionable intelligence. HDF 1.2 will be available in Q1 of 2016."
Databricks Inc. today announced a graph-processing library for Apache Spark called GraphFrames, based on Spark DataFrames and built in conjunction with UC Berkeley and MIT.
"GraphFrames support general graph processing, similar to Apache Spark's GraphX library," Databricks said. "However, GraphFrames are built on top of Spark DataFrames, resulting in some key advantages:
- Python, Java and Scala APIs: GraphFrames provide uniform APIs for all three languages. For the first time, all algorithms in GraphX are available from Python and Java.
- Powerful queries: GraphFrames allow users to phrase queries in the familiar, powerful APIs of Spark SQL and DataFrames.
- Saving and loading graphs: GraphFrames fully support DataFrame data sources, allowing writing and reading graphs using many formats like Parquet, JSON and CSV.
"Graph-specific optimizations for DataFrames are under active research and development," the company said. "Watch Ankur Dave's Spark Summit East 2016 talk to learn more. We plan to include some of these optimizations in GraphFrames for its next release." You can get the code, under an Apache 2.0 license, on GitHub.
The business-oriented social site, LinkedIn, today announced it has open sourced its homegrown tool, called WhereHows, which provides a repository for Big Data metadata. The project's name resembles the journalistic technique of writing a news article by answering basic questions such as who, what, when, where and how. In this case, it answers questions such as:
- Where is the member profile data?
- How did it get here?
- What data are used to create inferred member skills data?
- Who owns that flow?
- When was the latest member profile data published by ETL on HDFS
"Today, we are excited to announce that we are open sourcing WhereHows, a data discovery and lineage portal," the company said. "At LinkedIn, WhereHows integrates with all our data processing environments and extracts coarse and fine grain metadata from them. Then, it surfaces this information through two interfaces: (1) a Web application that enables navigation, search, lineage visualization, annotation, discussion, and community participation, and (2) an API endpoint that empowers automation of other data processes and applications."

In its own in-house use, LinkedIn said, the tool has captured the status of some 50,000 datasets, 14,000 comments and 35 million job executions, along with related lineage data. The company promised more work on the project, to integrate with more data systems and with data lifecycle management and provisioning systems -- along with new features suggested by analysts and engineers.
Syncsort today announced new capabilities in its DMX-h data integration software. The company, which specializes in bringing Big Data processing to mainframe computers, also introduced DMX Data Funnel, used to quickly ingest large amounts of data -- such as hundreds of database tables -- from repositories such as DB2, which reportedly reduces the time and effort required to populate enterprise data hubs.
The company said the DMX-h enhancements let enterprises work with mainframe data in its native format on data-processing platforms such as Hadoop or Spark.

"The largest organizations want to leverage the scalability and cost benefits of Big Data platforms like Apache Hadoop and Apache Spark to drive real-time insights from previously unattainable mainframe data, but they have faced significant challenges around accessing that data and adhering to compliance requirements," said exec Tendü Yoğurtçu. "Our customers tell us we have delivered a solution that will allow them to do things that were previously impossible. Not only do we simplify and secure the process of accessing and integrating mainframe data with Big Data platforms, but we also help organizations who need to maintain data lineage when loading mainframe data into Hadoop."