New Open Source Tool Formats Big Data for TensorFlow

LinkedIn today announced it has open sourced a tool it developed to convert Apache Spark-based Big Data into a format consumable by TensorFlow, the popular open source platform for machine learning.

The tool, called Avro2TF, removes the data-conversion hassle faced by many Big Data developers, who can now use their freed-up time to focus on building machine learning models, the business-oriented social site said.

Avro2TF is just one of the tools the company has donated to the community as part of its internal deep learning initiatives that have been applied to its recommendation and search artificial intelligence (AI) systems. LinkedIn said it's on a mission to democratize machine learning.

"One of the important lessons we have learned from this journey is the importance of providing good deep learning platforms that help our modeling engineers become more efficient and productive," the LinkedIn engineering team said in a blog post today (April 4). "Avro2TF is part of this effort to reduce the complexity of data processing and improve the velocity of advanced modeling."

To work with TensorFlow, the tool can convert datasets stored in the Apache Avro format (called a "sparse vector format"), the most popular format used by company engineers.

"Based on the feedback from our users on the LinkedIn ML vertical teams, we needed a scalable solution focused on scalable data conversion," the team said. "More specifically, we needed a solution that converted our LinkedIn data types (e.g., sparse vector, dense vector, etc.) into a deep learning format (i.e., tensors)."

The tool supports all Spark-readable data formats, including ORC, which is also used by company engineers.

The project is now available on GitHub, and the company has published a tutorial on how to use it.

LinkedIn believes Avro2TF can help other organizations facing the same challenges it is addressing, with the GitHub site stating: "We believe that this is not only a LinkedIn problem, many companies have vast amount of ML data in similar sparse vector format, and Tensor format is still relatively new to many companies. Avro2TF bridges this gap by providing scalable Spark based transformation and extensions mechanism to efficiently convert the data into TF records that can be readily consumed by TensorFlow. With this technology, developers can improve their productivity by focusing on model building rather than data conversion."

About the Author

David Ramel is an editor and writer for Converge360.