News
Databricks Open Sources Project Aimed at Data Lake Reliability
- By Becky Nagel
- April 24, 2019
San Francisco, Calif.-based Databricks, original creators of Apache Spark, today announced the release of Delta Lake, an open source solution designed to provide "reliability for both batch and streaming data" for data lakes.
Data lakes are large repositories of storage, often used by enterprises, that store the data in its "raw" or "natural" format in a flat structure -- unlike data warehouses, which are generally hierarchical and store data using folders or files -- with each item tagged with a unique identifier and metadata. The data can then be pulled by a variety of uses, whether data mining applications, machine learning, analytics or something else.
According to Databricks, while the architecture of data lakes offers enterprises benefits, reliability often isn't one of them. "Data reliability challenges derive from failed writes, schema mismatches and data inconsistencies when mixing batch and streaming data, and supporting multiple writers and readers simultaneously," the company explained in its announcement of Delta Lake.
Databricks said that Delta Lake offers better reliability "by managing transactions across streaming and batch data and across multiple simultaneous readers and writers."
"Delta Lakes can be easily plugged into any Apache Spark job as a data source, enabling organizations to gain data reliability with minimal change to their data architectures," the company continued. "Organizations no longer need to spend resources building complex and fragile data pipelines to move data across systems. Instead, developers can have hundreds of applications reliably upload and query data at scale."
More information and Delta Lake code is now available for download here.
About the Author
Becky Nagel is vice president of AI for 1105 Media, where she specializes in training internal and external customers on maximizing their business potential via a wide variety of generative AI technologies as well as developing cutting-edge AI content and events. She's the author of "ChatGPT Prompt 101 Guide for Business Uses," regularly leads research studies on generative AI business usage, and serves as the director of AI Boardroom, a new resource for C-level executives looking to excel in the AI era. Prior to her current position she was a technical leader for 1105 Media's Web, advertising and production teams as well as editorial director for a suite of enterprise technology publications, including serving as founding editor of PureAI.com. She has 20 years of enterprise technology journalism experience, and regularly speaks and writes about generative AI, AI, edge computing and other cutting-edge technologies. She can be reached at [email protected].