Analysts Warn of Data Lake 'Fallacy' in Big Data Analytics
- By David Ramel
- July 28, 2014
Storing unstructured Big Data from disparate sources in one large "data lake" for integrated analysis is an unfulfilled promise according to analysts from research firm Gartner Inc.
The company today issued a news release to publicize the recent research report, "The Data Lake Fallacy: All Water and Little Substance."
The news release -- titled, "Gartner Says Beware of the Data Lake Fallacy" -- states that while several vendors are publicizing data lakes as essential to capitalizing on Big Data analytics, there isn't a common view among these vendors about what a data lake is or how it can provide value.
A data lake is often described as a single repository for large amounts of data that can be unstructured and of different types, allowing for widespread analytics from various users and applications.
"In broad terms, data lakes are marketed as enterprisewide data management platforms for analyzing disparate sources of data in its native format," said Nick Heudecker, co-author of the report. "The idea is simple: instead of placing data in a purpose-built data store, you move it into a data lake in its original format. This eliminates the up-front costs of data ingestion, like transformation. Once data is placed into the lake, it's available for analysis by everyone in the organization."
Co-author Andrew White said that while data lakes -- driven by the need for more agility and more accessible data analytics -- can help certain parts of an organization, such organizations have yet to realize the proposition of effective enterprisewide data management.
Data lakes address two problems, the analysts said. One is the old problem of eliminating data silos, or collections of data managed independently. The second is the new problem faced in Big Data initiatives where the types of data vary so much that putting the information into structured data warehouses or relational database management systems (RDBMSes) hinders analysis.
"Addressing both of these issues with a data lake certainly benefits IT in the short term in that IT no longer has to spend time understanding how information is used -- data is simply dumped into the data lake," White said. "However, getting value out of the data remains the responsibility of the business end user. Of course, technology could be applied or added to the lake to do this, but without at least some semblance of information governance, the lake will end up being a collection of disconnected data pools or information silos all in one place."
The main risk of using data lakes is the absence of descriptive metadata and an underlying mechanism to maintain it, the lack of which can turn a data lake into a "data swamp," Gartner said. A data lake can intake any kind of data without oversight or governance, and the risk comes from being unable "to determine data quality or the lineage of findings by other analysts or users that have found value, previously, in using the same data in the lake."
And the lack of semantic consistency and governed metadata can thwart the goal of many different users and departments performing analytics on the data because all these different users may not have the same data manipulation or analytics skills as do data scientists. Getting everyone up to speed with the same advanced skills is time-consuming and expensive, if possible at all.
Other risks include security and access control considerations, as data dumped into a lake might have associated privacy or regulatory requirements and shouldn't be exposed without oversight.
These risks, combined with performance considerations, led Gartner to advise companies to "focus on semantic consistency and performance in upstream applications and data stores instead of information consolidation in a data lake."
Reactions to the Gartner report from various vendors in the Big Data/Hadoop arena were mixed.
"The data lake is necessary for meaningful Big Data analytics -- for the first time you can bring together diverse multi-structured data (transactions, customer interactions and machine data) without months/years of IT boiling it down to small data," Ben Werther, founder and CEO of Big Data analytics company Platfora Inc., told this site. "It is necessary but not sufficient -- the missing piece is the native analytical tools that give frustrated business analysts the self-service iterative workflow to weave together that data for insights not possible with traditional BI tools."
Pivotal Software Inc., which is partnering with Capgemini to provide a "Business Data Lake," had one major beef with the Gartner report. "In general, we agree with a lot of the points made by Gartner, aside from the comment that data lakes solve a limited set of use cases," a Pivotal spokesperson told this site. "Pivotal and Capgemini formed their relationship back in December 2013 to address the agreed points, laid out in [our white paper], to address not narrow use cases, but the varied and constantly changing range of use cases for a Business Data Lake within enterprise customers."
Jack Norris of MapR Technologies Inc. also claimed that his company's solutions are addressing some of the issues noted by Gartner. MapR, often referred to as one of the "big three" vendors offering Apache Hadoop-based distributions, says its packaged solutions add value to the core open source Hadoop technology in the areas of security, disaster recovery, full data protection and high availability.
"The cost, efficiency and agility of Hadoop is driving the adoption of data lakes across industries," Norris told this site. "Gartner is rightly pointing out that not all Big Data and Hadoop solutions provide the performance, security and data protection capabilities that customers need. MapR is specifically architected to address these enterprise requirements enabling organizations across industries to successfully deploy data lakes."
Nevertheless, the Gartner analysts indicated some new thinking might be required around the concept of data lakes -- or in place of that concept.
"There is always value to be found in data but the question your organization has to address is this -- do we allow or even encourage one-off, independent analysis of information in silos or a data lake, bringing said data together, or do we formalize to a degree that effort, and try to sustain the value-generating skills we develop?" White said. "If the option is the former, it is quite likely that a data lake will appeal. If the decision tends toward the latter, it is beneficial to move beyond a data lake concept quite quickly in order to develop a more robust logical data warehouse strategy."
David Ramel is an editor and writer for Converge360.