In-Depth

Are You Building a Data Lake or Falling into a Data Swamp?

A data lake can be an asset to business intelligence systems. But in developing a data lake it's important to avoid pitfalls that can end up creating a data swamp.

Alex Gorelik, author of "The Enterprise Big Data Lake: Delivering the Promise of Big Data and Data Science," offers guidance on avoiding the quagmire. As a consultant, he has worked with major companies including Unilever, Google, Royal Caribbean, LinkedIn, Kaiser Permanente, Goldman Sachs, GlaxoSmithKline and Fannie Mae.

"I wrote this book because I talk to a lot of companies," he explained in an interview. "For every successful data lake project, I talk to dozens who failed."

Problems arise when organizations are either unsure whether they even need a data lake, or are not sure how to build a successful one. Too often they end up building a data swamp and they don't know what to do with it.

Speaking at conferences, Gorelik asks for a show of hands to see where the attendees are in their attempts to build a successful data lake. A significant number are on their third or fourth attempt.

"So, people are really struggling with this idea of a data lake," he says.

While software and hardware companies and their marketing staffs may offer technology as the "solution," Gorelik talks about the need for an evolving cultural change where an organization makes data available and useful to more of its people.

"I really think of Big Data especially data lakes as IT's answer to sharing an economy," he explains. "It's so we can get data where we can share it and get volume out of it and do our jobs better. So, to be successful, it has to be used by many people. If you have two people using it, it's a sandbox."

But IT need not despair if they only have a sandbox to start. Like all major projects, data lakes can grow from modest beginnings.

"I define stages of maturity as data puddles," Gorelik says. "That's somewhere where you use the technology but it's like a data mart, but with Big Data. You have a small team. They know exactly what's in there. They use it for a single purpose or for a single team. Then you might grow to a data pond which is like a data warehouse in the cloud, by either moving your data warehouse to Big Data technology but having a number of puddles together. That's the traditional way data warehouses grow into a data mart. To have a real data lake, you have to get people to start sharing. You have to have self-service and a broad adoption. So, what's required is, you need scalability so lots of processing can be done, so when they share, they don't kill each other, otherwise they don't use it."

Future-proofing is another key concept in the evolution of Big Data and the creation of data lakes, Gorelik argues. To ensure that data scientists, analysts and business users will have use of the data the organization is gathering now, he urges IT to avoid proprietary data base systems or anything that locks data up so it is not available to share.

"You also need future-proofing because you're keeping history," he says. "If we don't start saving data now, we won't have it when we do our analytics. We don't even know what kind of analytics we need but without history we won't be able to do it. People have been saving it but if you put it inside of a proprietary database it's stuck in there. If you put it in a file system, any new project can use it."

As founder and CTO of Waterline Data of Mountain View, Calif., he practices what he preaches. He started the company five years ago with a mission to: "Help organizations discover and catalog data so they can connect the right people to the right data, accelerate their ability to take action and ensure compliance." He made sure the data would be available for future use regardless of whether the user was working with SQL or SPARC or Java or any future program. "The data is available for everybody to use," Gorelik says.

Besides being open and available, the data also needs to be secure if the data lake is going to be successful.

"One of the reasons these things fail is governance," Gorelik says. "A lot of data lakes I know fail with security, they let everybody load their data but nobody knows where it is because this is data for the future not for current use, so nobody knows where even the permission is. So, they let people go at it without even protecting it. If you have ten million files and nobody knows what's inside, how do you even use it? It's not very successful."

With a data lake, IT needs to guard the information with a secure permission policy, so it's not a swamp anyone can jump into. Both governance and cataloging of data are keys to making sure the data can be shared effectively, Gorelik says.

"Cataloging and making sense of data becomes critical to self-service and you need governance because if you know what data is you are able to protect it and you are able to expose it safely," he explains.

With a successful data lake, organizations can apply data science to real world problems. Gorelik cites examples such as a major American city that is using data on street pavement issues to be able to anticipate where potholes are likely to popup and fix them proactively before they even appear. In another example, a transportation company used data to determine what was causing passenger train cars to be taken out of service. The data showed the most common problem was damage to the doors. The company was able to take steps to better protect the doors from damage and avoid having cars out of service for repairs.

These applications of data science were made possible by following steps so data is available, catalogued and secure.

"If you get those things right and then you make it possible to share then you get a successful data lake," Gorelik says.