WatersWorks

Blog archive

'Big Data' Definition Evolving with Technology

While there's lots of talk (a lot of talk) about big data these days, according to Andrew Brust, Microsoft Regional Director and MVP, there currently is no good, authoritative definition of big data.

"It's still working itself out," Brust says. "Like any product in a good hype cycle, the malleability of the term is being used by people to suit their agendas."

"That's okay," he continues, "There's a definition evolving."

Still, Brust, who will be speaking about big data and Microsoft at the upcoming Visual Studio Live! New York conference, says that a few consistent big data characteristics have emerged. For one, it can't be big data if it isn't...well...big.

"We're talking about at least hundreds of terabytes," Brust explains. "Definitely not gigabytes. If it's not petabytes, we're getting close, and people are talking about exabytes and zettabytes. For now at least, if it's too big for a transactional system, you can legitimately call it big data. But that threshold is going to change as transactional systems evolve."

But big data also has "velocity," meaning that it's coming in an unrelenting stream. And it comes from a wide range of sources, including unstructured, non-relational sources -- click-stream data from Web sites, blogs, tweets, follows, comments and all the assets that come out of social media, for example.

Also, the big data conversation almost always includes Hadoop, Brust Says. The Hadoop Framework is an open source distributed computing platform designed to allow implementations of MapReduce to run on large clusters of commodity hardware. Google's MapReduce is a programming model for processing and generating large data sets. It supports parallel computations over large data sets on unreliable computer clusters.

"The truth is, we've always had Big Data, we just haven't kept it," says Brust, who is also the founder and CEO of Blue Badge Insights. "It hasn't been archived and used for analysis later on. But because storage has become so much cheaper, and because of Hadoop, we can now use inexpensive commodity hardware to do distributed processing on that data, and it's now financially feasible to hold the data and analyze it."

"Ultimately the value Microsoft is trying to provide is to connect the open-source Big Data world (Hadoop) with the more enterprise friendly Microsoft BI (business intelligence) world," Brust says.

Posted by John K. Waters on April 10, 2012