The Hadoop community recently promoted YARN -- the next-gen Hadoop data processing framework -- to the status of "sub-project" of the Apache Hadoop Top Level Project. The promotion puts YARN on the same level as Hadoop Common, the Hadoop Distributed File System, and MapReduce. It had been part of the MapReduce project; the promotion means it'll now get the spotlight and developer attention its proponents believe it deserves.
"We now have consent from the community to separate YARN from MapReduce," says Arun C. Murthy. "Which is as it should be. YARN is not another generation of MapReduce, and I really don't like the 'MapReduce 2.0' label. This is a different paradigm. This is much more general and much more interesting."
Murthy ought to know: he's has been a full-time contributor to the Hadoop MapReduce project since it got off the ground at Yahoo in early 2006. Back then, he and fellow Yahoo software engineer Own O'Malley set a world data-sorting record (http://sortbenchmark.org/) using Map-Reduce: a terabyte in 60 seconds. Today, Murthy is a member of the Apache Hadoop Project Management Committee and a co-founder of Hortonworks, one of the chief providers of commercial support and services for Hadoop.
And he's been working on YARN full-time for about two and a half years.
"We knew that we were going to have to take Hadoop beyond MapReduce," Murthy says. "The programming model—the MapReduce algorithm—was limited. It can't support the very wide variety of use-cases we're now seeing for Hadoop. YARN turns Hadoop into a generic resource-management-and-distributed-application framework that lets you implement multiple customized apps. I expect to see MPI, graph-processing, simple services, all co-existing with MapReduce applications in a Hadoop YARN cluster. You can even run MapReduce now as an application for YARN."
Hadoop, of course, is the open-source framework for running applications on large data clusters built on commodity hardware (let's just say it: Big Data). I sometimes forget that Hadoop is actually a combination of two technologies: Google's MapReduce and HDFS. MapReduce is a programming model for processing the large data sets that supports parallel computations on so-called unreliable clusters. HDFS is the storage component designed to scale to petabytes and run on top of the file systems of the underlying operating systems.
What Murthy and others are hoping to do is redefine Hadoop from "HDFS-plus-MapReduce" to "HDFS-plus-YARN."
"The users can now look at Hadoop as a much more general-purpose system," Murthy says. "And from a developer perspective, we've opened up Hadoop itself to the point where now anyone can implement their own applications without having to worry about the nitty-gritty details of how you manage resources in a cluster and what you do for fault tolerance. [Promoting it] will also help us get more users and more developers to build an ecosystem around YARN. I guarantee you that next year at this time, we will be looking at four or five ways of doing real-time processing on Hadoop."
And I had to ask: What does YARN stand for?
"We were sitting around at lunch one day, trying to come up with the most inane names for our product," Murthy confessed to me. "The result was 'Yet Another Resource Negotiator—YARN.' I know: it's a really bad name."
But really promising technology.
Hortonworks is in the process of publishing a still-unfolding series of blogs by Murthy and Hortonworks' product marketing director Jim Walker on the subject of YARN and its implications for Hadoop. And there's a new collaboration mailing list (firstname.lastname@example.org) for those who want to get involved in the project.
Posted by John K. Waters on 08/15/2012 at 10:53 AM0 comments
This week a California court ordered both Google and Oracle to disclose the identities of any bloggers, commentators or journalists who were paid to write about the companies' courtroom Java battle.
"The Court is concerned that the parties and/or counsel herein may have retained or paid print or internet authors, journalists, commentators or bloggers who have and/or may publish comments on the issues in this case," wrote Judge William Alsup in a Tuesday filing.
The judge added that even though this particular case is almost over, "the disclosure required by this order would be of use on appeal or on any remand to make clear whether any treatise, article, commentary or analysis on the issues posed by this case are possibly influenced by financial relationships to the parties or counsel."
Both sides in the case were ordered to file a statement clearly identifying "all authors, journalists, commentators or bloggers who have reported or commented on any issues in this case and who have received money (other than normal subscription fees) from the party or its counsel during the pendency of this action." The two companies are required to file those statements by Friday, August 17.
Oracle had alleged that Google infringed on Java-related patents and copyrights when it developed its Android operating system. The jury in the case ruled unanimously in May that Google had not infringed on those patents when it developed its Android operating system. But it delivered a partial verdict on May 7, holding that Google had infringed on Oracle's copyrights in its use of 37 Java APIs, but deadlocked on whether that infringement could be considered "fair use."
Alsup is the judge who presided over the case in the U.S. District Court for the Northern District of California, and ruled in June that the Java APIs are not subject to copyright, though he kept his ruling narrow: "This order does not hold that Java API packages are free for all to use without license," Alsup wrote. "It does not hold that the structure, sequence and organization of all computer programs may be stolen. Rather, it holds on the specific facts of this case, the particular elements replicated by Google were free for all to use under the Copyright Act."
On April 18, blogger Florian Mueller, who writes the FOSS Patents blog and is a long-time follower of the Oracle v. Google case, disclosed to his readers a new consulting relationship with Oracle. Mueller wrote:
- "I have been following Oracle v. Google since the filing of the lawsuit in August 2010 and have read pretty much every line of each court filing in this litigation. My long-standing views on this matter are well-documented. As an independent analyst and blogger, I will express only my own opinions, which cannot be attributed to any one of my diversity of clients. I often say things none of them would agree with. That said, as a believer in transparency I would like to inform you that Oracle has very recently become a consulting client of mine. We intend to work together for the long haul on mostly competition-related topics including, for one example, FRAND licensing terms."
Mueller noted in that posting that he "vocally opposed Oracle's acquisition of Sun Microsystems."
Posted by John K. Waters on 08/08/2012 at 10:53 AM0 comments
Looks like we won't be seeing the Java-native module system known as Project Jigsaw in the upcoming Java 8 release. In a blog posted this week, the chief architect of Oracle's Java Platform Group, Mark Reinhold, proposed to defer the project to the Java 9 release. Java 8 is currently on track for a September 2013 ship date. Java 9 is currently expected in 2015.
Although "steady progress is being made" on Jigsaw, some "significant technical challenges remain," Reinhold wrote, adding, "There is, more importantly, not enough time left for the broad evaluation, review, and feedback which such a profound change to the Platform demands."
Not to be confused with the weird puppet-head guy in the Saw movies, Project Jigsaw is the OpenJDK project focused on implementing a standard module system for Java Standard Edition (SE). Sponsored by the Java programming language Compiler Group, and originally aimed at modularizing just the JDK, the project will ultimately apply to the Java SE, EE, and ME platforms and the JDK.
"The growing demand for a truly standard module system for the Java Platform motivated expanding the scope of the [Jigsaw] Project," the sponsors explain on the OpenJDK Web site. The goal of the project is to "produce a module system that can ultimately become a JCP-approved part of the Java SE Platform and also serve the needs of the ME and EE Platforms."
When it is implemented, a modular system for Java will "ease the construction, maintenance, and distribution of large applications, at last allowing developers to escape the ‘JAR hell' of the brittle and error-prone class-path mechanism," Reinhold wrote." Such a system will support customizable configurations that scale from large servers to embedded devices and "in the long term, enable the convergence of Java SE with the higher-end Java ME Platforms." Reinhold also pointed out that "Modular applications built on top of a modular platform can be downloaded more quickly, and the run-time performance of the code they contain can be optimized more effectively."
Reinhold expects Java 8 to include the much-anticipated Project Lambda (JSR 335), which adds closures and related features to the Java language to support programming in multicore environments. Java 8 will also include the new Date/Time API (JSR 310), Type Annotations (JSR 308), and "a selection of the smaller features already in progress," he said.
Work on Jigsaw will, in the meantime, proceed at full speed, he added.
In that same blog post, Reinhold also advocated for a regular two-year cycle for all future Java SE releases.
"In all the years I've worked on Java," Reinhold wrote, "I've heard repeatedly that developers, partners, and customers strongly prefer a regular and predictable release cycle. Developers want rapid innovation while enterprises want stability, and a cadence of about two years seems to strike the right balance. It's therefore saner for all involved -- those working on new features, and those who want to use the new features -- to structure the development process as a continuous pipeline of innovation that's only loosely coupled to the actual release process, which itself has a constant rhythm. If a major feature misses its intended release train then that's unfortunate but it's not the end of the world: It will be on the next train, which will also leave at a predictable time."
IDC analyst and long-time Java watcher Al Hilwa believes that the delayed release of Project Jigsaw is probably the right move.
"Java does not exist in a vacuum and delays in the Java modularity project of the JDK will no doubt hinder certain parts of the ecosystem," Hilwa told ADTmag. "However, under the circumstances, I think it is wise to prioritize schedule over features. The maturity of any development process is measured by the predictability of its schedule. Oracle has done a decent job of steering Java to be schedule driven, and kudos to the team for owning up at the right time, because the ecosystem needs to know as early as possible."
I would argue that the success of the annual Eclipse Release Train, now in its seventh year, offers an example of the value of predictable releases, both in terms of reassuring the commercial adopters and the community itself. Hilwa believes that the releases should come even faster.
"I would argue that, in the era of cloud services, social interaction, and mobile app stores, a faster cadence is needed," Hilwa added, "and the two-year cycle should give way to a more incremental and faster approach to development everywhere."
Posted by John K. Waters on 07/19/2012 at 10:53 AM0 comments
Rod Johnson, who wrote the first version of the open-source, Java-based Spring framework, and later co-founded SpringSource, has left his position as SVP and GM of VMware's SpringSource product division. Johnson joined the Palo Alto, Calif.-based virtualization company when it acquired SpringSource in 2009, where he then served as CEO.
In the blog post announcing his departure, Johnson gave no specific reasons for leaving the company, but described that past decade as "a wild and engrossing ride that I could never have imagined when I wrote the first lines of BeanFactory code in my study in London in 2001."
The Spring Framework is one of the most popular Java application frameworks on the market today. It's a layered Java/J2EE framework based on code published in Johnson's book Expert One-on-One Java EE Design and Development (Wrox Press, October 2002). He also wrote the first version of the framework. Although SpringSource has been a Java-focused operation, the company has ported its framework to .NET.
The open source Spring project was launched in 2003, and Johnson co-founded SpringSource in 2004. When the company was acquired by VMware in 2009, Johnson saw the merger as a joining of forces.
"Both of these companies grew up around great technology," he told ADTMag.com at the time. "We believe that the technology synergies are very, very strong, and that they will allow us to do incredibly exciting things with Platform as a Service and Java cloud technologies."
The VMware merger is responsible, at least in part, for the Spring Framework's expansion into management, runtimes, and non-Java development tools. In 2010 the company launched a lightweight version of its tc Server to provide a small footprint for running applications in virtualization and cloud-deployment architectures. The division also acquired data management vendor GemStone that year with plans to use that company's GemFire enterprise data fabric to give developers using the Spring Framework the infrastructure necessary for emerging cloud-centric applications.
Johnson served as member of the Executive Committee (EC) of the Java Community Process (JCP) and was an outspoken critic of the JCP's slow progress toward resolution of problems with J2EE. In 2009, during the latest dustup in an ongoing conflict between Sun Microsystems and the Apache Software Foundation (ASF) over Sun's refusal to provide the Foundation with a license for a Technology Compatibility Kit (TCK), Johnson expressed his disappointment with the process to @ADTmag: "This issue raises legitimate concerns about the credibility of the JCP as a whole," he said. "I mean, the JCP is either open or it's not. I have a lot of sympathy for the Foundation on this issue."
In 2011, Johnson told attendees at the annual JAX Conference in San Francisco that Java developer needed to "seize the lead in cloud computing." Developers would soon to need to be able to build applications that "leverage a dynamic and changing infrastructure, access data in non-traditional storage formats, perform complex computations against large data sets, support access from a plethora of client platforms and do so more quickly than ever before without sacrificing scalability, reliability and performance," he declared. What's called for now, is "an open, productive Java Platform-as-a-Service."
Mike Milinkovich, executive director of the Eclipse Foundation, believes that, whatever Johnson does next, he'll be remembered for his work on the Spring Framework and his efforts to simplify Java development.
"I think Rod will be remembered as one of the pioneers of the open source and Java community," Milinkovich said. "He showed how open source can be used to create innovative technology that is widely used by the enterprise Java community. His lasting legacy will be forcing the simplification of the enterprise Java middleware stack. In doing so, he played a very large part in making Java a success."
IDC analyst Al Hilwa sees Johnson as a "role model in entrepreneurship" who has had a big impact on Java developers during his tenure at the head of SpringSource.
"Even though at heart he is a developer, very few have been able to roll obscure developer frameworks into an acquisition of the size that VMware paid for SpringSource," Hilwa said. "We may continue to evaluate whether VMware's ventures in application platform will make a lasting business model or generate sustainable revenue, but Rod's impact on the life of developers is undisputedly sizeable. The ideas pioneered by the Spring Framework have had long-lasting impact in the Java world as well as wide adoption. What's more, they affected the way Java EE has evolved, which has absorbed many of these innovations."
In his blog post, Johnson expressed his satisfaction with the success of Spring as a means of simplifying Java development. "Spring was created to simplify enterprise Java development, and has succeeded in that goal," he wrote, adding that Spring has become the dominant programming model for enterprise. Johnson also pointed to the framework's evolution as enterprise technology "well beyond the scope of the original Spring Framework." He cited a range of "Spring-created technology at the forefront of enterprise development," including Spring for Apache Hadoop (Big Data), Spring Data (NoSQL and distributed datastores), Spring Social (social networking), and Spring Mobile (mobile development).
Johnson also sought to reassure members of the open source Spring community in his blog post: "Spring will continue to be driven forward by the Spring project leads, whom you've all come to know and trust over the past several years. Their experience, deep technical knowledge and innovative thinking will continue to guide Spring's development. I look forward to seeing what they'll create for the next decade, in partnership with their communities."
Posted by John K. Waters on 07/10/2012 at 10:53 AM1 comments
The fifth annual Hadoop Summit brought an estimated 2,100 attendees to the Convention Center in downtown San Jose, Calif., last week. The two-day, big-data event was hosted by Yahoo, Hadoop's first large-scale user, and Hortonworks, a leading commercial support-and-services provider.
Among the announcements coming out of this year's summit were updates from the three leading commercial Hadoop distributors. Hortonworks unveiled the first general release of its Apache Hadoop software distro, Hortonworks Data Platform (HDP) 1.0, a day before the start of the show. The company bills the open source data management platform as "the next generation enterprise data architecture." Built on Apache Hadoop 1.0, this release includes a bundle of new provisioning, management, and monitoring capabilities built into the core platform. It also comes with an integration of the Talend Open Studio for Big Data tool.
Cloudera got a big jump on the competition by announcing a new release a week earlier, but the company showed on its new CDH4 and Cloudera Manager 4, which are part of Cloudera Enterprise 4.0, at the show. Version 4 of CDH, the company's open source Hadoop platform (on which Enterprise 4.0 is built), expands the number of computational processes executable under Hadoop and introduces a new feature designed to software programs to be embedded within the data itself. Dubbed "coprocessors," these programs are executed when certain pre-defined conditions are met.
MapR Technologies showed off version 2.0 of its Hadoop distro, the first to support multi-tenancy. The new version also comes with advanced monitoring management tools, isolation capabilities, and added security. MapR is offering this release in a basic edition (M3), and an advanced edition (M5). The MapR Hadoop Distribution M3 supports HBase, Pig, Hive, Mahout, Cascading, Sqoop and Flume. The M5 edition adds high availability features and additional security tools, including: JobTracker HA, Distributed NameNode HA, Snapshots and Mirroring.
Also, VMware launched a new open source project codenamed "Serengeti" at the show. The Web site describes the project's goal "to enable the rapid deployment of an Apache Hadoop cluster... on a virtual platform." VMware says the project aims to produce a virtualization-aware Hadoop configuration and management tool. VMware is partnering with Cloudera, Hortonworks, MapR and big data analysis company Greenplum on this project.
Apache Hadoop is an increasingly popular, Java-based, open-source framework for data-intensive distributed computing. They system is designed to analyze a large amount of data in a small amount of time. At its core, it is a combination of Google's MapReduce and the Hadoop Distributed File System (HDFS). MapReduce is a programming model for processing and generating large data sets. It supports parallel computations over large data sets on unreliable computer clusters. HDFS is designed to scale to petabytes of storage and to run on top of the file systems of the underlying OS.
Attendance at this year's Hadoop Summit set a record. The first event, held in 2008, drew an estimated 500 attendees. The Summit's sponsorship roster underscores the growing importance of the data analysis platform. Cisco, Facebook, IBM, Microsoft and VMware were among the heavy hitters adding their support to the event; there were 49 event sponsors total.
Speaking at the conference, Facebook engineer Andrew Ryan talked with attendees about his company's record-setting reliance on the HDFS clusters to store more than 100 petabytes of data. During his talk, Ryan explained how Facebook has worked around Hadoop's key weakness: its reliance on a single name server (Namenode) to send and receive all filesystem data via a pool of Datanodes. If a Datanode goes down there's little impact on the cluster, but if Namenode goes down, no clients can read or write to the HDFS. The fix: AvatarNode, a piece of software designed to provide a backup Namenode. Ryan laid out the details from his talk in a blog post.
Posted by John K. Waters on 06/18/2012 at 10:53 AM0 comments
JNBridge, maker of tools that connect Java and .NET Framework-based components and apps, released a free interoperability kit for developers looking for new ways of connecting disparate technologies on Monday. This second JNBridge Lab demonstrates how to build and use .NET-based MapReducers with Apache Hadoop, the popular Java-based, open-source platform for data-intensive distributed computing.
The company began offering these kits in March. The first JNBridge Lab was an SSH Adapter for BizTalk Server designed to enable the secure access and manipulation of files over the network. This new Lab aims to provide a faster and better way to create heterogeneous Hadoop apps than other current alternatives, the company claims. All of the Labs come with pointers to documentation and links to source code.
The new Hadoop Lab shows developers how to write .NET-based Hadoop MapReducers against the Java-based Hadoop API, which avoids the overhead of the Hadoop streaming utility. The resulting .NET code can run directly inside Hadoop processes.
"Streaming works," said JNBridge CTO Wayne Citrin, "but it's kind of thin gruel. It really makes non-Java MapReducers into second-class citizens in the Hadoop world. You have to manage and configure a separate process. You have to parse the output and put it back together when you're done, which is another overhead cost. Then there's the overhead of going through sockets. It's not surprising that not that many people actually use .NET in this case."
The code provided in the Hadoop Lab can be run as an example, Citrin explained, or it can be used as a design pattern for users to develop their own Hadoop apps using C# or VB.NET.
JNBridge started its Labs project started earlier this year as part of the company's 10-year anniversary celebration.
"It was a way of showing people how to use the out-of-the-box functionality of JNBridgePro to do useful things that they may not have thought of, or that don't exist out there as products," Citrin said.
The company's flagship product, JNBridgePro, is a general purpose Java/.NET interoperability tool designed to bridge anything Java to .NET, and vice versa, allowing developers to access the entire API from either platform. Last year the company stepped into the cloud with JNBridgePro 6.0.
Why would anyone want to build MapReducers in .NET?
"For the same reasons you would want to use JNBridgePro in the first place," Citrin said. "Your organization might have .NET-based libraries they need or want to use in a Hadoop application. Your company might have more people skilled in .NET than Java. Or you might be working with Windows Azure, which supports Java, but the .NET tooling is better."
Citrin confesses that developers have yet to begin trampling each other to download the JNBridge Labs, but there has been enough interest and feedback to keep the project going.
The JNBridge Labs are available for download, free from the company's Web site. Although the kits are free, they require a JNBridgePro license for use beyond the trial period. The company announces new Lab releases on its blog.
Posted by John K. Waters on 05/21/2012 at 10:53 AM0 comments
Brian Noyes didn't set out to become a software architect. He started writing code "to stimulate his brain," while he was flying F-14 Tomcat fighter aircraft for the U.S. Navy. As his software expertise developed, he found himself "going down a technical track" managing onboard mission computer software in the aircraft, and later, systems and ground support software for mission planning and controlling satellites.
"It was just a hobby," Noyes says, "but it led me to work that I still love to do."
Noyes left the Navy in 2000 and today is chief architect at IDesign, a .NET-focused architecture, design, consulting, and training company. He's also a Microsoft Regional Director and an MVP, and the author of several books, including: Data Binding with Windows Forms 2.0: Programming Smart Client Data Applications with .NET (Addison-Wesley Professional, 2006) and Developer's Guide to Microsoft Prism 4: Building Modular MVVM Applications with Windows Presentation Foundation and Microsoft Silverlight (Microsoft Press, 2011).
Noyes specializes in smart client architecture and development, presentation-tier technologies, ASP.NET, workflow and data access. He writes about all these topics and more on his blog, ".NET Ramblings."
Not surprisingly, Noyes is a fan of Microsoft's Extensible Application Markup Language (XAML). He says Microsoft got a lot of things right when it created this declarative, XML-based language for the .NET Framework back in 2005/2006.
"XAML provides a clean separation between the declarative structure and the code that supports it," Noyes says. "That can either come in the form of the code-behind that's inherently called to it in the way Visual Studio does it, or using the Model View ViewModel (MVVM) pattern to have even better separation. They put mechanisms into the bindings and control templates and data templates that just give you this nice separation of things -- if you want them." "
They really facilitated both ends of the spectrum," he continues. "They made it so you have a drag-and-droppy, RAD-development kind of approach, where you're not so concerned about the cleanliness of the code and how maintainable it is and you just want to get it done. Or, if you're more of maintainability Nazi, as I am, and want absolutely clean code and separation of concerns and things like that, it facilitates that as well."
XAML shipped with the .NET 3.0, along with the Windows Presentation Foundation (WPF), of which Noyes is also a fan. "One thing I always say about WPF is that they did a darned good job of getting it right the first time," he says, "because, since the first release, there has been very little change to the core framework. Whereas with Silverlight they've had to do substantial improvements with each release to inch it up closer to what WPF was capable of."
Noyes explores uses for all of these tools and technologies in his sessions scheduled for upcoming Visual Studio Live! conferences. "For events like this, it's about giving them knowledge they can take home and use in the trenches the very next day," he says. "I try to keep things close to the code."
Posted by John K. Waters on 05/11/2012 at 10:53 AM0 comments
When the CSLA .NET framework made its first appearance in a book written by its creator, Rockford Lhotka, back in 1998, it was little more than a hunk of sample code -- at least that's how he saw it. But readers of that extremely popular book, VB6 Business Objects, saw it as something more.
"That first implementation was not really a framework per se," Lhotka recalls. "But after I published the book, I would get these e-mails from people who would say, 'Hey, I bought your book and I was using your framework and I wish it did this,' or, 'Your framework has a bug.' Initially I would respond that I don't have a framework. Over time I gave in and decided, hey, maybe I do have a framework."
Today CSLA is one of the most widely used open source software development frameworks for .NET. It's designed to help developers build a business logic layer for Windows, Web, service-oriented and workflow applications.
"It helps developers create a set of business objects that contain all of their business rules in a way that allows those object to be reused to create many different kinds of user interfaces or user experiences," Lhotka explains. "And once you've created this business layer using CSLA, you can create a WPF interface, a Silverlight interface, a Web interface, or a service interface on top of it."
"But then it gets even more interesting," he continued, "because those same objects can work on a Windows Phone, an Android device, and the new Windows Runtime (WinRT). Even if you're not building distributed applications (which most developers are these days), the CSLA framework gives an application a lot of structure and organization, which leads to long-term maintainability."
Lhotka (Rocky to his friends), CTO of Magenic, will be holding workshops on "Full Application Lifecycle with TFS and CSLA .NET" at the upcoming Visual Studio Live! New York and Visual Studio Live! Redmond conferences, as well as sessions about other topics. Lhotka is both a Microsoft Regional Director, which is a designated technical expert and community leader who's not a Microsoft employee, and an MVP (Microsoft Most Valuable Professional).
Lhotka created the .NET implementation of CSLA in 1999. The framework was originally conceived in 1996 in the world of Microsoft's Component Object Model (COM) and Visual Basic 5, and dubbed "Component Based Scalable Logical Architecture." But when Lhotka re-implemented it for .NET, which is not component based, the name "CSLA" became "just an unpronounceable word," he says.
CSLA .NET is currently in version 4.2, which supports Visual Studio 2010, Microsoft .NET 4.0, Silverlight 4 and Windows Phone 7. Version 4.2 and higher supports Android, Linux and OS X through the use of Mono, MonoTouch and Mono for Android.
More information about the CSLA framework, including a FAQ page, a download page, documentation, and a blog, can be found on Lhotka's Web site here.
Posted by John K. Waters on 05/07/2012 at 10:53 AM1 comments
While there's lots of talk (a lot of talk) about big data these days, according to Andrew Brust, Microsoft Regional Director and MVP, there currently is no good, authoritative definition of big data.
"It's still working itself out," Brust says. "Like any product in a good hype cycle, the malleability of the term is being used by people to suit their agendas."
"That's okay," he continues, "There's a definition evolving."
Still, Brust, who will be speaking about big data and Microsoft at the upcoming Visual Studio Live! New York conference, says that a few consistent big data characteristics have emerged. For one, it can't be big data if it isn't...well...big.
"We're talking about at least hundreds of terabytes," Brust explains. "Definitely not gigabytes. If it's not petabytes, we're getting close, and people are talking about exabytes and zettabytes. For now at least, if it's too big for a transactional system, you can legitimately call it big data. But that threshold is going to change as transactional systems evolve."
But big data also has "velocity," meaning that it's coming in an unrelenting stream. And it comes from a wide range of sources, including unstructured, non-relational sources -- click-stream data from Web sites, blogs, tweets, follows, comments and all the assets that come out of social media, for example.
Also, the big data conversation almost always includes Hadoop, Brust Says. The Hadoop Framework is an open source distributed computing platform designed to allow implementations of MapReduce to run on large clusters of commodity hardware. Google's MapReduce is a programming model for processing and generating large data sets. It supports parallel computations over large data sets on unreliable computer clusters.
"The truth is, we've always had Big Data, we just haven't kept it," says Brust, who is also the founder and CEO of Blue Badge Insights. "It hasn't been archived and used for analysis later on. But because storage has become so much cheaper, and because of Hadoop, we can now use inexpensive commodity hardware to do distributed processing on that data, and it's now financially feasible to hold the data and analyze it."
"Ultimately the value Microsoft is trying to provide is to connect the open-source Big Data world (Hadoop) with the more enterprise friendly Microsoft BI (business intelligence) world," Brust says.
Posted by John K. Waters on 04/10/2012 at 10:53 AM1 comments