Q&A: On Hadoop's 10th Birthday, 10 Questions for One Who Was There
Today marks the 10th birthday of sorts for Apache Hadoop, as the first Hadoop cluster was put into production at Yahoo on Jan. 28, 2006. Since then, it has gone on to spawn the "Big Data" craze and transform enterprise analytics everywhere, turning the position of data scientist into the "best job in America" for 2016.
To look back on the history of one of the most disruptive technologies ever, we caught up Raymie Stata, who was involved early on with the project at Yahoo -- actually hiring Hadoop co-creator Doug Cutting -- and who now serves as the CEO of Altiscale.
Stata's bio reads:
Raymie comes to Altiscale from Yahoo, where he was Chief Technical Officer. At Yahoo he played an instrumental role in algorithmic search, display advertising and cloud computing. He also helped set Yahoo's open source strategy and initiated its participation in the Apache Hadoop project. Prior to joining Yahoo, Raymie founded Stata Laboratories, maker of the Bloomba search-based e-mail client, which Yahoo acquired in 2004. He has also worked for Digital Equipment's Systems Research Center, where he contributed to the AltaVista search engine. Raymie received his PhD in computer science from MIT in 1996.
- How did you first get involved with Hadoop?
I got to know Doug Cutting through the Internet Archive, which we both supported as technologists. He asked me to join the Board of the Nutch Foundation, so I was involved with "Hadoop" way before it was called that. In 2004, when the MapReduce paper was published, a bunch of my colleagues at Yahoo thought it would be great for us to develop a version of it as an open source project. I knew that Doug and Mike Cafarella wanted to implement MapReduce for Nutch, so we gave them some funding as consultants to do so. That went well, so in 2005 I hired Doug into Yahoo and we put a larger investment behind what became Hadoop. Fast forward to 2006 and we were in production with 10 nodes.
- How are you using it at Altiscale now?
At large Internet companies like Yahoo, Facebook and Twitter, employees have access to an amazing data infrastructure -- Hadoop at an unbelievable scale and loaded with all kinds of valuable data, with a large, supporting operations team to ensure job success. There is also a user community to help newbies become productive quickly. When I started looking around outside of Yahoo, I observed that most companies trying to use Hadoop didn't have the infrastructure, ops team and community to make success possible. Instead, they had sub-scale clusters, supported by a part-time staff, with end-users doing Web searches to figure out how to make it all work.
So we quickly formed a vision for how to help the broader enterprise achieve success with Big Data. Altiscale delivers, as a commercial service, the kind of Hadoop experience you get inside the large Internet companies. Working with Altiscale means that you have a team with the experience of running Hadoop at scale and in production at Web-scale companies. Our customers' developers, analysts and data scientists can stop wrangling complex Hadoop clusters and focus on their real strength of analyzing data and developing creative solutions that drive business value. We like to call this "Big Data without the swearing."
- How has it changed in that time?
Hadoop has matured significantly, and an entire supporting ecosystem of solutions has sprung up around it, such as Hive, Spark, Flink and the like. With this maturity has come an impressive number of use cases, from fraud detection in finance to ad tech optimization in the marketing world. As a result, it has moved beyond its roots in Silicon Valley to become a critical technology to businesses of all industries, all around the world.
- Why did the Yahoo team decide to open source the technology and contribute it to Apache?
When the MapReduce paper was published, it had a huge impact. People started talking about Google as if it had some kind of Black Magic that no other company could reproduce. I was on the Web Search team at Yahoo at the time, and we were making a serious run at challenging Google in search. We wanted to demystify what Google was doing, while at the same time making an amazing tool available to all.
Over the years, others inside Yahoo continued to question the level of investment we were making in an open source project, a project that started helping competitors like Facebook. I developed a five-point justification -- attracting talent, internal contributions, external contributions, technology alignment and software quality -- to justify our on-going contributions. Further, I constantly reminded people that, ultimately, our competitive advantage as a company needed to come from what we did with Hadoop to make distinctive, compelling products, and not from Hadoop itself.
Interestingly, as I talk to companies today who feel that they need to build their own Hadoop operations team to "stay competitive," I find myself making the same argument.
- How has that move benefited Yahoo and the open source community?
For Yahoo, the key five benefits are:
- Attracting talent -- we could bring in really good talent by giving them the opportunity to work on a project like Hadoop.
- Internal contributions -- one of the problems with centrally-developed infrastructure in companies like Yahoo is that, if a client team needs a feature, they are stuck waiting in the central team's priority queue. With an open source project, client teams can develop their own features (to a degree), mitigating this priority problem.
- External contributions -- over time, as the external community grows, you get tremendous benefit from the investments of that community.
- Technology alignment -- over time, as Hadoop became a de facto standard, other projects (like HBase and, more recently, Spark) embraced it as a foundation. Because Yahoo is Hadoop-based, it becomes aligned with all this other work. If Hadoop was kept proprietary, than we couldn't import all these other components. (Also, we started acquiring companies that were Hadoop-based, which eased integrations.)
- Software quality -- going open source meant there were more eyes on the code. Not only does that mean bugs get caught, but that the overall quality of the project stays high. With internal platforms, there's always a lot of pressure to ship features fast, and over time the inevitably become hair balls. Open source development puts pressure against cutting corners (too much), leading to a higher level of quality over time.
In general, for Yahoo, going open source meant that there were more eyes on the code and more hands on the code, so it developed more quickly and issues were identified and fixed far more rapidly. For the open source community, it also meant that they got an industrial strength solution quickly, and because Hadoop had Yahoo as a corporate god-parent providing armies of engineering talent against critical functionality, the solution was tested and proven in a real-world environment very quickly. It was a virtuous cycle for everyone involved. The involvement of Hortonworks, Cloudera, NTT, Huawei, Altiscale and Intel today, in addition to Yahoo, continues to contribute to this virtuous cycle of the Apache Software Foundation and private companies working together out of mutual interest to create great features and capabilities.
- What are primary attractions of Hadoop to the enterprise?
Hadoop provides enterprises with scalability, cost-effectiveness, flexibility, speed and resilience to failure that allow businesses to take advantage of volumes and types of data that they couldn't manage or analyze before. Big Data becomes something that more and more companies can do, not just the richest ones that can afford the expensive, proprietary infrastructure that was the only option available before Hadoop.
- Is Big Data over yet? If not, when will it be (and become just "data")?
While there may be some buzzword fatigue around the term "Big Data," it still refers to a very differentiated problem set from the one traditionally solved by the "enterprise data warehouse" or "MPP." We are still in the early years of the problem of dealing with multiple terabytes or petabytes of both structured and unstructured data. These massive data volumes represent a huge opportunity for businesses that can't be cracked by traditional means. That's why Hadoop exists, to solve this problem.
Also, Big Data solutions typically exist in silos today, solving unique problems. But we are seeing increasing interest in the enterprise to migrate it into a unified infrastructure with their enterprise data solutions, and to have that data consumed by business professionals, not just IT professionals. At Altiscale, we're working with BI players, like Birst and Tableau, to help provide this unified data infrastructure solution.
Ultimately, I believe that Big Data platforms will mature enough to handle all of the data in the enterprise. Based on our conversations with leading Fortune 500 companies, CIOs see this on the horizon, too. However, it will take some time to get there. By then, it will be just "data" and referring to it as "Big Data" might seem charmingly out of date.
- What's the deal with Spark? Will it overtake Hadoop and become the premier Big Data technology?
Spark is a rising in-memory processing solution that works well with fast-turn analytics on smaller (as in, not humongous) data sets and with machine learning use cases. For some reason, the growing popularity of Spark has created a misconception that Spark and Hadoop are somehow at odds. However, Spark actually runs on top of Hadoop. Hadoop is increasingly the enterprise platform of choice for Big Data projects which in turn has given rise to Spark. By definition, the two work best together.
- What other open source projects are emerging in the Hadoop ecosystem that enterprise developers should be aware of?
For applications that require the processing of streaming data, Kafka integrates very well with Hadoop and is becoming the tool of choice for capturing data streams. Many organizations want to leverage their existing SQL knowhow and use familiar visualization tools, so SQL on Hadoop is definitely going to be an integral component. There are a number of projects in this space that developers should be tracking: Spark SQL, Tez, Drill, HAWQ, Impala and Presto. With the popularity of Spark, it is helpful for developers to have a visual analysis environment to layer on top of the Spark engine. We've seen Zeppelin being used for this at a number of companies. Finally, Flink is a newer processing engine that is an intriguing alternative to MapReduce.
- What advice would you give developers interested in working with Hadoop?
There has never been a better time to get started building applications on Hadoop. We've come a long way in the last 10 years, and a great number of complementary tools and projects have been integrated into the Hadoop ecosystem. I've been involved with the Open Data Platform initiative (ODPi), which seeks to provide interoperability across Hadoop providers, including Altiscale, Hortonworks, IBM, and Pivotal. Developers can develop on an ODPi-compliant application and have confidence that this application will run across any ODPi-compliant platform. This is huge. It not only takes away a lot of development hurdles, it also gives customers the confidence to adopt Hadoop-based applications more quickly.
The Hadoop operations hurdle has also been a friction for app developers. With companies like Altiscale providing both a robust Hadoop foundation and operational services, developers are free to focus on differentiating at the application level, where they want to focus their time.
- Where is Hadoop headed in the coming years?
While it has matured, Hadoop is still in a young, growth phase. It has become critical to decisions made at leading companies, it runs applications that millions of people use or benefit from each day. And it keeps growing and maturing. I think that it will continue to develop in areas of security and governance that will allow it to serve as the sole data platform for the enterprise. You probably saw from the latest Gartner survey that Hadoop is moving increasingly to the cloud, even in the largest enterprises. A majority of survey respondents are saying that they're moving to or using cloud Hadoop. That's an exciting development, and it matches what we're seeing from our conversations with leading businesses. Companies that two years ago said "We'll never go to the cloud" are now saying "Everything is going to be in the cloud" and that certainly includes Hadoop and Spark.
- Do you have elephant fatigue?
I still like elephants. I have one of the best elephant t-shirt collections in the world. I'm sure only Doug Cutting beats me.
David Ramel is the editor of Visual Studio Magazine.