News

Cloudera Tool Provides End-to-End Python for Big Data

Major Hadoop distributor Cloudera Inc. today unveiled a new tool to provide a 100 percent, end-to-end Python stack for advanced Big Data analytics.

Incubated in-house by Cloudera Labs, the Ibis project is now available as a preview, installable as a standard Python package. Also, an Apache-licensed repository is available to the open source community on GitHub.

Cloudera, widely recognized as one of the top three commercial distributors of Apache Hadoop-based analytics software, said the Ibis project was just one of its new efforts to make Hadoop more usable and accessible to analytics pros. it also announced a new data science conference for this fall.

"Hadoop has evolved dramatically over the last decade, from a batch processing tool to an entire ecosystem that powers most of today's information architecture as well as traditional BI workloads," said Wes McKinney, a Cloudera software engineer who created the Python pandas data analysis library. "We want to build on this momentum and make Hadoop's infrastructure more accessible to the data science community. We're doing that by bringing Python more fully into the ecosystem and focusing on the real-world, practical applications of data science."

Cloudera noted that Python facilitates the use of more complex workflows better than many other programming languages, such as data analysis stalwart SQL. Python has ascended in popularity in modern data science and was recently identified as one of the two most lucrative skills to learn, along with R. But Python has been limited to smaller data sets, forcing analysts to make analytic compromises, the company said. Now, it said, data scientists and their ilk can perform higher-scale analytics without suffering performance deficits or degrading the user experience.

How Ibis Fits In
[Click on image for larger view.] How Ibis Fits In (source: Cloudera Inc.)

Ibis was designed to work with the Impala project, an open source, native analytic database for the Hadoop ecosystem.

"The initial version of Ibis provides an end-to-end Python experience with comprehensive support for the built-in analytic capabilities in Impala for simplified [extract, transform and load] ETL, data wrangling and analytics," Cloudera said in a statement. "Upcoming versions will allow users to leverage the full range of Python packages as well as express efficient custom logic using Python. By integrating with Impala, the leading [massively parallel processing] MPP database engine for Hadoop, Ibis can achieve the interactive performance and scalability necessary for Big Data."

After today's initial release of the unsupported Ibis preview, Cloudera outlined the following goals going forward in a blog post authored by McKinney and Marcel Kornacker:

  • Ibis will enable more natural data modeling by leveraging Impala's upcoming support for nested types (expected by year's end).
  • Cloudera will add support for Python user-defined logic so Ibis will integrate with the existing Python data ecosystem -- enabling custom Python functions at scale.
  • The company will further accelerate performance through low-level integrations between Ibis and Impala with a new Python-friendly, in-memory columnar format and Python-to-LLVM code generation. These updates will accelerate Python to run at native hardware speed.

"Although the Ibis vision is not yet fully executed in this early release version, we're confident that it will give you adequate insight into what Ibis will become over time," McKinney and Kornacker said. "We look forward to bringing you more news about its progress and are excited to hear your feedback."

Cloudera also announced it will organize and host the first-ever Wrangle Conference on Oct. 22 in San Francisco as another part of its continuing mission to further Big Data analytics.

"In light of Hadoop's wide-ranging flexibility and practicality, and as data scientists can now leverage its power to solve some of today's most pressing problems, Cloudera has announced Wrangle, a single-day, single-track industry event that will dive into the principles, practice and application of data science from the startup to the enterprise," the company said. "Presenters include data scientists from Facebook, Salesforce, Uber and more, who will share the most challenging problems they've faced and what they've learned."

Registration for the conference is currently being extended by invitations only, with public access promised soon.

About the Author

David Ramel is an editor and writer for Converge360.