IBM Announces Spark Development Environment in the Cloud

IBM, which last year announced a huge investment in Apache Spark technology as part of a mission to transform it into a kind of "analytics OS," today took that investment a step further by announcing a Spark development environment housed on its IBM Cloud Bluemix platform.

Described as the "first cloud-based development environment for near real-time, high performance analytics," the IBM Data Science Experience, now in preview, will provide some 250 curated data sets, a variety of open source tools and a collaborative workspace specifically targeting data scientists, "making it easier to rapidly develop applications that are infused with intelligence" with data developers.

"Today we are excited to announce the IBM Data Science Experience, an environment that has everything a data scientist needs to be successful," says a project blog post published last week. "IBM Data Science Experience is an interactive, collaborative, cloud-based environment where data scientists can use multiple tools to activate their insights. Data scientists can use the best of open source, tap into IBM's unique features, grow their capabilities, and share their successes."

A year ago, IBM hopped on the Spark bandwagon in a big way -- promising to put more than 3,500 researchers and developers to work on related projects at labs around the world -- while calling it "potentially the most significant open source project of the next decade." The company said today's announcement is building on that "$300 million investment in developing Apache Spark as a type of 'analytics operating system.'"

"IBM's Digital Science Experience is the killer enterprise app for Apache Spark, and gives data scientists new opportunities to deliver insight-driven models to developers, and opens the door for unprecedented innovation from the open source community," said exec Bob Picciano.

The IBM Digital Science Experience
[Click on image for larger view.] The IBM Digital Science Experience (source: IBM)

The project site invites data scientists to get started by enrolling in a course, starting a project from a provided sample or from scratch, using tools such as RStudio, Jupyter Notebooks, Python, R and Scala, though most site links now just present a message that says: "IBM Data Science Experience is in limited preview. We will be in touch shortly with new features and functionality." Interested data scientists can sign up to be added to the waitlist.

A key component of the project -- when it becomes operational -- will be the RStudio open source statistical computing environment using the R programming language in the flagship RStudio IDE.

In addition to the project's open source capabilities, IBM said it's also adding new features and APIs, such as:

  • Sparkling.Data: Cleaning and preparing data for analysis are the tasks that data scientists typically spend the majority of their time on. We created a library that helps you discover the different file types and returns a data frame loaded with data (by default) from the file type that occurs the most. You can use it to infer the schema, discover data types, profile data sets, view range and distribution, reveal and fix bad data and much more.
  • Prescriptive Analytics: The Decision Optimization CPLEX Modeling library (DOcplex) contains modeling packages such as Mathematical Programming and Constraint Programming.
  • Shiny: Data scientists typically create visualizations to share their analysis with others. We include Shiny in the IBM Data Science Experience to allow you to create interactive analytic Web applications without coding any HTML, CSS, or JavaScript—only R. Check here to see a gallery of useful examples to learn more.
  • Data Connections: From the Notebook interface, you can set up data connections to Bluemix data services like Cloudant or dashDB or to on-premises or external services.
  • Schedule Jobs: From the Notebook interface, you can schedule jobs to run periodically.

Upon signing up and opening an account, data scientists will be provided with a deployed Spark-as-a-Service instance for analyzing data and 5 GB of object storage to store that data.

"Just as IBM played a critical role in the development of computer science, we can see many similarities today" said Picciano. "Computer science went mainstream with the introduction of the PC. With Data Science, the major roadblock is having access to large data sets and having the ability to work with so much data. With today's announcement, clients can have both."

Today's news came during the second day of the Spark Summit conference in San Francisco, where several other companies have issued Spark-based product announcements.

About the Author

David Ramel is an editor and writer for Converge360.