New Amazon Service Uses SQL To Query Streaming Big Data

In the birth of the Big Data revolution, first there was Apache Hadoop, leveraging the batch-oriented MapReduce processing engine and scale-out NoSQL databases.

Ever since, the technology has been evolving, with an emphasis on incorporating streaming data into the mix, a need driven by the growing Internet of Things (IoT) spewing petabytes of data from networked devices. Streaming data leads to interactive, real-time analytics. Meanwhile, more SQL functionality was introduced (exemplified in SQL-on-Hadoop solutions), so developers and data scientists don't have to learn new query languages or programmatic querying via customized APIs. Along the way, managed services emerged to take care of many of the details involved in running in-house, on-premises Big Data processing frameworks.

These trends have converged to result in services such as Amazon Kinesis Analytics -- just announced by Amazon Web Services Inc. (AWS) -- which leverages standard SQL to query streaming data.

The reason for the new service is simple, according to a blog post published yesterday by AWS spokesperson Jeff Barr.

"We want you, whether you are a procedural developer, a data scientist, or a SQL developer, to be able to process voluminous clickstreams from Web applications, telemetry and sensor reports from connected devices, server logs, and more using a standard query language, all in real time!" Barr said.

"Today I am happy to be able to announce the availability of Amazon Kinesis Analytics," Barr continued. "You can now run continuous SQL queries against your streaming data, filtering, transforming and summarizing the data as it arrives. You can focus on processing the data and extracting business value from it instead of wasting your time on infrastructure. You can build a powerful, end-to-end stream processing pipeline in 5 minutes without having to write anything more complex than a SQL query."

Amazon Kinesis Analytics
[Click on image for larger view.] Amazon Kinesis Analytics (source: AWS)

The tool uses two other Kinesis components -- Firehose and Streams -- to provide real-time analysis via SQL queries. Firehose is used to automatically load streaming data into AWS services such as S3 (cloud storage), Redshift (data warehouse) and Amazon Elasticsearch Service (a search and analytics engine). Streams, meanwhile, is used to build custom applications to work with streaming data for a variety of needs.

"Being able to continuously query and gain insights from this information in real-time -- as it arrives -- can allow companies to respond more quickly to business and customer needs," AWS said in a statement. "However, existing data processing and analytics solutions aren't able to continuously process this 'fast moving' data, so customers have had to develop streaming data processing applications -- which can take months to build and fine-tune -- and invest in infrastructure to handle high-speed, high-volume data streams that might include tens of millions of events per hour."

Using Kinesis Analytics is done with a three-step workflow: configure an input stream from a console; write SQL queries with a built-in SQL editor and templates; and configure an output stream, specifying where you want the processed results to be loaded, such as the aforementioned S3, Redshift or Elasticsearch Service. Analytics tools can then be used to create alerts and respond to changing data, useful for IoT applications, for example. This can be done with the aid of built-in machine learning algorithms that provide stream processing functionality such as anomaly detection, top-K analysis and approximate distinct items, exposed as SQL functions.

Like other AWS services, Kinesis Analytics infrastructure can be scaled up and down as needed and users pay for what they use.

Along with IoT scenarios, Kinesis Analytics can be used for use cases such as serving up personalized content for Web surfers based on clickstream data, or the real-time placing of appropriate ads. The most common usage patterns, AWS said, are time-series analytics, real-time dashboards, and real-time alerts and notifications.

Barr provides a basic Kinesis Analytics example in his blog post, and Ryan Nienhuis provides more in-depth guidance in a blog post yesterday -- the first of a two-part series -- titled "Writing SQL on Streaming Data with Amazon Kinesis Analytics – Part 1."

That post demonstrates how Kinesis Analytics uses processing "windows" to control the records used by a query. These windows come in three types: tumbling, used for periodic reports to summarize data over time, for example; sliding, for monitoring or other kinds of trend detection; and custom, when the best grouping isn't based on time series.

"Previously, real-time stream data processing was only accessible to those with the technical skills to build and manage a complex application," Nienhuis concluded. "With Amazon Kinesis Analytics, anyone familiar with the ANSI SQL standard can build and deploy a stream data processing application in minutes.

"This application you just built provides a managed and elastic data processing pipeline using Analytics that calculates useful results over streaming data. Results are calculated as they arrive, and you can configure a destination to deliver them to a persistent store like Amazon S3."

Nienhuis promised that part two of his blog series will delve into more advanced stream processing concepts.

About the Author

David Ramel is an editor and writer for Converge360.