News

Google Hosts Public Datasets for BigQuery Analytics

So it turns out I have the seventh most popular first name in the U.S. but a terrible author ranking on Hacker News. You can find where you rank in these areas in the public datasets Google has made available via BigQuery, its fully managed interactive analytics service.

Querying the public datasets with BigQuery is free for the first 1 TB of data processed per month, after which pricing kicks in.

Google recently showcased some of the public datasets it's hosting, including USA Names (provided by the U.S. Social Security Administration); NYC TLC Trips (taxi trips); Hacker News (developer-oriented social media posts); USA Disease Surveillance (weekly nationally notifiable disease reports); GDELT Book Corpus (digitized books from the Internet Archive); and NOAA GSOD (climate data from the National Oceanic and Atmospheric Administration).

"A public dataset is any dataset that is stored in BigQuery and made available to the general public," Google says on its new public data site. "This page lists a special group of public datasets that Google BigQuery hosts for you to access and integrate into your applications. Google pays for the storage of these datasets and provides public access to the data via BigQuery. You pay only for the queries that you perform on the data (the first 1 TB per month is free, subject to query pricing details)."

Using SQL, the datasets can be queried through several means, including a Web UI, a command-line tool or in applications through the BigQuery REST API using client libraries including Java, .NET and Python.

Along with the aforementioned showcase datasets listed by Google, many more are available, ranging from soccer data to cancer genomics. Developers can also share their own datasets simply by changing their permissions.

To query the datasets, developers must have a Google account with access to BigQuery-enabled projects. New projects have such access built in.

In the meantime, Google has provided pages hosting sample queries, such as this one for the USA Name Data dataset.

Google has also teamed up with Looker to provide immediate access to the datasets in playgrounds that let developers explore the data, such as this one for the USA names data, and this one that lists the top authors on Hacker News.

Top Hacker News Authors
[Click on image for larger view.] Top Hacker News Authors (source: Google/Looker)

Speaking of Hacker News, a posting to that site about the new service resulted in many comments, including those from a reader with the handle "vgt" who said he/she works on BigQuery and provided more info on the project:

One large difference between this program and alternative programs is that data already resides in Google BigQuery:

- You do not need to spin up a database to work with BigQuery
- You can simply start writing SQL on top of BigQuery
- You may leverage Dataflow and MapReduce connectors to work with this data directly in Hadoop, Spark, or Dataflow
- BigQuery has a free tier - one Terabyte of data processed per month

Finally, for folks who would like to share their datasets, BigQuery offers free hosting and credits to help get a pipeline going.

The poster "vgt" also engaged in vigorous debate with developers who expressed hesitation about using such services from Google in light of a perceived notion that the company cancels projects developers have grown to depend on in their coding efforts.

"This narrative gets repeated time and time again, and it really doesn't hold up to even surface debate," vgt responded in part to one post that said: "Sorry Google but my trust in cloud solutions reside in AWS and Azure. Why? Because when Amazon and Microsoft announce something I know there's a good chance it will still exist 24 months later."

About the Author

David Ramel is an editor and writer for Converge360.