BigQuery Tackles 1 Billion GitHub Files To Reveal Spaces vs. Tabs Developer Preference
Google developer advocate Felipe Hoffa showed off the capabilities of the company's cloud-based BigQuery data warehouse by analyzing some 1 billion files across 400,000 GitHub repositories to see if developers prefer tabs or spaces to indent their code.
That weighty issue has been plaguing the developer community for years, even being addressed in formal research such as last year's Stack Overflow developer survey that found 45 percent of respondents preferred tabs, while 33.6 percent preferred spaces.
Last week, Hoffa put BigQuery to the task in order to demonstrate how to leverage its capabilities. He examined the code written in 14 top programming languages -- some 14 TB in all -- included in 1 billion files spanning 400,000 open source repositories.
"Analyzing each line of 133 GBs of code in 16 seconds?" he wrote. "That's why I love BigQuery."
The product's site explains it in a nutshell:
BigQuery is Google's fully managed, petabyte scale, low cost analytics data warehouse. BigQuery is serverless, there is no infrastructure to manage and you don't need a database administrator, so you can focus on analyzing data to find meaningful insights, use familiar SQL, and take advantage of our pay-as-you-go model. BigQuery is a powerful Big Data analytics platform used by all types of organizations, from start-ups to Fortune 500 companies.
In June, both GitHub and Google announced the expansion of publicly available BigQuery tables containing GitHub data to the tune of more than 3 TB.
"It contains activity data for more than 2.8 million open source GitHub repositories including more than 145 million unique commits, over 2 billion different file paths and the contents of the latest revision for 163 million files, all of which are searchable with regular expressions," GitHub said.
Google explained those tables can also be queried with SQL.
"The Google BigQuery Public Datasets program now offers a full snapshot of the content of more than 2.8 million open source GitHub repositories in BigQuery," Google said. "Thanks to our new collaboration with GitHub, you'll have access to analyze the source code of almost 2 billion files with a simple (or complex) SQL query. This will open the doors to all kinds of new insights and advances that we're just beginning to envision."
So, of course, the new datasets were promptly put to use to settle the tabs vs. spaces conundrum by Hoffa, who listed the following rules for his query project:
Data source: GitHub files stored in BigQuery.
- Stars matter: We'll only consider the top 400,000 repositories -- by number of stars they got on GitHub during the period Jan-May 2016.
- No small files: Files need to have at least 10 lines that start with a space or a tab.
- No duplicates: Duplicate files only have one vote, regardless of how many repos they live in.
- One vote per file: Some files use a mix of spaces or tabs. We'll count on which side depending on which method they use more.
- Top languages: We'll look into files with the extensions (.java, .h, .js, .c, .php, .html, .cs, .json, .py, .cpp, .xml, .rb, .cc and .go).
Using an existing table that lists the top 400,000 GitHub repositories, Hoffa extracted the files with the appropriate language extensions, a query that he acknowledged "took a relative long time since it involved joining a 190 million rows table with a 70 million rows one, and over 1.6 terabytes of contents."
Cleaning up the data and applying the aforementioned rules resulted in the final analysis findings.
While Hoffa reported no specific numbers in answering the tabs-vs.-spaces question, a couple of graphics are worth thousands of words:
David Ramel is the editor of Visual Studio Magazine.