BigQuery Tackles 1 Billion GitHub Files To Reveal Spaces vs. Tabs Developer Preference -- ADTmag

BigQuery Tackles 1 Billion GitHub Files To Reveal Spaces vs. Tabs Developer Preference

By David Ramel
September 6, 2016

Google developer advocate Felipe Hoffa showed off the capabilities of the company's cloud-based BigQuery data warehouse by analyzing some 1 billion files across 400,000 GitHub repositories to see if developers prefer tabs or spaces to indent their code.

That weighty issue has been plaguing the developer community for years, even being addressed in formal research such as last year's Stack Overflow developer survey that found 45 percent of respondents preferred tabs, while 33.6 percent preferred spaces.

Last week, Hoffa put BigQuery to the task in order to demonstrate how to leverage its capabilities. He examined the code written in 14 top programming languages -- some 14 TB in all -- included in 1 billion files spanning 400,000 open source repositories.

"Analyzing each line of 133 GBs of code in 16 seconds?" he wrote. "That's why I love BigQuery."

The product's site explains it in a nutshell:

BigQuery is Google's fully managed, petabyte scale, low cost analytics data warehouse. BigQuery is serverless, there is no infrastructure to manage and you don't need a database administrator, so you can focus on analyzing data to find meaningful insights, use familiar SQL, and take advantage of our pay-as-you-go model. BigQuery is a powerful Big Data analytics platform used by all types of organizations, from start-ups to Fortune 500 companies.

In June, both GitHub and Google announced the expansion of publicly available BigQuery tables containing GitHub data to the tune of more than 3 TB.

"It contains activity data for more than 2.8 million open source GitHub repositories including more than 145 million unique commits, over 2 billion different file paths and the contents of the latest revision for 163 million files, all of which are searchable with regular expressions," GitHub said.

Google explained those tables can also be queried with SQL.

"The Google BigQuery Public Datasets program now offers a full snapshot of the content of more than 2.8 million open source GitHub repositories in BigQuery," Google said. "Thanks to our new collaboration with GitHub, you'll have access to analyze the source code of almost 2 billion files with a simple (or complex) SQL query. This will open the doors to all kinds of new insights and advances that we're just beginning to envision."

So, of course, the new datasets were promptly put to use to settle the tabs vs. spaces conundrum by Hoffa, who listed the following rules for his query project:

Data source: GitHub files stored in BigQuery.
Stars matter: We'll only consider the top 400,000 repositories -- by number of stars they got on GitHub during the period Jan-May 2016.
No small files: Files need to have at least 10 lines that start with a space or a tab.
No duplicates: Duplicate files only have one vote, regardless of how many repos they live in.
One vote per file: Some files use a mix of spaces or tabs. We'll count on which side depending on which method they use more.
Top languages: We'll look into files with the extensions (.java, .h, .js, .c, .php, .html, .cs, .json, .py, .cpp, .xml, .rb, .cc and .go).

Using an existing table that lists the top 400,000 GitHub repositories, Hoffa extracted the files with the appropriate language extensions, a query that he acknowledged "took a relative long time since it involved joining a 190 million rows table with a 70 million rows one, and over 1.6 terabytes of contents."

Cleaning up the data and applying the aforementioned rules resulted in the final analysis findings.

While Hoffa reported no specific numbers in answering the tabs-vs.-spaces question, a couple of graphics are worth thousands of words:

**[Click on image for larger view.]** Tabs vs. Spaces -- The Numbers *(source: Google)*

About the Author

David Ramel is an editor and writer at Converge 360.

Featured

AppTrends

Email Address*Country*

Please type the letters/numbers you see above.

Upcoming Training Events

0 AM

VSLive! 2-Day Hands-On Training Seminar: Asynchronous and Parallel Programming in C#
June 24-25, 2025

VSLive! 4-Day Hands-On Training Seminar: Immersive .NET Full Stack Training: 4-Day Hands-On Experience
July 15-18, 2025

Securing IT in the AI Era
July 23, 2025

VSLive! 4-Hour In-Depth Workshop: Immersive .NET Full Stack Training: C# Interfaces: Effective Usage while Avoiding Pitfalls
July 29, 2025

Visual Studio Live! @ Microsoft HQ
August 4-8, 2025

4-Hour VSLive! Workshop: Testability in .NET
August 27, 2025

Visual Studio Live! San Diego
September 8-12, 2025

Live! 360 2-Day Hands-On Seminar: Swimming in the Lakes of Microsoft Fabric and AI – A Hands-on Experience
September 18-19, 2025

VSLive! 2-Day Hands-On Training Seminar: Hands-On with .NET Web Development in 2025
October 7-8, 2025

Live! 360 Orlando
November 16-21, 2025

Artificial Intelligence Live! Orlando
November 16-21, 2025

Cloud & Containers Live! Orlando
November 16-21, 2025

Cybersecurity & Ransomware Live! Orlando
November 16-21, 2025

Data Platform Live! Orlando
November 16-21, 2025

Visual Studio Live! Orlando
November 16-21, 2025

VSLive! 4-Day Hands-On Training Seminar: Immersive .NET Full Stack Training: 4-Day Hands-On Experience
December 16-19, 2025

Visual Studio Live! Las Vegas
March 16-20, 2026

Free White Papers

More Tech Library