SQL Holds Its Own in Lucrative Data Science Field
- By David Ramel
- March 1, 2017
Maybe Big Data isn't so unstructured after all.
Contradicting a popular view that sees programming languages such as R, Python and Java as the top tools for data science (which has been dubbed the "The Sexiest Job of the 21st Century"), new research shows SQL holding its own.
That research was recently published at r4stats.com, a site that analyzes trends in data science software with -- as the name suggests -- a focus on the R programming language. As Bob Muenchen explained in a blog post yesterday, he recently updated his tracking methodology to include job advertisements from Indeed.com, a jobs site with extensive data mining capabilities.
Muenchen developed a protocol to focus on data scientist postings from Indeed.com, which is more complicated than it sounds.
While Muenchen's main takeaway was that "R Passes SAS, But Python Leaves Them Both Behind," the actual data shows SQL ranked higher than all of the reported data science softwares.
SQL, however, rated only a single mention in the post: "Figure 1a shows that SQL is in the lead with nearly 18,000 jobs, followed by Python and Java in the 13,000's."
Here's that figure ("The number of data science jobs for the more popular software -- those with 250 jobs or more, 2/2017"):
Despite the surprise No. 1 ranking of SQL, many industry sources focus on Python, for example, or R vs. Python, with the occasional Python vs. Scala or Python vs. Java.
Taking those findings into account with a quick Web search shows that in the data science world, R and Python are consistently listed among the most prominent programming languages, with Java and Python close behind as honorable mentions.
For example, here's how ASI recently graded those languages in terms of scalability and initial development speed in a post on "The Right Programming Language for Data Science":
However, other research findings show SQL figuring prominently in data science in particular and Big Data analytics in general, despite being a decades-old legacy language strongly associated with structured relational database management.
For example, we recently reported on how the growth of Apache Spark, arguably the most important Big Data software, is being driven by increased use of SQL in Big Data analytics, along with streaming and machine learning.
And early last year, we reported on CrowdFlower Inc. research about "What Skills Should Data Scientists Have in 2016?"
"The answer to our question, then, is get your SQL on," said CrowdFlower's Justin Tenuto. The following figure shows that CrowdFlower found SQL to be the most in-demand skill for data scientists in 2016:
KDnuggets published research that shows SQL making a strong showing in "top 10 most popular tools in 2016," following behind R and Python:
Yet another recent post -- this one a partial report published on Data Science Central, examining "Top programming languages for Data Science" -- shows SQL just slightly behind R:
"Even if it was developed in the early 1970s, SQL plays a key role still today (in second position of ranking with 49 percent of preferences)," Data Science Central said. "Although SQL is not designed for the task of handling unstructured datasets (typical of Big Data), there is still a strong need for analyze structured data in organizations, and SQL is a very popular choice for data crunching stage."
Taken all together, these reports and many others show SQL isn't quite being relegated to the back burner in the Big Data and data science fields, despite industry pundits' warnings of such effect when NoSQL began disrupting the data space.
Which gets us back to the original point: Maybe Big Data analytics isn't so unstructured after all.
David Ramel is an editor and writer for Converge360.