News

Data Scientists Love Jobs, Dislike What They Do Most: Clean Data

Paradoxically, data scientists love their jobs overall but dislike what they do most: cleaning and organizing data.

That's one of the main takeaways from a new report by CrowdFlower Inc. on what has been called the "sexiest job of the 21st century."

"Organizations that start prioritizing ways to help data scientists clean their data are going to find a data team with more time to work on more important -- and more fulfilling -- tasks," said CrowdFlower's Justin Tenuto in a blog post this week announcing the new "2016 Data Science Report" (available as a free PDF upon providing registration information).

The report was compiled early this year from surveys, interviews and in-house analytics of CrowdFlower's own platform, which, conveniently, provides a contributor network to help organizations, "collect, clean and label data."

In its survey, CrowdFlower found almost the same percentage of respondents reported they spent most of their time cleaning data (60 percent) as those who reported that task to be the least enjoyable part of their job (57 percent). The next least enjoyable task was collecting data sets, a distant second place at 21 percent. That task was also reported as the second-most time-consuming part of the job (19 percent). As you can see below, responses to the two questions basically mirror each other -- yet most data scientists are quite happy in their positions.

Time-Consuming
[Click on image for larger view.] Time Consuming.(source: CrowdFlower)
Time-Consuming
[Click on image for larger view.] Least Enjoyable.(source: CrowdFlower)

Another key takeaway from developers thinking about getting in on this lucrative profession: the future of data science is machine learning (ML).

"What's next, to put it simply, is machine learning," the report states. "ML has already been adopted in some way, shape, or form, by most of the world's biggest companies, and with big players in the tech space like Google, Microsoft, Amazon, IBM, and Facebook open-sourcing their machine learning tools, the momentum is there for massive advancements."

Indeed, Google just this week made big waves at its NEXT conference by announcing a cloud-based service to take ML development mainstream, one of many such efforts launched by other companies, such as Hewlett Packard Enterprise.

"Over half our respondents noted machine learning had significant importance for their companies and their departments, while only one in ten marked that it wasn't very important at all," the report said. "We expect that 10 percent to shrink even further next year."

Other findings of the report include:

  • 83 percent of respondents said there weren't enough data scientists to go around, up from last year's number of 79 percent.
  • More than 80 percent of data scientists are really happy at work. On a scale of 1 to 5, 47 percent marked a 4 and 35 percent marked a 5.
  • When asked if they had access to the tools needed to do their job, the largest plurality, 46 percent, said they agree. 21 percent strongly agree, 19 percent were neutral, 13 percent disagreed, and only 1 percent strongly disagreed.

The report also included data on the top in-demand skills in data science, which was culled from examining LinkedIn job postings and which was released earlier this year by CrowdFlower.

The new report expounds on that information, listing the number of jobs the skills are mentioned in and the percentage of jobs with that skill:

Most-Wanted Data Science Skills
[Click on image for larger view.] Most-Wanted Data Science Skills (source: CrowdFlower)

"As more and more organizations adopt data as a key driver of decision making, the importance of streamlined, well-oiled data science teams is going to remain paramount," the report concluded. "But the current status quo probably isn't sustainable. On the one hand, we see a shortage of data scientists while on the other, they're spending too much time cleaning and munging data. This is time that could be much better served doing predictive analysis and building out machine learning practices.

"That's not to say that cleaning and labeling data isn't important, of course," the report continued. "Analysis on bad data is a garbage-in, garbage-out sort of scenario. Rather, organizations that want to get the most of their data should aim to fix the problems their teams have now. They should talk to them and find out exactly what takes up their time. By mitigating the effort their teams spend doing janitorial data work, they'll be able to empower their teams to do the valuable tasks that data scientists actually enjoy doing."

About the Author

David Ramel is an editor and writer at Converge 360.