Data Science: Harnessing Computing Power to Redefine Statistics

Few years ago, data science was an obscure term to many, even to those who work with data. A probable response would be “that’s statistics”. Fast forward to 2015 and everyone is talking about the new kid called data science. Whether it is a buzzword or the latest discipline we all agree it is the hottest thing at the moment. Glass door earlier this year named data scientist as the top job of 2016. The demand for data scientists is expected to see 1700 job openings and a lofty average pay of $116k this year. No wonder, the Harvard Review refer data science as the “sexiest job of the 21st century“.

The moment the term data science is mentioned, most of us think of statistics. So, what makes data science different from statistics? Or is it a spruced term that refers to statistics? Thinking data science as a related field of statistics is quite right. According to the American Statistical Association statistics is defined as “the science of learning from data…” Therefore, you expect most learners to think data science as a re-brand name of statistics. Outside academia, data science has not escaped ridicule of internet humorists. One humorist is quoted on twitter “A data scientist is a statistician who lives in San Francisco” …….Big Data Borat, another twitter humorist is quoted “data science is statistics on a Mac”. Other pundits in their own wisdom to distinguish statistics from data science opine that a data scientist is a statistician who is better in programming than any software engineering and a software engineering who is better in statistics than any statistician. Though, the statement may look like a joke, it has element of truth.

Data science is it any Different from Classical Statistics?

What differentiates statistics from data science is fairly complicated, with deep roots in computing. During pre-computer eras, statistics played a key role in testing empirical experiments of small samples. The advent of super computers and personal computers heralded the birth of big data and large databases. The humongous amounts of data could not be manipulated and analysed using conventional statistical methods.  Thus need for methods that are fast, accurate and efficient in dealing with large data and databases. Data science, therefore, is a response to new computing power. According to Peter Naur in his referred publication “Concise Survey of Computer Methods” data science is not a discipline concerned with analyzing data like classical statistics. It is wholesome manipulation and management of data. These include cleaning, processing, storing, manipulating and analysis of data.

As the world grew in complexity and computing power increased, there was need to develop sophisticated tools to deal with vast data sets. Researchers were increasingly using data sets, which required advanced manipulation techniques. Early inventors of data science borrowed heavily from machine learning and database management to create tools for manipulating these vast datasets. Consequently, it was now easier to predict on erratic markets, consumer behavior and analyze clinical trials.

Statistics, as a standalone field, has not dramatically changed in response to increased computing power. The field continues to rely on introductory statics, probability theory, hypothesis testing and computing. This has not augured well with some statisticians who feel that the field should align to changing world. William Cleveland, a renowned statistician, in 2001 advocated for the renaming of statistics to data science. The new field, according to him, would place greater emphasis on computing and real data analysis. Nate Silver, on the other hand, argues that data science is no different from statistics. The well known statistician who is famous for correctly predicting the 2012 US Presidential election believes that data science is a “sexed up term for a statistician”. Nate strongly argues that data science is a fad that is just patronizing, and that data science is a replica of what statisticians have been doing over the years. To him it is a buzzword whose time has come and it will wilt down. While it is true that no proper definition has been postulated to prop definition of data science, it is difficult to refute that data science has redefined the way we deal with data.

Nathan Yau a statistician and data visualizer states that data scientists unlike statisticians have three major skill sets.
Statistics and machine learning
Programming skills