Introducing, the man (or woman), the myth, the legend. STATISTICS!

Basic mathematics and statistics are the backbone for data science, and for anyone working in this field, they should be aware of the underlying concepts to the algorithms used. Simply doing model.fit()/model.predict() does not make you a data scientist.

This article is designed to outline the basic concepts of statistics. We try to provide a comprehensive list of important topics to learn, with the hope that this could act as a launchpad for a deep dive into machine learning models and techniques. Thus this is ground zero, the introduction to a series of articles that has been planned for you all.

Introduction to Statistics – What is Statistics?

Matthew Reimherr, author of Introduction to Functional Data Analysis defines statistics as the art of learning about phenomena through the collection and analysis of data. Matthew answers the million-dollar question that every pseudo data scientist thinks (or at least in my opinion, should!) –

What is the difference between statistics and other areas of data science?

The emphasis on probabilistic modeling as well as the understanding and incorporation of different dynamics involved in the data collection process.

Matthew Reimherr

Matthew also talks about the breadth of statistics, calling it the most wonderful part of statistics.

What I personally love about statistics is that it enables us to quantify uncertainty, thereby making it precise!

While statistics enables us to build insights or analyze the data, it is also important to understand how to interpret the information.

Statistical Interpretation

The world is full of data. Renowned British mathematician and data science entrepreneur and the latter half of a leading data science organizations dunnhumby, Clive Humby coined the famous phrase:

Data is the new oil

Clive Humby, 2006

Ever since, this phrase has been repeated and rehashed by several others. The world has seen a boom in data collection but interpreting this is the real challenge.

Consider a hypothetical scenario – Natural’s Ice Cream introduced a new advertisement mid-May last year and saw a 42% uptick in sales in the three months that followed. Riddle me this, oh humbled reader. Would you consider this an effective advertisement campaign?

Okay, here’s an example to help make things more clear.

Statistics: What do I need to know for Data Science 2

Very strong negative correlation between Internet Explorer’s market share and the number of monthly active Facebook users between 2006 and 2011.

(Facebook data source)

Does that mean that Facebook was the cause for IE’s decline?!

If you observe the data without having any prior knowledge as to what the columns represent, you would be forced to believe that there is strong correlation. But the lesson we need to learn here is that correlation does not always lead to causality.

There are many such ridiculous examples Check out Spurious Correlations (Tyler Vigen).

Watch out for upcoming articles in the series

  • Article 1: Getting your way around Statistics (topics to cover: types of statistics, types of variables, population vs. sample, different sampling methods)
  • Article 2: Data the Explorer (topics to cover: Univariate/Multivariate analysis, probability and distributions – discrete and continuous, pdf/cdf)
  • Article 3: Test thy Hypothesis (topics to cover: confidence intervals, one/two tailed tests, t-tests, chi-squared tests, ANOVA, ANCOVA, errors and their types)

About Us

Data Science Discovery is a step on the path of your data science journey. Please follow us on LinkedIn!

Authors

Raghav Datta holds coding very close to his heart, but over a period of time he has realized that statistics and algorithms are equally important. Raghav holds a degree in Mathematics, as well as Operational Research, both of which are supplementing his journey in data science.