What is the Difference Between Machine Learning and Statistics?

Machine learning and statistics use very similar tools: probability distributions, representations of conditional probability, maximum likelihood estimation, Bayesian inference, etc.  What is the difference?  Statistics is focused on estimating parameters of a statistical model and describing how ‘good’ these parameters are, while supervised machine learning is primarily concerned with generalization (out of sample or test) error when doing prediction, where statistical models may or may not be used.

Statistical Models

A statistical model is a pair (S,\mathcal{P}), where S is the sample space and \mathcal{P} is a set of probability distributions (actually probability measures).  In parametric statistics, \mathcal{P}=\{P_\theta:\theta\in \Omega\}, where \theta is a vector of parameters of fixed length p, and \Omega is the space of possible parameters.  As a statistician, one’s goal is to find a ‘good’ \hat{\theta} to estimate the true \theta and thus describe the distribution of one’s data well.  One would like to answer the following questions about \hat{\theta}:

  • Is it consistent?  That is, in large samples does it converge in probability to the true parameter?
  • Is it normal in large samples?
  • Can we say anything about the mean squared error of this estimator compared to other estimators?
  • Can we say anything about the variance of this estimator?
  • Can we derive confidence intervals?

For example, in simple linear regression, we have random variables y_1,\cdots,y_n and fixed values x_1,\cdots,x_n, and we want to describe the density p(y_i|x_i).  We assume that it’s parametrized by \beta and describe the parametric model, and we want to find some estimate \hat{\beta} of \beta to give us a ‘good’ approximation to p_{\beta}(y|x).  We’d then like to answer those same questions above about \hat{\beta}.

Notice that we haven’t said anything about prediction, and in fact if you take a class on linear regression, both applications and theory classes mostly focus on inference, with perhaps a small amount of prediction.

Probability Approximately Correct (PAC) Learning and Generalization Error

Now let’s look at the core idea of machine learning: PAC learning.  Heuristically, PAC learning tells us that given the ability to call new data points with binary labels, with high probability, a PAC learning algorithm can in polynomial time, with arbitrarily high probability learn a classifier with arbitrarily low average error on data drawn from the same distribution.

Modern Statistics and Machine Learning Papers

How does this relate to modern techniques: either modern statistics papers or machine learning papers, including those with deep neural networks?  If you look at statistics papers, many of them are as follows:

  • Write down a statistical model.  It could be an entirely new model, or it could be a setting of an old model where old techniques have problems (i.e. estimators for \theta won’t be ‘good’).
  • Give some idea for how to estimate \theta with \hat{\theta}.  This may be derived, or it may simply be some clever idea.
  • Show that \hat{\theta} is consistent and asymptotically normal, or show some other properties.
  • Do inference on some data.

There doesn’t need to be any prediction here.  Now let’s consider many machine learning papers, including modern papers using deep learning, many of them look like this:

  • Write down a prediction problem.
  • Write down a model for doing prediction.  This may or may not involve a statistical model.  For instance, probabilistic graphical models are statistical models, while support vector machines in their standard setup are not described as statistical models.  If our model for prediction involves a statistical model, then I would argue that this is statistical machine learning. . Otherwise I’d say that this is just machine learning.
  • Describe an algorithm for learning the parameters of your model.
  • Show that for various datasets, our approach has low generalization error, sometimes as a function of data size or number of training iterations.

Note the similarity to PAC learning in the last line.  Even if a model is not described as a statistical model, it often may correspond exactly to one, but if the setup for deriving it and most of its properties is not within the setup of a statistical model, then I’d consider it to not fall under statistics.

Obviously the lines can get blurry and there is crossover.  One may see some papers about hypothesis testing in machine learning venues, or papers on SVM in statistics venues.  However, if you want to differentiate them, the use of an explicit statistical model is as good a yardstick as any for deciding what falls under statistics or not.

Conclusion

In summary

  • Statistics involves inferring parameters of a statistical model, and describing how ‘good’ those parameters are.
  • If such a model is studied, either theoretically or empirically, in the context of generalization error, we have statistical machine learning.
  • If some other model that is not described as a statistical model is studied in the context of generalization error, we have machine learning.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.