The Problem of Multicollinearity in Linear Regression

Here we’ll talk about multicollinearity in linear regression.
This occurs when there is correlation among features, and causes the learned model to have very high variance. Consider for example predicting housing prices. If you have two features: number of bedrooms and size of the house in square feet, these are likely to be correlated. Consider a correctly specified model for linear regression. The mean squared error in the learned model $\hat{\beta}$ can be decomposed into squared bias and variance. If we estimate $\hat{\beta}$ via ordinary least squares (OLS), the bias is $0$ , so that

(1) $\begin{align*} \mathbb{E}((\hat{\beta}-\beta)^T(\hat{\beta}-\beta)|\boldsymbol{X})&=\textrm{Var}(\hat{\beta}|\boldsymbol{X})\\ &=\sigma^2 \Tr[(\boldsymbol{X}^T \boldsymbol{X})^{-1}] \end{align*}$

Let’s investigate the effect of multicollinearity by looking at parameter learning accuracy under both multicollinear features and independent features. We will first look at empirical performance and then talk about the theory of why we see this. We will use the following linear model
$y_i=\beta^T \boldsymbol{x}^{(i)}+\epsilon_i$ where $\epsilon_i\sim \mathcal{N}(0,1)$ i.i.d. and $\boldsymbol{x}^{(i)}=(x^{(i)}_1,x^{(i)}_2),i=1,\dots,n$ . For multicollinear data we use

(2) $\begin{align*} x^{(i)}_1&\sim \mathcal{N}(1,1)\\ x^{(i)}_2&=2x^{(i)}_1+\delta_i \end{align*}$

Thus the two features have some amount of multi-collinearity, but the gram matrix $X^T X$ is still invertible. For independent data we use

(3) $\begin{align*}x^{(i)}_1&\sim \mathcal{N}(1,1)\\ x^{(i)}_2&\sim\mathcal{N}(2,0.01) \end{align*}$

We now sample 200 $\boldsymbol{x},y$ pairs of both multicollinear and independent data and perform OLS. We repeat this 30 times and plot density estimates of the learned parameters.

As we can see, the variance with multicollinear features is drastically higher. In particular, with independent features we generally get a pretty good learned model, whereas with multicollinear features, some of the learned models are useless. Why is this? Recall that for a given dataset,

(4) $\begin{align*} MSE&=\mathbb{E}((\hat{\beta}-\beta)^T(\hat{\beta}-\beta)|\boldsymbol{X})\&=\sigma^2 Tr((\boldsymbol{X}^T \boldsymbol{X})^{-1}) \end{align*}$

Now with multicollinearity, $\boldsymbol{X}^T \boldsymbol{X}$ will have “almost” linearly dependent columns, leading to some eigenvalues being very small. Now

(5) $\begin{align*} Tr((\boldsymbol{X}^T \boldsymbol{X})^{-1})&=\frac{1}{\lambda_1(\boldsymbol{X}^T \boldsymbol{X})}+\dots+\frac{1}{\lambda_p(\boldsymbol{X}^T \boldsymbol{X})}\\ &\geq \frac{1}{\lambda_{\min}(\boldsymbol{X}^T \boldsymbol{X})} \end{align*}$

where $\lambda_i$ is the $i$ th eigen-value. Thus having a single small eigenvalue will lead to very large mean squared parameter error. Let’s see what the smallest eigenvalues look like in practice under both the multicollinear and independent data.

As expected, the min eigenvalues are much smaller under multicollinearity. To sum up:

Multicollinearity is when features are correlated
It leads to high variance in the learned model.
The reason for this is that the the min eigenvalues of the Gram matrix of the features becomes very small when we have multicollinearity.
In the future, we will discuss how to detect multicollinearity and what to do about it.

Related Posts

Leave a Reply Cancel reply