Why You Should Center Your Features in Linear Regression

In this post we describe centering features in linear regression: you should do it because it changes the interpretation of the intercept in a very helpful way.

In linear regression, one has $i=1,\cdots,n$ pairs of $p\times 1$ feature vectors $X^{(i)}$ and responses $y_i$ , and one relates them via the model

(1) $\begin{align*}y_i&=\beta_0+\beta_1 X^{(i)}_1+\cdots+\beta_p X^{(i)}_p\end{align*}$

Often one is told to center each feature to be mean $0$ . You should do this, as it changes the interpretation of the intercept $\beta_0$ . The standard interpretation of the intercept is: it’s the expected value of the response, holding all covariates fixed at $0$ . However, this isn’t super useful, as the covariate being $0$ may rarely happen, or may not have any special interpretation.

However, if we center the mean, then the interpretation changes to: the expected value of the response, holding all covariates fixed to their average values.

An Example: The Boston Housing Dataset

Let’s try this on the Boston housing dataset in R. We first load the data

library(mlbench)
data(BostonHousing)

The regression without mean centering would be as follows:

summary(lm(log(medv) ~ crim + rm + tax + lstat , data = BostonHousing))

Call:
lm(formula = log(medv) ~ crim + rm + tax + lstat, data = BostonHousing)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.72730 -0.13031 -0.01628  0.11215  0.92987 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.646e+00  1.256e-01  21.056  < 2e-16 ***
crim        -8.432e-03  1.406e-03  -5.998 3.82e-09 ***
rm           1.428e-01  1.738e-02   8.219 1.77e-15 ***
tax         -2.562e-04  7.599e-05  -3.372 0.000804 ***
lstat       -2.954e-02  1.987e-03 -14.867  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2158 on 501 degrees of freedom
Multiple R-squared:  0.7236,	Adjusted R-squared:  0.7214 
F-statistic: 327.9 on 4 and 501 DF,  p-value: < 2.2e-16

The intercept is 2.65. What does this mean? This is the expected log-price of a house in a neighborhood with no crime, on average no rooms (???), no property tax, and a proportion of lower-status people (not a nice phrasing but I got it from the documentation) of 0. This is clearly not a very realistic scenario, and isn’t very useful for us.

What about if we mean center our covariates? Then we have

summary(lm(log(medv) ~ scale(crim,scale=FALSE) + scale(rm,scale=FALSE) + scale(tax,scale=FALSE) + scale(lstat,scale=FALSE) , data = BostonHousing))

Call:
lm(formula = log(medv) ~ scale(crim, scale = FALSE) + scale(rm, 
    scale = FALSE) + scale(tax, scale = FALSE) + scale(lstat, 
    scale = FALSE), data = BostonHousing)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.72730 -0.13031 -0.01628  0.11215  0.92987 

Coefficients:
                              Estimate Std. Error t value
(Intercept)                  3.035e+00  9.592e-03 316.366
scale(crim, scale = FALSE)  -8.432e-03  1.406e-03  -5.998
scale(rm, scale = FALSE)     1.428e-01  1.738e-02   8.219
scale(tax, scale = FALSE)   -2.562e-04  7.599e-05  -3.372
scale(lstat, scale = FALSE) -2.954e-02  1.987e-03 -14.867
                            Pr(>|t|)    
(Intercept)                  < 2e-16 ***
scale(crim, scale = FALSE)  3.82e-09 ***
scale(rm, scale = FALSE)    1.77e-15 ***
scale(tax, scale = FALSE)   0.000804 ***
scale(lstat, scale = FALSE)  < 2e-16 ***
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2158 on 501 degrees of freedom
Multiple R-squared:  0.7236,	Adjusted R-squared:  0.7214 
F-statistic: 327.9 on 4 and 501 DF,  p-value: < 2.2e-16

The log-price, fixing crime, average number of rooms, tax rate, and proportion of lower status people to the averages is 3.035. This makes a lot more sense. As you may notice, the coefficients for the covariates don’t change, so the interpretation remains the same.

In conclusion, when you want your intercept to have a nice interpretation, you should center your covariates.

An Example: The Boston Housing Dataset

Related Posts

Leave a Reply Cancel reply