In this post we describe centering features in linear regression: you should do it because it changes the interpretation of the intercept in a very helpful way.
In linear regression, one has pairs of
feature vectors
and responses
, and one relates them via the model
(1)
Often one is told to center each feature to be mean . You should do this, as it changes the interpretation of the intercept
. The standard interpretation of the intercept is: it’s the expected value of the response, holding all covariates fixed at
. However, this isn’t super useful, as the covariate being
may rarely happen, or may not have any special interpretation.
However, if we center the mean, then the interpretation changes to: the expected value of the response, holding all covariates fixed to their average values.
An Example: The Boston Housing Dataset
Let’s try this on the Boston housing dataset in R. We first load the data
library(mlbench) data(BostonHousing)
The regression without mean centering would be as follows:
summary(lm(log(medv) ~ crim + rm + tax + lstat , data = BostonHousing)) Call: lm(formula = log(medv) ~ crim + rm + tax + lstat, data = BostonHousing) Residuals: Min 1Q Median 3Q Max -0.72730 -0.13031 -0.01628 0.11215 0.92987 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.646e+00 1.256e-01 21.056 < 2e-16 *** crim -8.432e-03 1.406e-03 -5.998 3.82e-09 *** rm 1.428e-01 1.738e-02 8.219 1.77e-15 *** tax -2.562e-04 7.599e-05 -3.372 0.000804 *** lstat -2.954e-02 1.987e-03 -14.867 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.2158 on 501 degrees of freedom Multiple R-squared: 0.7236, Adjusted R-squared: 0.7214 F-statistic: 327.9 on 4 and 501 DF, p-value: < 2.2e-16
The intercept is 2.65. What does this mean? This is the expected log-price of a house in a neighborhood with no crime, on average no rooms (???), no property tax, and a proportion of lower-status people (not a nice phrasing but I got it from the documentation) of 0. This is clearly not a very realistic scenario, and isn’t very useful for us.
What about if we mean center our covariates? Then we have
summary(lm(log(medv) ~ scale(crim,scale=FALSE) + scale(rm,scale=FALSE) + scale(tax,scale=FALSE) + scale(lstat,scale=FALSE) , data = BostonHousing)) Call: lm(formula = log(medv) ~ scale(crim, scale = FALSE) + scale(rm, scale = FALSE) + scale(tax, scale = FALSE) + scale(lstat, scale = FALSE), data = BostonHousing) Residuals: Min 1Q Median 3Q Max -0.72730 -0.13031 -0.01628 0.11215 0.92987 Coefficients: Estimate Std. Error t value (Intercept) 3.035e+00 9.592e-03 316.366 scale(crim, scale = FALSE) -8.432e-03 1.406e-03 -5.998 scale(rm, scale = FALSE) 1.428e-01 1.738e-02 8.219 scale(tax, scale = FALSE) -2.562e-04 7.599e-05 -3.372 scale(lstat, scale = FALSE) -2.954e-02 1.987e-03 -14.867 Pr(>|t|) (Intercept) < 2e-16 *** scale(crim, scale = FALSE) 3.82e-09 *** scale(rm, scale = FALSE) 1.77e-15 *** scale(tax, scale = FALSE) 0.000804 *** scale(lstat, scale = FALSE) < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.2158 on 501 degrees of freedom Multiple R-squared: 0.7236, Adjusted R-squared: 0.7214 F-statistic: 327.9 on 4 and 501 DF, p-value: < 2.2e-16
The log-price, fixing crime, average number of rooms, tax rate, and proportion of lower status people to the averages is 3.035. This makes a lot more sense. As you may notice, the coefficients for the covariates don’t change, so the interpretation remains the same.
In conclusion, when you want your intercept to have a nice interpretation, you should center your covariates.