In this post we describe how to do gradient descent with constraints. We first describe the problem, including why we can’t naively apply gradient descent and a few cases where this is necessary. Then we describe projected gradient descent and apply this to the popular Boston housing prices dataset to do non-negative least squares. We finish with a brief discussion.
We are interested in the problem of minimizing a convex, differentiable function over a compact convex set . This has many applications, including constrained least squares (linear regression with constraints on parameters), logistic regression with constraints, and certain types of boosting and SVMs. For example consider non-negative linear regression. In this case we know that parameters should all be non-negative. For instance, more rooms should have a positive effect on the prices of a house. While using ordinary least squares will eventually learn this, imposing the constraint structure can improving learning in small samples.
A general method for optimizing convex functions is gradient descent, which iteratively updates our as follows:
here is the step size. However even if is in the constraint set , there is no guarantee that is. There are two very common ways to handle this: one is to project onto , and another is to minimize a linear approximation to the function over .
Projected Gradient Descent for Non-negative Least Squares
Consider again non-negative least squares, where the coefficients cannot be negative. One applies this because of domain knowledge about the problem: for instance more rooms will not lower the price of a house, and similarly if the effect is a count it cannot be negative. If , the goal is to find
While we don’t need to enforce the constraint, it can help improve learning in small samples. One way to handle this is via projected gradient descent. The idea is to do one step of gradient descent, and then find the closest non-negative point in to that step. More formally, let and . Then projected gradient descent updates
Here the function simply zeroes out negative elements of a vector.
Application: Non-negative Least Squares for Boston Housing Prices
Let’s apply this to the Boston housing prices dataset. We will use four features: crime (per capita crime rate), rooms (average number of rooms per dwelling), tax (property tax rate), and percent of population of lower socioeconomic status. We expect the only positive coefficient to be rooms. Thus we flip the sign of all the other observations and then do projected gradient descent. We first load some libraries, write a function to calculate the gradient of the least squares objective function, and load the data.
from sklearn.datasets import load_boston import numpy as np np.set_printoptions(suppress=True) def deriv_f(beta,X,y): return -2*np.dot(X.T,y)+2*np.dot(np.dot(X.T,X),beta) dataset = load_boston() old_X,y=load_boston(True) dataset['feature_names']
Before running projected gradient descent, we can first run naive gradient descent and then compare the convergence empirically.
X = np.ones((old_X.shape,5)) X[:,1:5]=old_X[:,np.array([0,5,9,12])] p=np.shape(X) beta=np.zeros(p) grad = 100 i=0 while(np.linalg.norm(grad)>1e-2): grad = deriv_f(beta,X,y) beta = beta-1e-8*grad i+=1 if i%5000==0: print(i) print(beta) print(beta)
As a unit test, we can check the results against naively solving least squares via matrix inversion (note: you generally should not invert a matrix to solve linear least squares, however for relatively small problems where features are not highly correlated, it can be fine to do so).
print(i) print(beta) print(np.dot(np.linalg.inv(np.dot(X.T,X)),np.dot(X.T,y))) 122627056 [-1.41321706 -0.06157611 5.24849 -0.00501863 -0.53485144] [-1.41492797 -0.06157926 5.2487206 -0.00501849 -0.53483514]
Gradient descent took over 122 million iterations, and the results from gradient descent and directly solving are nearly identical (conclusion: you generally shouldn’t use gradient descent to solve least squares without a good reason). Now let’s try flipping the signs of relevant features and doing projected gradient descent to enforce non-negative coefficients.
X[:,np.array([0,1,3,4])]=-X[:,np.array([0,1,3,4])] beta=np.zeros(p) grad = 100 i=0 while(np.linalg.norm(grad)>1e-2): grad = deriv_f(beta,X,y) beta = beta-1e-8*grad #this is the projection step beta[beta<0]=0 i+=1 if i%5000==0: print(i) print(beta) print(beta)
We can undo the sign reversals to examine the learned coefficients:
beta[np.array([0,1,3,4])]=-beta[np.array([0,1,3,4])] X[:,np.array([0,1,3,4])]=-X[:,np.array([0,1,3,4])] print(i) print(beta) print(np.dot(np.linalg.inv(np.dot(X.T,X)),np.dot(X.T,y))) 116085845 [-1.41321706 -0.06157611 5.24849 -0.00501863 -0.53485144] [-1.41492797 -0.06157926 5.2487206 -0.00501849 -0.53483514]
This took approximately 116 million iterations, and also learned very close parameters to solving linear least squares using matrix inversion.
In this post we described projected gradient descent for constrained convex optimization. We described the special case algorithm for non-negative least squares and applied it to the Boston housing price dataset. We found the learned model very similar. Using projection for non-negative least squares is thus likely more useful when effects are small and noise is sufficiently high that the learned model is likely to get the coefficient sign incorrect.