Projected Gradient Descent for Constrained Optimization

In this post we describe how to do gradient descent with constraints. We first describe the problem, including why we can’t naively apply gradient descent and a few cases where this is necessary. Then we describe projected gradient descent and apply this to the popular Boston housing prices dataset to do non-negative least squares. We finish with a brief discussion.

Problem

We are interested in the problem of minimizing a convex, differentiable function f(x) over a compact convex set \scriptK. This has many applications, including constrained least squares (linear regression with constraints on parameters), logistic regression with constraints, and certain types of boosting and SVMs. For example consider non-negative linear regression. In this case we know that parameters should all be non-negative. For instance, more rooms should have a positive effect on the prices of a house. While using ordinary least squares will eventually learn this, imposing the constraint structure can improving learning in small samples.

A general method for optimizing convex functions is gradient descent, which iteratively updates our x as follows:

(1)   \begin{align*}x_{t+1}&=x_t-\eta_t \nabla f(x_t)\end{align*}

here \eta_t is the step size. However even if x_t is in the constraint set \scriptK, there is no guarantee that x_{t+1} is. There are two very common ways to handle this: one is to project x_t-\eta_t \nabla f(x_t) onto \scriptK, and another is to minimize a linear approximation to the function f(x_t) over \scriptK.

Projected Gradient Descent for Non-negative Least Squares

Consider again non-negative least squares, where the coefficients cannot be negative. One applies this because of domain knowledge about the problem: for instance more rooms will not lower the price of a house, and similarly if the effect is a count it cannot be negative. If \beta\in \mathbb{R}^p, the goal is to find

(2)   \begin{align*}\arg\min_{\boldsymbol{\beta}\in \mathbb{R}^p:\beta_i\geq 0,1\leq i\leq p}\Vert \boldsymbol{X}\boldsymbol{\beta}-\boldsymbol{y}\Vert^2\end{align*}

While we don’t need to enforce the constraint, it can help improve learning in small samples. One way to handle this is via projected gradient descent. The idea is to do one step of gradient descent, and then find the closest non-negative point in \mathbb{R}^p to that step. More formally, let \mathcal{K}=\{\boldsymbol{\beta}\in \mathbb{R}^p:\beta_i\geq 0,1\leq i\leq p\} and \Pi_{\mathcal{K}}(\boldsymbol{\beta})=\textrm{proj}_{\mathcal{K}}(\boldsymbol{\beta}). Then projected gradient descent updates

(3)   \begin{align*}\boldsymbol{\beta}_{t+1}&=\Pi_{\mathcal{K}}(\boldsymbol{\beta}_t-\eta_t \nabla f(\boldsymbol{\beta}_t))\\&=[\boldsymbol{\beta}_t-\eta_t \nabla f(\boldsymbol{\beta}_t)]_+\end{align*}

Here the function []_+ simply zeroes out negative elements of a vector.

Application: Non-negative Least Squares for Boston Housing Prices

Let’s apply this to the Boston housing prices dataset. We will use four features: crime (per capita crime rate), rooms (average number of rooms per dwelling), tax (property tax rate), and percent of population of lower socioeconomic status. We expect the only positive coefficient to be rooms. Thus we flip the sign of all the other observations and then do projected gradient descent. We first load some libraries, write a function to calculate the gradient of the least squares objective function, and load the data.

from sklearn.datasets import load_boston
import numpy as np
np.set_printoptions(suppress=True)

def deriv_f(beta,X,y):
    return -2*np.dot(X.T,y)+2*np.dot(np.dot(X.T,X),beta)

dataset = load_boston()
old_X,y=load_boston(True)
dataset['feature_names']

Before running projected gradient descent, we can first run naive gradient descent and then compare the convergence empirically.

X = np.ones((old_X.shape[0],5))
X[:,1:5]=old_X[:,np.array([0,5,9,12])]
p=np.shape(X)[1]

beta=np.zeros(p)
grad = 100
i=0
while(np.linalg.norm(grad)>1e-2):
    grad = deriv_f(beta,X,y)
    beta = beta-1e-8*grad
    i+=1
    if i%5000==0:
        print(i)
        print(beta)
print(beta)

As a unit test, we can check the results against naively solving least squares via matrix inversion (note: you generally should not invert a matrix to solve linear least squares, however for relatively small problems where features are not highly correlated, it can be fine to do so).

print(i)
print(beta)
print(np.dot(np.linalg.inv(np.dot(X.T,X)),np.dot(X.T,y)))

122627056
[-1.41321706 -0.06157611  5.24849    -0.00501863 -0.53485144]
[-1.41492797 -0.06157926  5.2487206  -0.00501849 -0.53483514]

Gradient descent took over 122 million iterations, and the results from gradient descent and directly solving are nearly identical (conclusion: you generally shouldn’t use gradient descent to solve least squares without a good reason). Now let’s try flipping the signs of relevant features and doing projected gradient descent to enforce non-negative coefficients.

X[:,np.array([0,1,3,4])]=-X[:,np.array([0,1,3,4])]
beta=np.zeros(p)
grad = 100
i=0
while(np.linalg.norm(grad)>1e-2):
    grad = deriv_f(beta,X,y)
    beta = beta-1e-8*grad
    #this is the projection step
    beta[beta<0]=0
    i+=1
    if i%5000==0:
        print(i)
        print(beta)
print(beta)

We can undo the sign reversals to examine the learned coefficients:

beta[np.array([0,1,3,4])]=-beta[np.array([0,1,3,4])]
X[:,np.array([0,1,3,4])]=-X[:,np.array([0,1,3,4])]
print(i)
print(beta)
print(np.dot(np.linalg.inv(np.dot(X.T,X)),np.dot(X.T,y)))

116085845
[-1.41321706 -0.06157611  5.24849    -0.00501863 -0.53485144]
[-1.41492797 -0.06157926  5.2487206  -0.00501849 -0.53483514]

This took approximately 116 million iterations, and also learned very close parameters to solving linear least squares using matrix inversion.

Discussion

In this post we described projected gradient descent for constrained convex optimization. We described the special case algorithm for non-negative least squares and applied it to the Boston housing price dataset. We found the learned model very similar. Using projection for non-negative least squares is thus likely more useful when effects are small and noise is sufficiently high that the learned model is likely to get the coefficient sign incorrect.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.