In this post we describe the high-level idea behind gradient descent for convex optimization. Much of the intuition comes from Nisheeth Vishnoi’s short course, but he provides a more theoretical treatment, while we aim to focus more on the intuition. We first describe why to use gradient descent. We next define convex sets and functions and then describe the intuitive idea behind gradient descent. We follow this with a toy example and some discussion.
Why Use Gradient Descent?
In many data analysis problems we want to minimize some functions: for instance the negative log-likelihood. However, the function either lacks a closed form solution or becomes very expensive to compute for large datasets. For instance, logistic regression lacks a closed form solution, while the naive closed form solution to linear regression requires solving a linear system, which may have stability issues. For the former, gradient descent provides us a method for solving the problem, while for the latter, gradient descent allows us to avoid these stability issues.
Convex Sets and Convex Functions
Much of the practical application and most of the theory for gradient descent involves convex sets and functions. Intuitively, a convex set is one where for any two points in a set, every point between them is also in the set. For example, consider a line in 1d from points





More formally, a convex set

![Rendered by QuickLaTeX.com \lambda\in [0,1],\lambda x+(1-\lambda)y\in \mathcal{K}](https://boostedml.com/wp-content/ql-cache/quicklatex.com-93db38b8e2d1fbfc6ff77ac57c1e3ca0_l3.png)
![Rendered by QuickLaTeX.com \mathcal{K}=[a,b]](https://boostedml.com/wp-content/ql-cache/quicklatex.com-f651e11fb842b8518a69560863c2ff17_l3.png)

In the 1d case, a convex function is one where if you draw a line segment between the function evaluated at any two points, the line lies at or above the function everywhere in between. Let’s look at an example: the function

We see that the line segment joining the points




![Rendered by QuickLaTeX.com \lambda\in [0,1]](https://boostedml.com/wp-content/ql-cache/quicklatex.com-d68729d1cbd0e41931a88fa0d2ee423f_l3.png)


(1)
here



An equivalent definition of convexity for differentiable functions is that the tangent line or hyperplane to the function at any point lies below the function everywhere. The following plot of the same function illustrates this

Again, we can see from looking at it that this property should hold for the function at any point. More formally, a differentiable function

(2)
where

Gradient Descent: Idea
The goal of gradient descent is to iteratively find the minimizer of




(3)
We want to minimize the right hand side in order to reduce






We can note that the gradient is the direction in which the function grows fastest. Since we want to minimize, we want to move in the opposite direction of the gradient. Thus we can set
(4)
where


Let’s now look at example. Consider again the function



Now considering the fixed



(5)
Why are Convex Functions Important for Gradient Descent?
For differentiable convex functions, the following three properties are equivalent:
is a local minimum of
is a global minimum of
For gradient descent this is helpful because we can check when to terminate the algorithm by looking at the derivative and checking its magnitude. Further, we know that under this termination we have achieved (close to) the global minimum.
A Gradient Descent Example
Let’s try minimizing
import numpy as np from matplotlib import pyplot as plt def f(x): return x**4 def deriv_f(x): return 4*(x**3)
Then we will write code to run and plot the gradients over iterations.
x=7.5 gradient_magnitudes=[] while(np.abs(deriv_f(x))>1e-4): grad = deriv_f(x) x=x-1e-3*grad gradient_magnitudes.append(grad) plt.plot(gradient_magnitudes) plt.xlabel('iteration') plt.ylabel('gradient magnitude') plt.title('Gradient Magnitude of Gradient Descent')

We can see fast convergence initially, but as the slope gets smaller, convergence gets very slow: the algorithm needs over 140,000 iterations with the learning rate we set. Further, if we increase the learning rate by a factor of 10, gradient descent diverges. In future posts we will investigate methods to deal with this.
Discussion
In this post we discussed the intuition behind gradient descent. We first defined convex sets and convex functions, then described the idea behind gradient descent: moving in the direction opposite the direction with the largest rate of increase. We then described why this is useful for convex functions, and finally showed a toy example.