The Nyström Method for Finding Eigenpairs of a Kernel Function

In this post we describe the Nyström method for finding the eigenvalues and eigenfunctions of a kernel function. This has applications in a wide range of settings: from Gaussian process modeling to functional data analysis to kernel methods. We first describe the motivation in terms of finding the posterior mean of a Gaussian processes. Next we describe how having access to the eigenvalues and eigenfunctions of the kernel can help us with a faster approximate solution. We then frame the eigenproblem in terms of an integral equation and describe the Nyström method.

Motivation

In many problems using kernels in machine learning, the naive solution involves inverting a matrix computed by evaluating a kernel at all combinations of observed points. For instance, say you want to estimate the posterior mean of a Gaussian process, where the prior has mean $0$ and covariance function $K(t,s)$ , where $t$ and $s$ are two time points. If $\boldsymbol{t}$ are the set of $n$ training time with observations $\boldsymbol{y}$ and $t_*$ is the set of test time points, then the posterior mean is

(1) $\begin{align*}\mu_{*|n}&=K(t_*,\boldsymbol{t})[\sigma^2 I+K(\boldsymbol{t},\boldsymbol{t})]^{-1}\boldsymbol{y}\end{align*}$

However, this involves forming the covariance matrix between all time points $K(\boldsymbol{t},\boldsymbol{t})$ and inverting it, an $O(n^3)$ operation. The same issue comes up in kernel ridge regression, and very similar issues come up in other kernel settings.

A reasonable question is: can we approximate the inverse $[\sigma^2 I+K(\boldsymbol{t},\boldsymbol{t})]^{-1}$ with a lower order dependence on $n$ ?

Approximating the Inverse

Mercer’s Theorem

The key idea towards approximating the inverse efficiently is approximating the kernel. To do so we recall Mercer’s theorem, which describes how we can express a continuous symmetric positive semi-definite kernel in terms of eigenvalues and eigenfunctions.

Theorem: (Mercer’s Theorem) Let $k$ be a continuous symmetric positive semi-definite kernel on the space $L^2(\mathcal{T})$ with measure $\mu$ . There exists an orthonormal basis $\phi_i$ and corresponding non-negative weights $\lambda_i$ such that

(2) $\begin{align*}k(s,t)&=\sum_{i=1}^\infty \lambda_i \phi_i(s)\phi_i(t)\end{align*}$

where $\lambda_i,\phi_i$ , the eigenvalues and eigenfunctions, respectively, are the solutions to

(3) $\begin{align*}\lambda \phi(t)&=\int_{\mathcal{T}}k(t,s)\phi(s)d\mu(s)\end{align*}$

Using Mercer’s Theorem to Approximate the Inverse

Since the $\lambda_i\rightarrow 0$ in order for the series to converge, if we knew the eigenvalues and eigenfunctions we could approximate the kernel with a finite sum

(4) $\begin{align*}k(s,t)\approx \sum_{i=1}^l\lambda_i \phi_i(s)\phi_i(t)\end{align*}$

We can then express the matrix

(5) $\begin{align*}K(\boldsymbol{t},\boldsymbol{t})&\approx \sum_{i=1}^l \lambda_i \left(\begin{array}{c}\phi_i(t_1)\\\vdots\\\phi_i(t_n)\end{array}\right)\left(\begin{array}{ccc}\phi_i(t_1)& \cdots &\phi_i(t_n)\end{array}\right)\end{align*}$

This is a sum of rank one matrices and can be cheaply inverted using Woodbury’s matrix identity. However we do not have direct access to the eigenvalues and eigenfunctions. The Nyström method involves using an approximation to the integral in order to approximate them. We can then use the approximate eigenpairs to approximate the kernel matrix as the above sum of rank one updates, which can be cheaply inverted.

Nyström Method

The Nyström method involves approximating $\int_{\mathcal{T}}k(t,s)\phi(s)d\mu(s)\approx \frac{1}{l}\sum_{i=1}^l k(t,s_i)\phi(s_i)$ by sampling $s_i$ from $p(s)$ and then solving

(6) $\begin{align*}\lambda \phi(t)&=\frac{1}{l}\sum_{i=1}^l k(t,s_i)\phi(s_i)\end{align*}$

Substituting $t$ with $s_1,\cdots,s_n$ gives us

(7) $\begin{align*}\frac{1}{l}\left(\begin{array}{ccc}k(s_1,s_1)& \cdots & k(s_1,s_l)\\\vdots & \ddots & \vdots\\k(s_n,s_1) &\cdots &k(s_l,s_l)\end{array}\right)\left(\begin{array}{c}\phi(s_1)\\\vdots\\\phi(s_l)\end{array}\right)&=\lambda \left(\begin{array}{c}\phi(s_1)\\\vdots\\\phi(s_l)\end{array}\right)\end{align*}$

We can then find $\tilde{\lambda}_i,\tilde{\phi}_i(s_j)$ for $i,j=1,\cdots,l$ , approximations to the true eigenpairs by solving for the eigenvalues and eigenvectors of the scaled kernel matrix. However, we want $\tilde{\phi}_i$ evaluated at arbitrary points $t$ . We obtain this via

(8) $\begin{align*}\tilde{\phi}_i(t)&=\frac{1}{l\lambda_i}\sum_{j=1}^l k(t,s_j)\tilde{\phi}_i(s_j)\end{align*}$

Putting it All Together

Recall that we have approximated the kernel $k(s,t)\approx \sum_{i=1}^l\lambda_i \phi_i(s)\phi_i(t)$ with a finite sum, but we lacked $\lambda_i$ and $\phi_i$ . We now have approximations to them that we can plug in from Nyström’s method. We can then compute an approximate kernel matrix using

(9) $\begin{align*}\tilde{K}(\boldsymbol{t},\boldsymbol{t})&=\sum_{i=1}^l \tilde{\lambda}_i \left(\begin{array}{c}\tilde{\phi}_i(t_1)\\\vdots\\\tilde{\phi}_i(t_n)\end{array}\right)\left(\begin{array}{ccc}\tilde{\phi}_i(t_1)& \cdots &\tilde{\phi}_i(t_n)\end{array}\right)\end{align*}$

Now recalling that

(10) $\begin{align*}\mu_{*|n}&=K(t_*,\boldsymbol{t})[\sigma^2 I+K(\boldsymbol{t},\boldsymbol{t})]^{-1}\boldsymbol{y}\end{align*}$

We can approximate

(11) $\begin{align*}\mu_{*|n}&\approx K(t_*,\boldsymbol{t})[\sigma^2 I+\tilde{K}(\boldsymbol{t},\boldsymbol{t})]^{-1}\boldsymbol{y}\end{align*}$

where the term $[\sigma^2 I+\tilde{K}(\boldsymbol{t},\boldsymbol{t})]^{-1}$ is inexpensive to compute.