Notation
Assume we have the \(N \times D\) data matrix \(\mathbf{X}\), with \(N\) observations and \(D\) variable where \(x_{i,j}\) represents the \(i\)th observation of the \(j\)th variable:
\[
\mathbf{X} = \begin{bmatrix}
x_{1,1} & \dots & x_{1,D} \\
\vdots & \ddots & \vdots \\
x_{i,1} & \dots & x_{i,D} \\
\vdots & \ddots & \vdots \\
x_{N,1} & \dots & x_{N,D}
\end{bmatrix}
\]
As in linear PCA, a more succinct way to represent this matrix is to write:
\[
\mathbf{X} = \begin{bmatrix} \vec{x}_1^T \\ \vdots \\ \vec{x}_i
^T\\ \vdots \\ \vec{x}_N^T \end{bmatrix}
\]
We will once again treat \(\vec{x}_1, \dots, \vec{x}_N\) as i.i.d. samples of a random vector in \(\mathbb{R}^D\) from the same underlying distribution \(F\) for which \(E_{x \sim F} \|x\|^2 < \infty\) to assure that the mean and covariance of the data matrix are well-deefined. We use:
\[
\overline{x} = \frac{1}{N} \sum_{i = 1}^N \vec{x}_i
\]
to denote the sample mean of \(\vec{x}\). To denote the covariance operator, we use:
\[
\mathbf{\Sigma}_x = \frac{1}{N} \sum_{i = 1}^N (\vec{x}_i - \overline{x})(\vec{x}_i - \overline{x})^T = \frac{1}{N} \left(\sum_{i = 1}^N \vec{x}_i \vec{x}_i^T \right) - \overline{x}\overline{x}^T = \frac{1}{N} \mathbf{X}^T \mathbf{X} - \overline{x} \overline{x}^T
\]
This covariance operator will come into use later on. Additional notation includes:
\(1_N\) is the column vector of all ones with length \(N\) (i.e. \(1_N \in \mathbb{R}^N\))
\(\| \vec{z} \|\) denotes the Euclidean norm (i.e. for \(\vec{z} \in \mathbb{R}^d, \|\vec{z}\| = \left(\sum_{i = 1}^{d} z_i ^2 \right)^{1/2} = \sqrt{\vec{z}^T \vec{z}}\))
\(\Psi: \mathbb{R}^D \rightarrow \mathbb{R}^{d+1}\) is a projection map from space \(\mathbb{R}^D\) to \(\mathbb{R}^{d+1}\). Note: \(\Psi\) maps from \(\mathbb{R}^D\) to \(\mathbb{R}^{d+1}\) rather than \(\mathbb{R}^{d}\) because the projected points lie on an affine subspace with one fewer degree of freedom than the manifold. One example is fitting a curve in \(\mathbf{R}^2\) with sections of 2-dimensional circles; while circles are 2D, the projected points only live along the edge of the circle. As such, we include on extra dimension in our projection map to account for the extra degree of freedom.
\(d^2(\cdot, \cdot)\) is the distance operator between two points. Note that \(d^2(x, y)\) is equivalent to \(\sqrt{\| x - y \|}\).
Assumptions
There are three main assumptions made by Li and Dunson:
- Distributional Assumption. The data matrix \(\mathbf{X} \in \mathbb{R}^{N \times D}\) consists of a transformation \(V^* \Lambda^{*\frac{1}{2}} \mathbf{Z}\) where \(\mathbf{Z} \in \mathbb{R}^{N \times D}\) is a data matrix consisting of i.i.d \(z_i\) non-degenerate random variables.
- Moment Assumption. For the underlying random variables \(z_{i}\), \(E(z_{i}) = 0\), \(E(z_i^2) = 1\), and \(E(z_i^6) < \infty\). We can choose an affine transformation \(V^*\) such that the underlying random variables satisfy the first two moment assumptions. A similar assumption is considered in Lee et al. that uses bounded fourth moments instead to prove convergence in high dimensional PCA (S. Lee, Zou, and Wright 2010).
- Spike Population Model Assumption. If \(\lambda_1, \lambda_2, \dots, \lambda_D\) are the ordered eigenvalues of \(\Lambda^*\) as similar to a PCA setting, then there exists an integer \(m > d\) such that \(\lambda_1 \ge \lambda_2 \ge \dots \ge \lambda_m \ge lambda_{m+1} = \dots = \lambda_D = 1\).
The consequence of these assumptions is that we have a theoretical guarantee that the estimate projection \(\hat{\Psi}\) converges in probability to \(\Psi^*\). In other words:
\[
\lim_{n \rightarrow \infty} P(\sup_{x \in M} \| \hat{\Psi}(x) - \Psi^*(x) \| > \epsilon) = 0, \, \forall\epsilon <0
\]
Given a set of data \(\vec{x}_1, \dots, \vec{x}_N \in \mathbb{R}^D\) organized into data matrix \(\mathbf{X}\), spherical PCA estimates the best approximating sphere \(S_{V}(c,r)\), where \(c\) is the center, \(r\) is the radius, and \(V \in \mathbb{R}^{(d+1) \times (d+1)}\) is the \((d+1)\)th dimensional affine subspace the sphere lives on.
For observation \(\vec{x}_i\), the closest point \(\vec{y}_i\) lying on the sphere \(S_V(c,r)\) is the point that minimizes Euclidean distance \(d^2(\vec{x}_i, \vec{y}_i)\) between \(\vec{x}\) and \(\vec{y}\).
Continuing the convention that \(\Psi\) is an orthogonal projection from \(R^D\) to the affine subspace given by c + V, then \(\Psi\) is given by:
\[
\hat{\Psi}(\vec{x}_i) = \hat{c} + \frac{\hat{r}}{\| \hat{V}\hat{V}^T (\vec{x}_i - \hat{c}) \|}\hat{V}\hat{V}^T(\vec{x}_i - \hat{c})
\]
Note that \(\vec{x} - \Psi_{V,c}(\vec{x}) \perp \Psi_{V,c}(\vec{x}) - \vec{y}\) since \(\Psi\) is an orthogonal projection.
The optimal subspace \(V\) is given by \(\hat{V} = (\vec{v}_1, \dots, \vec{v}_{d+1})\), where \(\vec{v}_i, i \in \{1,\dots,d+1\}\) is the \(i\)th eigenvector ranked in descending order of \((\mathbf{X} - 1_N\bar{\mathbf{X}})^T(\mathbf{X} - 1_N\bar{\mathbf{X}})\). This is a result taken from linear PCA, where we eigendecomposed the covariance matrix \(\Sigma\) to obtain the decreasing eigenvalues and eigenvectors to obtain the principal component scores and loadings. In fact, the matrix given by \((\mathbf{X} - 1_N\bar{\mathbf{X}})^T(\mathbf{X} - 1_N\bar{\mathbf{X}})\) is equivalent to:
\[
\begin{aligned}
(\mathbf{X} - 1_N\bar{\mathbf{X}})^T(\mathbf{X} - 1_N\bar{\mathbf{X}})&= \mathbf{X}^T \mathbf{X} - \bar{\mathbf{X}}^T1_N^T\mathbf{X} - \mathbf{X}^T 1_N \bar{\mathbf{X}} + \bar{\mathbf{X}}^T 1_N^T 1_N \bar{\mathbf{X}} \\
&= \mathbf{X}^T \mathbf{X} - N \overline{x} \overline{x}^T \\
&= N \cdot \Sigma_X
\end{aligned}
\]
In other words, the first step of spherical PCA is to eigendecompose the scaled covariance matrix \(\Sigma_X\)! This is where the “PCA” aspect of spherical PCA arises.
If \(\vec{z}_i = \bar{\mathbf{X}} + \hat{V}\hat{V}^T(\vec{x}_i - \bar{\mathbf{X}})\) are a change of basis to affine subspace \(V\), then it can be shown that the minimizing pair \((\vec{\eta}^*, \vec{\xi}^*)\) of loss function \(g(\vec{\eta}, \vec{\xi}) = \sum_{k = 1}^N(\vec{z}_i^T \vec{z}_i + \vec{\eta}^T \vec{x}_i + \vec{\xi})^2\) is:
\[
\begin{gathered}
\vec{\eta} = -H^{-1}\omega \\
\vec{\xi} = -\frac{1}{N} \sum_{k=1}^N(\vec{z}_i^T \vec{z}_i + \vec{\eta}^T\vec{z}_i)
\end{gathered}
\]
where \(H\) and \(\omega\) are defined as:
\[
\begin{aligned}
H &= \sum_{i = 1}^N (\vec{z}_i - \overline{z})(\vec{z}_i - \vec{z})^T \\
&= N \Sigma_z \\
\vec{\omega} &= \sum_{i=1}^N \left( \|\vec{z}_i^T\vec{z}_i\| - \frac{1}{N}\sum_{j=1}^N\|\vec{z}_j^T\vec{z}_j\| \right) (\vec{z}_i - \overline{z})
\end{aligned}
\]
The \(H\) matrix can be seen as a centered matrix that is the covariance matrix of new coordinates \(\mathbf{Z}\) multiplied by number of observations \(N\). \(\vec{\omega}\) is a vector where \(\omega_i\) is the centered \(i\)th \(z\)-coordinate scaled by a weight, where the weight is a centered magnitude of the \(i\)th \(z\)-coordinate.
The optimal parametrization \((\hat{V}, \hat{c}, \hat{r})\) of the projection of \(\mathbf{X} \in \mathbb{R}^{N \times D}\) onto the sphere \(S_V(c, r)\) is:
\[
\begin{gathered}
\hat{V} = (\vec{v}_1, \dots, \vec{v}_{d + 1}) \\
\hat{c} = -\frac{\vec{\eta}^*}{2} \\
\hat{r} = \frac{1}{N} \sum_{k = 1}^N \left\|\vec{z}_i - \hat{c} \right\|
\end{gathered}
\]
The projection map \(\hat{\Psi}\) of data matrix \(\mathbf{X}\) onto sphere \(S_{\hat{V}}(\hat{c}, \hat{r})\) is the projection map onto affine subspace \(\hat{c} + \hat{V}\), given by:
\[
\hat{\Psi}(\vec{x}_i) = \hat{c} + \frac{\hat{r}}{\| \hat{V}\hat{V}^T (\vec{x}_i - \hat{c}) \|}\hat{V}\hat{V}^T(\vec{x}_i - \hat{c})
\]