summaryrefslogtreecommitdiff
path: root/marginal.tex
diff options
context:
space:
mode:
authorJaron Kent-Dobias <jaron@kent-dobias.com>2024-06-27 16:54:56 +0200
committerJaron Kent-Dobias <jaron@kent-dobias.com>2024-06-27 16:54:56 +0200
commitf8e31ea038b207f4a5fee131923a09af952c41c7 (patch)
tree2b53e48ae89e2f03540de93332c8a46f9c413248 /marginal.tex
parent2914e5d4639a5a9b34d3026fcd86b851ddd63a6d (diff)
downloadmarginal-f8e31ea038b207f4a5fee131923a09af952c41c7.tar.gz
marginal-f8e31ea038b207f4a5fee131923a09af952c41c7.tar.bz2
marginal-f8e31ea038b207f4a5fee131923a09af952c41c7.zip
Small rewriting and edits in the first two sections.
Diffstat (limited to 'marginal.tex')
-rw-r--r--marginal.tex105
1 files changed, 56 insertions, 49 deletions
diff --git a/marginal.tex b/marginal.tex
index 59b92e0..c1b5a82 100644
--- a/marginal.tex
+++ b/marginal.tex
@@ -35,9 +35,9 @@
conditioning the statistics of stationary points in random landscapes on
their marginality, and apply it in three isotropic settings with
qualitatively different structure: in the spherical spin-glasses, where the
- energy is Gaussian and its Hessian is GOE; in a multispherical spin glasses,
- which are Gaussian but non-GOE; and in spherical random nonlinear sum of
- squares, which is non-Gaussian. In these problems we are able to fully
+ energy is Gaussian and its Hessian is GOE; in multispherical spin glasses,
+ which are Gaussian but non-GOE; and in sums of squared spherical random
+ functions, which are non-Gaussian. In these problems we are able to fully
characterize the distribution of marginal optima in the landscape, including
when they are in the minority.
\end{abstract}
@@ -47,23 +47,23 @@
\section{Introduction}
Systems with rugged landscapes are important across many disciplines, from the
-physics to glasses and spin-glasses to statistical inference problems. The
+physics to glasses and spin-glasses to statistical inference problems \cite{Ros_2023_High-dimensional}. The
behavior of these systems is best understood when equilibrium or optimal
solutions are studied and weighted averages can be taken statically over all
possible configurations. However, such systems are also infamous for their
tendency to defy equilibrium and optimal expectations in practice, due to the
presence of dynamic transitions or crossovers that leave physical or
-algorithmic dynamics stuck exploring only a subset of configurations.
+algorithmic dynamics stuck exploring only a subset of configurations \cite{Biroli_2013_Perspective, Krzakala_2007_Landscape}.
In mean-field settings, it was long thought that physical and many algorithmic
dynamics would get stuck at a specific energy level, called the threshold
energy. The threshold energy is the energy level at which level sets of the
landscape transition from containing mostly saddle points to containing mostly
-minima. At this threshold, the level set contains mostly \emph{marginal
+minima. The level set associated with this threshold energy contains mostly \emph{marginal
minima}, or minima that have a pseudogap in the spectrum of their Hessian.
However, recent work found that the threshold energy is not important even for
-simple gradient descent dynamics \cite{Folena_2020_Rethinking, Folena_2023_On}.
+simple gradient descent dynamics \cite{Folena_2020_Rethinking, Folena_2023_On, ElAlaoui_2020_Algorithmic}.
Depending on the initial condition of the system and the nature of the
dynamics, the energy reached can be above or below the threshold energy, while
in some models the threshold energy is completely inaccessible to any dynamics
@@ -71,16 +71,16 @@ in some models the threshold energy is completely inaccessible to any dynamics
energy level that many simple algorithms will reach, the results all share one
commonality: the minima found are still marginal, despite being in the minority
compared to stiff minima or saddle points. This ubiquity of behavior suggests
-that the distribution of marginal minima can bound out-of-equilibrium dynamical
+that the distribution of marginal minima can be used to bound out-of-equilibrium dynamical
behavior.
It is not straightforward to condition on the marginality of minima using the
traditional methods for analyzing the distribution of minima in rugged
landscapes. Using the method of a Legendre transformation of the Parisi
parameter corresponding to a set of real replicas, one can force the result to
-correspond with marginal minima by tuning the value of that parameter, but this
-results in only the threshold energy and cannot characterize marginal minima at
-energies where they are a minority \cite{Monasson_1995_Structural}.
+correspond with marginal minima by tuning the value of that parameter \cite{Monasson_1995_Structural}. However, this
+results in only a characterization of the threshold energy and cannot characterize marginal minima at
+other energies where they are a minority.
The alternative approach, used to great success in the spherical spin glasses, is to
start by making a detailed understanding of the Hessian matrix at stationary
@@ -111,12 +111,12 @@ pseudogap, and is therefore marginal. We demonstrate the method on the
spherical spin glasses, where it is unnecessary but instructive, and on extensions of
the spherical models with non-GOE Hessians where the technique is more useful.
In a companion paper, we compare the marginal complexity with the performance
-of gradient descent and approximate message passing \cite{Kent-Dobias_2024_Algorithm-independent}.
+of gradient descent and approximate message passing algorithms \cite{Kent-Dobias_2024_Algorithm-independent}.
-In Section \ref{sec:eigenvalue} we introduce the technique for conditioning on
+An outline of this paper follows. In Section \ref{sec:eigenvalue} we introduce the technique for conditioning on
the smallest eigenvalue and how to extend it to further condition on the
presence of a pseudogap. We provide a simple but illustrative example using a
-shifted GOE matrix. In Section \ref{sec:marginal.complexity} we apply this
+GOE matrix with a shifted diagonal. In Section \ref{sec:marginal.complexity} we apply this
technique to the problem of characterizing marginal minima in random
landscapes. The following Section \ref{sec:examples} gives several examples of
the marginal complexity applied to specific models of increasing difficulty.
@@ -135,8 +135,9 @@ at the bottom on the spectrum.
\subsection{The general method}
-An arbitrary function $g$ of the minimum eigenvalue of a matrix $A$ can be
-expressed as
+Consider an $N\times N$ real matrix $A$. An arbitrary function $g$ of the
+minimum eigenvalue of $A$ can be expressed using integrals over $\mathbf
+s\in\mathbb R^N$ as
\begin{equation} \label{eq:λmin}
g(\lambda_\textrm{min}(A))
=\lim_{\beta\to\infty}\int
@@ -163,15 +164,16 @@ of $A$. This produces
&=g(\lambda_\mathrm{min}(A))
\end{aligned}
\end{equation}
+as desired.
The first relation extends a technique first introduced in
-\cite{Ikeda_2023_Bose-Einstein-like} and used in
+\cite{Ikeda_2023_Bose-Einstein-like} and later used in
\cite{Kent-Dobias_2024_Arrangement} in the context of random landscapes. A Boltzmann distribution is introduced
over a spherical model whose Hamiltonian is quadratic with interaction matrix
given by $A$. In the limit of zero temperature, the measure will concentrate on
the ground states of the model, which correspond with the eigenspace of $A$
associated with its minimum eigenvalue $\lambda_\mathrm{min}$. The second
-relation uses the fact that, once restricted to the sphere $\mathbf s^T\mathbf
-s=N$ and the minimum eigenspace, $\mathbf s^TA\mathbf s=N\lambda_\mathrm{min}(A)$.
+relation uses the fact that, once restricted to the sphere $\|\mathbf
+s\|^2=N$ and the minimum eigenspace, $\mathbf s^TA\mathbf s=\mathbf s^T\mathbf s\lambda_\mathrm{min}(A)=N\lambda_\mathrm{min}(A)$.
The relationship is formal, but we can make use of the fact that the integral
expression with a Gibbs distribution can be manipulated with replica
@@ -196,11 +198,12 @@ while for $\mu>2\sigma$ the minimum eigenvalue would need to be a large
deviation from the typical spectrum and its likelihood will be exponentially
suppressed with $N$. For $\mu<2\sigma$, the bulk of the typical spectrum contains
zero and therefore a larger $N^2$ deviation, moving an extensive number of
-eigenvalues, would be necessary. This final case cannot be quantified by this
+eigenvalues, would be necessary \cite{Dean_2006_Large}. This final case cannot be quantified by this
method, but instead the nonexistence of a large deviation linear in $N$ appears
-as the emergence of an imaginary part in the function.
+as the emergence of an imaginary part in the large deviation function.
-As an example, we compute
+To compute this large deviation function, we will employ the method outlined in
+the previous subsection to calculate
\begin{equation} \label{eq:large.dev}
\begin{aligned}
e^{NG_{\lambda^*}(\mu)}
@@ -235,7 +238,8 @@ find
\begin{aligned}
&e^{NG_{\lambda^*}(\mu)}
=\lim_{\beta\to\infty}\lim_{m\to0}\int d\hat\lambda\prod_{\alpha=1}^m\left[d\mathbf s^\alpha\,\delta(N-\|\mathbf s^\alpha\|^2)\right] \\
- &\hspace{10em}\exp\left\{N\left[\hat\lambda(\lambda^*-\mu)-m\beta\mu\right]+\frac{\sigma^2}{N}\left[\beta^2\sum_{\alpha\gamma}^m(\mathbf s^\alpha\cdot\mathbf s^\gamma)^2
+ &\hspace{10em}
+ \times\exp\left\{N\left[\hat\lambda(\lambda^*-\mu)-m\beta\mu\right]+\frac{\sigma^2}{N}\left[\beta^2\sum_{\alpha\gamma}^m(\mathbf s^\alpha\cdot\mathbf s^\gamma)^2
+2\beta\hat\lambda\sum_\alpha^m(\mathbf s^\alpha\cdot\mathbf s^1)^2
+\hat\lambda^2N^2
\right]\right\}
@@ -243,26 +247,26 @@ find
\end{equation}
\end{widetext}
We make the Hubbard--Stratonovich transformation to the matrix field
-$Q^{\alpha\beta}=\frac1N\mathbf s^\alpha\cdot\mathbf s^\beta$. This gives
+$Q^{\alpha\beta}=\frac1N\mathbf s^\alpha\cdot\mathbf s^\beta$. This produces an integral expression of the form
\begin{equation}
e^{NG_{\lambda^*}(\mu)}
=\lim_{\beta\to\infty}\lim_{m\to0}\int d\hat\lambda\,dQ\,
e^{N\mathcal U_\mathrm{GOE}(\hat\lambda,Q\mid\beta,\lambda^*,\mu)}
\end{equation}
-where the effective action is given by
+where the effective action $\mathcal U_\mathrm{GOE}$ is given by
\begin{equation} \label{eq:goe.action}
\begin{aligned}
&\mathcal U_\textrm{GOE}(\hat\lambda, Q\mid\beta,\lambda^*,\mu)
- =\hat\lambda(\lambda^*-\mu)-m\beta\mu \\
+ =\hat\lambda(\lambda^*-\mu)+\lim_{m\to0}\bigg\{-m\beta\mu \\
&+\sigma^2\left[\beta^2\sum_{\alpha\gamma}^m(Q^{\alpha\gamma})^2
+2\beta\hat\lambda\sum_\alpha^m(Q^{1\alpha})^2
+\hat\lambda^2
- \right]+\frac12\log\det Q
+ \right]+\frac12\log\det Q\bigg\}
\end{aligned}
\end{equation}
and $Q^{\alpha\alpha}=1$ because of the spherical constraint. We can evaluate this
integral using the saddle point method. We make a replica symmetric ansatz for
-$Q$, because this is a 2-spin model, but with the first row singled out because
+$Q$, because this is a 2-spin spherical model, but with the first row singled out because
of its unique coupling with $\hat\lambda$. The resulting matrix has the form
\begin{equation} \label{eq:Q.structure}
Q=\begin{bmatrix}
@@ -273,11 +277,15 @@ of its unique coupling with $\hat\lambda$. The resulting matrix has the form
\tilde q_0&q_0&q_0&\cdots&1
\end{bmatrix}
\end{equation}
-The relevant expressions in the effective action produce $\sum_{\alpha\beta}(Q^{\alpha\beta})^2=m+2(m-1)\tilde q_0^2+(m-1)(m-2)q_0^2$, $\sum_\alpha(Q^{1\alpha})^2=1+(m-1)\tilde q_0^2$,
-and
-\begin{equation}
- \log\det Q=(m-2)\log(1-q_0)+\log(1+(m-2)q_0-(m-1)\tilde q_0^2)
-\end{equation}
+The relevant expressions in the effective action produce
+\begin{align}
+ &\sum_{\alpha\beta}(Q^{\alpha\beta})^2=m+2(m-1)\tilde q_0^2+(m-1)(m-2)q_0^2 \\
+ &\sum_\alpha(Q^{1\alpha})^2=1+(m-1)\tilde q_0^2 \\
+ &\begin{aligned}
+ &\log\det Q=(m-2)\log(1-q_0) \\
+ &\hspace{6em}+\log\big[1+(m-2)q_0-(m-1)\tilde q_0^2\big]
+ \end{aligned}
+\end{align}
Inserting these expressions into the effective action and taking the limit of
$m$ to zero, we arrive at
\begin{equation}
@@ -285,7 +293,7 @@ $m$ to zero, we arrive at
=\lim_{\beta\to\infty}\int d\hat\lambda\,dq_0\,d\tilde q_0\,
e^{N\mathcal U_\textrm{GOE}(\hat\lambda,q_0,\tilde q_0\mid\beta,\lambda^*,\mu)}
\end{equation}
-with the effective action
+with the new effective action
\begin{equation}
\begin{aligned}
&\mathcal U_\mathrm{GOE}(\hat\lambda,q_0,\tilde q_0\mid\beta,\lambda^*,\mu) \\
@@ -316,7 +324,7 @@ action that diverges with $\beta$. To cure this, we must take $\tilde y=y$. The
+\frac12\log\left(1-\frac{2\Delta z}{y^2}\right)
\end{aligned}
\end{equation}
-Extremizing this action over the new parameters $y$, $\Delta z$, and $\hat\lambda$, we have
+Extremizing this action over the new parameters $y$, $\Delta z$, and $\hat\lambda$, we find
\begin{align}
\hat\lambda&=\frac1\sigma\sqrt{\left(\frac{\mu-\lambda^*}{2\sigma}\right)^2-1}
\\
@@ -329,7 +337,7 @@ Extremizing this action over the new parameters $y$, $\Delta z$, and $\hat\lambd
-\frac{\mu-\lambda^*}{2\sigma}\sqrt{\left(\frac{\mu-\lambda^*}{2\sigma}\right)^2-1}
\right]
\end{align}
-Inserting this solution into the effective action we find
+Inserting this solution into the effective action we arrive at
\begin{equation} \label{eq:goe.large.dev}
\begin{aligned}
&G_{\lambda^*}(\mu)
@@ -341,15 +349,15 @@ Inserting this solution into the effective action we find
\right]
\end{aligned}
\end{equation}
-This function is plotted in Fig.~\ref{fig:large.dev} for $\lambda^*=0$. For $\mu<2\sigma$ $G_{0}(\mu)$ has an
-imaginary part. This indicates that the existence of a minimally zero eigenvalue when $\mu<2\sigma$ corresponds with a large deviation that grows faster than $N$,
+This function is plotted in Fig.~\ref{fig:large.dev} for $\lambda^*=0$. For $\mu<2\sigma$, $G_{0}(\mu)$ has an
+imaginary part. This indicates that the existence of a zero minimum eigenvalue when $\mu<2\sigma$ corresponds with a large deviation that grows faster than $N$,
rather like $N^2$, since in this regime the bulk of the typical spectrum is
over zero and therefore extensively many eigenvalues have to have large
-deviations in order for the smallest eigenvalue to be zero. For
+deviations in order for the smallest eigenvalue to be zero \cite{Dean_2006_Large}. For
$\mu\geq2\sigma$ this function gives the large deviation function for the
probability of seeing a zero eigenvalue given the shift $\mu$.
$\mu=2\sigma$ is the maximum of the function with a real value, and
-corresponds to the intersection of the average spectrum with zero, i.e., a
+corresponds to the intersection of the typical bulk spectrum with zero, i.e., a
pseudogap.
\begin{figure}
@@ -377,11 +385,9 @@ pseudogap.
} \label{fig:large.dev}
\end{figure}
-Marginal spectra with a pseudogap and those with simple isolated eigenvalues
-are qualitatively different, and more attention may be focused on the former.
Here, we see what appears to be a general heuristic for identifying the saddle
parameters for which the spectrum is pseudogapped: the equivalent of this
-large-deviation functions will lie on the singular boundary between a purely
+large-deviation function will lie on the singular boundary between a purely
real and complex value.
\subsection{Conditioning on a pseudogap}
@@ -389,7 +395,7 @@ real and complex value.
We have seen that this method effectively conditions a random matrix ensemble
on its lowest eigenvalue being zero. However, this does not correspond on its
-own to marginal minima. In the previous example, most values of $\mu$ where
+own to marginality. In the previous example, most values of $\mu$ where
the calculation was valid correspond to matrices with a single isolated
eigenvalue. However, the marginal minima we are concerned with have
pseudogapped spectra, where the continuous part of the spectral density has a
@@ -398,7 +404,7 @@ lower bound at zero.
Fortunately, our calculation can be modified to ensure that we consider only
pseudogapped spectra. First, we insert a shift $\mu$ by hand into the `natural'
spectrum of the problem at hand, conditioning the trace to have a specific
-value $\mu=\operatorname{Tr}A$. Then, we choose this artificial shift so that
+value $\mu=\frac1N\operatorname{Tr}A$. Then, we choose this artificial shift so that
the resulting conditioned spectra are pseudogapped. As seen the previous
subsection, this can be done by starting from a sufficiently large $\mu$ and
decreasing it until the calculation develops an imaginary part, signaling the
@@ -406,9 +412,10 @@ breakdown of the large-deviation principle at order $N$.
In isotropic or zero-signal landscapes, there is another way to condition on a
pseudogap. In such landscapes, the typical spectrum does not have an isolated
-eigenvalue. Therefore, the condition associated with the bulk of the spectrum
-touching zero, i.e., the pseudogap, will always correspond to the most common
-configuration. We can therefore choose $\mu=\mu_\textrm m$ such that
+eigenvalue. Therefore, for a given $\mu$ the bottom of the spectrum can be located by looking for the value $\lambda^*$ that maximizes the (real) large deviation function.
+Inverting this reasoning, we can find the value $\mu=\mu_\textrm m$
+corresponding to a marginal spectrum by requiring that the large deviation
+function has a maximum in $\lambda^*$ at $\lambda^*=0$, or
\begin{equation}
0=\frac\partial{\partial\lambda^*}G_{\lambda^*}(\mu_\mathrm m)\bigg|_{\lambda^*=0}
\end{equation}
@@ -1629,7 +1636,7 @@ still in its infancy \cite{Urbani_2024_Statistical}.
\label{sec:superspace}
The superspace $\mathbb R^{N|2D}$ is a vector space with $N$ real indices and
-$2D$ Grassmann indices $\bar\theta_1,\theta_1,\ldots,\bar\theta_D,\theta_D$.
+$2D$ Grassmann indices $\bar\theta_1,\theta_1,\ldots,\bar\theta_D,\theta_D$ \cite{DeWitt_1992_Supermanifolds}.
The Grassmann indices anticommute like fermions. Their integration is defined by
\begin{equation}
\int d\theta\,\theta=1