1. Introduction
Given a target Gibbs distribution with Hamiltonian function $\mathcal{H}$ and temperature $\epsilon$ , the Metropolis–Hastings (MH) algorithm is a popular and important Markov chain Monte Carlo algorithm that has been applied to various sampling and optimization problems in a wide range of disciplines, including but not limited to Bayesian computation, statistical physics, and theoretical computer science. While there are many improved variants of the MH algorithm that have been investigated in the literature, this paper centres around a method that is known as landscape modification. This technique is particularly suitable in the context of stochastic minimization with respect to $\mathcal{H}$ , either by running the MH algorithm at a low enough temperature $\epsilon$ or by driving the temperature $\epsilon$ to zero as in simulated annealing. The core idea of landscape modification relies on targeting a modified Hamiltonian function instead of $\mathcal{H}$ , where the modification or the transformation is based upon two parameters, namely f and the threshold parameter c. On the part of landscape of $\mathcal{H}$ that is below c, the original landscape is utilized, which allows for exploitation and concentration of the MH algorithm at low temperatures. On the other hand, on the region of the landscape of $\mathcal{H}$ which is above c, the function f is applied to transform this part of the landscape to encourage and facilitate exploration of the chain. More precisely, the acceptance–rejection probability on this part is increased by the transformation, which consequently leads to a higher transition rate for the modified MH chain, which promotes exploration.
The advantage of landscape modification stems from the notion of critical height, which measures the difficulty of the landscape in a broad sense. With appropriate tuning of both f and c, the critical height is reduced, which leads to an improved simulated annealing algorithm or, for instance, reduced mean crossover time or relaxation time in the Curie–Weiss (CW) model, which we shall discuss in detail.
We now summarize and highlight the key achievements and contributions of this paper:
-
1. We propose and analyse MH algorithms with landscape modification. In Section 2, we first define a new MH algorithm using a modified Hamiltonian function. In general, the corresponding acceptance–rejection probability in the modified MH is of integral form, but we show that this integral can be readily calculated upon specializing to various choices of f, such as linear, quadratic, or square root functions. This is then followed by an example of landscape modification in the Ehrenfest urn with a linear Hamiltonian in Section 2.4, where we prove an upper bound on the spectral gap with polynomial dependence on the dimension, whereas the same technique yields an exponential dependence on the dimension for the classical MH. We provide a discussion on possible strategies for tuning f and c in Section 2.5. In the final subsection, that is, Section 2.6, we elaborate on the similarities and differences between landscape modification and other acceleration techniques in the Markov chain Monte Carlo literature, such as Catoni’s energy transformation algorithm [Reference Catoni1, Reference Catoni2], preconditioning of the Hamiltonian, importance sampling, and quantum annealing [Reference Wang, Wu and Zou3].
-
2. We obtain improved mean crossover time and relaxation time in the CW model with landscape modification. In Section 3, we investigate the effect of landscape modification on the CW model. We first consider the CW model under a fixed external magnetic field in the subcritical regime, where the free energy landscape, as a function of the magnetization, has two local minima. Using a Glauber dynamics with landscape modification, we introduce a new mean-field equation, and with appropriate choice of c, we transform the free energy landscape from a double-well into a single-well landscape, while preserving the location of the global minimum on the modified landscape. As a result, running algorithms on the modified landscape can accelerate the convergence towards the ground state. We prove a subexponential mean tunnelling time in such a setting and discuss related metastability results. Similar results are then extended to the random-field CW model.
-
3. We obtain an improved simulated annealing algorithm with an improved logarithmic cooling schedule. In Section 4, we consider the simulated annealing setting by driving the temperature down to zero. We define a clipped critical height $c^*$ (that depends on the threshold parameter c) associated with the improved simulated annealing algorithm, and prove tight spectral gap asymptotics based on this parameter. This leads to similar asymptotic results concerning the total-variation mixing time and the mean tunnelling time in the low-temperature regime. Utilizing existing results concerning simulated annealing with time-dependent target function, we prove a convergence guarantee for the landscape-modified simulated annealing on a finite state space with an improved logarithmic cooling schedule. These theoretical results on improved convergence are corroborated by numerical experiments on a travelling salesman problem, in Section 4.1.
This paper can be seen as a sequel to an earlier work by the author [Reference Choi4]. The original motivation for landscape modification is from Fang et al. [Reference Fang, Qian and Gong5], who propose a variant of overdamped Langevin diffusion with state-dependent diffusion coefficient. Choi [Reference Choi4] applies this idea of state-dependent noise via landscape modification in the setting of kinetic simulated annealing and develops an improved kinetic simulated annealing algorithm with a convergence guarantee. In the present paper, we recognize that this idea of landscape modification can also be applied to the finite-state-space setting in MH and simulated annealing, and investigate the benefits and speed-ups that this technique can bring, in particular, to the analysis of the CW model and stochastic optimization. While the technique developed in Choi [Reference Choi4] can readily be applied to gradient-based continuous optimization algorithms, we emphasize that the landscape modification technique proposed in this paper can be analogously implemented in essentially all discrete optimization algorithms by a change of landscape and is not limited to simulated annealing or the CW model.
1.1. Notation
Throughout this paper, we adopt the following notation. For $x,y \in \mathbb{R}$ , we let $x_+ = \max\{x,0\}$ denote the non-negative part of x, and we write $x \wedge y = \min\{x,y\}$ . For two functions $g_1, g_2 \;:\; \mathbb{R} \to \mathbb{R}$ , we say that $g_1 = \mathcal{O}(g_2)$ if there exists a constant $C > 0$ such that for sufficiently large x, we have $|g_1(x)| \leqslant C g_2(x)$ . We write $g_1 = o(g_2)$ if $\lim_{x \to \infty} g_1(x)/g_2(x) = 0$ , and we write $g_1 \sim g_2$ if $\lim_{x \to \infty} g_1(x)/g_2(x) = 1$ . We say that $g_1(x)$ is a subexponential function if $\lim_{x \to \infty} \frac{1}{x} \log g_1(x) = 0$ .
2. Metropolis–Hastings with landscape modification
Let $\mathcal{X}$ be a finite state space, let $Q = (Q(x,y))_{x,y \in \mathcal{X}}$ be the transition matrix of a reversible proposal chain with respect to the probability measure $\mu = (\mu(x))_{x \in \mathcal{X}}$ , and let $\mathcal{H}\;:\;\mathcal{X} \to \mathbb{R}$ be the target Hamiltonian function. Denote by $M^{0} = (M^{0}(x,y))_{x,y \in \mathcal{X}}$ the infinitesimal generator of the continuized classical MH chain $X^0 = (X^0(t))_{t \geqslant 0}$ , with proposal chain Q, and with target distribution being the Gibbs distribution $\pi^{0}(x) \propto e^{-\frac{1}{\epsilon} \mathcal{H}(x)}\mu(x)$ at temperature $\epsilon > 0$ . Recall that its dynamics is given by
We shall explain the superscript of 0 in both $M^0$ and $\pi^0$ in Definition 1 below.
Let us denote the ground-state energy level or the global minimum value of $\mathcal{H}$ by $\mathcal{H}_{\textrm{min}} \;:\!=\; \min_{x \in \mathcal{X}} \mathcal{H}(x)$ . Instead of directly targeting the Hamiltonian $\mathcal{H}$ in the Gibbs distribution, we instead target the following modified or transformed function $\mathcal{H}_{\epsilon,c}^f$ at temperature $\epsilon$ :
where the function f and the parameter c are chosen to satisfy the following assumptions.
Assumption 1.
-
1. The function $f\;:\; \mathbb{R}^+ \to \mathbb{R}^+$ is differentiable and non-decreasing, and it satisfies
\begin{align*}f(0) = 0.\end{align*} -
2. The parameter c satisfies $c \geqslant \mathcal{H}_{\textrm{min}}$ .
While it is impossible to calculate $\mathcal{H}_{\epsilon,c}^f$ without knowing $\mathcal{H}_{\textrm{min}}$ a priori, in an MH chain what matters is the difference of the energy function. For $x,y \in \mathcal{X}$ , we see that
which does not depend on $\mathcal{H}_{\textrm{min}}$ . In the special case where we choose $f = 0$ , the above equation reduces to $\mathcal{H}_{\epsilon,c}^0(y) - \mathcal{H}_{\epsilon,c}^0(x) = \frac{1}{\epsilon}(\mathcal{H}(y) - \mathcal{H}(x))$ . On the other hand, in the case where $c < \mathcal{H}(x) < \mathcal{H}(y)$ and f is chosen such that $f(z) > 0$ whenever $z > 0$ , we have
Since f is assumed to be non-decreasing in Assumption 1, the greater the difference between $\mathcal{H}(x)$ and c, the smaller the upper bound is in the first inequality in the above equation; thus, the smaller we expect $\mathcal{H}_{\epsilon,c}^f(y) - \mathcal{H}_{\epsilon,c}^f(x)$ to be, and the higher the transition rate $Q(x,y)\exp\Big\{-(\mathcal{H}_{\epsilon,c}^f(y) - \mathcal{H}_{\epsilon,c}^f(x))\Big\}$ is. This is the sense in which the landscape is modified. Intuitively speaking, when the algorithm is above the threshold parameter c, the landscape is modified so that the transition rate is higher, to encourage exploration, while the original landscape is utilized for exploitation when the algorithm is below c.
To illustrate the effect of the proposed transformation, we plot the function
as in Monmarché [Reference Monmarché6] and compare this landscape with that of $\mathcal{H}_{\epsilon,c}^f$ in Figure 1, where we take $\epsilon \in \{0.25,0.5,0.75,1\}$ , $c = -1.5$ , and $f(z) = z$ . With these choices of parameters, we compute that
In view of the above equation, we plot and compare $\frac{1}{\epsilon} \mathcal{H}(x)$ and $\mathcal{H}^f_{\epsilon,c}(x) + \frac{1}{\epsilon} \mathcal{H}_{\textrm{min}}$ in Figure 1 on the domain $\mathcal{X} = \{-5 + \frac{k}{1000} 10;\; k = 1,2,\ldots,1000\}$ . The shift of $+\frac{1}{\epsilon}\mathcal{H}_{\textrm{min}}$ is necessary to make these two landscapes match exactly in the region where $\{x;\;\mathcal{H}(x) \leqslant c\}$ , so that they are on the same scale. When $x \in (\!-\!4,-2)$ in Figure 1, we can see that the gradient of $\mathcal{H}_{\epsilon,c}^f$ is smaller than that of $\frac{1}{\epsilon} \mathcal{H}$ ; thus it is easier to climb up the hill in this region (i.e. there is a higher transition rate). From the plot we also note that both $\frac{1}{\epsilon} \mathcal{H}$ (the solid red curve) and $\mathcal{H}_{\epsilon,c}^f$ (the dashed blue curve) have exactly the same set of stationary points.
Keeping in mind the ideas and notation above, we are now ready to introduce the MH chain with landscape modification.
Definition 1. (Metropolis–Hastings with landscape modification.) Let $\mathcal{H}$ be the target Hamiltonian function, and let $\mathcal{H}_{\epsilon,c}^f$ be the landscape-modified function at temperature $\epsilon$ introduced in (2.1), where f and c satisfy Assumption 1. The continuized Metropolis–Hastings chain with landscape modification $X^f_{\epsilon,c} = \big(X^f_{\epsilon,c}(t)\big)_{t \geqslant 0}$ has target distribution $\pi^f(x) = \pi^f_{\epsilon,c}(x) \propto e^{-\mathcal{H}_{\epsilon,c}^f(x)}\mu(x)$ and proposal chain Q, and its infinitesimal generator $M^f = \big(M^f(x,y)\big)_{x,y \in \mathcal{X}}$ is given by
Note that when $f = 0$ , the above dynamics reduces to the classical MH $X^0$ .
We now fix some notation and recall some important concepts and results from the literature on Markov chains. We endow the Hilbert space $\ell^2\big(\pi^f_{\epsilon,c}\big)$ with the usual inner product weighted by the invariant measure $\pi^f_{\epsilon,c}$ : for $g_1,g_2 \in \ell^2 \big(\pi^f_{\epsilon,c}\big)$ ,
and for $p > 1$ we denote the $\ell^p$ norm by $\left\lVert\cdot\right\rVert_{\ell^p(\pi^f_{\epsilon,c})}$ . We write $\lambda_2\big(\!-\!M^f_{\epsilon,c}\big)$ for the spectral gap of $M^f_{\epsilon,c}$ ; that is,
Analogously, we denote by $\lambda_2\big(\!-\!M^0_{\epsilon}\big)$ the spectral gap of $M^0_{\epsilon}$ .
In the upcoming sections, we shall investigate and compare the total-variation mixing time between the MH chains with and without landscape modification. For any probability measures $\nu_1,\nu_2$ with support on $\mathcal{X}$ , the total-variation distance between $\nu_1$ and $\nu_2$ is
The worst-case total-variation mixing time is defined to be
where $\big(P_t^{f} = e^{M^f_{\epsilon,c}t}\big)_{t \geqslant 0}$ is the transition semigroup of $X^f_{\epsilon,c}$ .
Besides the mixing time, we will also be interested in various hitting times of the MH chain. These variables naturally appear when we discuss metastability results in the CW model in Section 3 or in discrete simulated annealing in Section 4. For any $A \subset \mathcal{X}$ , we write $\tau^f_{A} \;:\!=\; \inf\big\{t \geqslant 0;\; X^f_{\epsilon,c}(t) \in A\big\}$ , and the usual convention that $\inf \emptyset = \infty$ applies. Similarly, we define $\tau^0_A$ to be the first hitting time of the set A for the MH chain $X^0$ . When $A = \{x\}$ , we shall simply write $\tau^f_x = \tau^f_{\{x\}}$ (resp. $\tau^0_x = \tau^0_{\{x\}}$ ). Also, we denote by $\mathbb{E}_x(\cdot)$ (resp. $\mathbb{P}_x(\cdot)$ ) the mathematical expectation (resp. probability) of the Markov chain with initial state at $x \in \mathcal{X}$ .
One crucial notion that quantifies the possible benefits of landscape modification is the concept of critical height, which in a broad sense measures the difficulty of the landscape. We now recall this classical notion, which originates from the literature on simulated annealing and metastability. A path from x to y is any sequence of points $x_0 = x, x_1, x_2,\ldots, x_n =y$ such that $Q(x_{i-1},x_i) > 0$ for $i = 1,2,\ldots,n$ . For any $x \neq y$ , such a path exists as the proposal chain Q is irreducible. We write $\chi^{x,y}$ for the set of paths from x to y, and elements of $\chi^{x,y}$ are denoted by $\gamma = (\gamma_i)_{i=0}^n$ . Given a target function $\mathcal{U}$ defined on $\mathcal{X}$ , the highest value of $\mathcal{U}$ along a path $\gamma \in \chi^{x,y}$ , known as the elevation, is defined to be
and the lowest possible highest elevation along paths from x to y is
The associated critical height of $\mathcal{U}$ is then defined to be
Another related notion is the clipped critical height $c^*$ , which is defined to be
One can understand $c^*$ as if we are optimizing with respect to the function $\mathcal{U} \wedge c$ . In Section 3, we shall consider $\mathcal{U}$ to be the free energy with and without landscape modification that arises in the CW model, while in Section 4, when we investigate an improved simulated annealing algorithm, we take $\mathcal{U}$ to be either $\mathcal{H}$ or $\mathcal{H}^f_{\epsilon,c}$ . We refer readers to Figure 4 in Section 4, where we offer a visual illustration of the notion of critical height.
To simulate $X^{f}_{\epsilon,c}$ practically, we would need to evaluate the acceptance–rejection probability, which amounts to the following integration:
In the following three subsections, we evaluate the above integrals (2.4) with f chosen to be a linear, quadratic, or cubic function, respectively.
2.1. Linear f: Metropolis–Hastings with logarithmic Hamiltonian and Catoni’s energy transformation algorithm
In this subsection, we specialize to $f(u) = u$ . It turns out we can understand the landscape modification as if the Hamiltonian is on a logarithmic scale whenever $\mathcal{H}(x) > c$ .
For $x,y \in \{\mathcal{H}(y) > \mathcal{H}(x) \geqslant c\}$ , since
substituting the expression into (2.4) gives
This resulting dynamics $M^f$ coincides with the energy transformation method introduced by Catoni [Reference Catoni1, Reference Catoni2] on $\{\mathcal{H}(y) > \mathcal{H}(x) > c\}$ , which is based on a logarithmic Hamiltonian. We refer readers to Section 2.6 for a more detailed account of the connection between landscape modification and Catoni’s energy transformation.
2.2. Quadratic f: Metropolis–Hastings with $\arctan$ Hamiltonian
In this subsection, we take $f(u) = u^2$ . In this case, the effect of the landscape modification gives an inverse-tangent-transformed Hamiltonian whenever $\mathcal{H}(x) > c$ .
For $x,y \in \{\mathcal{H}(y) > \mathcal{H}(x) \geqslant c\}$ , using the inverse-tangent difference formula we obtain
and substituting the above expression into (2.4) gives
2.3. Square root f: Metropolis–Hastings with sum of square root and logarithmic Hamiltonian
In the final example, we let $f(z) = \sqrt{z}$ for $z \geqslant 0$ . For $x,y \in \{\mathcal{H}(y) > \mathcal{H}(x) \geqslant c\}$ , consider the integral
Substituting the expression into (2.4) gives
Remark 1. (Use of landscape modification for sampling.) This paper focuses on investigating the acceleration effect of landscape modification in stochastic optimization and simulated annealing. However, the technique of landscape modification can also be applied to sampling from multimodal distributions. The paper Zhang and Choi [Reference Zhang and Choi7] analyses the use of landscape modification for sampling. We now briefly describe the setting therein. Suppose that we are interested in sampling from a multimodal distribution that we denote by $\nu(x) \propto e^{-\mathcal{H}(x)}$ . Applying the idea of landscape modification, we then construct an MH chain with the transformed Hamiltonian function
so that its stationary distribution is given by $\nu^f_{1,c,\alpha}(x) \propto e^{-\mathcal{H}^f_{1,c,\alpha}(x)}$ . We note that the parameter $\alpha \geqslant 0$ , introduced in (2.5), controls the bias between the distribution $\nu$ and its landscape-modified counterpart $\nu^f_{1,c,\alpha}$ . If we anneal this parameter by sending $\alpha_t \to 0$ as $t \to \infty$ , then $\nu^f_{1,c,\alpha_t}$ converges weakly to the target distribution $\nu$ . As a result, we are able to construct a non-homogeneous MH chain that converges to $\nu$ in the long run while enjoying the benefits of landscape modification.
2.4. A Metropolized Ehrenfest urn with landscape modification
In this section, we discuss a Metropolized Ehrenfest urn model; our exposition closely follows that of Deuschel and Mazza ([Reference Deuschel and Mazza8], Section 5.1.2). Let us first briefly fix the setting. We consider the state space $\mathcal{X} = \{0,1,\ldots,d\}$ with $d \in \mathbb{N}$ and take a linear Hamiltonian $\mathcal{H}(x) = x$ , where $\mathcal{H}_{\textrm{min}} = 0$ . The proposal birth–death chain has generator Q given by $Q(x,x+1) = 1 - x/d$ , $Q(x,x-1) = x/d$ , $Q(x,x) = -1$ and zero otherwise, and we note that the stationary measure of Q is $\mu(x) \propto 2^{-d} (\substack{d \\[5pt] x})$ . With these choices, the classical Metropolized dynamics is $M^0_{\epsilon}(x,x+1) = Q(x,x+1) e^{-\frac{1}{\epsilon}}$ , $M^0_{\epsilon}(x,x-1) = Q(x,x-1)$ , and $\pi^0(x) \propto e^{-\frac{1}{\epsilon} x} \mu(x)$ . It is shown in Deuschel and Mazza ([Reference Deuschel and Mazza8], Equation (5.1.1)) that
At a fixed temperature $\epsilon$ , we note that the upper bound in (2.6) is exponential in d.
Now, we consider the landscape-modified MH with $f(z) = z$ and $c = 1$ . With these parameters, we compute
Using Deuschel and Mazza ([Reference Deuschel and Mazza8], Equation (5.1.1)) leads to
where we use Stirling’s formula, which gives, for large enough d,
As a result, the upper bound in (2.7) gives a polynomial dependency on d, at the tradeoff of an exponential dependency on $\frac{1}{\epsilon}$ . In retrospect this result is not surprising, since we are working with a logarithmic Hamiltonian and hence the asymptotics is in the polynomial of d instead of $2^d$ . This section serves as a warm-up example to prepare us for discussing the more complicated CW model in Section 3.
2.5. Tuning strategies for f and c
In this section, we discuss tuning strategies for f and the threshold parameter c in the context of using the MH chain with landscape modification for stochastic optimization.
First, the function f controls how the landscape is transformed above the threshold parameter c. For example, taking a linear $f(z) = z$ gives a logarithmic transformation, while taking a quadratic $f(z) = z^2$ yields an $\arctan$ transformation, as shown in Sections 2.1 and 2.2. In general, we recommend using $f(z) = z^2$ , since the $\arctan$ transformation is uniformly bounded above by $\pi/2$ , which facilitates exploration on the part of the landscape above c by giving a higher transition rate compared with the choice of $f(z) = z$ . However, we suspect that in numerical investigations, and depending on the target Hamiltonian $\mathcal{H}$ , there are possibilities of using a linear f to yield improved convergence towards the global minimum.
Second, as we shall see in Section 4, the threshold parameter c controls the clipped critical height $c^*$ , and ideally it should be set as close to $\mathcal{H}_{\textrm{min}}$ as possible. This is possible if we have information about the value $\mathcal{H}_{\textrm{min}}$ , which is the case for some statistical physics and theoretical computer science models [Reference Nardi and Zocca9, Reference Zocca10]. In general, however, we may not have access to the value $\mathcal{H}_{\textrm{min}}$ , and one general method is to tune the threshold parameter adaptively by setting the value at time t to be the running minimum up to time t generated by the chain. However, the resulting process becomes non-Markovian because of the adaptive tuning. Another adaptive tuning strategy is to set the value of c at time t to be $c_t = \mathcal{H}(y_t) - d$ , where $y_t$ is the proposed state at time t generated by the proposal chain and $d \geqslant 0$ is a fixed number. In this way, the transition rate of the landscape-modified MH chain is always greater than or equal to that of the MH chain without landscape modification. We shall numerically investigate this tuning strategy and report positive results in Section 4.1.
As far as this paper is concerned, we assume a fixed c in all of the main theoretical results. We shall postpone to future work a systematic numerical study of investigating various choices of f and tuning strategies for c on benchmark functions, as well as a theoretical analysis of adaptively tuning c by the running minimum generated by the algorithm in an MH chain with landscape modification.
2.6. Connections between landscape modification and other acceleration techniques
In this section, we outline conceptual similarities and differences between MH with landscape modification and other common acceleration techniques in the literature for MH and simulated annealing.
2.6.1. Catoni’s energy transformation algorithm
Let $\alpha_1 \geqslant 0$ , $\alpha_2 \geqslant 0$ , $\alpha_3 > - \mathcal{H}_{\textrm{min}}$ be three parameters. In Catoni [Reference Catoni1, Reference Catoni2], the author introduces the energy transformation algorithm by transforming the Hamiltonian $\mathcal{H}$ to
Recall from Section 2.1 that MH with landscape modification can be considered as a state-dependent version of energy transformation if we take $f(z) = z$ , $\alpha_3 = -c + \epsilon$ , with $\alpha_1$ , $\alpha_2$ chosen in a state-dependent manner:
Note that MH with landscape modification can give rise to other kinds of energy transformation through different choices of f; see for example the quadratic case or the square root case in Section 2.2 or 2.3, respectively. The idea of mapping or transforming the function from $\mathcal{H}$ to $F(\mathcal{H})$ with F being strictly increasing and concave can be dated back to R. Azencott.
2.6.2. Preconditioning of the Hamiltonian
Landscape modification can be understood as a state-dependent preconditioning of the Hamiltonian $\mathcal{H}$ . Recall that in (2.4) we compute the acceptance–rejection probability in MH by
On the set $\{c\geqslant \mathcal{H}(y) > \mathcal{H}(x)\}$ , the acceptance–rejection probability is the same as the original MH to allow for exploitation, while on the set $\{\mathcal{H}(y) > c \geqslant \mathcal{H}(x)\}$ and $\{\mathcal{H}(y) > \mathcal{H}(x) > c\}$ , the acceptance–rejection probability is higher than that of the original MH to encourage exploration of the landscape. Therefore, there is a higher transition rate of moving to other states when the algorithm is above the threshold c.
2.6.3. Importance sampling
In importance sampling, the target distribution is altered for possible benefits and speed-ups such as variance reduction. In landscape modification, the target distribution in MH is altered from the original Gibbs distribution $\pi^0(x) \propto e^{-\frac{1}{\epsilon} \mathcal{H}(x)}$ to $\pi^f_{\epsilon,c}(x) \propto e^{-\mathcal{H}_{\epsilon,c}^f(x)}$ , while the set of stationary points is preserved in the sense that $\mathcal{H}$ and $\mathcal{H}^f_{\epsilon,c}$ share the same set of stationary points. In importance sampling, however, the set of stationary points need not be preserved between the altered Hamiltonian and the original Hamiltonian.
2.6.4 Quantum annealing
In quantum annealing [Reference Wang, Wu and Zou3], given a target Hamiltonian $\mathcal{H}$ and an initial Hamiltonian $\mathcal{H}_{init}$ that is usually easy to optimize, we optimize a time-dependent function $Q_t$ defined, for $t \in [0,T]$ , by
where A(t) and B(t) are smooth annealing schedules that satisfy $A(T) = B(0) = 0$ and T is the total annealing time. We also choose A(t) to be decreasing and B(t) to be increasing on the interval [0,T].
In simulated annealing with landscape modification, we also optimize a time-dependent function $\mathcal{H}^f_{\epsilon_t,c}$ which shares the same set of stationary points as the target $\mathcal{H}$ . In quantum annealing, $\mathcal{H}_{\textrm{init}}$ and $\mathcal{H}$ do not necessarily share the same set of stationary points. We mention the works Del Moral and Miclo [Reference Del Moral and Miclo11], Löwe [Reference Löwe12], and Frigerio and Grillo [Reference Frigerio and Grillo13] for simulated annealing with time-dependent energy function.
3. The Curie–Weiss model with landscape modification
In this section, we demonstrate the power of landscape modification by revisiting the Curie–Weiss (CW) model. With an appropriate choice of parameters, the landscape of the CW free energy is modified and the local minimum is eliminated while the global minimum is preserved on the transformed function. As a result, landscape modification convexifies the free energy from a double-well to a single-well as a function of the magnetization.
Let us first recall the setting of the CW model of a ferromagnet with external field $h \in \mathbb{R}$ and fix some notation. We shall follow the setting of Bovier and den Hollander ([Reference Bovier and den Hollander14], Chapters 13–14). Let $\mathcal{X} = \{-1,1\}^N$ be the set of possible configurations of the CW model with $N \in \mathbb{N}$ . The CW Hamiltonian is given, for $\sigma = (\sigma_i)_{i=1}^N \in \mathcal{X}$ , by
where $m_N(\sigma) = (1/N)\sum_{i=1}^N \sigma_i$ is the empirical magnetization. Consider the continuized Glauber dynamics by picking a node uniformly at random and flipping the sign of the selected spin, while targeting the Gibbs distribution with the CW Hamiltonian $H_N$ at temperature $\epsilon$ . The resulting Metropolis dynamics is given by
where $\left\lVert\cdot\right\rVert_1$ is the $l^1$ norm on $\mathcal{X}$ .
The dynamics of the empirical magnetization $(m^0(t))_{t \geqslant 0}$ can be described by lumping the Glauber dynamics to give
on the state space $\Gamma_N \;:\!=\; \{-1,-1+2N^{-1},\ldots,1-2N^{-1},1\}$ with the image Gibbs distribution
as the stationary distribution. Note that the dependency on $\epsilon$ is suppressed in the notation of $M^0_N$ and $\pi^0_N$ . Define
where I(m) is the Cramér rate function for coin tossing. Then the image Gibbs distribution can be written as
The quantity $g_{\epsilon}$ is called the free energy of the CW model. The stationary points of $g_{\epsilon}$ satisfy the classical mean-field equation
To seek the ground state(s) of the free energy, we consider modifying the landscape of the CW Hamiltonian from $H_N$ to
where $d \in \mathbb{R}$ can be chosen arbitrarily, since we are only interested in the difference of $E^f_{\epsilon,c}$ . The infinitesimal generator of the magnetization $(m^f(t))_{t \geqslant 0}$ is
with stationary distribution
where
By taking the limit $N \to \infty$ , we find that the free energy in the landscape-modified CW model is
Setting the derivative of $g^f_{\epsilon,c}$ equal to zero gives the landscape-modified mean-field equation:
Observe that if we take $f = 0$ , then (3.4) reduces to the classical mean-field equation in (3.1).
3.1. Main results
Without loss of generality, assume the external magnetic field is $h < 0$ . In the subcritical regime where $\frac{1}{\epsilon} > 1$ , it is known that there are two local minima of $g_{\epsilon}$ . We denote the global minimum of $g_{\epsilon}$ by $m_{-}^* < 0$ and the other local minimum by $m_+^* > 0$ , where $|m_-^*| > m_+^*$ , and let $z^*$ be the saddle point between $m_-^*$ and $m_+^*$ . We also write $m_-^*(N)$ (resp. $m_+^*(N)$ ) for the closest point in Euclidean distance to $m_-^*$ (resp. $m_+^*$ ) on $\Gamma_N$ .
Theorem 1. (Landscape modification in the subcritical regime.) Suppose $\frac{1}{\epsilon} > 1$ , $h < 0$ , and f,c are chosen as in Assumption 1.
-
1. (Convexification of the free energy $g_{\epsilon}$ and subexponential mean crossover time.) If we choose $c \in [E(m_-^*), E(m_+^*))$ , $c < h^2/2$ , $-h - \sqrt{h^2 - 2c} \leqslant z^*$ , and for $m \in [-h - \sqrt{h^2 - 2c},-h + \sqrt{h^2 - 2c}]$ ,
\begin{align*}m > \tanh\!\bigg(\dfrac{m+h}{f((E(m)-c)_+) + \epsilon}\bigg),\end{align*}then $m_-^*$ is the only stationary point of the modified free energy $g^f_{\epsilon,c}$ , which is a global minimum. Consequently, we have subexponential mean crossover time on the modified landscape(3.5) \begin{align} \lim_{N \to \infty} \dfrac{1}{N} \log \mathbb{E}_{m_+^*(N)} \Big(\tau^f_{m_-^*(N)}\Big) = 0, \end{align}while the mean crossover time on the original landscape is exponential in N, $1/\epsilon$ , and the original critical height $g_{\epsilon}(z^*) - g_{\epsilon}(m_+^*)$ with\begin{align*} \lim_{N \to \infty} \dfrac{1}{N} \log \mathbb{E}_{m_+^*(N)} \Big(\tau^0_{m_-^*(N)}\Big) = \frac{1}{\epsilon} (g_{\epsilon}(z^*) - g_{\epsilon}(m_+^*)). \end{align*} -
2. If we choose $c \in [E(m_+^*), E(z^*)]$ and assume in addition that f is twice differentiable and satisfies $f^{\prime}(0) = f^{\prime \prime}(0) = 0$ , then there exists
\begin{align*}\textbf{z}^* = \arg \max_{m_-^* \leqslant m \leqslant m_+^*} g^f_{\epsilon,c}(m),\end{align*}and as $N \to \infty$ ,\begin{align*} \mathbb{E}_{m_+^*(N)} \Big(\tau^f_{m_-^*(N)}\Big) &= \exp\!\Big(N \Big(g_{\epsilon,c}^f(\textbf{z}^{*}) - g_{\epsilon,c}^f(m_+^*)\Big)\Big) \\[5pt] &\quad \times \frac{2}{1-|\textbf{z}^{*}|} \sqrt{\frac{1-\textbf{z}^{*2}}{1-m_{+}^{* 2}}} \frac{2 \pi N / 4}{ \sqrt{\left(\!-\!\Big(g_{\epsilon,c}^{f}\Big)^{\prime \prime}\left(\textbf{z}^{*}\right)\right) \Big(g_{\epsilon,c}^{f}\Big)^{\prime \prime}\left(m_{+}^{*}\right)}} (1+o(1)). \end{align*}Consequently,\begin{align*} \lim_{N \to \infty} \dfrac{1}{N} \log \mathbb{E}_{m_+^*(N)} \Big(\tau^f_{m_-^*(N)}\Big) &= g_{\epsilon,c}^f(\textbf{z}^{*}) - g_{\epsilon,c}^f(m_+^*) \\[5pt] &= \frac{1}{\epsilon} (c - E(m_+^*)) + \int_c^{E(\textbf{z}^{*})} \dfrac{1}{f((u-c)_+) + \epsilon} \,du \\[5pt] & \quad + \left(I(\textbf{z}^{*}) - I(m_+^*)\right) \\[5pt] &\leqslant \frac{1}{\epsilon} (g_{\epsilon}(z^*) - g_{\epsilon}(m_+^*)) = \lim_{N \to \infty} \dfrac{1}{N} \log \mathbb{E}_{m_+^*(N)} \Big(\tau^0_{m_-^*(N)}\Big). \end{align*}
Before we present the proof, we interpret the results in Theorem 1 intuitively: in Item 1, on the one hand we would like to choose c small enough so that the mapping
is flattened and only intersects the straight line $m \mapsto m$ at the global minimum $m_-^*$ . In this way the landscape of $g^f_{\epsilon,c}$ is transformed from a double-well to a single-well landscape, while the location of the global minimum at $m_-^*$ is preserved as that in the original landscape $g_{\epsilon}$ . This is illustrated in Figures 2 and 3. On the other hand, we cannot choose c to be too small if we are interested in seeking the ground state of $g_{\epsilon}$ , since otherwise, if $c < E(m_-^*)$ , then $m_-^*$ may no longer be the global minimum in the transformed free energy $g^f_{\epsilon,c}$ . This consequently yields a mean crossover time on the modified landscape that is subexponential in N, while the original mean crossover time is exponential in N, $1/\epsilon$ , and the original critical height $g_{\epsilon}(z^*) - g_{\epsilon}(m_+^*)$ . In Item 2 of Theorem 1, we choose a larger value of c compared with that in Item 1. Although the transformed free energy $g^f_{\epsilon,c}$ is not a convex function, it has a smaller critical height than the original free energy $g_{\epsilon}$ . This gives a reduced exponential dependence on the modified mean crossover time compared with the original mean crossover time.
The power of landscape modification or energy transformation lies in tuning the parameter c appropriately. One way to tune c is to use the running minimum generated by the algorithm on the original free energy $g_{\epsilon}$ . Suppose we start in the well containing the local minimum $m_+^*$ ; setting c to be the running minimum eventually gives $c = E(m_+^*)$ , and hence Item 2 of Theorem 1 can be applied and the critical height on the modified landscape is reduced.
We illustrate Theorem 1 with a concrete numerical example in Figures 2 and 3, where we take $h = -0.05$ and $f(z) = z$ at temperature $\epsilon = 1/1.5$ . We numerically compute that $m_-^* = -0.8863$ , $m_+^* = 0.8188$ , and $z^* = 0.1524$ . As a result we have $E(m_-^*) = -0.4371$ , $E(m_+^*) = -0.2943$ , and $E(z^*) = -0.004$ . In the leftmost plots of Figures 2 and 3, we choose $c = -0.4 \in [E(m_-^*), E(m_+^*))$ . We numerically check that the conditions in Item 1 of Theorem 1 are satisfied, and we see that the location of the global minimum is the same for the blue curve as for the orange curve. In the rightmost plots of Figures 2 and 3, we choose $c = -0.2 \in [E(m_+^*), E(z^*)]$ . We see that for the blue curve and the red curve, the locations of the two local minima are the same, while the critical height is smaller than that in the original landscape $g_{\epsilon}$ .
3.2. Proof of Theorem 1
Before we give the proof, let us first recall the concept of critical height and the notation L as introduced in Section 2. In this section, we are interested in the CW model with and without landscape modification, with free energy $g^f_{\epsilon,c,N}$ and $g_{\epsilon,N}$ respectively. We therefore define the analogous concepts of critical height by inserting a subscript of N. This leads us to
where we recall that G is introduced in (2.3).
We now proceed with the proof.
Proof of Theorem 1. First we prove Item 1. We observe that $\{E(m) \geqslant c\} = \{m \in [-h - \sqrt{h^2 - 2c},-h + \sqrt{h^2 - 2c}]\}$ . On this interval,
and hence the modified free energy is strictly increasing on this interval. On the interval $\{m > -h + \sqrt{h^2 - 2c}\}$ ,
as the original free energy is strictly increasing. On the interval $\{m < -h - \sqrt{h^2 - 2c}\}$ , we also have
Thus, with these parameter choices, the only stationary point of $g^f_{\epsilon,c}$ is $m^*_-$ , which is the global minimum.
Next we proceed to proving (3.5). According to Löwe ([Reference Löwe12], Theorem 2.1), for $\xi_1(N)$ a polynomial function in N, we have
Using the random target lemma (Aldous and Fill, [Reference Aldous and Fill15], Section 4.2) and the above inequality leads to
Now, let $\overline{m}(N) \;:\!=\; \arg \min g^f_{\epsilon,c,N}(m)$ and compute
where we use Bovier and den Hollander ([Reference Bovier and den Hollander14], Equation (13.2.5)) in the third equality, and $m_-^*(N), \overline{m}(N) \to m_-^*$ as $N \to \infty$ . The above computation combined with (3.6) yields
On the other hand, as the magnetization $(m^f(t))_{t \geqslant 0}$ is a birth–death process, using Equation (13.2.2) of Bovier and den Hollander [Reference Bovier and den Hollander14], the mean hitting time can be calculated explicitly as
As a result, as $m_+^*(N) \to m_+^*$ we have
Using both (3.8) and (3.7) gives (3.5).
Next, we prove Item 2, the argument for which closely follows that of Bovier and den Hollander ([Reference Bovier and den Hollander14], Theorem 13.1). Since we choose $c \in [E(m_+^*), E(z^*)]$ , we have $E(m_-^*) - c < E(m_+^*) - c \leqslant 0$ , and hence the landscape-modified mean-field equation (3.4) has at least two solutions $m_+^*$ and $m_-^*$ , which are exactly the same as those of the original mean-field equation (3.1). As the landscape-modified mean-field equation is continuous in m, there exists $\textbf{z}^* = \arg \max_{m_-^* \leqslant m \leqslant m_+^*} g^f_{\epsilon,c}(m)$ which also satisfies (3.4). Now, for $m \in \Gamma_N$ we consider
If we take $N \to \infty$ and $m \rightarrow \textbf{z}^*$ , we obtain
since, if $\textbf{z}^* > 0$ , then $\textbf{z}^* + h > 0$ and satisfies the mean-field equation (3.4). Using the mean hitting time formula again leads to the following, for any $\delta > 0$ :
Using Stirling’s formula and the same argument as in Bovier and den Hollander ([Reference Bovier and den Hollander14], Equation (13.2.7)) yields
Substitution into (3.9) gives
We use a Laplace method argument to handle the sum in (3.10). Note that as f is assumed to be twice differentiable with $f(0) = f^{\prime}(0) = f^{\prime \prime}(0) = 0$ , this implies that $g^f_{\epsilon,c}$ is three times differentiable, and applying the third-order Taylor expansion gives
where we use $(g_{\epsilon,c}^{f})^{\prime}(\textbf{z}^*) = (g_{\epsilon,c}^{f})^{\prime}(m_{+}^{*}) = 0$ . Now we observe that the sum in (3.10) is
since $\big(g_{\epsilon,c}^{f}\big)^{\prime \prime}\left(\textbf{z}^{*}\right) < 0$ and $\big(g_{\epsilon,c}^{f}\big)^{\prime \prime}\left(m_{+}^{*}\right) > 0$ .
3.3. Extension to the random-field Curie–Weiss model
In Section 3.1, we discussed the classical CW model with landscape modification under a fixed magnetic field. In this section, we consider the random-field CW model with landscape modification. We discuss related metastability results and the ground-state free energy in such a setting, with the aim of illustrating that landscape modification can also be applied in the setting of a random energy landscape. Let us begin by recalling the random-field CW model. We shall adapt the setting of Mathieu and Picco [Reference Mathieu and Picco16]. Let $(h_i)_{i \in \mathbb{N}}$ be a sequence of independent and identically distributed random variables with $\mathbb{P}(h_i = 1) = \mathbb{P}(h_i = -1) = 1/2$ . We consider the random Hamiltonian function given, for a fixed $\theta > 0$ and $\sigma \in \{-1,1\}^N$ , by
where
In the sequel, we shall suppress the dependency on $\omega$ . Denote the Gibbs distribution at temperature $\epsilon$ on $\{-1,1\}^N$ by
Let $N^+ \;:\!=\; |\{i;\;h_i = +1\}|$ , $N^- \;:\!=\; |\{i;\;h_i = -1\}|$ , and define the random set
For $\textbf{m} = (\textbf{m}^+, \textbf{m}^-) \in \boldsymbol{\Gamma}_N$ , with slight abuse of notation we write
where $\textbf{m}^+$ and $\textbf{m}^-$ are the magnetization among the sites i where respectively $h_i = 1$ and $h_i = -1$ . Let $\boldsymbol{\pi}_{\epsilon,N}^0$ denote the image Gibbs distribution of $\boldsymbol{\nu}_N$ by $\boldsymbol{\Gamma}_{N}$ , where
As $N \to \infty$ , by the strong law of large numbers, $\textbf{g}_{\epsilon,N}$ converges almost surely to the free energy given by
where I(m) is the Cramér rate function as introduced in Section 3. The critical points of $\textbf{g}_{\epsilon}$ satisfy
In this section, we shall only consider the subcritical regime where $\frac{1}{\epsilon} > \cosh^2(\frac{1}{\epsilon} \theta)$ . It can be shown (see e.g. [Reference Mathieu and Picco16]) that there are exactly three critical points. Let $\textbf{m}_{*} > 0$ be the unique positive solution to the mean-field equation
The three critical points of $\textbf{g}_{\epsilon}$ are given by
where $\textbf{m}_0$ is the saddle point and $\textbf{m}_1, \textbf{m}_2$ are the two global minima. Consider the continuized Glauber dynamics $(\sigma_N(t))_{t \geqslant 0}$ by picking a node uniformly at random and changing the sign of the selected spin, while targeting the Gibbs distribution $\boldsymbol{\nu}_N$ at temperature $\epsilon$ . Denote by $\textbf{m}_N(t) \;:\!=\; \textbf{m}_N(\sigma_N(t))$ the induced dynamics on the magnetization, and denote its infinitesimal generator by $\textbf{M}^0_{\epsilon,N}$ . This is proven to be a Markov chain in Mathieu and Picco [Reference Mathieu and Picco16], with stationary measure $\boldsymbol{\pi}^0_{\epsilon,N}$ .
Now let us consider the landscape-modified Hamiltonian on $\boldsymbol{\Gamma}_N$ ,
where $d \in \mathbb{R}$ can be chosen arbitrarily since we are only interested in the difference of $\textbf{E}^f_{\epsilon,c}$ . The transformed image Gibbs distribution is therefore
The strong law of large numbers yields that as $N \to \infty$ , $\textbf{g}^f_{\epsilon,c,N}$ converges almost surely to the transformed free energy
The critical points of $\textbf{g}^f_{\epsilon,c}$ satisfy the following landscape-modified mean-field equations:
Note that (3.17) and (3.18) reduce to the classical case (3.13) and (3.14) if we take $f = 0$ . Consider the continuized Glauber dynamics $(\sigma_N^f(t))_{t \geqslant 0}$ by picking a node uniformly at random and changing the sign of the selected spin, while targeting the Gibbs distribution with Hamiltonian $\epsilon \textbf{H}^f_{\epsilon,c,N}$ at temperature $\epsilon$ . Denote by $\textbf{m}_N^f(t) \;:\!=\; \textbf{m}^f_N(\sigma_N^f(t))$ the induced dynamics on the magnetization, and denote by $\textbf{M}^f_{\epsilon,c,N}$ its infinitesimal generator, which is a Markov chain with stationary measure $\boldsymbol{\pi}^f_{\epsilon,c,N}$ .
In the following, we shall consider the case where $c \in [\textbf{E}(\textbf{m}_1), \textbf{E}(\textbf{m}_0)]$ . It can be seen that the two global minima of $\textbf{g}^f_{\epsilon,c}$ remain at $\textbf{m}_1, \textbf{m}_2$ with this choice of c. For any path $\gamma^{\textbf{m}_1,\textbf{m}_0}$ connecting $\textbf{m}_1$ and $\textbf{m}_0$ , we define
where $\Delta \textbf{g}^f_{\epsilon,c}$ is the critical height on the modified landscape. We also write $\Delta \textbf{g}_{\epsilon}$ to denote the critical height on the original landscape. Suppose that $\Delta \textbf{g}_{\epsilon}$ is attained at $\textbf{m}_4$ , so that $\Delta \textbf{g}_{\epsilon} = \textbf{g}_{\epsilon}(\textbf{m}_4) - \textbf{g}_{\epsilon}(\textbf{m}_0)$ ; we deduce
In other words, the critical height of the free energy in the modified landscape is bounded above by $\frac{1}{\epsilon}$ times the critical height of the free energy in the original landscape.
A direct application of Mathieu and Picco ([Reference Mathieu and Picco16], Theorem 2.7) yields the following result on the asymptotics of the spectral gap.
Theorem 2. (Asymptotics of the spectral gap.) Suppose $\theta > 0$ , $\frac{1}{\epsilon} > \cosh^2\left(\frac{1}{\epsilon} \theta\right)$ are fixed, and $\textbf{m}_0$ is the saddle point, while $\textbf{m}_1, \textbf{m}_2$ are the two global minima on the original free energy landscape $\textbf{g}_{\epsilon}$ . For $c \in [\textbf{E}(\textbf{m}_1), \textbf{E}(\textbf{m}_0)]$ , we have $\mathbb{P}$ -almost surely that
In essence, the relaxation time in the mean-field limit of the transformed generator $\textbf{M}^f_{\epsilon,c,N}$ is asymptotically less than or equal to that of the original generator $\textbf{M}^0_{\epsilon,N}$ .
This yields a reduced exponential dependence of the relaxation time on the modified landscape, compared with the relaxation time on the original landscape.
4. Discrete simulated annealing with landscape modification
Unlike in previous sections of this paper, where the temperature parameter is fixed, in this section we consider the non-homogeneous MH with landscape modification where the temperature schedule $(\epsilon_t)_{t \geqslant 0}$ is time-dependent and non-increasing and goes to zero as $t \to \infty$ .
We first recall the concept of critical height as introduced in Section 2. To be precise, we define
where $H^f_{\epsilon,c}$ is the critical height associated with the modified landscape, $H^0$ is the critical height associated with the original landscape $\mathcal{H}$ , and $c^*$ is the clipped critical height. We shall see in the main results of this section that both $H^f_{\epsilon,c}$ and $c^*$ play fundamental roles in the relaxation time in the low-temperature regime and in determining the cooling schedule of an improved simulated annealing algorithm running on the modified landscape.
As an illustration to calculate and compare these critical heights, we consider a simple one-dimensional landscape with a saddle point at s, a local (but not global) minimum at m, and a single global minimum. At temperature $\epsilon = 1$ , the original critical height is attained at $H^0 = \mathcal{H}(s) - \mathcal{H}(m)$ while the modified critical height is $H^f_{1,c} = \mathcal{H}^f_{1,c}(s) - \mathcal{H}^f_{1,c}(m) \leqslant \mathcal{H}^0$ . In this setting, depending on whether c is above or below $\mathcal{H}(m)$ , the clipped critical height $c^*$ is
These critical heights are illustrated in Figure 4.
Our first result gives the asymptotic order of the spectral gap $\lambda_2\Big(\!-\!M^f_{\epsilon,c}\Big)$ in terms of $c^*$ in the low-temperature regime, which will be proven to be essential in obtaining a convergence result for simulated annealing.
Theorem 3. Assume that f and $\min \mathcal{H} \leqslant c \leqslant \max \mathcal{H}$ satisfy Assumption 1, and in addition for all small enough $z > 0$ we have $f(z) \geqslant z$ . Then there exist positive constants $C_2, C_3, C_4$ , depending on the state space $\mathcal{X}$ and the proposal generator Q but not on the temperature $\epsilon$ , and a subexponential function
where $\delta \;:\!=\; \min_{x;\; \mathcal{H}(x) > c} \{\mathcal{H}(x) - c\}$ , such that
where $H^f_{\epsilon,c}$ is introduced in (4.1) and $c^*$ is defined in (4.3). This leads to
As a corollary of Theorem 3, using the asymptotics of the spectral gap, we derive similar asymptotics for the mixing time and tunnelling time on the modified landscape.
Corollary 1. (Asymptotics of mixing and tunnelling times in the low-temperature regime.) Assume the same setting as in Theorem 3. Let $S_{\textrm{min}} \;:\!=\; \arg \min \mathcal{H}(x)$ be the set of global minima of $\mathcal{H}$ , let $\eta \in S_{\textrm{min}}$ , and assume that $\sigma,\eta$ attain $H^0$ such that $H^0 = G^0(\sigma,\eta) - \mathcal{H}(\sigma)$ . Then the following statements hold:
-
1.
\begin{align*}\lim_{\epsilon \to 0} \epsilon \log t_{mix}(M^f_{\epsilon,c},1/4) = c^*.\end{align*} -
2.
\begin{align*}\lim_{\epsilon \to 0} \epsilon \log \mathbb{E}_{\sigma}(\tau^f_\eta) = c^* \leqslant H^0 = \lim_{\epsilon \to 0} \epsilon \log \mathbb{E}_{\sigma}(\tau^0_\eta).\end{align*}
In particular, when $c = \mathcal{H}_{\textrm{min}}$ , we have subexponential tunnelling time, as $\lim_{\epsilon \to 0} \epsilon \log \mathbb{E}_{\sigma}(\tau^f_\eta) = 0 = c^*$ .
Note that in the case where both $\sigma, \eta \in S_{\textrm{min}}$ with initial state $\sigma$ , it is a reasonable choice to pick the parameter $c = \mathcal{H}(X^f(0)) = \mathcal{H}(\sigma) = \mathcal{H}_{\textrm{min}}$ , and in this setting we have subexponential tunnelling time on the modified landscape. For instance, in applications we may know about a global minimizer $\sigma$ , and by setting $c = \mathcal{H}(\sigma)$ we can search for other possible global minimizers owing to the subexponential tunnelling in the low-temperature regime. For a concrete example, in the Widom–Rowlinson model with $m \in \mathbb{N}$ particle types, $S_{\textrm{min}}$ is precisely the set of configurations in which all sites are occupied by particles of the same type, and hence both $S_{\textrm{min}}$ and $\mathcal{H}_{\textrm{min}}$ are known in this model. In this direction, we refer interested readers to Nardi and Zocca [Reference Nardi and Zocca9] and Zocca [Reference Zocca10] for work on the energy landscape analysis of various statistical physics models.
To prove the convergence result for simulated annealing with landscape modification, as our target function $\mathcal{H}^f_{\epsilon,c}$ depends on time through the cooling schedule, we are in the setting of simulated annealing with time-dependent energy function as in Löwe [Reference Löwe12]. We first present the following auxiliary lemma, where we verify various assumptions from Löwe [Reference Löwe12] in our setting. We have included the statement of the lemma in this section, rather than with the proofs later on, since it will be helpful for understanding the convergence result in Theorem 4 below.
Lemma 1. Assume the same setting as in Theorem 3. Let $M \;:\!=\; \max \mathcal{H} - \min \mathcal{H}$ , $\beta_t \;:\!=\; 1/\epsilon_t$ , and assume the cooling schedule is
for $\epsilon$ small enough so that $M + \max \mathcal{H} - c > \epsilon > 0$ and $t \geqslant 0$ . Then we have the following:
-
1. For all $x \in \mathcal{X}$ and all $t \geqslant 0$ ,
\begin{align*}0 \leqslant \epsilon_t \mathcal{H}^f_{\epsilon_t,c}(x) \leqslant M.\end{align*} -
2. For all $x \in \mathcal{X}$ ,
\begin{align*}\left|\dfrac{\partial}{\partial t} \epsilon_t \mathcal{H}^f_{\epsilon_t,c}(x)\right| \leqslant \dfrac{2M}{(\ln(1+t))(1+t)}.\end{align*} -
3. Let $R_t \;:\!=\; \sup_x \frac{\partial}{\partial t} \epsilon_t \mathcal{H}^f_{\epsilon_t,c}(x)$ and $B \;:\!=\; 6M/(c^*+\epsilon)$ . For all $t \geqslant 0$ ,
\begin{align*}\beta_t^{\prime} M + \beta_t R_t \leqslant \dfrac{3M}{(c^*+\epsilon)(1+t)} = \dfrac{B}{2(1+t)}.\end{align*} -
4. Let
\begin{align*}p \;:\!=\; \dfrac{2M}{M + \max \mathcal{H} - \epsilon - c} > 2\end{align*}and\begin{align*}A \;:\!=\; \begin{cases} \dfrac{1}{C_2 (\!\min_x \mu(x))^{(p-2)/p}}\exp\bigg\{\dfrac{1}{f(\delta)} \left(\max \mathcal{H} - \min \mathcal{H}\right)\bigg\} &\textit{if} \, c < \max \mathcal{H}, \\[11pt] \dfrac{1}{C_2 (\!\min_x \mu(x))^{(p-2)/p}} &\textit{if}\, c = \max \mathcal{H}, \end{cases}\end{align*}where $C_2, \delta$ are as in Theorem 3, and we recall that $\mu$ is the stationary measure of the proposal generator Q. For $g \in \ell^p(\pi_{\epsilon_t,c}^f)$ , we have\begin{align*}\left\lVert g - \pi_{\epsilon_t,c}^f(g)\right\rVert_{\ell^p(\pi^f_{\epsilon_t,c})}^2 \leqslant A (1+t) \langle -M^f_{\epsilon_t,c}g,g \rangle_{\pi^f_{\epsilon_t,c}}.\end{align*}
Remark 2. Items 1 and 2 correspond respectively to Equations (2.11) and (2.12) in Löwe [Reference Löwe12], while the lower bound of the spectral gap in Theorem 3 verifies Equation (2.13) in Löwe [Reference Löwe12]. Assumptions (A1) and (A2) in Löwe [Reference Löwe12] are checked in Items 3 and 4, respectively.
With the above lemma and the notation introduced there, we are ready to give one of the main results of this paper, which concerns the large-time convergence of discrete simulated annealing with landscape modification. The gain from landscape modification in simulated annealing can be seen by operating a possibly faster logarithmic cooling schedule with clipped critical height $c^*$ , while in classical simulated annealing the critical height is $H^0$ . The possible benefit thus depends on the tuning of c, since $c^* \leqslant c - \mathcal{H}_{\textrm{min}}$ . This result is analogous to the result obtained in Choi [Reference Choi4] for improved kinetic simulated annealing. A similar improvement of a logarithmic cooling schedule by means of reduction in critical height can be found in the infinite swapping algorithm [Reference Menz, Schlichting, Tang and Wu17].
Theorem 4. Assume the same setting as in Theorem 3. Let A,B,p be the quantities introduced in Lemma 1. Define
Consider a cooling schedule of the following form: for any $\epsilon > 0$ as in Lemma 1,
Under this cooling schedule, for any $x \in \mathcal{X} \backslash S_{\textrm{min}}$ and $t \geqslant e^{1/\overline{\epsilon}}-1$ , we have
Note that
Remark 3. (On tuning the threshold parameter c.) There are various ways to tune the parameter c for improved convergence. In Choi [Reference Choi4], we proposed to use the running minimum generated by the algorithm to tune c. Note that in the setting of the CW model, the parameter c can be tuned as explained earlier, in the second paragraph below Theorem 1.
4.1. Numerical illustrations
Before discussing the proofs of the main results above, we illustrate the convergence performance of simulated annealing with landscape modification, which we call improved simulated annealing (ISA), and compare it to the performance of the classical simulated annealing algorithm (SA) for the travelling salesman problem (TSP). We first state the parameters that we used to generate the numerical results:
-
• TSP and its objective function. We select 50 nodes uniformly at random on the grid $[0,100] \times [0,100]$ . The objective is to find a configuration that minimizes the total Euclidean distance with the same starting and ending point. Each node can only be visited once.
-
• Initial configuration. Both ISA and SA have the same initialization. They are initialized using the output of the nearest-neighbour algorithm: a node is randomly chosen as the starting point, which is then connected to the closest unvisited node. This is repeated until every node has been visited, and subsequently the last node is connected back to the starting node.
-
• Proposal chain. Both ISA and SA share the same proposal chain: at each step, a proposal move is generated using the 2-OPT algorithm [Reference Croes18].
-
• Acceptance–rejection mechanism. In SA, the proposed move is accepted with probability
\begin{align*}\min \left\{ 1,e^{\beta(\mathcal{H}(x)-\mathcal{H}(y))} \right\},\end{align*}while in ISA, the acceptance probability is computed as in Section 2.1. In other words, we use a linear f in ISA. Both SA and ISA have the same source of randomness. -
• Cooling schedule. Both ISA and SA use the same logarithmic cooling schedule, of the form
\begin{align*}\epsilon_t = \dfrac{\sqrt{50}}{\ln(t+1)}.\end{align*} -
• Choice of c in ISA. In this experiment, if we denote the proposal configuration at time t by $y_t$ , we set $c = c_t$ to be
\begin{align*}c_t = \mathcal{H}(y_t) - 5.\end{align*}This tuning strategy has been discussed in Section 2.5. -
• Number of iterations. We run both ISA and SA for 100,000 iterations.
We generate 1000 random TSP instances. For each instance we compute what we call the improvement percentage ( $\textrm{IP}$ ) of ISA over SA, defined by
The summary statistics of $\textrm{IP}$ are provided in Table 1, while its histogram over these 1000 instances can be found in Figure 5. The code for reproducing these results can be found at https://github.com/mchchoi/Improved-discrete-simulated-annealing.
The summary statistics in Table 1 and the histogram in Figure 5 offer empirical evidence for using ISA over SA: out of the 1000 TSP instances, there are 798 instances in which the $\textrm{IP}$ is non-negative. The sample mean of the $\textrm{IP}$ is approximately 1.87%, while its sample median is 1.47%.
Next, in Figure 6, we investigate the difference between SA and ISA in a particular instance. We see that SA (the blue curve) is stuck at a local minimum, while ISA (the orange curve) is able to escape the local minimum, owing to the increased acceptance probability compared with SA, and it reaches regions where the objective value is smaller than that of SA.
The rest of this section is devoted to the proofs of Theorem 3, Corollary 1, Lemma 1, and Theorem 4.
4.2 Proof of Theorem 3
First, using the classical result of Holley and Stroock ([Reference Holley and Stroock19], Theorem 2.1), it is immediate that
For any arbitrary $x_1, x_2 \in \{\mathcal{H}(x_1) \geqslant \mathcal{H}(x_2)\}$ , we deduce the following upper bound:
As a result, $C_2^{-1}e^{H^f_{\epsilon,c}} \leqslant e^{\frac{1}{\epsilon} c^*} C_1(\epsilon)$ . On the other hand, we have the following lower bound:
and hence $e^{H^f_{\epsilon,c}} \geqslant e^{\frac{1}{\epsilon} c^*} C_4^{-1}$ .
4.3. Proof of Corollary 1
We first prove Item 1. For a continuous-time reversible Markov chain, by Levin and Peres ([Reference Levin and Peres20], Theorems 12.5 and 20.6), we can bound the total-variation mixing time by relaxation time via
where
for some $x^* \in S_{\textrm{min}}$ , and
is the normalization constant. Note that since $ \ln Z^f_{\epsilon,c} \to \ln \mu(S_{\textrm{min}})$ , we have
Item 1 follows from collecting the above results together with Theorem 3.
Next, we prove Item 2. First, using the random target lemma ([Reference Aldous and Fill15], Section 4.2), we have
Since $\mathcal{H}(\eta) = 0$ and $Z^f_{\epsilon,c} \leqslant 1$ , upon rearranging and using Theorem 2 we obtain
Define the equilibrium potential and capacity of the pair $(\sigma,\eta)$ as in Bovier and den Hollander ([Reference Bovier and den Hollander14], Chapter 7.2) to be respectively
If we prove that
then the mean hitting time formula combined with the equilibrium potential and capacity leads to
and the desired result follows, since
It therefore remains to prove (4.4). Define
Let $\textbf{1}_{A}$ denote the indicator function of the set A; the Dirichlet principle of capacity gives
where in the last inequality we use the fact that $G^0(\sigma,\eta)$ is the lowest possible highest elevation between $\sigma$ and $\eta$ .
4.4. Proof of Lemma 1
We first prove Item 1. The lower bound is immediate, while the upper bound can be deduced via
Next, we prove Item 2. We consider
This leads to
Thirdly, we prove Item 3. Using Item 2, we calculate that
Finally, we prove Item 4. Following exactly the same calculation as in the proof of Löwe ([Reference Löwe12], Lemma 3.5), we see that
The desired result follows if we let
so that
4.5 Proof of Theorem 4
We would like to invoke the results in Löwe [Reference Löwe12] for a time-dependent target function in simulated annealing. In Lemma 1, Items 1, 2, 3, and 4, we verify that Equations (11) and (12) and Assumptions (A1) and (A2), respectively, from Löwe [Reference Löwe12] hold. Consequently, if we let $h_t(y) \;:\!=\; \mathbb{P}_x(X^f_{\epsilon_t,c}(t) = y)/\pi^f_{\epsilon_t,c}(y)$ , then according to Holley and Stroock ([Reference Holley and Stroock19], Lemma 1.7), its $\ell^2$ norm is bounded by
for $t \geqslant e^{1/\overline{\epsilon}}-1$ . The desired result follows from exactly the same argument as in Löwe ([Reference Löwe12], Theorem 3.8).
Now we calculate $\pi^f_{\epsilon_t,c}(\mathcal{X} \backslash S_{\textrm{min}})$ . For $c \geqslant \underline{d}$ , we compute that
On the other hand, for $\mathcal{H}_{\textrm{min}} \leqslant c < \underline{d}$ ,
Acknowledgements
We thank Laurent Miclo for pointers to the work of Olivier Catoni, and we thank the two reviewers for careful reading and constructive feedback.
Funding information
The author acknowledges financial support from the startup grant of the National University of Singapore, Yale-NUS College, and a Singapore MoE Tier 1 grant entitled ‘MAPLE’.
Competing interests
There were no competing interests to declare which arose during the preparation or publication process of this article.