1. Introduction
The search for a global minimiser $v^*$ of a potentially non-convex and non-smooth cost function
holds significant importance in a variety of applications throughout applied mathematics, science and technology, engineering, and machine learning. Historically, a class of methods known as metaheuristics [Reference Bäck, Fogel and Michalewicz1, Reference Blum and Roli2] has been developed to address this inherently challenging and, in general, NP-hard problem. Examples of such include evolutionary programming [Reference Fogel3], genetic algorithms [Reference Holland4], particle swarm optimisation (PSO) [Reference Kennedy and Eberhart5], simulated annealing [Reference Aarts and Korst6] and many others. These methods work by combining local improvement procedures and global strategies by orchestrating deterministic and stochastic advances, with the aim of creating a method capable of robustly and efficiently finding the globally minimising argument $v^*$ of $f$ . However, despite their empirical success and widespread adoption in practice, most metaheuristics lack a solid mathematical foundation that could guarantee their robust convergence to global minimisers under reasonable assumptions.
Motivated by the urge to devise algorithms which converge provably, a novel class of metaheuristics, so-called consensus-based optimisation (CBO), originally proposed by the authors of [Reference Pinnau, Totzeck, Tse and Martin7], has recently emerged in the literature. Due to the inherent simplicity in the design of CBO, this class of optimisation algorithms lends itself to a rigorous theoretical analysis, as demonstrated in particular in the works [Reference Carrillo, Choi, Totzeck and Tse8–Reference Ko, Ha, Jin and Kim14]. However, this recent line of research does not just offer a promising avenue for establishing a thorough mathematical framework for understanding the numerically observed successes of CBO methods [Reference Carrillo, Jin, Li and Zhu9, Reference Fornasier, Klock, Riedl, Laredo, Hidalgo and Babaagba11, Reference Carrillo, Trillos, Li and Zhu15–Reference Riedl17], but beyond that allows to explain the effective use of conceptually similar and widespread methods such as PSO as well as at first glance completely different optimisation algorithms such as stochastic gradient descent (SGD). While the first connection is to be expected and by now made fairly rigorous [Reference Cipriani, Huang and Qiu18–Reference Huang, Qiu and Riedl20] due to CBO indisputably taking PSO as inspiration, the second observation is somewhat surprising, as it builds a bridge between derivative-free metaheuristics and gradient-based learning algorithms. Despite CBO solely relying on evaluations of the objective function, recent work [Reference Riedl, Klock, Geldhauser and Fornasier21] reveals an intrinsic SGD-like behaviour of CBO itself by interpreting it as a certain stochastic relaxation of gradient descent, which provably overcomes energy barriers of non-convex function. These perspectives, and, in particular, the already well-investigated convergence behaviour of standard CBO, encourage the exploration of improvements to the method in order to allow overcoming the limitations of traditional metaheuristics mentioned at the start. For recent surveys on CBO, we refer to [Reference Grassi, Huang, Pareschi and Qiu22, Reference Totzeck23].
While the original CBO model [Reference Pinnau, Totzeck, Tse and Martin7] has been adapted to solve constrained optimisations [Reference Bae, Ha, Kang, Lim, Min and Yoo24–Reference Carrillo, Totzeck and Vaes26], optimisations on manifolds [Reference Fornasier, Huang, Pareschi and Sünnen16, Reference Fornasier, Huang, Pareschi and Sünnen27–Reference Kim, Kang, Kim, Ha and Yang30], multi-objective optimisation problems [Reference Borghi, Herty and Pareschi31–Reference Klamroth, Stiglmayr and Totzeck33], saddle point problems [Reference Huang, Qiu and Riedl34] or the task of sampling [Reference Carrillo, Hoffmann, Stuart and Vaes35], as well as has been extended to make use of memory mechanisms [Reference Riedl17, Reference Borghi, Grassi and Pareschi36, Reference Totzeck and Wolfram37], gradient information [Reference Riedl17, Reference Schillings, Totzeck and Wacker38], momentum [Reference Chen, Jin and Lyu39], jump-diffusion processes [Reference Kalise, Sharma and Tretyakov40] or localisation kernels for polarisation [Reference Bungert, Wacker and Roith41], we focus in this work on a variation of the original model, which incorporates a truncation in the noise term of the dynamics. More formally, given a time horizon $T\gt 0$ , a time discretisation $t_0 = 0 \lt \Delta t \lt \cdots \lt K \Delta t = t_K = T$ of $[0,T]$ , and user-specified parameters $\alpha,\lambda,\sigma \gt 0$ as well as $v_b,R\gt 0$ , we consider the interacting particle system
where $(( B^i_{{k,\Delta t}})_{k=0,\ldots,K-1})_{i=1,\ldots,N}$ are independent, identically distributed Gaussian random vectors in $\mathbb{R}^d$ with zero mean and covariance matrix $\Delta t \mathsf{Id}_d$ . Equation (1) originates from a simple Euler–Maruyama time discretisation [Reference Higham42, Reference Platen43] of the system of stochastic differential equations (SDEs), expressed in Itô’s form as
where $((B_t^i)_{t\geq 0})_{i = 1,\ldots,N}$ are now independent standard Brownian motions in $\mathbb{R}^d$ . The empirical measure of the particles at time $t$ is denoted by $\widehat \rho ^{\,N}_{t} \,:\!=\, \frac{1}{N} \sum _{i=1}^{N} \delta _{V_t^i}$ . Moreover, $\mathcal{P}_{v_b,R}$ is the projection onto $B_R(v_b)$ defined as
As a crucial assumption in this paper, the map $\mathcal{P}_{v_b,R}$ depends on $R$ and $v_b$ in such way that $v^*\in B_R(v_b)$ . Setting the parameters can be feasible under specific circumstances, as exemplified by the regularised optimisation problem $f(v)\,:\!=\,\operatorname{Loss}(v)+ \Lambda \!\left \| v \right \|_2$ , wherein $v^*\in B_{\operatorname{Loss}(0)/ \Lambda }(0)$ . In the absence of prior knowledge regarding $v_b$ and $R$ , a practical approach is to choose $v_b=0$ and assign a sufficiently large value to $R$ . The first terms in (1) and (3), respectively, impose a deterministic drift of each particle towards the possibly projected momentaneous consensus point $v_{\alpha }\big({\widehat \rho ^{\,N}_{t}}\big)$ , which is a weighted average of the particles’ positions and computed according to
The weights $ \omega _{\alpha }(v)\,:\!=\,\exp \!({-}\alpha{f}(v))$ are motivated by the well-known Laplace principle [Reference Dembo and Zeitouni44], which states for any absolutely continuous probability distribution $\varrho$ on $\mathbb{R}^d$ that
and thus justifies that $v_{\alpha }\big({\widehat \rho ^{\,N}_{t}}\big)$ serves as a suitable proxy for the global minimiser $v^*$ given the currently available information of the particles $(V^i_t)_{i=1,\dots,N}$ . The second terms in (1) and (3), respectively, encode the diffusion or exploration mechanism of the algorithm, where, in contrast to standard CBO, we truncate the noise by some fixed constant $M\gt 0$ .
We conclude and re-iterate that both the introduction of the projection $\mathcal{P}_{v_b,R}\!\left (v_{\alpha }\!\left(\widehat{\rho }_t^N\right)\right )$ of the consensus point and the employment of truncation of the noise variance $\left (\,\!\left \| V^i_t-v_{\alpha }\left(\widehat \rho ^{\,N}_{t}\right) \right \|_2\wedge M\right )$ are main innovations to the original CBO method. We shall explain and justify these modifications in the following paragraph.
Despite these technical improvements, the approach to analyse the convergence behaviour of the implementable scheme (1) follows a similar route already explored in [Reference Carrillo, Choi, Totzeck and Tse8–Reference Fornasier, Klock, Riedl, Laredo, Hidalgo and Babaagba11]. In particular, the convergence behaviour of the method to the global minimiser $v^*$ of the objective $f$ is investigated on the level of the mean-field limit [Reference Fornasier, Klock and Riedl10, Reference Huang and Qiu45] of the system (3). More precisely, we study the macroscopic behaviour of the agent density $\rho \in{\mathcal{C}}([0,T],{\mathcal{P}}(\mathbb{R}^d))$ , where $\rho _t=\textrm{Law}(\overline{V}_t)$ with
and initial data $\overline{V}_0\sim \rho _0$ . Afterwards, by establishing a quantitative estimate on the mean-field approximation, that is, the proximity of the mean-field system (8) to the interacting particle system (3) and combining the two results, we obtain a convergence result for the CBO algorithm (1) with truncated noise.
1.1. Motivation for using truncated noise
In what follows, we provide a heuristic explanation of the theoretical benefits of employing a truncation in the noise of CBO as in (1), (3) and (8). Let us therefore first recall that the standard variant of CBO [Reference Pinnau, Totzeck, Tse and Martin7] can be retrieved from the model considered in this paper by setting $v_b=0$ , $ R=\infty$ and $M=\infty$ . For instance, in place of the mean-field dynamics (8), we would have
Attributed to the Laplace principle (7), it holds $v_{\alpha }({\rho ^{\text{CBO}}_t})\approx v^*$ for $\alpha$ sufficiently large, that is, as $\alpha \rightarrow \infty$ , the former dynamics converges to
First, observe that here the first term imposes a direct drift to the global minimiser $v^*$ and thereby induces a contracting behaviour, which is on the other hand counteracted by the diffusion term, which contributes a stochastic exploration around this point. In particular, with $\overline{Y}^{\text{CBO}}_t$ approaching $v^*$ , the exploration vanishes so that $\overline{Y}^{\text{CBO}}_t$ converges eventually deterministically to $v^*$ . Conversely, as long as $\overline{Y}^{\text{CBO}}_t$ is far away from $v^*$ , the order of the random exploration is strong. By Itô’s formula, we have
and thus
for any $p\geq 1$ . Denoting with $\mu ^{\text{CBO}}_t$ the law of $\overline{Y}^{\text{CBO}}_t$ , this means that, given any $\lambda,\sigma \gt 0$ , there is some threshold exponent $p^*=p^*(\lambda,\sigma,d)$ , such that
for $p\lt p^*$ , while for $p\gt p^*$ it holds
Recalling that the distribution of a random variable $Y$ has heavy tails if and only if the moment generating function $M_Y(s)\,:\!=\,\mathbb{E}\!\left [\exp \!(sY)\right ] =\mathbb{E}\!\left [\sum _{p=0}^\infty (sY)^p/p!\right ]$ is infinite for all $s\gt 0$ , these computations suggest that the distribution of $\mu ^{\text{CBO}}_t$ exhibits characteristics of heavy tails as $t\to \infty$ , thereby increasing the likelihood of encountering outliers in a sample drawn from $\mu ^{\text{CBO}}_t$ for large $t$ .
On the contrary, for CBO with truncated noise (8), we get, thanks once again to the Laplace principle as $\alpha \rightarrow \infty$ , that (8) converges to
for which we can compute
for any $p\geq 2$ . Notice, that to obtain the second inequality we used Young’s inequalityFootnote 1 as well as Jensen’s inequality. By means of Grönwall’s inequality, we then have
and therefore, denoting with $\mu _t$ the law of $\overline{Y}_t$ ,
for any $p\geq 2$ .
In conclusion, we observe from Equation (10) that the standard CBO dynamics as described in Equation (9) diverges in the setting $\sigma ^2d\gt 2\lambda$ when considering the Wasserstein- $2$ distance $W_2$ . Contrarily, according to Equation (12), the CBO dynamics with truncated noise as presented in Equation (11) converges with exponential rate towards a neighbourhood of $v^*$ , with radius $\sigma M\sqrt{d}/\sqrt{\lambda }$ . This implies that for a relatively small value of $M$ , the CBO dynamics with truncated noise exhibits greater robustness in relation to the parameter $\sigma ^2d/\lambda$ . This effect is confirmed numerically in Figure 1.
Remark 1 (Sub-Gaussianity of truncated CBO). An application of Itô’s formula allows to show that, for some $\kappa \gt 0$ , $\mathbb{E}\!\left [\exp \!\left (\!\left \| \mkern 1.5mu\overline{\mkern -1.5muY\mkern -1.5mu}\mkern 0mu_t-v^* \right \|_2^2/ \kappa ^2\right )\right ]\lt \infty$ , provided $\mathbb{E}\!\left [\exp \!\left (\!\left \| \mkern 1.5mu\overline{\mkern -1.5muY\mkern -1.5mu}\mkern 0mu_0-v^* \right \|_2^2/ \kappa ^2\right )\right ]\lt \infty$ . Thus, by incorporating a truncation in the noise term of the CBO dynamics, we ensure that the resulting distribution $\mu _t$ exhibits sub-Gaussian behaviour and therefore we enhance the regularity and well-behavedness of the statistics of $\mu _t$ . As a consequence, more reliable and stable results when analysing the properties and characteristics of the dynamics are to be expected.
1.2. Contributions
In view of the aforementioned enhanced regularity and well-behavedness of the statistics of CBO with truncated noise compared to standard CBO [Reference Pinnau, Totzeck, Tse and Martin7] together with the numerically observed improved performance as depicted in Figure 1, a rigorous convergence analysis of the implementable CBO algorithm with truncated noise as given in (1) is of theoretical interest. In this work, we provide theoretical guarantees of global convergence of (1) to the global minimiser $v^*$ for possibly non-convex and non-smooth objective functions $f$ . The approach to analyse the convergence behaviour of the implementable scheme (1) follows a similar route as initiated and explored by the authors of [Reference Carrillo, Choi, Totzeck and Tse8–Reference Fornasier, Klock, Riedl, Laredo, Hidalgo and Babaagba11]. In particular, we first investigate the mean-field behaviour (8) of the system (3). Then, by establishing a quantitative estimate on the mean-field approximation, that is, the proximity of the mean-field system (8) to the interacting particle system (3), we obtain a convergence result for the CBO algorithm (1) with truncated noise. Our proving technique nevertheless differs in crucial parts from the one in [Reference Fornasier, Klock and Riedl10, Reference Fornasier, Klock, Riedl, Laredo, Hidalgo and Babaagba11] as, on the one side, we do take advantage of the truncations, and, on the other side, we require additional technical effort to exploit and deal with the enhanced flexibility of the truncated model. Specifically, the central novelty can be identified in the proof of sub-Gaussianity of the process, see Lemma 8.
1.3. Organisation
In Section 2, we present and discuss our main theoretical contribution about the global convergence of CBO with truncated noise in probability and expectation. Section 3 collects the necessary proof details for this result. In Section 4, we numerically demonstrate the benefits of using truncated noise, before we provide a conclusion of the paper in Section 5. For the sake of reproducible research, in the GitHub repository https://github.com/KonstantinRiedl/CBOGlobalConvergenceAnalysis, we provide the Matlab code implementing CBO with truncated noise.
1.4. Notation
We use $\left \|{\,\cdot \,}\right \|_2$ to denote the Euclidean norm on $\mathbb{R}^d$ . Euclidean balls are denoted as $B_{r}(u) \,:\!=\, \{v \in \mathbb{R}^d\,:\, \|{v-u}\|_2 \leq r\}$ . For the space of continuous functions $f\,:\,X\rightarrow Y$ , we write ${\mathcal{C}}(X,Y)$ , with $X\subset \mathbb{R}^n$ and a suitable topological space $Y$ . For an open set $X\subset \mathbb{R}^n$ and for $Y=\mathbb{R}^m$ , the spaces ${\mathcal{C}}^k_{c}(X,Y)$ and ${\mathcal{C}}^k_{b}(X,Y)$ contain functions $f\in{\mathcal{C}}(X,Y)$ that are $k$ -times continuously differentiable and have compact support or are bounded, respectively. We omit $Y$ in the real-valued case. All stochastic processes are considered on the probability space $\left (\Omega,\mathscr{F},\mathbb{P}\right )$ . The main objects of study are laws of such processes, $\rho \in{\mathcal{C}}([0,T],{\mathcal{P}}(\mathbb{R}^d))$ , where the set ${\mathcal{P}}(\mathbb{R}^d)$ contains all Borel probability measures over $\mathbb{R}^d$ . With $\rho _t\in{\mathcal{P}}(\mathbb{R}^d)$ , we refer to a snapshot of such law at time $t$ . Measures $\varrho \in{\mathcal{P}}(\mathbb{R}^d)$ with finite $p$ -th moment $\int \|{v}\|_2^p\,d\varrho (v)$ are collected in ${\mathcal{P}}_p(\mathbb{R}^d)$ . For any $1\leq p\lt \infty$ , $W_p$ denotes the Wasserstein- $p$ distance between two Borel probability measures $\varrho _1,\varrho _2\in{\mathcal{P}}_p(\mathbb{R}^d)$ , see, for example, [Reference Ambrosio, Gigli and Savaré46]. $\mathbb{E}\!\left [\cdot \right ]$ denotes the expectation.
2. Global convergence of CBO with truncated noise
We now present the main theoretical result of this work about the global convergence of CBO with truncated noise for objective functions that satisfy the following conditions.
Definition 2 (Assumptions). Throughout we are interested in functions ${f} \in{\mathcal{C}}(\mathbb{R}^d)$ , for which
-
A1 there exist $v^*\in \mathbb{R}^d$ such that ${f}(v^*)=\inf _{v\in \mathbb{R}^d}{f}(v)\,=\!:\,\underline{f}$ and $\underline{\alpha },L_u\gt 0$ such that
(13) \begin{align} \sup _{v\in \mathbb{R}^d}\left \| ve^{-\alpha (f(v)-\underline{f})} \right \|_2\,=\!:\,L_u\lt \infty \end{align}for any $\alpha \geq \underline{\alpha }$ and any $v\in \mathbb{R}^d$ , -
A2 there exist ${f}_{\infty },R_0,\nu,L_\nu \gt 0$ such that
(14) \begin{align} \left \|{v-v^*}\right \|_2 &\leq \frac{1}{L_\nu }\big({f}(v)-\underline{f}\big)^{\nu } \quad \text{ for all } v \in B_{R_0}(v^*), \end{align}(15) \begin{align}\qquad {f}_{\infty } &\lt{f}(v)-\underline{f}\quad \text{ for all } v \in \big (B_{R_0}(v^*)\big )^c, \end{align} -
A3 there exist $L_{\gamma }\gt 0,\gamma \in [0,1]$ such that
(16) \begin{align} \left |{f(v)-f(w)}\right | &\leq L_{\gamma }\!\left(\,\!\left \| v-v^* \right \|_2^{\gamma }+\left \| w-v^* \right \|_2^{\gamma }\right)\left \| v-w \right \|_2 \quad \text{ for all } v, w \in \mathbb{R}^d, \end{align}(17) \begin{align} f(v)-\underline{f} &\leq L_{\gamma }\!\left (1+\left \| v-v^* \right \|_2^{1+\gamma }\right ) \quad \text{ for all } v \in \mathbb{R}^d.\qquad \end{align}
A few comments are in order: Condition A1 establishes the existence of a minimiser $v^*$ and requires a certain growth of the function $f$ . Condition A2 ensures that the value of the function $f$ at a point $v$ can locally be an indicator of the distance between $v$ and the minimiser $v^*$ . This error-bound condition was first introduced in [Reference Fornasier, Klock and Riedl10] under the name inverse continuity condition. It in particular guarantees the uniqueness of the global minimiser $v^*$ . Condition A3 sets controllable bounds on the local Lipschitz constant of $f$ and on the growth of $f$ , which is required to be at most quadratic. A similar requirement appears also in [Reference Carrillo, Choi, Totzeck and Tse8, Reference Fornasier, Klock and Riedl10], but there also a quadratic lower bound was imposed.
2.1. Main result
We can now state the main result of the paper. Its proof is deferred to Section 3.
Theorem 3. Let ${f} \in{\mathcal{C}}(\mathbb{R}^d)$ satisfy A1, A2 and A3. Moreover, let $\rho _0\in{\mathcal{P}}_4(\mathbb{R}^d)$ with $v^*\in \operatorname{supp}(\rho _0)$ . Let $V^i_{ 0,\Delta t}$ be sampled i.i.d. from $\rho _0$ and denote by $((V^i_{{k,\Delta t}})_{k=1,\dots,K})_{i=1,\dots,N}$ the iterations generated by the numerical scheme (1). Fix any $\epsilon \in \left(0,W_2^2\left (\rho _0,\delta _{v^*}\right )\right )$ , define the time horizon
and let $K \in \mathbb{N}$ and $\Delta t$ satisfy ${{K\Delta t}}=T^*$ . Moreover, let $R\in \big (\!\left \| v_b-v^* \right \|_2+\sqrt{\epsilon/2},\infty \big )$ , $M\in (0,\infty )$ and $\lambda,\sigma \gt 0$ be such that $\lambda \geq 2\sigma ^2d$ or $\sigma ^2M^2d=\mathcal{O}(\epsilon )$ . Then, by choosing $\alpha$ sufficiently large and $N\geq \left(16\alpha L_{\gamma }\sigma ^2M^2\right)/\lambda$ , it holds
up to a generic constant. Here, $C_{\textrm{NA}}$ depends linearly on the dimension $d$ and the number of particles $N$ and exponentially on the time horizon $T^*$ , $m$ is the order of accuracy of the numerical scheme (for the Euler–Maruyama scheme $m = 1/2$ ), and $C_{\textrm{MFA}} = C_{\textrm{MFA}}(\lambda,\sigma,d,\alpha,L_{\nu },\nu,L_{\gamma },L_u,T^*,R,v_b,v^*,M)$ .
Remark 4. In the statement of Theorem 3, the parameters $R$ and $v_b$ play a crucial role. We already mentioned how they can be chosen in an example after Equation (5). The role of these parameters is bolstered in particular in the proof of Theorem 3, where it is demonstrated that, by selecting a sufficiently large $\alpha$ depending on $R$ and $v_b$ , the dynamics (8) can be set equal to
where $\delta$ represents a small value. For the dynamics (3), we can analogously establish its equivalence to
with high probability, contingent upon the selection of sufficiently large values for both $\alpha$ and $N$ .
Remark 5. The convergence result in the form of Theorem 3 obtained in this work differs from the one presented in [Reference Fornasier, Klock and Riedl10, Theorem 14] in the sense that we obtain convergence is in expectation, while in [Reference Fornasier, Klock and Riedl10] convergence with high probability is established. This distinction arises from the truncation of the noise term employed in our algorithm.
3. Proof details for section 2
3.1. Well-posedness of equations (1) and (3)
With the projection map ${\mathcal{P}}_{v_b,R}$ being $1$ -Lipschitz, existence and uniqueness of strong solutions to the SDEs (1) and (3) are assured by essentially analogous proofs as in [Reference Carrillo, Choi, Totzeck and Tse8, Theorems 2.1, 3.1 and 3.2]. The details shall be omitted. Let us remark, however, that due to the presence of the truncation and the projection map, we do not require the function $f$ to be bounded from above or exhibit quadratic growth outside a ball, as required in [Reference Carrillo, Choi, Totzeck and Tse8, Theorems 2.1, 3.1 and 3.2].
3.2. Proof details for theorem 3
Remark 6. Since adding some constant offset to $f$ does not affect the dynamics of Equations (3) and (8), we will assume $\underline{f}=0$ in the proofs for simplicity but without loss of generality.
Let us first provide a sketch of the proof of Theorem 3. For the approximation error (18), we have the error decomposition
where $\big(\big(\overline{V}_t^i\big)_{t\geq 0}\big)_{i=1,\dots,N}$ denote $N$ independent copies of the mean-field process $\big(\overline{V}_t\big)_{t\geq 0}$ satisfying Equation (8).
In what follows, we investigate each of the three term separately. Term $I$ can be bounded by $C_{\textrm{NA}}\!\left (\Delta t\right )^{2m}$ using classical results on the convergence of numerical schemes for SDEs, as mentioned for instance in [Reference Platen43]. The second and third terms, respectively, are analysed in separate subsections, providing detailed explanations and bounds for each of the two terms $II$ and $III$ .
Before doing so, let us provide a concise guide for reading the proofs. As the proofs are quite technical, we start for reader’s convenience by presenting the main building blocks of the result first and collect the more technical steps in subsequent lemmas. This arrangement should hopefully allow to grasp the structure of the proof more easily and to dig deeper into the details along with the reading.
3.2.1. Upper bound for the second term in (19)
For Term $II$ of the error decomposition (19), we have the following upper bound.
Proposition 7. Let ${f} \in{\mathcal{C}}(\mathbb{R}^d)$ satisfy A1, A2 and A3. Moreover, let $R$ and $M$ be finite such that $R\geq \left \| v_b-v^* \right \|_2$ and let $N\geq (16\alpha L_{\gamma }\sigma ^2M^2)/\lambda$ . Then, we have
where $C_{\textrm{MFA}} = C_{\textrm{MFA}}(\lambda,\sigma,d,\alpha,L_{\nu },\nu,L_{\gamma },L_u,T^*,R,v_b,v^*,M)$ .
Proof. By a synchronous coupling, we have
with coinciding Brownian motions. Moreover, recall that $\textrm{Law}\big(\overline{V}_t^i\big)=\rho _t$ and $\widehat \rho ^{\,N}_{t}={1}/{N}\sum _{i=1}^N\delta _{V_t^i}$ . By Itô’s formula, we then have
and after taking the expectation on both sides
Here, let us remark that the last (stochastic) term in (21) disappears after taking the expectation. This is due to $\mathbb{E}\!\left [\left \| \overline{V}_t^i-V_t^i \right \|_2^2\right ]\lt \infty$ , which can be derived from Lemma 8 after noticing that Lemma 8 also holds for processes $V_t^i$ . Since by Young’s inequality, it holds
and
we obtain
after inserting the former two inequalities into Equation (22). For the term $\mathbb{E}\!\left [\left \| v_{\alpha }(\rho _t)-{v}_{\alpha }\!\left(\widehat \rho ^{\,N}_{t}\right) \right \|_2^2\right ]$ , we can decompose
where we denote
For the first term in Equation (24), by Lemma 11, we have
for some constant $C_0$ depending on $\lambda,\sigma,d,\alpha,L_{\gamma },L_u,T^*,R,v_b,v^*$ and $M$ . For the second term in Equation (24), by combining [Reference Carrillo, Choi, Totzeck and Tse8, Lemma 3.2] and Lemma 8, we obtain
for some constant $C_1$ depending on $\lambda,\sigma,d,\alpha,L_u,R$ and $M$ . Combining these estimates, we conclude
After an application of Grönwall’s inequality and noting that $\overline{V}^i_0=V^i_0$ for all $i=1,\dots,N$ , we have
for any $t\in [0,T^*]$ . Finally, by Jensen’s inequality and letting $t=T^*$ , we have
where the constant $C_{\textrm{MFA}}$ depends on $\lambda,\sigma,d,\alpha,L_u,L_{\gamma },T^*,R,v_b,v^*$ and $M$ .
In the next lemma, we show that the distribution of $\overline{V}_t$ is sub-Gaussian.
Lemma 8. Let $R$ and $M$ be finite with $R\geq \left \| v_b-v^* \right \|_2$ . For any $ \kappa \gt 0$ , let $N$ satisfy $N\geq{\left(4 \sigma ^2 M^2\right)}/{\left(\lambda \kappa ^2\right)}$ . Then, provided that $\mathbb{E}\!\left [\exp \!\left({\sum _{i=1}^N\big\| \overline{V}^i_0-v^* \big\|_2^2}/{\left(N \kappa ^2\right)}\right)\right ]\lt \infty$ , it holds
where $C_{ \kappa }$ depends on $ \kappa,\lambda,\sigma,d,R,M$ and $T^*$ , and where
for $i=1,\dots,N$ with $B^i_t$ being independent to each other and $\textrm{Law}\big(\overline{V}_t^i\big)=\rho _t$ .
Proof. To apply Itô’s formula, we need to truncate the function $\exp \!\left({\left \| v \right \|_2^2}/{ \kappa ^2}\right)$ from above. For this, define for $W\gt 0$ the function
It is easy to verify that $G_W$ is a $\mathcal{C}^2$ approximation of the function $x\wedge W$ satisfying $ G_W\in \mathcal{C}^2(\mathbb{R}^+)$ , $G_W(x)\leq x\wedge W$ , $ G_W^{\prime }\in [0,1]$ and $G_W^{\prime \prime }\leq 0$ .
Since $G_{W,N, \kappa }(t)\,:\!=\,\exp \!\left(G_W\!\left(\sum _{i=1}^N\big\| \overline{V}^i_t-v^* \big\|_2^2/N\right)/{ \kappa ^2}\right)$ is upper-bounded, we can apply Itô’s formula to it. We abbreviate $G_W^{\prime }\,:\!=\,G_W^{\prime }\left(\!\sum _{i=1}^N\big\| \overline{V}_t^i -v^* \big\|_2^2/N\right)$ and $G_W^{\prime \prime }\,:\!=\,G_W^{\prime \prime }\left(\!\sum _{i=1}^N\big\| \overline{V}_t^i \big\|_2^2/N\right)$ in what follows. With the notation $Y_t\,:\!=\,\left (\big(\overline{V}_t^1\big)^{\top },\cdots,\big(\overline{V}_t^N\big)^{\top }\right )^{\top }$ , the $Nd$ -dimensional process $Y_t$ satisfies $ dY_t=-\lambda \big (Y_t-\overline{\mathcal{P}_{v_b,R}(\rho _t)}\big )\,dt+\mathcal{M}dB_t$ , where $\overline{\mathcal{P}_{v_b,R}(\rho _t)}=\left ({\mathcal{P}_{v_b,R}(\rho _t)}^{\top },\ldots,{\mathcal{P}_{v_b,R}(\rho _t)}^{\top }\right )^{\top }$ , $\mathcal{M}=\operatorname{diag}\!\left (\mathcal{M}_1,\ldots,\mathcal{M}_N\right )$ with $\mathcal{M}_i=\sigma \big\| \overline{V}^i_t-v_\alpha \!\left (\rho _t\right ) \big\|_2\wedge M \textrm{I}_{d}$ and $B_t$ the $Nd$ -dimensional Brownian motion. We then have $G_{W,N, \kappa }(t)=\exp \!\left (G_W\big (\!\left \| Y_t \right \|_2^2/N\big )/ \kappa ^2\right )$ and
The first term on the right-hand side of (28) can be expanded as
Notice additionally that
as $v^*$ and $\mathcal{P}_{v_b,R}(v_\alpha (\rho _t))$ belong to the same ball $B_R(v_b)$ around $v_b$ of radius $R$ . Similarly, we can expand the coefficient of the second term. According to the properties $G_W^{\prime }\in [0,1]$ and $G_W^{\prime \prime }\leq 0$ , we can bound it from above yielding
By taking expectations in (28) and combining it with (29), (30) and (31), we obtain
Rearranging the former yields
Since by Young’s inequality, it holds $4R\big\| \overline{V}^i_t-v^* \big\|_2\leq 4R^2+\big\| \overline{V}^i_t-v^* \big\|_2^2$ , we can continue Estimate (32) by
with $A\,:\!=\,\frac{\lambda }{N \kappa ^2}-\frac{2\sigma ^2M^2}{N^2 \kappa ^4}$ and $B\,:\!=\,\frac{\sigma ^2M^2d+4\lambda R^2}{ \kappa ^2}$ . Now, if $\sum _{i=1}^N\big\| \overline{V}^i_t-v^* \big\|_2^2\geq ({B-1})/{A}$ , we have
while, if $\sum _{i=1}^N\big\| \overline{V}^i_t-v^* \big\|_2^2\leq ({B-1})/{A}$ , we have
Thus, the latter inequality always holds true and consequently we have with (33)
which gives after integration
Letting $W\to \infty$ , we eventually obtain
provided that $\mathbb{E}\!\left [\exp \!\left({\sum _{i=1}^N\big\| \overline{V}^i_0-v^* \big\|_2^2}/{N \kappa ^2}\right)\right ]\lt \infty$ .
If $N\geq{(4 \sigma ^2 M^2)}/{(\lambda \kappa ^2)}$ , we have
Thus, $C_{ \kappa }$ is upper-bounded and independent of $N$ .
Remark 9. The sub-Gaussianity of $\overline{V}_t$ follows from Lemma 8 by noticing that the statement can be applied in the setting $N=1$ when choosing $\kappa$ sufficiently large.
Remark 10. In Lemma 8, as the number of particles $N$ increases, the condition for $ \kappa$ to ensure $C_{ \kappa }\lt \infty$ becomes more relaxed. Specifically, the value of $ \kappa$ can be as small as one needs as $N$ increases. This phenomenon can be easily understood by considering the limit as $N$ approaches infinity. In this case, $C_{ \kappa }$ tends to $\sup _{t\in [0,T^*]}\exp \!\left(\mathbb{E}\!\left [\big\| \overline{V}_t-v^* \big\|_2^2\right ]/ \kappa ^2\right)$ . Therefore, as one shows an upper bound on the second moment of $\overline{V}_t$ , it becomes evident that $C_{ \kappa }$ remains finite as $N$ tends to infinity.
With the help of Lemma 8, we can now prove the following lemma.
Lemma 11. Let ${f} \in{\mathcal{C}}(\mathbb{R}^d)$ satisfy A1 and A3. Then, for any $t\in [0,T^*]$ , $M$ and $R$ with $R\geq \left \| v_b-v^* \right \|_2$ finite, and $N$ satisfying $N\geq (16\alpha L_{\gamma }\sigma ^2M^2)/\lambda$ , we have
where $C_0\,:\!=\,C_0(\lambda,\sigma,d,\alpha,L_{\gamma },L_u,T^*,R,v_b,v^*,M)$ .
Proof. Without the loss of generality, we assume $v^*=0$ and recall that we assumed $\underline{f}=0$ in the proofs as of Remark 6. We have
where we defined
In the following, we upper-bound the terms $T_1, T_2$ and $ T_3$ separately. First, recall that by Lemma 8 we have for $t\in [0,T^*]$ that
where $C_{ \kappa }$ only depends on $ \kappa,\lambda,\sigma,d,R,M$ and $T^*$ . With this,
where we set $ \kappa ^2={1}/{(4\alpha L_{\gamma })}$ in the next-to-last step and where $N$ should satisfy $N\geq (16\alpha L_{\gamma }\sigma ^2M^2)/\lambda$ .
Second, we have
where $\left (\overline{Z}_t^i\,:\!=\,\overline{V}_t^ie^{-\alpha f(\mkern 1.5mu\overline{\mkern -1.5muV\mkern -1.5mu}\mkern 0mu_t^i)}-{\int _{\mathbb{R}^d}v e^{-\alpha f(v)}d\rho _t(v)}\right )_{i=1,\dots,N}$ are i.i.d. and have zero mean. Thus,
Similarly, we can derive
Collecting the bounds for the terms $T_1$ , $T_2$ and $T_3$ and inserting them in (36), we obtain
Since by Lemmas 14, 16 and 17, we know that $\left \| v_{\alpha }(\rho _t) \right \|_2$ can be uniformly bounded by a constant depending on $\alpha,\lambda,\sigma,d,R,v_b,v^*,M,L_{\nu }$ and $\nu$ (see in particular Equation (48) that combines the aforementioned lemmas), we can conclude (38) with
for some constant $C_0$ depends on $\lambda,\sigma,d,\alpha,L_{\nu },\nu,L_{\gamma },L_u,T^*,R,v_b,v^*$ and $M$ .
3.2.2. Upper bound for the third term in (19)
In this section, we bound Term $III$ of the error decomposition (19). Before stating the main result of this section, Proposition 15, we first need to provide two auxiliary lemmas, Lemma 12 and Lemma 14.
Lemma 12. Let $R,M\in (0,\infty )$ . Then, it holds
If further $\lambda \geq 2\sigma ^2d$ , we have
Proof. By Itô’s formula, we have
which, after taking the expectation on both sides, yields
For the term $\mathbb{E}\!\left [\left \| \overline{V}_t-\mathcal{P}_{v_b,R}\!\left (v_{\alpha }(\rho _t)\right ) \right \|_2^2\right ]$ , we notice that
which, inserted into Equation (42), allows to derive
From this, we get for any $\lambda$ and $\sigma$ that
as well as
If $\lambda \geq 2\sigma ^2d$ , by Equation (44), we get
Remark 13. When $R=M=\infty$ , we can show
If further $\lambda \geq \sigma ^2d$ , we have
This differs from [Reference Fornasier, Klock and Riedl10, Lemma 18].
The next result is a quantitative version of the Laplace principle as established in [Reference Fornasier, Klock and Riedl10, Proposition 21].
Lemma 14. For any $r\gt 0$ , define $f_r\,:\!=\, \sup _{v \in B_r\left (v^*\right )} f(v)$ . Then, under the inverse continuity condition A2, for any $r \in \left (0, R_0\right ]$ and $q\gt 0$ such that $q+f_r \leq f_{\infty }$ , it holds
With the above preparation, we can now upper bound Term $III$ . We have by Jensen’s inequality
that is, it is enough to upper-bound $\mathbb{E}\!\left [\left \| \overline{V}_{T^*}-v^* \right \|_2^2\right ]$ , which is the content of the next statement.
Proposition 15. Let ${f} \in{\mathcal{C}}(\mathbb{R}^d)$ satisfy A1, A2 and A3. Moreover, let $\rho _0\in{\mathcal{P}}_4(\mathbb{R}^d)$ with $v^*\in \operatorname{supp}(\rho _0)$ . Fix any $\epsilon \in (0,W_2^2(\rho _0,\delta _{v^*}))$ and define the time horizon
Moreover, let $R\in (\!\left \| v_b-v^* \right \|_2+\sqrt{\epsilon/2},\infty )$ , $M\in (0,\infty )$ and $\lambda,\sigma \gt 0$ be such that $\lambda \geq 2\sigma ^2d$ or $\sigma ^2M^2d=\mathcal{O}(\epsilon )$ . Then, we can choose $\alpha$ sufficiently large, depending on $\lambda,\sigma,d,T^*,R,v_b,M,\epsilon$ and properties of $f$ , such that $ \mathbb{E}\!\left [\left \| \overline{V}_{T^*}-v^* \right \|_2^2\right ] =\mathcal{O}(\epsilon )$ .
Proof. We only prove the case $\lambda \geq 2\sigma ^2d$ in detail. The case $\sigma ^2M^2d=\mathcal{O}(\epsilon )$ follows similarly.
According to Lemmas 14 and 17, we have
where $C_{2}\,:\!=\,(\!\exp{q^{\prime}T^*})/{C_{4}}\lt \infty$ , $q^{\prime}$ and $C_4$ are from Lemma 17, and where, as of Lemma 16, $C_{3}\,:\!=\,\sup _{[0,T^*]}\mathbb{E}\!\left [\big\| \overline{V}_t-v^* \big\|_2\right ]\lt \infty$ . In what follows, let us deal with the two terms on the right-hand side of (48). For the term ${\left (q+f_r\right )^\nu }/{L_{\nu }}$ , let $q=f_r$ . Then by A2 and A3, we can choose proper $r$ , such that $2(L_{\nu } r)^{{1}/{\nu }}\leq 2f_r\leq f_{\infty }$ . Further by A3, we have
so if
we can bound
For term ${\exp \!({-}\alpha q)}C_{2}C_{3}$ , we can choose $\alpha$ large enough such that
With these choices of $r$ and $\alpha$ and by integrating them into Equation (48), we obtain
for all $t\in [0,T^*]$ , and thus
Consequently, by Lemma 12, we have
since now $\mathcal{P}_{v_b,R}(v_{\alpha }(\rho _t))=v_{\alpha }(\rho _t)$ . Finally by Grönwall’s inequality, $\mathbb{E}\!\left [\left \| \overline{V}_{T^*}-v^* \right \|_2^2\right ]\leq \epsilon$ .
Lemma 16. Let $\left \| v_b-v^* \right \|_2\lt R\lt \infty$ and $0\lt M\lt \infty$ . Then, it holds
Proof. By Equation (42), we have
yielding
after an application of Grönwall’s inequality for any $t\geq 0$ .
Lemma 17. For any $M\in (0,\infty )$ , $\tau \geq 1$ , $r\gt 0$ and $R\in (\!\left \| v_b-v^* \right \|_2+r,\infty )$ , it holds
where
and where $q^{\prime}$ depends on $\tau,\lambda,\sigma,d,r,R,v_b$ and $M$ .
Proof. Recall that the law $\rho _t$ of $\overline{V}_t$ satisfies the Fokker–Planck equation
Let us first define for $\tau \geq 1$ the test function
for which it is easy to verify that $\phi _r^{\tau }\in \mathcal{C}_c^1(\mathbb{R}^d,[0,1])$ . Since $\textrm{Im}\,\phi _r^{\tau }\subset [0,1]$ , we have $\rho _t(B_r(v^*))\geq \int _{B_r(v^*)}\phi _r^{\tau }(v-v^*)\,d\rho _t(v)$ . To lower bound $\rho _t(B_r(v^*))$ , it is thus sufficient to establish a lower bound on $\int _{B_r(v^*)}\phi _r^{\tau }(v-v^*)\,d\rho _t(v)$ . By Green’s formula,
For simplicity, let us abbreviate
We can choose $\epsilon _1$ small enough, depending on $\tau$ and $d$ , such that when ${\left \| v-v^* \right \|_2}/{r}\gt 1-\epsilon _1$ , we have
where the last inequality works if ${\left \| v-v^* \right \|_2}/{r}\geq 1-1/(6(d+\tau -2))$ .
If $v_{\alpha }(\rho _t)\not \in B_{R}(v_b)$ , we have $\left |{\left \lt v-\mathcal{P}_{v_b,R}\!\left (v_{\alpha }(\rho _t)\right ), v-v^* \right \gt }\right |/r^2\leq C(r,R,v_b)$ and, since $R\gt \left \| v_b-v^* \right \|_2+r$ , $({\left \| v-v_{\alpha }(\rho _t) \right \|_2^2\wedge M^2})/{r^2}\geq C(r,M,R,v_b)$ , which allows to choose $\epsilon _2$ small enough, depending on $\lambda,r,\sigma,R,v_b$ and $M$ , such that $\Theta \gt 0$ when ${\left \| v-v^* \right \|_2}/{r}\gt 1-\min \{\epsilon _1,\epsilon _2\}$ .
If $v_{\alpha }(\rho _t)\in B_{R}(v_b)$ and $\left \| v-v_{\alpha }(\rho _t) \right \|_2\leq M$ , we have by Lemma 18
when $\left \| v-v^* \right \|_2/r\in \left [1-2\sigma ^2/(3\lambda ),1\right ]$ .
If $v_{\alpha }(\rho _t)\in B_{R}(v_b)$ and $\left \| v-v_{\alpha }(\rho _t) \right \|_2\gt M$ , we have
that is, we can choose $\epsilon _3$ small enough, depending on $\lambda,r,\sigma,R,v_b$ and $M$ , such that $\Theta \geq 0$ when ${\left \| v-v^* \right \|_2}/{r}\gt 1-\min \{\epsilon _1,\epsilon _2,\epsilon _3,{2\sigma ^2}/{3\lambda }\}$ .
Combining the cases from above, we conclude that $\Theta \geq 0$ when ${\left \| v-v^* \right \|_2}/{r}\geq 1-\min \{\epsilon _1,\epsilon _2,\epsilon _3,{2\sigma ^2}/{3\lambda }\}$ . On the other hand, when ${\left \| v-v^* \right \|_2}/{r}\leq 1-\min \{\epsilon _1,\epsilon _2,\epsilon _3,{2\sigma ^2}/{3\lambda }\}$ , we have
for some constant $C_5$ depending on $r,R,M,v_b,\lambda,\sigma,d$ and $\tau$ , since $\left |{\Theta }\right |$ is upper-bounded and $\phi _r^{\tau }(v-v^*)\geq \phi _r^{\tau }((1-\min \{\epsilon _1,\epsilon _2,\epsilon _3,{2\sigma ^2}/{3\lambda }\})r)\gt 0$ for any $v$ satisfies ${\left \| v-v^* \right \|_2}/{r}\leq 1-\min \{\epsilon _1,\epsilon _2,\epsilon _3,{2\sigma ^2}/{3\lambda }\}$ .
All in all we have
where $q^{\prime}\,:\!=\,\max \{C_5,0\}$ . By Grönwall’s inequality, we thus have
which concludes the proof.
Lemma 18. Let $a,b\gt 0$ . Then, we have
for any $x\in [1-{2a}/{b},1]\cap (0,\infty )$ and $y\geq 0$ .
Proof. For $y=0$ , this is true. For $y\gt 0$ , divide both side by $ay^2$ and denote $c={b}/{a}$ . Then the lemma is equivalent to showing $(1+c(1-x))\left (x/y\right )^2-(2+c(1-x))x/y+1\geq 0$ , that is, it is enough to show $\min _{r\geq 0}\!(1+c(1-x))r^2-(2+c(1-x))r+1\geq 0$ , when $x\in [1-{2}/{c},1]$ . We have
and thus
when $x\in [1-{2}/{c},1]$ . This finishes the proof.
4. Numerical experiments
In this section, we numerically demonstrate the benefit of using CBO with truncated noise. For isotropic [Reference Pinnau, Totzeck, Tse and Martin7, Reference Carrillo, Choi, Totzeck and Tse8, Reference Fornasier, Klock and Riedl10] and anisotropic noise [Reference Carrillo, Jin, Li and Zhu9, Reference Fornasier, Klock, Riedl, Laredo, Hidalgo and Babaagba11], we compare the CBO method with truncation $M=1$ to standard CBO for several benchmark problems in optimisation, which are summarised in Table 1.
In the subsequent tables, we report comparison results for the two methods for the different benchmark functions as well as different numbers of particles $N$ and, potentially, different numbers of steps $K$ . Throughout, we set $v_b=0$ and $R=\infty$ , which is out of convenience. Any sufficiently large but finite choice for $R$ yields identical results.
The success criterion is defined by achieving the condition $\left \| \frac{1}{N}\sum _{i=1}^N V^i_{{K,\Delta t}}-v^* \right \|_2\leq 0.1$ , which ensures that the algorithm has reached the basin of attraction of the global minimiser. The success rate is averaged over $1000$ runs.
4.1. Isotropic case
Let $d=15$ . In the case of isotropic noise, we always set $\lambda =1$ , $\sigma =0.3$ , $\alpha =10^5$ and step size $\Delta t=0.02$ . The initial positions $(V_0^i)_{i=1,\dots,N}$ are sampled i.i.d. from $\rho _0=\mathcal{N}(0,I_d)$ . In Table 2, we report results comparing the isotropic CBO method with truncation $M=1$ and the original isotropic CBO method [Reference Pinnau, Totzeck, Tse and Martin7, Reference Carrillo, Choi, Totzeck and Tse8, Reference Fornasier, Klock and Riedl10] ( $M=+\infty$ ) for the Ackley, Griewank and Salomon function. Each algorithm is run for $K=200$ steps.
Since the benchmark functions Rastrigin and Alpine are more challenging, we use more particles $N$ and a larger number of steps $K$ , namely $K=200$ and $K=500$ . We report the results in Table 3.
4.2. Anisotropic case
Let $d=20$ . In the case of anisotropic noise, we set $\lambda =1,\sigma =5,\alpha =10^5$ and step size $\Delta t=0.02$ . The initial positions of the particles are initialised with $\rho _0=\mathcal{N}(0,100I_d)$ . In Table 4, we report results comparing the anisotropic CBO method with truncation $M=1$ and the original anisotropic CBO method [Reference Carrillo, Jin, Li and Zhu9, Reference Fornasier, Klock, Riedl, Laredo, Hidalgo and Babaagba11] ( $M=+\infty$ ) for the Rastrigin, Ackley, Griewank and Salomon function. Each algorithm is run for $K=200$ steps.
Since the benchmark function Alpine is more challenging and none of the algorithms work in the previous setting, we reduce the dimensionality to $d=15$ , choose $\sigma =1$ , use $\rho _0=\mathcal{N}(0,I_d)$ to initialise, employ more particles and use a larger number of steps $K$ , namely $K=200$ , $K=500$ and $K=1000$ . We report the results in Table 5.
5. Conclusions
In this paper, we establish the convergence to a global minimiser of a potentially non-convex and non-smooth objective function for a variant of CBO which incorporates truncated noise. We observe that truncating the noise in CBO enhances the well-behavedness of the statistics of the law of the dynamics, which enables enhanced convergence performance and allows in particular for a wider flexibility in choosing the noise parameter of the method, as we observe numerically. For rigorously proving the convergence of the implementable algorithm to the global minimiser of the objective, we follow the route devised in [Reference Fornasier, Klock and Riedl10].
Acknowledgements
MF acknowledges the support of the Munich Center for Machine Learning. PR acknowledges the support of the Extreme Computing Research Center at KAUST. KR acknowledges the support of the Munich Center for Machine Learning and the financial support from the Technical University of Munich – Institute for Ethics in Artificial Intelligence (IEAI). LS acknowledges the support of KAUST Optimization and Machine Learning Lab. LS also thanks the hospitality of the Chair of Applied Numerical Analysis of the Technical University of Munich for discussions that contributed to the finalisation of this work.
Funding statement
This work has been funded by the KAUST Baseline Research Scheme and the German Federal Ministry of Education and Research, and the Bavarian State Ministry for Science and the Arts.
Competing interests
None.