1. Introduction
We are given a diffusion process
where $X_0=x_0\in\mathbb{R}^d$ is given, $\alpha\,:\,\mathbb{R}^d\rightarrow\mathbb{R}^d$ , $\beta\,:\,\mathbb{R}^d\rightarrow\mathbb{R}^{d\times d}$ , and $\{W_t\}_{t\geq 0}$ is a standard d-dimensional Brownian motion. We consider the problem where this is a latent process and we observe it only through a sequence of data that are discrete and regular in time, and in particular where the structure of the observations is a special case of the structure in a state-space or hidden Markov model (HMM). More specifically, we suppose that data are observed at regular and unit times, forming a sequence of random variables $(Y_1,Y_2,\dots)$ . In particular, at time k we assume that, conditioning on all other random variables in the sequence and given the diffusion process $\{X_t\}_{t\geq 0}$ , $Y_k$ has a probability density function that depends on $x_k$ only and is denoted by $g(x_k,y_k)$ . These models have a wide class of applications from finance, econometrics, and engineering; see [Reference Beskos3, Reference Cappé, Moulines and Rydén5, Reference Del Moral6, Reference Golightly and Wilkinson12, Reference Ryder, Golightly, McGough and Prangle22] for some particular examples in various contexts.
In this paper we consider the problems of filtering and normalizing constant (marginal likelihood) estimation—that is, the recursive-in-time (k is the time index) computation of expectations with respect to the conditional distribution of $X_k|y_{1},\dots,y_k$ (filtering) and the calculation of the associated marginal likelihood of the data $(y_1,\dots,y_k)$ . For standard HMMs, that is, where the latent process is a discrete-time Markov chain, this is a notoriously challenging problem, requiring the application of advanced numerical (Monte Carlo) methods such as the particle filter; see e.g. [Reference Cappé, Moulines and Rydén5, Reference Del Moral6] for a survey. The particle filter simulates a collection of N samples in parallel, with the samples undergoing transitions such as sampling and resampling. The first of these uses the hidden Markov chain and the resampling samples with replacement amongst the collection samples using a weight (proportional to $g(x_k,y_k)$ ). The scenario which we consider is even more challenging, as typically the transition density associated to (1.1), assuming it exists, is intractable. This can limit the applicability of particle filters; although there are some exceptions [Reference Fearnhead, Papaspiliopoulos and Roberts8] and exact simulation methods [Reference Beskos and Roberts1, Reference Blanchet and Zhang4], these are often not general enough or too expensive to be of practical use in the filtering problem. As a result, we focus on the case where one discretizes the process (1.1) in time.
In recent years, one of the most successful methods for improving Monte-Carlo-based estimators associated to probability laws under time-discretization has been the multilevel Monte Carlo (MLMC) method [Reference Giles9, Reference Giles10, Reference Heinrich13]. This is an approach that considers a collapsing-sum representation of an expectation with respect to a probability law at a given level of discretization. The collapsing element is associated to differences in expectations with increasingly coarse discretization levels, with a final (single) expectation at a course level. Then, if one can sample appropriate couplings of the probability laws at consecutive levels, it is possible to reduce the cost to achieve a certain mean squared error (MSE) in several examples, diffusions being one of them. This method has been combined with the particle filter in several articles, resulting in the multilevel particle filter (MLPF); see [Reference Jasra, Kamatani, Law and Zhou14, Reference Jasra, Kamatani, Osei and Zhou15, Reference Jasra, Law and Osei16], as well as [Reference Jasra, Law and Suciu17] for a review and [Reference Jasra, Law and Yu18] for extensions.
The method of [Reference Jasra, Kamatani, Law and Zhou14] is intrinsically based upon the ability to sample couplings of discretized diffusion processes. In almost all of the applications that we are aware of, this is based upon the synchronous coupling of Brownian motion for an Euler or Milstein scheme. These particular couplings inherit properties of the strong error of the associated time-discretization. The importance of the strong error rate is that it can help determine the efficiency gain of any MLMC approach, of which the MLPF is one. As is well known, in dimensions larger than two ( $d> 2$ ), the Milstein scheme, which is of higher (strong) order than the Euler method, can be difficult to implement numerically because of the need to simulate Lévy areas. Giles and Szpruch [Reference Giles and Szpruch11] consider computing expectations associated to the law of (1.1) at a given terminal time. They show that by eliminating the Lévy area and including an antithetic-type modification of the traditional MLMC estimator, one can maintain the strong error rate of the Milstein scheme even in a multidimensional setting. Moreover, the cost of the simulation is of the same order as for Euler discretizations, which are the ones most often used in multidimensional cases.
We develop a new particle filter based upon the antithetic truncated Milstein scheme of [Reference Giles and Szpruch11]. We show empirically for a class of diffusion problems that, for $\epsilon>0$ given, the cost to produce an MSE of $\mathcal{O}(\epsilon^2)$ in the estimation of the filter is $\mathcal{O}\big(\epsilon^{-2}\log(\epsilon)^2\big)$ . In the case of multidimensional diffusions with non-constant diffusion coefficient, the method of [Reference Jasra, Kamatani, Law and Zhou14] requires a cost of $\mathcal{O}\big(\epsilon^{-2.5}\big)$ to achieve the same MSE. We also show how this new particle filter can be used to compute the normalizing constant recursively in time, but we do not prove anything about the efficiency gains. Our theoretical results are not optimal and show only that the cost of the new method is $\mathcal{O}\big(\epsilon^{-2.5}\big)$ for the MSE of $\mathcal{O}(\epsilon^2)$ . We highlight exactly where the proof loses the rate needed.
To provide a more detailed summary of the contribution of this paper, we state the following. First, the main point is to understand how the rather clever discretization scheme of [Reference Giles and Szpruch11] can be leveraged in the context of filtering and the multilevel method. Generally, for ordinary discretization schemes in MLMC, one has to simulate coupled samples at two levels of time-discretization, one that is fine and another that is coarse. Then this can be extended to the case of filtering diffusion processes, as was done in [Reference Jasra, Kamatani, Law and Zhou14] by constructing a coupling in the resampling mechanism of particle filters. In the method of [Reference Giles and Szpruch11], the authors sample a third ‘antithetic’ process at the finer level. We show how such a process can be incorporated into the context of particle filters with, to the best of our knowledge, a new resampling scheme. Second, there are significant technical challenges in showing that this new antithetic particle filter can provide the improvements in terms of cost and MSE that the approach of [Reference Giles and Szpruch11] does for ordinary diffusion processes. Unfortunately, we obtain a non-optimal rate; again, we highlight where the proof misses the extra rate.
The article is structured as follows. In Section 2 we give details on the model to be considered and the time-discretization associated to the process (1.1). In Section 3 we present our algorithm. Section 4 details our mathematical results and their algorithmic implications. In Section 5 we give numerical results that support our theory. Most of our mathematical proofs can be found in the appendices at the end of the article.
2. Model and discretization
2.1. State-space model
Our objective is to consider the filtering problem and normalizing constant estimation for a specific class of state-space models associated to the diffusion process (1.1). In particular, we will assume that (1.1) is subject to a certain assumption (A1) which is described in Section 4. This assumption is certainly strong enough to guarantee that (1.1) has a unique solution and in addition that the diffusion has a transition probability, which we denote, over one unit of time, by P(x, dx ′). We are then interested in filtering associated to the state-space model
where $y_{1:n}=(y_1,\dots,y_n)^{\top}\in\mathsf{Y}^n$ are observations, each with conditional density $g(x_k,\cdot)$ . The filter associated to this measure is, for $k\in\mathbb{N}$ ,
where $\mathsf{X}=\mathbb{R}^d$ . The denominator
is the normalizing constant or marginal likelihood. This latter object is often used in statistics for model selection.
In practice, we assume that working directly with P is not possible, because of either intractability or cost of simulation. We therefore propose to work with an alternative collection of filters based upon time-discretization, which is what we now describe.
2.2. Time-discretization
Typically one must time-discretize (1.1). We consider a time-discretization at equally spaced times, separated by $\Delta_l=2^{-l}$ . To continue with our exposition, we define the d-vector $H\,:\,\mathbb{R}^{2d}\times\mathbb{R}^+\rightarrow\mathbb{R}^d$ , $H_{\Delta}(x,z)=(H_{\Delta,1}(x,z),\dots,H_{\Delta,d}(x,z))^{\top}$ , where, for $i\in\{1,\dots,d\}$ ,
We denote by $\mathcal{N}_d(\mu,\Sigma)$ the d-dimensional Gaussian distribution with mean vector $\mu$ and covariance matrix $\Sigma$ ; if $d=1$ we drop the subscript d. $I_d$ is the $d\times d$ identity matrix. A single-level version of the truncated Milstein scheme, which is the focus of this article, is presented in Algorithm 1. Ultimately, we will consider the antithetic truncated Milstein scheme in [Reference Giles and Szpruch11], which can be simulated as described in Algorithm 2. The method in Algorithm 2 is a means of approximating differences of expectations with respect to discretized laws at consecutive levels, as is used in MLMC; this latter approach will be detailed in Section 3.
To understand the use of Algorithms 1 and 2, consider computing the expectation
which is generally approximated by considering $\int_{\mathsf{X}} x_1 P^l(x_0,dx_1)$ , where we denote the transition kernel induced by Algorithm 1 by $P^l(x_0,dx_1)$ . This latter integral can certainly be approximated using Monte Carlo and Algorithm 1. However, there is an alternative using both Algorithms 1 and 2, which has been established by [Reference Giles and Szpruch11] as more efficient. Clearly one can write
Now the rightmost term on the right-hand side of the above equation can be approximated by Monte Carlo and running Algorithm 1 at level $l-1$ . We then consider the difference on the right-hand side, independently, which can be approximated using Algorithm 2 N times:
where the superscript i denotes the ith sample generated. The reason such an approximation works so well is that marginally $X_1^{i,l-1}$ has the same distribution as any sample generated by running Algorithm 1 at level $l-1$ ; similarly, both $X_1^{i,l}$ and $X_1^{i,l,a}$ have the same distribution as any sample generated by running Algorithm 1 at level l. Moreover, the second moment of the term, which converges to zero as l increases,
where $\|\cdot\|$ is the usual Euclidean norm, is of the same order (as a function of l) as if one considered a synchronous coupling of two discretizations of the exact Milstein scheme—that is, if one could sample an exact Milstein discretization at level l $\big(X_1^{l,M}\big)$ and using the same Brownian motion as at level $l-1$ $\big(X_1^{l-1,M}\big)$ , and consider the second moment of $\big\|X_1^{l,M}-X_1^{l-1,M}\big\|$ . The reason why this is so important is that the rate of decrease of the aforementioned quantities is what enables MLMC (here we have described two levels) to improve over ordinary Monte Carlo. The key is that by using antithetic variates of the quantity (2.1) (in terms of square expectation) one has replicated the Milstein scheme without ever having to compute intractable Lévy areas.
For a given $l\in\mathbb{N}$ , we will be concerned with the filter induced by the following joint measure:
The filter associated to this measure is, for $k\in\mathbb{N}$ ,
The associated normalizing constant is
Ultimately, we will seek to approximate expectations with respect to $\pi_k^l$ and to estimate $p^l(y_{1:k})$ as k increases, then study the MSE of a numerical approximation relative to considering the exact filter $\pi_k$ and associated normalizing constant $p(y_{1:k})$ .
3. Antithetic multilevel particle filters
3.1. Multilevel particle filter
Let $\varphi\in\mathcal{B}_b(\mathsf{X})$ , the latter denoting the collection of bounded and measurable real-valued functions; we write $\pi_k^l(\varphi)=\int_{\mathsf{X}}\varphi(x_k)\pi_k^l(dx_k)$ . Let $(\underline{L},\overline{L})\in\mathbb{N}_0\times\mathbb{N}$ , with $\underline{L}<\overline{L}$ given. Our objective is the approximation of $\pi_k^{\overline{L}}(\varphi)$ sequentially in time; this is the filtering problem. This can be achieved using the MLMC identity
As noted in the introduction, the computational cost of approximating the right-hand side of (3.1) can be lower than that of approximating the left-hand side, when one is seeking to achieve a pre-specified MSE. The right-hand side of (3.1) can be approximated by the method in [Reference Jasra, Kamatani, Law and Zhou14], but we shall consider a modification which can improve on the computational effort to achieve a given MSE. Often $\overline{L}=1$ in the literature, but we have found in previous work that one needs to set it higher for the aforementioned improvement in computational effort.
We begin by approximating $\pi_k^{\underline{L}}(\varphi)$ , which can be done using the standard particle filter as described in Algorithm 3. Algorithm 3 generates $N_{\underline{L}}$ samples in parallel, and these samples undergo sampling operations (Steps 1 and 3) and resampling operations (Step 2). The sampling step uses Algorithm 1, and the resampling step is the well-known multinomial method. We refer the reader to e.g. [Reference Del Moral6] for an introduction to particle filter methods. It is by now standard in the literature that
will converge almost surely to $\pi_k^{\underline{L}}(\varphi)$ , where the samples $X_k^{i,\underline{L}}$ are after Step 1 or 3 of Algorithm 3. Therefore, recalling (3.1), at this stage we can approximate $\pi_k^{\underline{L}}(\varphi)$ recursively in time.
In reference to (3.1), we will approximate the differences $\big[\pi_{k}^l-\pi_{k}^{l-1}\big](\varphi)$ sequentially in time and independently of using Algorithm 3 and for each l. As stated in the introduction, our objective is to leverage the work of [Reference Giles and Szpruch11] as described in Algorithm 2 and be able to insert this within a particle filter as given in Algorithm 3. The particle filter has two main operations, which are sampling and resampling. The sampling mechanism is to be achieved by using Algorithm 2. The problem is now the resampling approach. The main idea is to ensure that we resample the level-l coordinate just as if we had done so using the particle filter at level l; similarly, we want the same for the antithetic-level coordinate and finally we resample the level- $(l-1)$ coordinate just as if we had used the particle filter at level $l-1$ . Critically, the resampling for these three coordinates should be correlated, because if this is not the case, the effect of using Algorithm 2 is lost; see [Reference Jasra and Yu19] for a theoretical justification of this point. The reason we want to maintain the resampling as if the particle filter had been used is that we know this will produce consistent estimates of the filter. We seek to correlate the resampling because this can help ensure that we reduce the cost of achieving a pre-specified MSE, relative to using the particle filter.
Algorithm 4 presents a new resampling method which we shall employ. This resampling method achieves the objectives discussed above and gives rise to the new coupled particle filter, which is described in Algorithm 5. Algorithm 5 consists of propagating $N_l$ samples of the three coordinates sequentially in time, using Algorithm 2 for the sampling and Algorithm 4 for the resampling. In this context one can define, for $(k,l,N_l,\varphi)\in\mathbb{N}^3\times\mathcal{B}_b(\mathsf{X})$ ,
and estimate $\big[\pi_{k}^l-\pi_{k}^{l-1}\big](\varphi)$ as
The samples in $\Big(\pi_k^{N_l,l}(\varphi), \pi_k^{N_l,l,a}(\varphi),\bar{\pi}_k^{N_l,l-1}(\varphi)\Big)$ are obtained after Step 1 or Step 3 in Algorithm 5.
3.2. Final algorithm
To conclude, the antithetic multilevel particle filter (AMLPF) is as follows:
-
1. At level $\underline{L}$ run Algorithm 3 sequentially in time.
-
2. At level $l\in\{\underline{L}+1, \underline{L}+2,\dots,\overline{L}\}$ , independently of all other l and of Step 1, run Algorithm 3 sequentially in time.
To approximate the right-hand side of (3.1) at any time k, one can use the estimator
where $\pi_k^{N_{\underline{L}},\underline{L}}(\varphi)$ has the definition (3.3) when $l=\underline{L}$ and has been obtained using Algorithm 3.
The normalizing constant $p^{\overline{L}}(y_{1:k})$ can also be approximated using the AMLPF. For any $(k,l,N_l,\varphi)\in\mathbb{N}^3\times\mathcal{B}_b(\mathsf{X})$ , we define
Then we have the following approximation:
where $g_j(x)=g(x,y_j)$ is used as shorthand, $g_j$ is assumed bounded and measurable for each $j\in\mathbb{N}$ , and $\eta_k^{N_{\underline{L}},\underline{L}}(\varphi)$ has the definition (3.4) when $l=\underline{L}$ and has been obtained using Algorithm 3. This new estimator approximates the quantity
as was done in [Reference Jasra, Kamatani, Osei and Zhou15] and essentially uses the ordinary particle filter normalizing constant estimator to approximate $p^{\underline{L}}(y_{1:k})$ (see e.g. [Reference Del Moral6]). The approximation of the increment $\big[p^{l}(y_{1:k})-p^{l-1}(y_{1:k})\big]$ follows the approach used for the filter based upon the output from the AMLPF. This estimator is unbiased in terms of $p^{\overline{L}}(y_{1:k})$ (see e.g. [Reference Del Moral6]) and is consistent as the number of samples used at each level grows. We note, however, that there is nothing in the estimator that ensures that it is almost surely positive, whereas it does approximate a positive quantity.
4. Mathematical results
We now present our main mathematical result, which relates to the $\mathbb{L}_2$ -convergence of our estimator of the filter. We remark that we will consider the estimation of the normalizing constant in a sequel paper. The result is proved under two assumptions (A1) and (A2), which are stated below. We write $\mathsf{X}_2=\mathbb{R}^{d\times d}$ . By $\mathcal{C}_b^2(\mathsf{X},\mathbb{R})$ we denote the collection of twice continuously differentiable functions from $\mathsf{X}$ to $\mathbb{R}$ (resp. $\mathsf{X}_2$ ) with bounded derivatives of orders 1 and 2. Similarly, $\mathcal{C}^2(\mathsf{X},\mathsf{X})$ and $\mathcal{C}^2(\mathsf{X},\mathsf{X}_2)$ are respectively the collections of twice continuously differentiable functions from $\mathsf{X}$ to $\mathsf{X}$ and from $\mathsf{X}$ to $\mathsf{X}_2$ . For $\varphi\in\mathcal{B}_b(\mathsf{X})$ we use the notation $\pi_k^{\overline{L}}(\varphi)=\int_{\mathsf{X}}\varphi(x)\pi_k^{\overline{L}}(dx)$ . By $\mathbb{E}$ we denote the expectation with respect to the law used to generate the AMLPF. We drop the dependence upon the data in $g(x_k,y_k)$ and simply write $g_k(x_k)$ in this section.
Our assumptions are as follows:
-
(A1)
-
• For each $(i,j)\in\{1,\dots,d\}$ , $\alpha_i\in\mathcal{B}_b(\mathsf{X})$ , $\beta_{ij}\in\mathcal{B}_b(\mathsf{X})$ .
-
• We have $\alpha\in\mathcal{C}^2(\mathsf{X},\mathsf{X})$ , $\beta\in\mathcal{C}^2(\mathsf{X},\mathsf{X}_2)$ .
-
• We have that $\beta(x)\beta(x)^{\top}$ is uniformly positive definite.
-
• There exists a $C<+\infty$ such that for any $(x,i,j,k,m)\in\mathsf{X}\times\{1,\dots,d\}^{4}$ ,
\begin{align*}\max\left\{\left|\frac{\partial \alpha_i}{\partial x_m}(x)\right|,\left|\frac{\partial \beta_{ij}}{\partial x_m}(x)\right|,\left|\frac{\partial h_{ijk}}{\partial x_m}(x)\right|,\left|\frac{\partial^2 \alpha_i}{\partial x_{k}\partial x_m}(x)\right|,\left|\frac{\partial^2 \beta_{ij}}{\partial x_{k}\partial x_m}(x)\right|\right\} \leq C.\end{align*}
-
-
(A2)
-
• For each $k\in\mathbb{N}$ , $g_k\in\mathcal{B}_b(\mathsf{X})\cap\mathcal{C}^2(\mathsf{X},\mathbb{R})$ .
-
• For each $k\in\mathbb{N}$ there exists a $0<C<+\infty$ such that for any $x\in\mathsf{X}$ $g_k(x)\geq C$ .
-
• For each $k\in\mathbb{N}$ there exists a $0<C<+\infty$ such that for any $(x,j,m)\in\mathsf{X}\times\{1,\dots,d\}^2$ ,
\begin{align*}\max\left\{\bigg|\frac{\partial g_{k}}{\partial x_j}(x)\bigg|,\bigg|\frac{\partial^2 g_{k}}{\partial x_j\partial x_m}(x)\bigg|\right\} \leq C.\end{align*}
-
Theorem 1. Assume (A1)–(A2). Then, for any $(k,\varphi)\in\mathbb{N}\times\mathcal{B}_b(\mathsf{X})\cap\mathcal{C}_b^2(\mathsf{X},\mathbb{R})$ , there exists a $C<+\infty$ such that for any $\big(\underline{L},\overline{L},l,N_{\underline{L}},\dots,N_{\overline{L}}\big)\in\mathbb{N}_0^2\times\{\underline{L},\dots,\overline{L}\}\times\mathbb{N}^{\overline{L}-\underline{L}+1}$ with $\underline{L}<\overline{L}$ ,
where $\mathsf{D}_{\underline{L},\overline{L}}=\{(l,q)\in\{\underline{L},\dots,\overline{L}\}\,:\,l\neq q\}$ .
Proof. This follows by using the $C_2$ -inequality to separate the terms $\pi_k^{N_{\underline{L}},\underline{L}}(\varphi) -\pi_k^{\underline{L}}(\varphi)$ and $\sum_{l=\underline{L}+1}^{\overline{L}}\big[\pi_{k}^l-\pi_{k}^{l-1}\big]^{N_l}(\varphi)-\sum_{l=\underline{L}+1}^{\overline{L}}\big[\pi_{k}^l-\pi_{k}^{l-1}\big](\varphi)$ . Then one uses standard results for particle filters (see e.g. [Reference Beskos2, Lemma A.3]) for the $\pi_k^{N_{\underline{L}},\underline{L}}(\varphi) -\pi_k^{\underline{L}}(\varphi)$ , and one multiplies out the brackets and uses Theorems 2 and 3 for the other term.
The implication of this result is as follows. We know that the bias $\big|\pi_k^{\underline{L}}(\varphi)-\pi_k(\varphi)\big|$ is $\mathcal{O}(\Delta_{\overline{L}})$ . This can be proved in a similar manner to [Reference Jasra, Kamatani, Law and Zhou14, Lemma D.2], using the fact that the truncated Milstein scheme is a first-order (weak) method. Therefore, for $\epsilon>0$ given, one can choose $\overline{L}$ such that $\Delta_{\overline{L}}=\mathcal{O}(\epsilon)$ (so that the squared bias is $\mathcal{O}(\epsilon^2)$ ). Then, to choose $N_{\underline{L}},\dots,N_{\overline{L}}$ , one can set $N_{\underline{L}}=\mathcal{O}(\epsilon^{-2})$ and, just as in e.g. [Reference Giles9], $N_l=\mathcal{O}\big(\epsilon^{-2}\Delta_l^{3/4}\Delta_{\overline{L}}^{-1/4}\big)$ , $l\in\{\underline{L}+1,\dots,\overline{L}\}$ . If one does this, one can verify that the upper bound in Theorem 1 is $\mathcal{O}(\epsilon^2)$ , so the MSE is also $\mathcal{O}(\epsilon^2)$ . The cost to run the algorithm per time step is $\mathcal{O}\big(\!\sum_{l=\underline{L}}^{\overline{L}}N_l\Delta_{l}^{-1}\big) = \mathcal{O}\big(\epsilon^{-2.5}\big)$ . However, as we see in our simulations, the cost should be lower, at $\mathcal{O}\big(\epsilon^{-2}\log(\epsilon)^2\big)$ ; this indicates that our proof is not optimal. In the case that (1.1) has a non-constant diffusion coefficient with $d> 2$ , the approach in [Reference Jasra, Kamatani, Law and Zhou14], based upon using Euler discretizations, would obtain an MSE of $\mathcal{O}(\epsilon^2)$ at a cost of $\mathcal{O}\big(\epsilon^{-2.5}\big)$ .
5. Numerical results
5.1. Models
We consider three different models for our numerical experiments.
5.1.1. Model 1: geometric Brownian motion (GBM) process
The first model we use is
We set $Y_n|X_{n}=x \sim \mathcal{N}\big(\log(x),\tau^2\big)$ , where $\tau^2=0.02$ and $\mathcal{N}(x,\tau^2)$ is the Gaussian distribution with mean x and variance $\tau^2$ . We choose $x_0 = 1$ , $\mu = 0.02$ , and $\sigma = 0.2$ .
5.1.2. Model 2: Clark–Cameron SDE
The second model we consider is the Clark–Cameron stochastic differential equation (SDE) model (e.g. [Reference Giles and Szpruch11]) with initial conditions $x_0=(0,0)^{\top}$ :
where $X_{j,t}$ denotes the jth dimension of $X_t$ , $j\in\{1,\dots,d\}$ . In addition, $Y_n|X_n=x\sim \mathcal{N}\big( \tfrac{X_{1} + X_{2}}{2},\tau^2\big)$ , where $\tau^2=0.1$ .
5.1.3. Model 3: multidimensional SDEs with a nonlinear diffusion term (NLMs)
For our last model we use the following multidimensional SDEs, with $x_0=(0,0)^{\top}$ :
We set $Y_n|X_n=x\sim \mathcal{L}\left(\tfrac{X_{1} + X_{2}}{2},s\right)$ , where $\mathcal{L}(m,s)$ is the Laplace distribution with location m and scale s. The values of the parameters that we choose are $(\theta_{1}, \theta_{2}) = (1,1)$ , $(\mu_{1}, \mu_{2}) = (0,0)$ , $(\sigma_{1}, \sigma_{2}) = (1,1)$ , and $s= \sqrt{0.1}$ .
5.2. Simulation settings
We will consider a comparison of the particle filter, MLPF (as in [Reference Jasra, Kamatani, Law and Zhou14]), and AMLPF. For our numerical experiments, we consider multilevel estimators at levels $l \in \{3,4,5,6,7\}$ . Resampling is performed adaptively. For the particle filters, resampling is done when the effective sample size (ESS) is less than $1/2$ of the particle numbers. Suppose that we have a collection of weights which sum to one, $(W^1,\dots,W^N)$ , for instance $\big(W_k^{1,l},\dots,W_k^{N_l,l}\big)$ in Algorithm 5. Then the ESS can be calculated using the formula
This is a number between 1 and N and is a measure of the variability of the weights. Typically one would like this measure to be close to N. For the coupled filters, we use the ESS of the coarse filter as the measure of discrepancy, and resampling occurs at the same threshold as the particle filter. Each simulation is repeated 100 times.
5.3. Simulation results
We now present our numerical simulations to show the benefits of the AMLPF for the three models under consideration. In Figures 1 and 2 we present the rate of the cost and the MSE for the estimator for the filter and normalizing constant. The cost is computed using the CPU time, although we found similar results using the formula of Section 4. The figures show that as we increase the level from $l=3$ to $l=7$ , the difference between the costs of the two methods increases. In particular, the figures show the advantage and accuracy of using AMLPF. Table 1 presents the estimated rates of the MSE with respect to cost for the normalizing constant and filter. The results agree with the theory [Reference Jasra, Kamatani, Law and Zhou14] and the results of this article, which predicts a rate of $-1.5$ for the particle filter, $-1.25$ for the MLPF, and $-1$ for the AMLPF.
Appendix A. Introduction and some notation
The purpose of this appendix is to provide the necessary technical results to prove Theorem 1. For a full understanding of the arguments, we advise the reader to read the proofs in order. It is possible to skip Section B and simply refer to the proofs, but the reader may miss some notation or nuances.
The structure of this appendix is as follows. We first consider some properties of the antithetic truncated Milstein scheme of [Reference Giles and Szpruch11]. We present several new results which are needed directly for the subsequent proofs. In Section C we consider the coupled particle filter as in Algorithm 5. This section is split into three subsections. The first of these, Section C.1, considers the application of the results in Section B to a coupled particle filter. These results feed into the final two subsections, which consider the convergence of $\big[\pi_{k}^l-\pi_{k}^{l-1}\big]^{N_l}(\varphi)$ in $\mathbb{L}_p$ (Section C.2) and the associated bias (Section C.3). This is the culmination of the work in this appendix and is summarized in Theorems 2 and 3. To prove our results we use two major assumptions, (A1) and (A2), which can be found in Section 4 of the main text.
A.1. Some notation
Let $(\mathsf{V},\mathcal{V})$ be a measurable space. For $\varphi\,:\,\mathsf{V}\rightarrow\mathbb{R}$ we write $\mathcal{B}_b(\mathsf{V})$ for the collection of bounded measurable functions. Let $\varphi\,:\,\mathbb{R}^d\rightarrow\mathbb{R}$ , and let $\textrm{Lip}(\mathbb{R}^{d})$ denote the collection of real-valued functions that are Lipschitz with respect to $\|\cdot\|$ (where $\|\cdot\|$ denotes the $\mathbb{L}_2$ -norm of a vector $x\in\mathbb{R}^d$ ). That is, $\varphi\in\textrm{Lip}(\mathbb{R}^{d})$ if there exists a $C<+\infty$ such that for any $(x,y)\in\mathbb{R}^{2d}$ ,
For $\varphi\in\mathcal{B}_b(\mathsf{V})$ , we write the supremum norm as $\|\varphi\|_{\infty}=\sup_{x\in\mathsf{V}}|\varphi(x)|$ . For a measure $\mu$ on $(\mathsf{V},\mathcal{V})$ and a function $\varphi\in\mathcal{B}_b(\mathsf{V})$ , the notation $\mu(\varphi)=\int_{\mathsf{V}}\varphi(x)\mu(dx)$ is used. For $A\in\mathcal{V}$ , the indicator function is written as $\mathbb{I}_A(x)$ . If $K\,:\,\mathsf{V}\times\mathcal{V}\rightarrow[0,\infty)$ is a non-negative operator and $\mu$ is a measure, we use the notation $\mu K(dy) = \int_{\mathsf{V}}\mu(dx) K(x,dy),$ and for $\varphi\in\mathcal{B}_b(\mathsf{V})$ , $K(\varphi)(x) = \int_{\mathsf{V}} \varphi(y) K(x,dy).$ Throughout, we use C to denote a generic finite constant whose value may change at each appearance and whose dependencies (on model and simulation parameters) are clear from the context.
Appendix B. Proofs for antithetic truncated Milstein scheme
The proofs of this section focus on the antithetic truncated Milstein discretization over a unit of time (i.e. as in Algorithm 2). The case we consider is almost identical to that in [Reference Giles and Szpruch11], except that we impose that our initial points $\big(x_0^l,x_0^{l-1},x_0^{l,a}\big)$ need not be equal. This constraint is important in our subsequent proofs for the coupled (and multilevel) particle filter. Our objective is to prove a result similar to [Reference Giles and Szpruch11, Theorem 4.10], and to that end we make the assumption (A1), which is stronger than [Reference Giles and Szpruch11, Assumption 4.1]. The stronger assumption is related to the boundedness of the drift and diffusion coefficients of (1.1). The reason we require this is that it greatly simplifies our (subsequent) proofs if the constants in the results below do not depend on the initial points $\big(x_0^l,x_0^{l-1},x_0^{l,a}\big)$ ; this would be the case otherwise. In this section, $\mathbb{E}$ denotes the expectation with respect to the law associated to Algorithm 2.
The final result in this section, Proposition B.1, is our adaptation of [Reference Giles and Szpruch11, Theorem 4.10] and is proved using simple modifications of Lemmata 4.6, 4.7, and 4.9 and Corollary 4.8 in [Reference Giles and Szpruch11]. The proofs of Lemma 4.7 and Corollary 4.8 in [Reference Giles and Szpruch11] need not be modified, so we proceed to prove analogues of Lemmata 4.6 and 4.9.
Lemma B.1. Assume (A1). Then for any $p\in[1,\infty)$ there exists a $C<+\infty$ such that for any $l\in\mathbb{N}$ ,
Proof. Using the $C_p$ -inequality, we have that
Denoting by $X_t^x$ the solution of (1.1) with initial point x at time t, we have
The proof is then easily concluded via the strong convergence result [Reference Giles and Szpruch11, Lemma 4.2] and standard results for diffusion processes (e.g. via Gronwall and [Reference Rogers and Williams20, Corollary V.11.7]).
Remark B.1. Note that the proof also establishes the following: for any $p\in[1,\infty)$ there exists a $C<+\infty$ such that for any $l\in\mathbb{N}$ ,
and
To state our next result, which mirrors [Reference Giles and Szpruch11, Lemma 4.9], we need to introduce a significant amount of notation directly from [Reference Giles and Szpruch11]. Below, we set $\overline{X}_{k\Delta_l}^l=\tfrac{1}{2}\big(X_{k\Delta_l}^l+X_{k\Delta_l}^{l,a}\big)$ , $k\in\big\{0,\dots,\Delta_l^{-1}\big\}$ . For $(i,k)\in\{1,\dots,d\}\times\big\{2,4,\dots,\Delta_l^{-1}\big\}$ we define
where, in $R_{i,k}^l$ , $\xi_1^l$ is some point which lies on the line between $X_{k\Delta_l}^l$ and $X_{(k+1)\Delta_l}^l$ . In the case of $R_{i,k}^l$ and $M_{i,k}^{(j,l)}$ , $j\in\{1,2,3\}$ , one can substitute $X_{k\Delta_l}^l,\xi_1^{l},X_{(k+1)\Delta_l}^l$ for $X_{k\Delta_l}^{l,a},\xi_1^{l,a},X_{(k+1)\Delta_l}^{l,a}$ $\Big($ where $\xi_1^{l,a}$ is some point which lies on the line between $X_{k\Delta_l}^{l,a}$ and $X_{(k+1)\Delta_l}^{l,a}\Big)$ ; when we do so, we use the notation $R_{i,k}^{l,a}$ and $M_{i,k}^{(j,l,a)}$ , $j\in\{1,2,3\}$ . Finally we set
We are now in a position to give our analogue of [Reference Giles and Szpruch11, Lemma 4.9].
Lemma B.2. Assume (A1). Then the following hold:
-
For any $p\in[1,\infty)$ there exists a $C<+\infty$ such that for any $l\in\mathbb{N}$ ,
\begin{align*}\max_{k\in\{2,4,\dots,\Delta_l\}}\mathbb{E}[|R_k|^p] \leq C\Delta_l^p\left(\Delta_l^p + \big\|x_0^{l}-x_0^{l-1}\big\|^{2p}+\big\|x_0^{l,a}-x_0^{l-1}\big\|^{2p}\right).\end{align*} -
For any $p\in[1,\infty)$ there exists a $C<+\infty$ such that for any $l\in\mathbb{N}$ ,
\begin{align*}\max_{k\in\{2,4,\dots,\Delta_l\}}\mathbb{E}[|M_k|^p] \end{align*}\begin{align*}\!\!\!\!\!\!\!\leq C\!\left(\!\Delta_l^{\tfrac{3p}{2}}+\Delta_l^{\tfrac{p}{2}}\Big\{\big\|x_0^{l}-x_0^{l-1}\big\|^{2p}+\big\|x_0^{l,a}-x_0^{l-1}\big\|^{2p}\Big\} +\Delta_l^{p}\Big\{\big\|x_0^{l}-x_0^{l-1}\big\|^{p}+\big\|x_0^{l,a}-x_0^{l-1}\big\|^{p}\Big\}\!\right)\!.\end{align*}
Proof. The proof of this result essentially follows from [Reference Giles and Szpruch11] and controlling the terms (in $\mathbb{L}_p$ ). The expressions $\frac{1}{2}\sum_{j=1}^3\Big\{M_k^{(j,l)} + M_k^{(j,l,a)}\Big\}$ and $\frac{1}{2}\big\{R_{i,k}^l+R_{i,k}^{l,a}\big\}$ can be dealt with exactly as in [Reference Giles and Szpruch11, Lemma 4.7], so we need only consider the terms $\sum_{j=1}^3 M_k^{(j)}$ and $R_k^{(1)}$ . It will suffice to control in $\mathbb{L}_p$ any of the d coordinates of the aforementioned vectors, which is what we do below.
Beginning with $R_{i,k}^{(1)}$ , using the second-order Taylor expansion in [Reference Giles and Szpruch11], one has
where $\xi_1^l$ is some point between $\overline{X}^l_{k\Delta_l}$ and $X^l_{k\Delta_l}$ and $\xi_2^l$ is some point between $\overline{X}^l_{k\Delta_l}$ and $X^{l,a}_{k\Delta_l}$ . Then it follows easily that
Application of Lemma B.1 yields that
which is the desired result.
For $M_{i,k}^{(1)}$ , again one has
where $\xi_3^l$ is some point between $\overline{X}^l_{k\Delta_l}$ and $X^l_{k\Delta_l}$ and $\xi_4^l$ is some point between $\overline{X}^l_{k\Delta_l}$ and $X^{l,a}_{k\Delta_l}$ . Then, using the independence of the Brownian increment (with the random variables $\Big(X_{m,k\Delta_l}^l - X_{m,k\Delta_l}^{l,a}\Big)\Big(X_{n,k\Delta_l}^l - X_{n,k\Delta_l}^{l,a}\Big)\Big)$ and the same approach as above, one has
and applying Lemma B.1 yields
For $M_{i,k}^{(2)}$ one has
where $\xi_5^l$ is some point between $\overline{X}^l_{k\Delta_l}$ and $X^l_{k\Delta_l}$ and $\xi_6^l$ is some point between $\overline{X}^l_{k\Delta_l}$ and $X^{l,a}_{k\Delta_l}$ . As the Brownian increments are independent of $\Big(X_{n,k\Delta_l}^l - X_{n,k\Delta_l}^{l,a}\Big)$ and the dimensions are also independent of each other, one obtains
Lemma B.1 gives
For $M_{i,k}^{(3)}$ , one has
where $\xi_7^l$ is some point between $\overline{X}^l_{k\Delta_l}$ and $X^l_{k\Delta_l}$ and $\xi_8^l$ is some point between $\overline{X}^l_{k\Delta_l}$ and $X^{l,a}_{k\Delta_l}$ . Using essentially the same approach as for $M_{i,k}^{(2)}$ , one obtains
which concludes the proof.
Proposition B.1. Assume (A1). Then for any $p\in[1,\infty)$ there exists a $C<+\infty$ such that for any $l\in\mathbb{N}$ ,
Proof. We omit the proof, because it is identical to that of [Reference Giles and Szpruch11, Theorem 4.10], except that one uses Lemma B.2 above instead of [Reference Giles and Szpruch11, Lemma 4.9].
Appendix C. Proofs for coupled particle filter
In this section, $\mathbb{E}$ denotes the expectation with respect to the law that generates the AMLPF.
C.1. Rate proofs
Our analysis will apply to any coupling used in the second bullet point of Algorithm 4, Step 2. At any time point k of Algorithm 5, we will denote the resampled index of particle $i\in\{1,\dots,N_l\}$ by $I_k^{i,l}$ (level l), $I_k^{i,l-1}$ (level $l-1$ ), and $I_k^{i,l,a}$ (level l antithetic). Now let $I_k^l(i)=I_k^{i,l}$ , $I_k^{l-1}(i)=I_k^{i,l-1}$ , $I_k^{l,a}(i)=I_k^{i,l,a}$ , and define $\mathsf{S}_k^l$ as the collection of indices that choose the same ancestor at each resampling step, i.e.
We use the convention that $\mathsf{S}_0^l=\{1,\dots,N_l\}$ . Denote the $\sigma$ -field generated by the simulated samples, resampled samples, and resampled indices up to time k by $\hat{\mathcal{F}}_k^{l}$ , and denote the $\sigma$ -field which does the same, except excluding the resampled samples and indices, by $\mathcal{F}_k^{l}$ .
Lemma C.1. Assume (A1)–(A2). Then for any $(p,k)\in[1,\infty)\times\mathbb{N}$ there exists a $C<+\infty$ such that for any $(l,N_l)\in\mathbb{N}^2$ ,
Proof. The case $k=1$ follows from the work in [Reference Giles and Szpruch11] (the case $p=[1,2)$ can be adapted from that paper), so we assume $k\geq 2$ . By conditioning on $\hat{\mathcal{F}}_{k-1}^{l}$ and applying Lemma B.1 (see also Remark B.1), we have
One can exchangeably write the expectation on the right-hand side as
The proof from here is then essentially identical (up to the fact that one has three indices instead of two) to that of [Reference Jasra, Kamatani, Law and Zhou14, Lemma D.3] and is hence omitted.
Lemma C.2. Assume (A1)–(A2). Then for any $(p,k)\in[1,\infty)\times\mathbb{N}$ there exists a $C<+\infty$ such that for any $(l,N_l)\in\mathbb{N}^2$ ,
Proof. Following the start of the proof of Lemma C.1, except using Proposition B.1 instead of Lemma B.1, one can deduce that
From here, one can follow the calculations in [Reference Jasra, Kamatani, Law and Zhou14, Lemma D.3] to deduce that
Application of Lemma C.1 concludes the result.
Lemma C.3. Assume (A1)–(A2). Then for any $k\in\mathbb{N}$ there exists a $C<+\infty$ such that for any $(l,N_l)\in\mathbb{N}^2$ ,
Proof. We have the elementary decomposition
So application of the triangle inequality along with the lower and upper bounds on $g_k$ (which are uniform in x) yields
Using the Lipschitz property of $g_k$ along with Lemma C.1 allows one to conclude the result.
The following is a useful lemma that we will need below. Its proof is a matter of direct computation and is hence omitted.
Lemma C.4. Let $\big(G^l,f^l,G^a,f^a,G^{l-1},f^{l-1}\big)\in\mathbb{R}^6$ with $\big(f^l,f^a,f^{l-1}\big)$ non-zero; then
Lemma C.5. Assume (A1)–(A2). Then for any $k\in\mathbb{N}$ there exists a $C<+\infty$ such that for any $(l,N_l)\in\mathbb{N}^2$ ,
Proof. One can apply Lemma C.4 to see that
where
The terms $T_1$ (resp. $T_2$ ) and $T_4$ (resp. $T_3$ ) can be dealt with in a similar manner, so we only consider $T_1$ (resp. $T_2$ ).
For the case of $T_1$ we have the upper bound
Then one can apply [Reference Giles and Szpruch11, Lemma 2.2] along with Lemmata C.1 and C.2 to deduce that
For the case of $T_2$ we have the upper bound
For the first term on the right-hand side one can use the Lipschitz property of $g_{k}$ , Cauchy–Schwarz, and Lemma C.1, and for the second term one can use the boundedness of $g_k$ , to obtain
which concludes the proof.
Lemma C.6. Assume (A1)–(A2). Then for any $k\in\mathbb{N}_0$ there exists a $C<+\infty$ such that for any $(l,N_l)\in\mathbb{N}^2$ ,
Proof. We note that
As our proof strategy is via induction (the initialization is trivially true since $\mathsf{S}_{0}^l=\{1,\dots,N_l\}$ ), we need to focus on the first term on the right-hand side of (C.1).
Now, applying twice the result that $\min\{a,b\} = \tfrac{1}{2}(a+b-|a-b|)$ for any $(a,b)\in\mathbb{R}^2$ , we easily obtain that
To shorten the subsequent notation, we set
In this notation, we have that
Then by induction, via Lemmata C.3 and C.5, it trivially follows that
Remark C.1. Lemma C.6 is the place where the rate is lost. It is clear that this particular quantity cannot fall faster than $\mathcal{O}\big(\Delta_l^{1/2}\big)$ , as the right-hand side of (C.2) is exactly equal to
The expectation of this quantity falls exactly at $\mathcal{O}\big(\Delta_l^{1/2}\big)$ . We could not find a way to enhance the proof to deal with this point, so we leave it to future work.
Lemma C.7. Assume (A1)–(A2). Then for any $k\in\mathbb{N}$ there exists a $C<+\infty$ such that for any $(l,N_l,i)\in\mathbb{N}^2\times\{1,\dots,N_l\}$ ,
Proof. This follows from Lemmata C.2 and C.6; see e.g. the corresponding proof in [Reference Jasra, Kamatani, Law and Zhou14, Theorem D.5].
Remark C.2. In a similar manner to the proofs of Lemma C.7 and [Reference Jasra, Kamatani, Law and Zhou14, Theorem D.5], one can establish, assuming (A1)–(A2), that the following hold:
-
For any $k\in\mathbb{N}$ there exists a $C<+\infty$ such that for any $(l,N_l,i)\in\mathbb{N}^2\times\{1,\dots,N_l\}$ ,
\begin{align*}\mathbb{E}\left[\min\Big\{\Big\|X_k^{i,l}-X_k^{i,l-1}\Big\|^p,1\Big\}\right] \leq C\Delta_l^{1/2}.\end{align*} -
For any $k\in\mathbb{N}$ there exists a $C<+\infty$ such that for any $(l,N_l,i)\in\mathbb{N}^2\times\{1,\dots,N_l\}$ ,
\begin{align*}\mathbb{E}\left[\min\Big\{\Big\|X_k^{i,l}-X_k^{i,l,a}\Big\|^p,1\Big\}\right] \leq C\Delta_l^{1/2}.\end{align*}
C.2. Particle convergence proofs: $\mathbb{L}_p$ -bounds
For a function $\varphi\in\mathcal{B}_b(\mathsf{X})\cap\textrm{Lip}(\mathsf{X})$ , we set $\overline{\|\varphi\|}=\max\{\|\varphi\|_{\infty},\|\varphi\|_{\textrm{Lip}}\}$ , where $\|\cdot\|_{\textrm{Lip}}$ is the Lipschitz constant. We denote the $N_l$ -empirical measures by
where $k\in\mathbb{N}$ and $\delta_A(dx)$ is the Dirac mass on a set A. We define the predictors as follows: for $l\in\mathbb{N}_0$ , $\eta_1^{l}(dx) = P^l(x_0,dx)$ , and for $k\geq 2$ , $\eta_k^{l}(dx) =\eta_{k-1}^l(g_{k-1}P^l(dx))/\eta_{k-1}^l(g_{k-1})$ .
Lemma C.8. Assume (A1)–(A2). Then for any $(k,p)\in\mathbb{N}\times[1,\infty)$ there exists a $C<+\infty$ such that for any $(l,N_l,\varphi)\in\mathbb{N}^2\times\mathcal{B}_b(\mathsf{X})\cap\textit{Lip}(\mathsf{X})$ ,
Proof. The proof is by induction, with initialization following easily from the Marcinkiewicz–Zygmund (MZ) inequality (the case $p\in[1,2)$ can be dealt with using the bound $p\in[2,\infty)$ and Jensen’s inequality) and strong convergence results for Euler discretizations. We therefore proceed to the induction step. We have that
where
We deal with the two terms in turn. For the case of $T_1$ , one can apply the (conditional) MZ inequality to obtain
Using $\varphi\in\mathcal{B}_b(\mathsf{X})\cap\textrm{Lip}(\mathsf{X})$ along with Remark C.2 yields
For $T_2$ , we first note that
where we recall that $P^l$ (resp. $P^{l-1}$ ) represents the truncated Milstein kernel over unit time with discretization level $\Delta_l$ (resp. $\Delta_{l-1}$ ). This can be further decomposed to
Therefore we have that
where
The quantities $T_3$ and $T_4$ can be treated using similar calculations, so we only consider $T_3$ . For the case of $T_3$ , clearly we have the upper bound
where
For $T_6$ we have that
By using standard results for particle filters (e.g. [Reference Jasra, Kamatani, Law and Zhou14, Proposition C.6]) we have
Then, using standard results for weak errors (the truncated Milstein scheme is a first-order method), we have
For $T_7$ , again using weak errors, we have
and again using standard results for particle filters, we obtain
Hence we have shown that
For $T_5$ , one can apply [Reference Jasra, Kamatani, Law and Zhou14, Lemma C.5] to show that
where
Since $T_8$ and $T_9$ can be bounded using similar approaches, we consider only $T_8$ . Similarly, $T_{10},\dots,T_{13}$ can be bounded in almost the same way, so we consider only $T_{10}$ . For $T_8$ we have the upper bound
Since $g_{k-1}P(\varphi)\in\mathcal{B}_b(\mathsf{X})\cap\textrm{Lip}(\mathsf{X})$ (see e.g. [Reference Del Moral, Jacod and Protter7, Equation (2.6)]), it follows by the induction hypothesis that
For $T_{10}$ , we have the upper bound
Then, using [Reference Jasra, Kamatani, Law and Zhou14, Lemma D.2] and standard results for particle filters, we have
Therefore we deduce that
The proof is then easily completed by combining (C.4), (C.6), and (C.7).
Remark C.3. One can also deduce the following result using an argument that is similar to (but simpler than) the proof of Lemma C.8. Assume (A1)–(A2). Then for any $(k,p)\in\mathbb{N}\times[1,\infty)$ there exists a $C<+\infty$ such that for any $(l,N_l,\varphi)\in\mathbb{N}^2\times\mathcal{B}_b(\mathsf{X})\cap\textit{Lip}(\mathsf{X})$ ,
Below, $\mathcal{C}_b^2(\mathsf{X},\mathbb{R})$ denotes the collection of twice continuously differentiable functions from $\mathsf{X}$ to $\mathbb{R}$ with bounded derivatives of orders 1 and 2.
Lemma C.9. Assume (A1)–(A2). Then for any $(k,p,\varphi)\in\mathbb{N}\times[1,\infty)\times\mathcal{B}_b(\mathsf{X})\cap\mathcal{C}_b^2(\mathsf{X},\mathbb{R})$ there exists a $C<+\infty$ such that for any $(l,N_l)\in\mathbb{N}^2$ ,
Proof. The proof is by induction. Initialization follows easily by the MZ inequality and [Reference Giles and Szpruch11, Lemma 2.2, Theorem 4.10] (the case $p\in[1,2)$ can also be covered with the theory of [Reference Giles and Szpruch11]), so we proceed immediately to the induction step.
As in the proof of Lemma C.8, one can add and subtract the conditional expectation to obtain an upper bound
where
For $T_1$ one can use the conditional MZ inequality along with the boundedness of $\varphi$ , [Reference Giles and Szpruch11, Lemma 2.2], Lemma C.7, and Remark C.2 to deduce that
The case of $T_2$ is more challenging. We have the decomposition
where
For $T_3$ one has the upper bound
where
Each of these terms can be controlled (almost) exactly as is done for (C.5) in the proof of Lemma C.8, so we do not give the proof; rather, we simply state that
For $T_4$ one can apply Lemma C.4 along with Minkowski to deduce that
where
As the arguments for dealing with $T_8$ (resp. $T_9$ ) and $T_{11}$ (resp. $T_{10}$ ) are similar, we only prove bounds for $T_8$ (resp. $T_9$ ). For $T_8$ we have the upper bound
where
For $T_{12}$ one can use the lower bound on $g_{k-1}$ along with $g_{k-1}P(\varphi)\in\mathcal{C}_b^2(\mathsf{X},\mathbb{R})$ (see e.g. [Reference Stroock23, Corollary 2.2.8]) and the induction hypothesis to obtain
For $T_{13}$ we apply [Reference Jasra, Kamatani, Law and Zhou14, Lemma D.2] and standard results for particle filters to obtain
Thus
For $T_9$ we have the upper bound
where
For $T_{14}$ one can use the lower bound on $g_{k-1}$ , Cauchy–Schwarz, Remark C.3, and Lemma C.8 to obtain
For $T_{15}$ one can use the lower bound on $g_{k-1}$ , [Reference Jasra, Kamatani, Law and Zhou14, Lemma D.2], and Remark C.3 to obtain
thus,
Therefore we have shown that
The proof can be completed by combining the bounds (C.8), (C.10), and (C.11).
Theorem 2. Assume (A1)–(A2). Then for any $(k,p,\varphi)\in\mathbb{N}\times[1,\infty)\times\mathcal{B}_b(\mathsf{X})\cap\mathcal{C}_b^2(\mathsf{X},\mathbb{R})$ there exists a $C<+\infty$ such that for any $(l,N_l)\in\mathbb{N}^2$ ,
C.3. Particle convergence proofs: bias bounds
Lemma C.10. Assume (A1)–(A2). Then for any $k\in\mathbb{N}$ there exists a $C<+\infty$ such that for any $(l,N_l,\varphi)\in\mathbb{N}^2\times\mathcal{B}_b(\mathsf{X})\cap\textit{Lip}(\mathsf{X})$ ,
Proof. The proof is by induction. The case $k=1$ is trivial as there is no bias, so we proceed to the induction step. Following the proof of Lemma C.8, we have the decomposition
where
$T_1$ and $T_2$ can be bounded using almost the same calculations, so we consider only $T_1$ . The latter can easily be bounded above by $\sum_{j=1}^4 T_{j+3}$ , where
For both $T_4$ and $T_6$ , one can use the Cauchy–Schwarz inequality, along with [Reference Jasra, Kamatani, Law and Zhou14, Proposition C.6], the lower bound on $g_{k-1}$ , and weak-error results for diffusions to show that
For $T_5$ and $T_7$ one can use standard bias bounds for particle filters (see e.g. the proof of [Reference Jasra, Kamatani, Law and Zhou14, Lemma C.3]), the lower bound on $g_{k-1}$ , and weak-error results for diffusions to obtain
hence
For $T_3$ , using [Reference Jasra, Kamatani, Law and Zhou14, Lemma C.5] we have the upper bound $T_3\leq\sum_{j=1}^6 T_{j+7}$ , where
Since $T_8$ and $T_9$ can be bounded using similar approaches, we consider only $T_8$ . Similarly, $T_{10},\dots,T_{13}$ can be bounded in almost the same way, so we consider only $T_{10}$ . For $T_8$ one has $T_8\leq T_{14}+T_{15}$ , where
For $T_{14}$ one can use Cauchy–Schwarz, the lower bound on $g_{k-1}$ , [Reference Jasra, Kamatani, Law and Zhou14, Proposition C.6], and Lemma C.8. For $T_{15}$ we can apply the lower bound on $g_{k-1}$ and the induction hypothesis. These two arguments give that
and hence that
For $T_{10}$ we have the upper bound $T_{10}\leq T_{16}+T_{17}$ , where
For $T_{15}$ one can use Cauchy–Schwarz, the lower bound on $g_{k-1}$ , [Reference Jasra, Kamatani, Law and Zhou14, Proposition C.6] (twice), and [Reference Jasra, Kamatani, Law and Zhou14, Lemma D.2]. For $T_{16}$ we can apply the lower bound on $g_{k-1}$ , [Reference Jasra, Kamatani, Law and Zhou14, Lemma D.2], and bias results for particle filters. These two arguments give
and thus
Thus we have proved that
Remark C.4. Using the approach in the proof of Lemma C.10, one can also prove the following result. Assume (A1)–(A2). Then for any $k\in\mathbb{N}$ there exists a $C<+\infty$ such that for any $(l,N_l,\varphi)\in\mathbb{N}^2\times\mathcal{B}_b(\mathsf{X})\cap\textit{Lip}(\mathsf{X})$ ,
Lemma C.11 . Assume (A1)–(A2). Then for any $(k,\varphi)\in\mathbb{N}\times\mathcal{B}_b(\mathsf{X})\cap\mathcal{C}_b^2(\mathsf{X},\mathbb{R})$ there exists a $C<+\infty$ such that for any $(l,N_l,\varepsilon)\in\mathbb{N}^2\times(0,1/2)$ ,
Proof. The proof is by induction. The case $k=1$ is trivial as there is no bias, so we proceed to the induction step. Following the proof of Lemma C.9, we have the decomposition
where
For $T_1$ one can match the empirical and limit terms across the $l,a,l-1$ and adopt the same proof approach as used for $T_1$ in the proof of Lemma C.10; since the proof would be repeated, we omit it and remark only that
For $T_2$ , using Lemma C.4 and the triangle inequality, we have that $T_2\leq\sum_{j=1}^4 T_{j+2}$ , where
As $T_3$ (resp. $T_4$ ) and $T_{6}$ (resp. $T_{5}$ ) can be dealt with using similar arguments, we only prove bounds for $T_3$ (resp. $T_4$ ). For $T_3$ we have the upper bound $T_3\leq\sum_{j=1}^4 T_{j+6}$ , where
For $T_7$ one can use Cauchy–Schwarz, the lower bound on $g_{k-1}$ , [Reference Jasra, Kamatani, Law and Zhou14, Proposition C.6], and Lemma C.9. For $T_8$ we can apply the lower bound on $g_{k-1}$ and the induction hypothesis. These arguments yield
For $T_9$ we can use [Reference Jasra, Kamatani, Law and Zhou14, Lemma D.2], Cauchy–Schwarz, the lower bound on $g_{k-1}$ , and [Reference Jasra, Kamatani, Law and Zhou14, Proposition C.6] twice. For $T_{10}$ we can use [Reference Jasra, Kamatani, Law and Zhou14, Lemma D.2], the lower bound on $g_{k-1}$ , and standard bias results for particle filters. This gives
Collecting the above arguments, we have shown that
For $T_4$ we have the upper bound $T_4\leq\sum_{j=1}^3 T_{j+10}$ , where
For $T_{11}$ one can use the lower bound on $g_{k-1}$ , Cauchy–Schwarz, Remark C.3, and Lemma C.9. For $T_{12}$ we can apply [Reference Jasra, Kamatani, Law and Zhou14, Lemma D.2], Cauchy–Schwarz, the lower bound on $g_{k-1}$ , [Reference Jasra, Kamatani, Law and Zhou14, Proposition C.6], and Remark C.3. For $T_{13}$ we can use the lower bound on $g_{k-1}$ , Remark C.4, and [Reference Jasra, Kamatani, Law and Zhou14, Lemma D.2]. Putting these results together, we have established that
and thus we can deduce the same upper bound (up to a constant) for $T_4$ . As a result of the above arguments, we have established that
Theorem 3. Assume (A1)–(A2). Then for any $(k,\varphi)\in\mathbb{N}\times\mathcal{B}_b(\mathsf{X})\cap\mathcal{C}_b^2(\mathsf{X},\mathbb{R})$ there exists a $C<+\infty$ such that for any $(l,N_l)\in\mathbb{N}^2$ ,
Acknowledgements
All of the authors were supported by KAUST baseline funding. We thank two referees and an editor for their valuable comments, which greatly improved the article.
Funding information
There are no funding bodies to thank in relation to the creation of this article.
Competing interests
There were no competing interests to declare which arose during the preparation or publication process of this article.