Hostname: page-component-78c5997874-v9fdk Total loading time: 0 Render date: 2024-11-14T09:22:33.346Z Has data issue: false hasContentIssue false

Bounding the difference between the values of robust and non-robust Markov decision problems

Published online by Cambridge University Press:  11 November 2024

Ariel Neufeld*
Affiliation:
Nanyang Technological University Singapore
Julian Sester*
Affiliation:
National University of Singapore
*
*Postal address: Division of Mathematical Sciences, 21 Nanyang Link, Singapore 637371. Email: ariel.neufeld@ntu.edu.sg
**Postal address: Department of Mathematics, 21 Lower Kent Ridge Road, Singapore 119077. Email: jul_ses@nus.edu.sg
Rights & Permissions [Opens in a new window]

Abstract

In this note we provide an upper bound for the difference between the value function of a distributionally robust Markov decision problem and the value function of a non-robust Markov decision problem, where the ambiguity set of probability kernels of the distributionally robust Markov decision process is described by a Wasserstein ball around some reference kernel whereas the non-robust Markov decision process behaves according to a fixed probability kernel contained in the ambiguity set. Our derived upper bound for the difference between the value functions is dimension-free and depends linearly on the radius of the Wasserstein ball.

Type
Original Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright
© The Author(s), 2024. Published by Cambridge University Press on behalf of Applied Probability Trust

1. Introduction

Markov decision processes enable the modeling of non-deterministic interactions between an agent and its environment within a tractable stochastic framework. At each time t an agent observes the current state and takes an action which leads to an immediate reward. The goal of the agent then is to optimize its expected cumulative reward. Mathematically, Markov decision problems are solved based on a dynamic programming principle, whose framework is the foundation of many reinforcement learning algorithms such as, e.g., the Q-learning algorithm. See [Reference Bäuerle and Rieder5, Reference Feinberg and Shwartz10, Reference Puterman25, Reference Puterman26] for the theory of Markov decision processes, and [Reference Angiuli, Fouque and Lauriere1, Reference Cao, Chen, Hull and Poulos6, Reference Charpentier, Elie and Remlinger7, Reference Kaelbling, Littman and Moore11, Reference Kallus and Uehara12, Reference Levin, Pieraccini and Eckert15, Reference Natarajan and Kolobov20, Reference Sutton and Barto29, Reference White33] for their applications, especially in the field of reinforcement learning.

In the classical setup for Markov decision problems, the transition kernel describing the transition probabilities of the underlying Markov decision processes is given. Economically, this means that the agent possesses the knowledge of the true distribution of the underlying process, an assumption which typically cannot be justified in practice. To address this issue, academics have recently introduced a robust version of the Markov decision problem accounting for a possible misspecification of the assumed underlying probability kernel that describes the dynamics of the state process. Typically, we assume that the agent possesses a good guess of the true, but to the agent unknown, probability kernel, but due to her uncertainty decides to consider the worst case among all laws which lie within a ball of certain radius around the estimated probability kernel with respect to some distance, e.g. the Wasserstein distance or the Kullback–Leibler distance. See [Reference Bäuerle and Glauner3, Reference Bäuerle and Glauner4, Reference Chen, Yu and Haskell8, Reference El Ghaoui and Nilim9, Reference Li, Sutter and Kuhn16Reference Mannor, Mebel and Xu18, Reference Neufeld and Sester21, Reference Neufeld, Sester and Šikić23, Reference Panaganti and Kalathil24, Reference Si, Zhang, Zhou and Blanchet27, Reference Si, Zhang, Zhou and Blanchet28, Reference Uğurlu30, Reference Wang and Zou32, Reference Wiesemann, Kuhn and Rustem34Reference Yang, Zhang and Zhang37, Reference Zhou, Zhou, Bai, Qiu, Blanchet and Glynn39] for robust Markov decision problems and corresponding reinforcement-learning-based algorithms to solve them.

In this note, the goal is to analyze the difference between the value function of the corresponding Markov decision problem with respect to the true (but to the agent unknown) probability kernel and that of the robust Markov decision problem defined with respect to some Wasserstein ball around the (by the agent) estimated transition kernel. Note that the estimated transition kernel does not necessarily need to coincide with the true probability kernel, however we assume that the agent’s guess is good enough that the true probability kernel lies within the Wasserstein ball around the estimated probability kernel.

Similar, while not identical, research questions have been studied in [Reference Bartl and Wiesel2, Reference Kern13, Reference Kiszka and Wozabal14, Reference Müller19, Reference Zähle38], mainly focusing on establishing stability results for value functions with respect to the choice of the underlying transition probability. In [Reference Müller19, Theorem 4.2], the author presents a state-dependent bound on the difference between iterations of value functions (obtained via the so-called value iteration algorithm) of two Markov decision processes, implying that these iterations depend continuously on the transition kernels. As a refinement of [Reference Müller19, Theorem 4.2] and also of the related result obtained in [Reference Kern13, Theorem 2.2.8], the result from [Reference Zähle38, Theorem 6.2] shows that in a finite time horizon setting the difference between the value functions of two Markov decision processes with different transition probabilities can be bounded by an expression depending on a certain tailored distance between the transition probabilities. In [Reference Kiszka and Wozabal14], the author proposes a semi-metric for Markov processes which allows us to determine bounds for certain types of linear stochastic optimization problems, cf. [Reference Kiszka and Wozabal14, Theorem 3]. The authors of [Reference Bartl and Wiesel2] study the sensitvity of multi-period stochastic optimization problems over a finite time horizon with respect to the underlying probability distribution in the so-called adapted Wasserstein distance. They show in [Reference Bartl and Wiesel2, Theorem 2.4] that the value function of their robust optimization problem, with the corresponding ambiguity set being a Wasserstein ball around a reference measure, can be approximated by the corresponding value function of the non-robust optimization problem defined with respect to the reference measure plus an explicit correction term. Vaguely speaking (as the optimization problem in [Reference Bartl and Wiesel2] is technically speaking not comparable to our setting), this is similar to our analysis in the special case where our reference measure coincides with the true measure.

Under some mild assumptions, we obtain in Theorem 3.1 an explicit upper bound for the difference between the value function of the robust and the non-robust Markov decision problem which only depends on the radius $\varepsilon$ of the Wasserstein ball, the discount factor $\alpha$ , and the Lipschitz constants of the reward function and the true transition kernel. In particular, we obtain that the difference of the two value functions only grows at most linearly in the radius $\varepsilon$ and does not depend on the dimensions of the underlying state and action space.

The remainder of this note is as follows. In Section 2 we introduce the underlying setting used to derive our main result, which is reported in Section 3. The proof of the main result and auxiliary results necessary for the proof are reported in Section 4.

2. Setting

We first present the underlying setting to define both robust and non-robust Markov decision processes that we then use to compare their respective value functions.

2.1. Setting

As the state space we consider a closed subset $\mathcal{X} \subseteq\mathbb{R}^d$ for some $d\in \mathbb{N}$ , equipped with its Borel $\sigma$ -field $\mathcal{F}_{\mathcal{X}}$ , which we use to define the infinite Cartesian product $\Omega \,:\!=\, {{\mathcal{X}}^{{\mathbb{N}_0}}} = {\mathcal{X}} \times {\mathcal{X}} \times \cdots$ and the $\sigma$ -field $\mathcal{F}\,:\!=\,{\mathcal{F}_{\mathcal{X}}}\otimes \mathcal{F}_{\mathcal{X}} \otimes \cdots$ . For any $q\in \mathbb{N}$ , we denote by $\mathcal{M}_1^q(\mathcal{X})$ the set of probability measures on $\mathcal{X}$ with finite q-moments and write $\mathcal{M}_1(\mathcal{X})\,:\!=\,\mathcal{M}_1^1(\mathcal{X})$ for brevity. We define on $\Omega$ the infinite-horizon stochastic process $(X_{t})_{t\in\mathbb{N}_0}$ via the canonical process $X_t((\omega_0,\omega_1,\dots,\omega_t,\dots))\,:\!=\,\omega_t$ for $(\omega_0,\omega_1,\dots,\omega_t,\dots) \in \Omega$ , $t \in \mathbb{N}_0$ .

To define the set of controls (also called actions) we fix a compact set $A \subseteq \mathbb{R}^m$ for some $m \in \mathbb{N}$ , and set

\begin{align*} \mathcal{A} & \,:\!=\, \{{\mathbf{a}}=(a_t)_{t\in\mathbb{N}_0}\mid(a_t)_{t\in\mathbb{N}_0}\colon\Omega\rightarrow A;\, a_t\text{ is }\sigma(X_{t})\text{-measurable for all } t \in \mathbb{N}_0\} \\ &\ = \{(a_t(X_t))_{t\in\mathbb{N}_0}\mid a_t\colon\mathcal{X}\rightarrow A \text{ Borel measurable for all } t \in \mathbb{N}_0\}.\end{align*}

Next, we define the q-Wasserstein distance $d_{W_q}(\cdot,\cdot)$ for some $q\in \mathbb{N}$ . For any $\mathbb{P}_1,\mathbb{P}_2 \in \mathcal{M}_1^q(\mathcal{X})$ let $d_{W_q}(\mathbb{P}_1,\mathbb{P}_2)$ be defined as

\[d_{W_q}(\mathbb{P}_1,\mathbb{P}_2)\,:\!=\,\bigg(\inf_{\pi \in \Pi(\mathbb{P}_1,\mathbb{P}_2)}\int_{\mathcal{X} \times \mathcal{X}} \|x-y\|^q \,\mathrm{d}\pi(x,y)\bigg)^{1/q},\]

where $\|\cdot\|$ denotes the Euclidean norm on $\mathbb{R}^d$ , and where $\Pi(\mathbb{P}_1,\mathbb{P}_2)$ denotes the set of joint distributions of $\mathbb{P}_1$ and $\mathbb{P}_2$ . Moreover, we denote by $\tau_q$ the Wasserstein q-topology induced by the convergence with respect to $d_{W_q}$ .

To define an ambiguity set of probability kernels, we first fix throughout this paper some $q\in \mathbb{N}$ and $\varepsilon>0$ . Then, we define, as an ambiguity set of probability kernels,

(2.1) \begin{equation} \mathcal{X} \times A \ni (x,a) \twoheadrightarrow \mathcal{P}(x,a) \,:\!=\, \mathcal{B}^{(q)}_\varepsilon(\widehat{\mathbb{P}}(x,a)) \,:\!=\, \big\{\mathbb{P}\in\mathcal{M}_1(\mathcal{X})\mid d_{W_q}(\mathbb{P},\widehat{\mathbb{P}}(x,a)) \leq \varepsilon\big\}\end{equation}

with respect to some center $\mathcal{X}\times A\ni(x,a)\mapsto\widehat{\mathbb{P}}(x,a)\in\big(\mathcal{M}^q_1(\mathcal{X}),\tau_q\big)$ , meaning that $\mathcal{B}^{(q)}_\varepsilon(\widehat{\mathbb{P}}(x,a))$ denotes the q-Wasserstein ball (also called the Wasserstein ball of order q) with radius $\varepsilon$ and center $\widehat{\mathbb{P}}(x,a)$ .

Under these assumptions we define, for every $x \in \mathcal{X}$ and ${\mathbf{a}} \in \mathcal{A}$ , the set of admissible measures on $(\Omega, \mathcal{F})$ by

\begin{align*} \mathfrak{P}_{x,{\mathbf{a}}} \,:\!=\, \{\delta_x\otimes\mathbb{P}_0\otimes\mathbb{P}_1\otimes\cdots \mid & \text{ for all }t\in\mathbb{N}_0\colon\mathbb{P}_t:\mathcal{X}\rightarrow\mathcal{M}_1(\mathcal{X}) \text{ Borel measurable,} \\ & \text{ and }\mathbb{P}_t(\omega_t)\in\mathcal{P}(\omega_t,a_t(\omega_t)) \text{ for all }\omega_t\in\mathcal{X}\},\end{align*}

where the notation $\mathbb{P}=\delta_x \otimes\mathbb{P}_0\otimes \mathbb{P}_1 \otimes\cdots \in \mathfrak{P}_{x,{\mathbf{a}}}$ abbreviates

\[ \mathbb{P}(B)\,:\!=\,\int_{\mathcal{X}}\cdots \int_{\mathcal{X}} \cdots \mathbf{1}_{B}((\omega_t)_{t\in \mathbb{N}_0}) \cdots \mathbb{P}_{t-1}(\omega_{t-1};{\mathop{}\!\mathrm{d}}\omega_t)\cdots \mathbb{P}_0(\omega_0;{\mathop{}\!\mathrm{d}}\omega_1) \delta_x({\mathop{}\!\mathrm{d}} \omega_0),\quad B \in \mathcal{F}.\]

2.1. Problem formulation and standing assumptions

Let $r\colon\mathcal{X} \times A \times \mathcal{X} \rightarrow \mathbb{R}$ be some reward function. We assume from now on that it fulfils the following assumptions.

Assumption 2.1. (Assumptions on the reward function) The reward function $r\colon\mathcal{X}\times A\times\mathcal{X}\rightarrow\mathbb{R}$ satisfies the following:

  1. (i) The map $\mathcal{X}\times A\times\mathcal{X}\ni(x_0,a,x_1)\mapsto r(x_0,a,x_1)\in\mathbb{R}$ is Lipschitz continuous with constant $L_r>0$ .

  2. (ii) If $\mathcal{X}$ is unbounded and $q\in\mathbb{N}$ defined in (2.1) satisfies $q=1$ , then we additionally assume that $\sup_{x_0,x_1\in \mathcal{X}, a\in A} |r(x_0,a,x_1)|<\infty$ .

Note that Assumption 2.1(i) implies that the reward r is bounded whenever $\mathcal{X}$ is bounded.

Next, we impose the following standing assumption on our reference probability kernel modeled by the center of the q-Wasserstein ball.

Assumption 2.2. (Assumption on the center of the ambiguity set.) Let $q\in \mathbb{N}$ be defined in (2.1). Then the center $\mathcal{X}\times A\ni(x,a)\mapsto\widehat{\mathbb{P}}(x,a)\in(\mathcal{M}^q_1(\mathcal{X}),\tau_q)$ satisfies the following:

  1. (i) The map $\mathcal{X}\times A\ni(x,a)\mapsto\widehat{\mathbb{P}}(x,a)\in(\mathcal{M}^q_1(\mathcal{X}),\tau_q)$ is continuous.

  2. (i) If the reward function r is unbounded, then we assume instead of (i) the stronger assumption that $\widehat{\mathbb{P}}$ is Lipschitz continuous, i.e. that there exists $L_{\widehat{\mathbb{P}}}>0$ such that

    \begin{equation*} d_{W_q}(\widehat{\mathbb{P}}(x,a),\widehat{\mathbb{P}}(x^{\prime},a^{\prime})) \leq L_{\widehat{\mathbb{P}}}\big(\Vert x-x^{\prime} \Vert + \Vert a-a^{\prime} \Vert\big) \quad \text{for all } x,x^{\prime} \in \mathcal{X},\,a,a^{\prime}\in A.\end{equation*}

Finally, we assume the following on the discount factor $\alpha\in (0,1)$ .

Assumption 2.3. (Assumption on the discount factor) Let $q\in \mathbb{N}$ , $\varepsilon>0$ be defined as in (2.1), and $\mathcal{X}\times A\ni(x,a)\mapsto\widehat{\mathbb{P}}(x,a)\in(\mathcal{M}^q_1(\mathcal{X}),\tau_q)$ be as defined in Assumption 2.2. Then the discount factor $\alpha$ satisfies $0<\alpha<{1}/{C_P}$ , where $1\leq C_P<\infty$ is defined by

\begin{align*} C_P = \begin{cases} \max\bigg\{1 + \varepsilon + \sup\limits_{a\in A}\inf\limits_{x\in\mathcal{X}}\bigg\{ \displaystyle\int_{\mathcal{X}}\|z\|\,\widehat{\mathbb{P}}(x,a)({\mathop{}\!\mathrm{d}}z) + L_{\widehat{\mathbb{P}}}\|x\|\bigg\},\,L_{\widehat{\mathbb{P}}}\bigg\} & \mbox{if} \, \textit{r} \, \text{is unbounded}, \\ 1 & \mbox{otherwise}.\end{cases} \end{align*}

Our goal is to compare the value of the robust Markov decision problem with the value of the non-robust Markov decision problem. To define the robust value function, for every initial value $x\in \mathcal{X}$ , we maximize the expected value of $\sum_{t=0}^\infty \alpha^tr(X_{t},a_t,X_{t+1})$ under the worst-case measure from $\mathfrak{P}_{x,{\mathbf{a}}}$ over all possible actions ${\mathbf{a}} \in \mathcal{A}$ . More precisely, we introduce the robust value function by

(2.2) \begin{equation} \mathcal{X} \ni x \mapsto V(x) \,:\!=\, \sup_{{\mathbf{a}} \in \mathcal{A}}\inf_{\mathbb{P} \in \mathfrak{P}_{x,{\mathbf{a}}}} \Bigg(\mathbb{E}_{\mathbb{P}}\Bigg[\sum_{t=0}^\infty \alpha^tr(X_{t},a_t,X_{t+1})\Bigg]\Bigg).\end{equation}

To define the non-robust value function under the true, but to the agent unknown, probability kernel $\mathbb{P}^{\operatorname{true}}$ contained in the ambiguity set $\mathcal{P}$ , we impose the following assumptions on $\mathbb{P}^{\operatorname{true}}$ .

Assumption 2.4. (Assumptions on the true probability kernel) Let $q\in \mathbb{N}$ be as defined in (2.1). Then the true (but unknown) probability kernel $\mathcal{X}\times A\ni(x,a)\mapsto\mathbb{P}^{\mathrm{true}}(x,a)\in (\mathcal{M}^q_1(\mathcal{X}),\tau_q)$ satisfies the following:

  1. (i) $\mathbb{P}^{\operatorname{true}}(x,a) \in \mathcal{P}(x,a)$ for all $(x,a)\in \mathcal{X} \times A$ .

  2. (ii) $\mathbb{P}^{\operatorname{true}}$ is $L_P$ -Lipschitz with constant $0\leq L_P < {1}/{\alpha}$ , where $0<\alpha<1$ is defined in Assumption 2.3, i.e. we have

    \begin{equation*} d_{W_q}(\mathbb{P}^{\operatorname{true}}(x,a),\mathbb{P}^{\operatorname{true}}(x^{\prime},a^{\prime})) \leq L_P(\|x-x^{\prime}\|+\|a-a^{\prime}\|) \end{equation*}
    for all $ x,x^{\prime} \in \mathcal{X}$ , $a,a^{\prime}\in A$ .

Then, we introduce the non-robust value function under the true (but to the agent unknown) transition kernel by

(2.3) \begin{equation} \mathcal{X} \ni x \mapsto V^{\operatorname{true}}(x) \,:\!=\, \sup_{{\mathbf{a}} \in \mathcal{A}}\Bigg(\mathbb{E}_{{\mathbb{P}}^{\operatorname{true}}_{x,{\mathbf{a}}}}\Bigg[ \sum_{t=0}^\infty \alpha^tr(X_{t},a_t,X_{t+1})\Bigg]\Bigg),\end{equation}

where we write, for any $x \in \mathcal{X}$ and ${\mathbf{a}} \in \mathcal{A}$ ,

\begin{align*}\mathbb{P}_{x,{\mathbf{a}}}^{\operatorname{true}}\,:\!=\,\delta_{x} \otimes \mathbb{P}^{\operatorname{true}} \otimes \mathbb{P}^{\operatorname{true}} \otimes \mathbb{P}^{\operatorname{true}} \otimes \mathbb{P}^{\operatorname{true}} \cdots \in \mathcal{M}_1(\Omega).\end{align*}

Note that Assumptions 2.12.4 ensure that the dynamic programming principle holds for both the robust and non-robust Markov decision problem; see [Reference Neufeld, Sester and Šikić23, Theorem 2.7].

3. Main result

As a main result we establish a bound on the difference between the value function of the Markov decision process with fixed reference measure defined in (2.3), and the value function of the robust Markov decision process defined in (2.2).

Theorem 3.1. Let all Assumptions 2.12.4 hold.

  1. (i) Then, for any $x_0\in \mathcal{X}$ ,

    (3.1) \begin{equation} 0 \leq V^{\operatorname{true}}(x_0)-V(x_0) \leq 2 L_r \varepsilon(1+\alpha)\sum_{i=0}^\infty \alpha^i \sum_{j=0}^i(L_P)^j < \infty. \end{equation}
  2. (ii) Moreover, in the special case that $\mathbb{P}^{\operatorname{true}} = \widehat{\mathbb{P}}$ , for any $x_0\in \mathcal{X}$ ,

    (3.2) \begin{equation} 0 \leq V^{\operatorname{true}}(x_0)-V(x_0) \leq L_r \varepsilon(1+\alpha)\sum_{i=0}^\infty \alpha^i \sum_{j=0}^i(L_P)^j < \infty. \end{equation}

We highlight that the upper bound from (3.1) depends only on $\varepsilon$ , $\alpha$ , and the Lipschitz-constants $L_r$ and $L_P$ . In particular, the upper bound depends linearly on the radius $\varepsilon$ of the Wasserstein ball and is independent of the current state $x_0$ and the dimensions d and m of the state and action space, respectively.

Remark 3.1. The assertion from Theorem 3.1 also carries over to the case of autocorrelated time series where one assumes that the past $h \in \mathbb{N} \cap [2,\infty)$ values of a time series $(Y_{t})_{t \in \{-h,\ldots,-1,0,1,\ldots\}}$ taking values in some closed subset $\mathcal{Y}$ of $\mathbb{R}^{D}$ for some $D\in \mathbb{N}$ may have an influence on the next value. This can be modeled by defining the state process $X_t\,:\!=\,(Y_{t-h+1},\dots,Y_t) \in \mathcal{Y}^h\,=\!:\,\mathcal{X}$ , $t\in\mathbb{N}_0$ . In this setting, the subsequent state $X_{t+1}=(Y_{t-h+2},\dots,Y_{t+1})$ shares $h-1$ components with the preceding state $X_t=(Y_{t-h+1},\dots,Y_t)$ and uncertainty is only inherent in the last component $Y_{t+1}$ . Thus, we consider a reference kernel of the form $\mathcal{X}\times A\ni(x,a)\mapsto\mathbb{P}^{\mathrm{true}}(x,a) = \delta_{\pi(x)}\otimes\widetilde{\mathbb{P}}^{\mathrm{true}}(x,a)\in\mathcal{M}_1(\mathcal{X})$ , where $\widetilde{\mathbb{P}}^{\mathrm{true}}(x,a)\in\mathcal{M}_1(\mathcal{Y})$ and $\mathcal{X} \ni (x_1,\dots,x_h) \mapsto \pi(x)\,:\!=\, (x_2,\dots,x_h)$ denotes the projection on the last $h-1$ components. In this setting, for $q\in \mathbb{N}$ and $\varepsilon>0$ , the ambiguity set is given by

\begin{equation*} \begin{aligned} \mathcal{X} \times A \ni (x,a) \twoheadrightarrow \mathcal{P}(x,a) \,:\!=\, \big\{ & \mathbb{P}\in\mathcal{M}_1(\mathcal{X})\text{ such that } \mathbb{P}=\delta_{\pi(x)} \otimes \widetilde{\mathbb{P}} \\ & \text{for some } \widetilde{\mathbb{P}} \in \mathcal{M}_1(\mathcal{Y}) \text{ with } W_q(\widetilde{\mathbb{P}},\widetilde{\mathbb{P}}^{\mathrm{true}}(x,a))\leq \varepsilon\big\}. \end{aligned} \end{equation*}

The described setting is discussed in more detail in [Reference Neufeld, Sester and Šikić23, Section 3.3] or [Reference Neufeld and Sester21, Section 2.2]. Typical applications can be found in finance and include portfolio optimization; cf. [Reference Neufeld, Sester and Šikić23, Section 4].

Example 3.1. (Coin toss.) To illustrate the applicability of Theorem 3.1, we study an example similar to the one provided in [Reference Neufeld and Sester21, Example 4.1]. To this end, we consider an agent who at each time tosses 10 coins and observes the number of heads. Thus, we model the environment by a state space $\mathcal{X}\,:\!=\,\{0,\dots,10\}$ . Prior to the toss, the agent can bet whether in the next toss of 10 coins the sum of heads will be smaller ( $a=-1$ ) or larger $(a=1)$ than the previous toss. She gains $\$!$ if the bet is correct, and in turn has to pay $\$1$ if it is not (without being rewarded/punished if the sum of heads remains the same). Moreover, the agent can also decide not to bet for the toss (by choosing $a=0$ ). We model this via the reward function

\[ \mathcal{X} \times A \times \mathcal{X} \ni (x,a,x^{\prime}) \mapsto r(x,a,x^{\prime}) \,:\!=\, a \mathbf{1}_{\{x<x^{\prime}\}}-a \mathbf{1}_{\{x>x^{\prime}\}}, \]

where the possible actions are given by $ A\,:\!=\,\{-1,0,1\}$ . The reference measure in this setting assumes a fair coin, and therefore (independent of the state action pair) is a binomial distribution with $n=10$ , $p=0.5$ , i.e.

\begin{align*}\mathcal{X} \times A \ni (x,a) \mapsto \mathbb{P}^{\mathrm{true}}(x,a) = \widehat{\mathbb{P}}(x,a)\,:\!=\, \operatorname{Bin}(10,0.5).\end{align*}

In the described setting it is easy to see that r is Lipschitz continuous with Lipschitz constant

\begin{align*} L_r = \Bigg(\max_{\substack{y_0,y^{\prime}_0,x_1,x^{\prime}_1\in \mathcal{X},\,b,b^{\prime}\in A\\ (y_0,b,x_1) \neq (y^{\prime}_0,b^{\prime},x^{\prime}_1)}} \frac{|r(y_0,b,x_1)-r\big(y^{\prime}_0,b^{\prime},x^{\prime}_1\big)|}{\|y_0-y^{\prime}_0\|+\|b-b^{\prime}\|+\|x_1-x^{\prime}_1\|}\Bigg) = 1. \end{align*}

Moreover, we have $L_P=0$ . In Figure 1 we plot the corresponding upper bound from (3.2) against the difference $V^{\mathrm{true}}(x_0)-V(x_0)$ for different initial values $x_0$ and different levels of $\varepsilon$ used for the computation of V with $\alpha =0.45$ . The value functions are computed using the robust Q-learning algorithm proposed in [Reference Neufeld and Sester21]. The code used can be found at https://github.com/juliansester/MDP_Bound.

Figure 1. The difference between the non-robust and the robust value function compared with the upper bound from (3.2) in the setting described in Example 3.1 with $\varepsilon>0$ and for different initial values of the Markov decision process. Initial values larger than 5 are omitted due to the setting-specific symmetry $V(x_0)-V^{\mathrm{true}}(x_0) = V(10-x_0)-V^{\mathrm{true}}(10-x_0)$ for $x_0\in\{0,1,\dots,10\}$ .

4. Proof of the main result

In Section 4.1 we provide several auxiliary lemmas which are necessary to establish the proof of Theorem 3.1, which is reported in Section 4.2.

4.1. Auxiliary results

Lemma 4.1. Let $r\colon\mathcal{X}\times A\times\mathcal{X}\rightarrow\mathbb{R}$ satisfy Assumption 2.1. Let $\mathcal{X}\times A\ni(x,a)\mapsto\mathbb{P}^{\mathrm{true}}(x,a)(\mathrm{d} x_1) \in (\mathcal{M}_1^q(\mathcal{X}),\tau_q)$ satisfy Assumption 2.4. For any $v \in C_b(\mathcal{X}, \mathbb{R})$ , where here and in the following we denote by $C_b(\mathcal{X}, \mathbb{R})$ the set of continuous and bounded functions from $\mathcal{X}$ to $\mathbb{R}$ , define

(4.1) \begin{equation} \mathcal{T}^{\mathrm{true}}{v}(x_0) \,:\!=\, \sup_{a \in A}{\int_{\mathcal{X}}}(r(x_0,a,{x_1})+\alpha v({x_1})) {\mathbb{P}^{\mathrm{true}}(x_0,a)(\mathrm{d}x_1)},\qquad x_0 \in \mathcal{X}. \end{equation}

Then, for any $v \in C_b(\mathcal{X}, \mathbb{R})$ being $L_r$ -Lipschitz, $n \in \mathbb{N}$ , $x_0,x^{\prime}_0\in \mathcal{X}$ , we have

(4.2) \begin{equation} \big|(\mathcal{T}^{\mathrm{true}})^nv(x_0) - (\mathcal{T}^{\mathrm{true}})^nv\big(x^{\prime}_0\big)\big| \leq L_r\Bigg(1+L_P(1+\alpha)\sum_{i=0}^{n-1}\alpha^i L_P^i\Bigg)\|x_0-x^{\prime}_0\|. \end{equation}

Proof. For any $x_0,x^{\prime}_0\in \mathcal{X}$ and $a \in A$ , let ${\Pi^{\mathrm{true}}}(\mathrm{d}{x_1},\mathrm{d}{x^{\prime}_1})\in\mathcal{M}_1(\mathcal{X}\times\mathcal{X})$ denote an optimal coupling between $\mathbb{P}^{\mathrm{true}}(x_0,a)$ and $\mathbb{P}^{\mathrm{true}}(x^{\prime}_0,a)$ with respect to $d_{W_1}$ , i.e.

\begin{equation*} \begin{aligned} \int_{\mathcal{X}\times\mathcal{X}}\|{x_1}-{x^{\prime}_1}\|\Pi^{\mathrm{true}}(\mathrm{d}{x_1},\mathrm{d}{x^{\prime}_1}) & = d_{W_1}(\mathbb{P}^{\mathrm{true}}(x_0,a),\mathbb{P}^{\mathrm{true}}(x^{\prime}_0,a)) \\ & \leq d_{W_q}(\mathbb{P}^{\mathrm{true}}(x_0,a),\mathbb{P}^{\mathrm{true}}(x^{\prime}_0,a)), \end{aligned} \end{equation*}

where the inequality follows from Hölder’s inequality (see, e.g., [Reference Villani31, Remark 6.6]). We prove the claim by induction. We start with the base case $n=1$ , and compute by using the Lipschitz continuity of the functions r and v, and of $\mathbb{P}^{\mathrm{true}}$ that

\begin{align*} \big|(\mathcal{T}^{\mathrm{true}})v(x_0)-(\mathcal{T}^{\mathrm{true}})v\left(x^{\prime}_0\right)\big| & = \bigg|\sup_{a\in A}\int_{\mathcal{X}}(r(x_0,a,x_1)+\alpha v(x_1))\mathbb{P}^{\mathrm{true}}(x_0,a) (\mathrm{d}x_1) \\ & \qquad - \sup_{a\in A}\int_{\mathcal{X}}\left(r\left(x^{\prime}_0,a,x^{\prime}_1\right)+\alpha v\left(x^{\prime}_1\right)\right) \mathbb{P}^{\mathrm{true}}\left(x^{\prime}_0,a\right)\left(\mathrm{d}x^{\prime}_1\right) \bigg| \\ & \leq \sup_{a\in A}\int_{\mathcal{X}\times\mathcal{X}}|r(x_0,a,x_1)+\alpha v(x_1)-r\left(x^{\prime}_0,a,x^{\prime}_1\right)-\alpha v\left(x^{\prime}_1\right)| \Pi^{\mathrm{true}}\left(\mathrm{d}x_1,\mathrm{d}x^{\prime}_1\right) \\ & \leq L_r\|x_0-x^{\prime}_0\|+L_r(1+\alpha)\sup_{a\in A}\int_{\mathcal{X}\times\mathcal{X}}\|x_1-x^{\prime}_1\| \Pi^{\mathrm{true}}\left(\mathrm{d}x_1,\mathrm{d}x^{\prime}_1\right) \\ & \leq L_r\|x_0-x^{\prime}_0\|+L_r(1+\alpha)\sup_{a\in A} d_{W_q}\left(\mathbb{P}^{\mathrm{true}}(x_0,a),\mathbb{P}^{\mathrm{true}}\left(x^{\prime}_0,a\right)\right) \\ & \leq L_r \|x_0-x^{\prime}_0\|+L_r(1+\alpha)L_P\|x_0-x^{\prime}_0\| \\ & = L_r(1+(1+\alpha)L_P)\|x_0-x^{\prime}_0\|. \end{align*}

We continue with the induction step. Hence, let $n\in\mathbb{N}\cap[2,\infty)$ be arbitrary and assume that (4.2) holds for $n-1$ . Then, we compute

\begin{align*} \big|(\mathcal{T}^{\mathrm{true}})^nv(x_0)-(\mathcal{T}^{\mathrm{true}})^nv\left(x^{\prime}_0\right)\big| & \leq \sup_{a\in A}\int_{\mathcal{X}\times\mathcal{X}}\big|r(x_0,a,{x_1}) + \alpha({\mathcal{T}^{\mathrm{true}}})^{n-1}v({x_1}) \\ & \qquad\qquad\qquad - r\left(x^{\prime}_0,a,{x^{\prime}_1}\right) - \alpha({\mathcal{T}^{\mathrm{true}}})^{n-1}v\left({x^{\prime}_1}\right)\big| \Pi^{\mathrm{true}}\left(\mathrm{d}x_1,\mathrm{d}x^{\prime}_1\right) \\ & \leq L_r\|x_0-x^{\prime}_0\| + L_r\sup_{a\in A}\int_{\mathcal{X}\times\mathcal{X}}\|x_1-x^{\prime}_1\| \Pi^{\mathrm{true}}\left(\mathrm{d}x_1,\mathrm{d}x^{\prime}_1\right) \\ & \quad + \alpha\sup_{a\in A}\int_{\mathcal{X}\times\mathcal{X}}\big|({\mathcal{T}^{\mathrm{true}}})^{n-1}v({x_1}) - ({\mathcal{T}^{\mathrm{true}}})^{n-1}v\left({x^{\prime}_1}\right)\big|{\Pi^{\mathrm{true}}}\left(\mathrm{d}x_1,\mathrm{d}x^{\prime}_1\right). \end{align*}

Applying the induction hypothesis to this therefore yields

\begin{align*} & \big|(\mathcal{T}^{\mathrm{true}})^nv(x_0) - (\mathcal{T}^{mathrm{true}})^nv\big(x^{\prime}_0\big)\big| \\[5pt] & \leq L_r\|x_0-x^{\prime}_0\| + L_r\sup_{a\in A}\int_{\mathcal{X}\times\mathcal{X}}\|x_1-x^{\prime}_1\| \Pi^{\mathrm{true}}\big(\mathrm{d}x_1,\mathrm{d}x^{\prime}_1\big) \\[5pt] & \quad + \alpha L_r\Bigg(1+L_P(1+\alpha)\sum_{i=0}^{n-2}\alpha^i L_P^i \Bigg) \sup_{a\in A}\int_{\mathcal{X}\times\mathcal{X}}\|x_1-x^{\prime}_1\|\Pi^{\mathrm{true}}\big(\mathrm{d}x_1,\mathrm{d}x^{\prime}_1\big) \\[5pt] & \leq L_r\|x_0-x^{\prime}_0\| + L_r\cdot L_P\|x_0-x^{\prime}_0\| + \alpha L_r\Bigg(1+L_P(1+\alpha)\sum_{i=0}^{n-2}\alpha^i L_P^i\Bigg)L_P\|x_0-x^{\prime}_0\| \\[5pt] & = L_r\Bigg(1+ (1+\alpha)L_P +L_P(1+\alpha)\sum_{i=0}^{n-2}\alpha^{i+1} L_P^{i+1}\Bigg) \|x_0-x^{\prime}_0\| \\[5pt] & = L_r\Bigg(1+L_P(1+\alpha)\sum_{i=0}^{n-1}\alpha^i L_P^{i} \Bigg) \|x_0-x^{\prime}_0\|.\\[-36pt] \end{align*}

Lemma 4.2. Let Assumptions 2.1 and 2.4 hold. Moreover, let $\mathcal{X} \times A \ni (x,a) \mapsto \mathbb{P}^{\mathrm{wc}}(x,a) \in \mathcal{P}(x,a)$ denote another probability kernel contained in $\mathcal{P}(x,a)$ for each $x,a \in \mathcal{X} \times A$ . Furthermore, for any $v \in C_b(\mathcal{X},\mathbb{R})$ , define

(4.3) \begin{equation} \mathcal{T}^{\mathrm{wc}}v(x_0) \,:\!=\, \sup_{a\in A}\int_{\mathcal{X}}(r(x_0,a,{x_1})+\alpha v({x_1})) \mathbb{P}^{\mathrm{wc}}(x_0,a)(\mathrm{d}x_1), \qquad x_0 \in \mathcal{X}. \end{equation}
  1. (i) Then, for any $v \in C_b(\mathcal{X}, \mathbb{R})$ that is $L_r$ -Lipschitz, $n \in \mathbb{N}$ , and $x_0\in \mathcal{X}$ ,

    (4.4) \begin{equation} \big|(\mathcal{T}^{\mathrm{wc}})^nv(x_0)-(\mathcal{T}^{\mathrm{true}})^nv(x_0)\big| \leq 2L_r\varepsilon(1+\alpha)\sum_{i=0}^{n-1}\alpha^i\sum_{j=0}^i(L_P)^j, \end{equation}
    where $\mathcal{T}^{\mathrm{true}}$ is defined in (4.1).
  2. (ii) Moreover, in the special case that $\mathbb{P}^{\mathrm{true}} = \widehat{\mathbb{P}}$ , we obtain, for any $x_0\in \mathcal{X}$ ,

    (4.5) \begin{equation} \big|(\mathcal{T}^{\mathrm{wc}})^nv(x_0)-(\mathcal{T}^{\mathrm{true}})^nv(x_0)\big| \leq L_r\varepsilon(1+\alpha)\sum_{i=0}^{n-1}\alpha^i\sum_{j=0}^i(L_P)^j. \end{equation}

Proof. (i) For any $x_0\in \mathcal{X}$ and $a \in A$ , let $\Pi\big(\mathrm{d}x_1,\mathrm{d}x^{\prime}_1\big)\in\mathcal{M}_1(\mathcal{X}\times\mathcal{X})$ denote an optimal coupling between $\mathbb{P}^{\mathrm{wc}}(x_0,a)$ and $\mathbb{P}^{\mathrm{true}}(x_0,a)$ with respect to $d_{W_1}$ . Then, since both $\mathbb{P}^{\mathrm{wc}}(x_0,a),\mathbb{P}^{\mathrm{true}}(x_0,a) \in \mathcal{B}^{(q)}_\varepsilon(\widehat{\mathbb{P}}(x_0,a))$ we have

(4.6) \begin{align} \int_{\mathcal{X}\times\mathcal{X}}\|x_1-x^{\prime}_1\|\Pi\big(\mathrm{d}x_1,\mathrm{d}x^{\prime}_1\big) & = d_{W_1}(\mathbb{P}^{\mathrm{wc}}(x_0,a),\mathbb{P}^{\mathrm{true}}(x_0,a)) \nonumber \\ & \leq d_{W_q}(\mathbb{P}^{\mathrm{wc}}(x_0,a),\mathbb{P}^{\mathrm{true}}(x_0,a)) \leq 2\varepsilon, \end{align}

where the first inequality follows from Hölder’s inequality (see, e.g., [Reference Villani31, Remark 6.6]). We prove the claim by induction. To this end, we start with the base case $n=1$ , and compute by using (4.6) and the Lipschitz continuity of r, v, and $\mathbb{P}^{\mathrm{true}}$ that

\begin{align*} \big|(\mathcal{T}^{\mathrm{wc}})v(x_0)-(\mathcal{T}^{\mathrm{true}})v(x_0)\big| & = \bigg|\sup_{a\in A}\int_{\mathcal{X}}(r(x_0,a,x_1)+\alpha v(x_1))\mathbb{P}^{\mathrm{wc}}(x_0,a)(\mathrm{d}x_1) \\ & \qquad - \sup_{a\in A}\int_{\mathcal{X}}\big(r\big(x_0,a,x^{\prime}_1\big)+\alpha v\big(x^{\prime}_1\big)\big) \mathbb{P}^{\mathrm{true}}(x_0,a)\big(\mathrm{d}x^{\prime}_1\big)\bigg| \\ & \leq \sup_{a\in A}\int_{\mathcal{X}\times\mathcal{X}}\!|r(x_0,a,x_1)+\alpha v(x_1)-r\big(x_0,a,x^{\prime}_1\big)-\alpha v\big(x^{\prime}_1\big)| \Pi\big(\mathrm{d}x_1,\mathrm{d}x^{\prime}_1\big) \\ & \leq L_r(1+\alpha)\sup_{a\in A}\int_{\mathcal{X}\times\mathcal{X}}\|x_1-x^{\prime}_1\|\Pi\big(\mathrm{d}x_1,\mathrm{d}x^{\prime}_1\big) \\ & \leq L_r(1+\alpha)\sup_{a\in A}d_{W_q}(\mathbb{P}^{\mathrm{wc}}(x_0,a),\mathbb{P}^{\mathrm{true}}(x_0,a)) \leq L_r(1+\alpha) \cdot 2\varepsilon. \end{align*}

We continue with the induction step. Therefore, let $n\in\mathbb{N}\cap[2,\infty)$ be arbitrary and assume that (4.4) holds for $n-1$ . Then, we compute

(4.7) \begin{align} & \big|(\mathcal{T}^{\mathrm{wc}})^nv(x_0)-(\mathcal{T}^{\mathrm{true}})^nv(x_0)\big| \nonumber \\ & \quad \leq \sup_{a\in A}\int_{\mathcal{X}\times\mathcal{X}}\big|r(x_0,a,{x_1}) + \alpha(\mathcal{T}^{\mathrm{wc}})^{n-1}v({x_1}) \nonumber \\ & \qquad\qquad\qquad\quad - r\big(x_0,a,x^{\prime}_1\big)-\alpha(\mathcal{T}^{\mathrm{true}})^{n-1}v\big(x^{\prime}_1\big)\big| \Pi\big(\mathrm{d}x_1,\mathrm{d}x^{\prime}_1\big) \nonumber \\ & \quad \leq \sup_{a\in A}\int_{\mathcal{X}\times\mathcal{X}}|r(x_0,a,{x_1})-r\big(x_0,a,{x^{\prime}_1}\big)| \Pi\big(\mathrm{d}x_1,\mathrm{d}x^{\prime}_1\big) \nonumber \\ & \qquad + \alpha\sup_{a\in A}\int_{\mathcal{X}\times\mathcal{X}}\big|(\mathcal{T}^{\mathrm{true}})^{n-1}v({x_1}) - (\mathcal{T}^{\mathrm{true}})^{n-1}v\big(x^{\prime}_1\big)\big| \Pi\big(\mathrm{d}x_1,\mathrm{d}x^{\prime}_1\big) \end{align}
(4.8) \begin{align} & \qquad + \alpha\sup_{a\in A}\int_{\mathcal{X}\times\mathcal{X}}\big|(\mathcal{T}^{\mathrm{wc}})^{n-1}v({x_1}) - (\mathcal{T}^{\mathrm{true}})^{n-1}v(x_1)\big|\Pi\big(\mathrm{d}x_1,\mathrm{d}x^{\prime}_1\big). \end{align}

Applying Lemma 4.1 to (4.7) and the induction hypothesis to (4.8) together with (4.6) therefore yields

\begin{align*} &\big|(\mathcal{T}^{\mathrm{wc}})^nv(x_0) - (\mathcal{T}^{\mathrm{true}})^nv(x_0)\bigg| \\ & \quad \leq L_r\sup_{ \in A}\int_{\mathcal{X}\times\mathcal{X}}\|x_1-x^{\prime}_1\big\| \Pi\big(\mathrm{d}x_1,\mathrm{d}x^{\prime}_1\big) \\ & \qquad + \alpha L_r\Bigg(1+L_P(1+\alpha)\sum_{i=0}^{n-2}\alpha^i L_P^i\Bigg) \int_{\mathcal{X}\times\mathcal{X}}\|x_1-x^{\prime}_1\|\Pi\big(\mathrm{d}x_1,\mathrm{d}x^{\prime}_1\big) \\ & \qquad + \alpha\Bigg(2L_r\varepsilon(1+\alpha)\sum_{i=0}^{n-2}\alpha^i\sum_{j=0}^i L_P^j\Bigg) \\ & \quad \leq L_r \cdot 2\varepsilon+\alpha L_r\Bigg(1+L_P(1+\alpha)\sum_{i=0}^{n-2} \alpha^iL_P^i\Bigg) 2\varepsilon + \alpha\Bigg(L_r \cdot 2\varepsilon(1+\alpha)\sum_{i=0}^{n-2}\alpha^i\sum_{j=0}^i L_P^j\Bigg) \\ & \quad = 2L_r\varepsilon(1+\alpha)\Bigg(1+\alpha L_P\sum_{i=0}^{n-2}\alpha^iL_P^i + \sum_{i=0}^{n-2}\alpha^{i+1}\sum_{j=0}^i L_P^j\Bigg) \\ & \quad = 2L_r\varepsilon(1+\alpha)\Bigg(\sum_{i=0}^{n-1}\alpha^iL_P^i + \sum_{i=1}^{n-1}\alpha^{i}\sum_{j=0}^{i-1} L_P^j\Bigg) = 2L_r\varepsilon(1+\alpha)\Bigg(\sum_{i=0}^{n-1} \alpha^{i} \sum_{j=0}^{i}L_P^j\Bigg). \end{align*}

(ii) In the case $\mathbb{P}^{\mathrm{true}} = \widehat{\mathbb{P}}$ we have, for any $x_0\in\mathcal{X}$ and $a \in A$ that

(4.9) \begin{equation} d_{W_q}(\mathbb{P}^{\mathrm{wc}}(x_0,a),\mathbb{P}^{\mathrm{true}}(x_0,a))\leq \varepsilon, \end{equation}

since the ambiguity set $\mathcal{P}(x_0,a)$ is centered around $\mathbb{P}^{\mathrm{true}}(x_0,a)=\widehat{\mathbb{P}}(x_0,a)$ . Hence, replacing (4.6) by (4.9) and then following the proof of (i) shows the assertion.

Lemma 4.3. Let $0<\alpha<1$ and $L_P\geq 0$ satisfy $\alpha \cdot L_P <1$ . Then

\begin{equation*} \sum_{i=0}^\infty \alpha^i \sum_{j=0}^i (L_P)^j < \infty. \end{equation*}

Proof. Note that

(4.10) \begin{equation} 0 \leq \sum_{i=0}^\infty\alpha^i\sum_{j=0}^i(L_P)^j \leq \sum_{i=0}^\infty(i+1)\cdot\alpha^i\max\{1,L_P\}^i \,=\!:\, \sum_{i=0}^\infty a_i, \end{equation}

with $a_i = (i+1) \cdot \alpha^i \max\{1,L_P\}^i $ . Moreover,

\[ \frac{a_{i+1}}{a_i} = \frac{(i+2)\cdot\alpha^{i+1}\max\{1,L_P\}^{i+1}}{(i+1)\cdot\alpha^i\max\{1,L_P\}^i} = \frac{i+2}{i+1}\cdot\alpha\cdot\max\{1,L_P\} \rightarrow \alpha\cdot\max\{1,L_P\} < 1 \]

as $i \rightarrow \infty$ . Hence, d’Alembert’s criterion implies that $\sum_{i=0}^{\cdot}a_i$ converges absolutely. Thus, by (4.10), we have $\sum_{i=0}^\infty \alpha^i \sum_{j=0}^i (L_P)^j <\infty$ .

Lemma 4.4. Let Assumptions 2.12.3 hold. Then $\mathcal{P}(x,a)\,:\!=\,\mathcal{B}^{(q)}_\varepsilon(\widehat{\mathbb{P}}(x,a))$ as defined in (2.1) satisfies [Reference Neufeld, Sester and Šikić23, Standing Assumption 2.2] and the reward function $r\colon\mathcal{X} \times A \times \mathcal{X} \rightarrow \mathbb{R}$ together with the discount factor $0<\alpha<1$ satisfy [Reference Neufeld, Sester and Šikić23, Standing Assumption 2.4]. As a consequence, [Reference Neufeld, Sester and Šikić23, Theorem 2.7] then directly implies that the dynamic programming principle holds for the robust Markov decision problem defined in (2.2).

Proof. First, if $r\colon\mathcal{X} \times A \times \mathcal{X} \rightarrow \mathbb{R}$ is bounded, then Assumptions 2.12.3 allow us to use [Reference Neufeld, Sester and Šikić23, Proposition 3.1], which immediately ensures that the result holds true with respect to $p=0$ and $C_P=1$ in the notation of [Reference Neufeld, Sester and Šikić23, Standing Assumptions 2.2 2.4].

Now, assume for the rest of this proof that $r\colon\mathcal{X} \times A \times \mathcal{X} \rightarrow \mathbb{R}$ is unbounded. Then, by Assumption 2.1(ii) we have that $q\in[2,\infty)\cap\mathbb{N}$ . In this case, let $p=1$ in the notation of [Reference Neufeld, Sester and Šikić23, Standing Assumptions 2.2 and 2.4]. Then our Assumptions 2.1 and 2.3 immediately ensure that [Reference Neufeld, Sester and Šikić23, Standing Assumption 2.4] holds. Moreover, by our Assumption 2.2, we directly obtain from [Reference Neufeld and Sester22, Proposition 4.1] that [Reference Neufeld, Sester and Šikić23, Standing Assumption 2.2(i)] holds. Therefore, it remains to verify [Reference Neufeld, Sester and Šikić23, Standing Assumptions 2.2(ii)]. To that end, let

(4.11) \begin{equation} C_P \,:\!=\, \max\Bigg\{1 + \varepsilon + \sup_{a\in A}\inf_{x^{\prime}\in\mathcal{X}} \bigg\{\int_{\mathcal{X}}\|z\|\,\widehat{\mathbb{P}}(x^{\prime},a)(\mathrm{d}z) + L_{\widehat{\mathbb{P}}}\|x^{\prime}\|\bigg\}, L_{\widehat{\mathbb{P}}}\Bigg\} < \infty. \end{equation}

Indeed, note that $C_P<\infty$ , as Assumption 2.2 ensures that the map

\begin{equation*} \mathcal{X}\times A\ni(x^{\prime},a)\mapsto\int_{\mathcal{X}}\|z\|\,\widehat{\mathbb{P}}(x^{\prime},a)(\mathrm{d}z) + L_{\widehat{\mathbb{P}}}\|x^{\prime}\| \in [0,\infty) \end{equation*}

is continuous. This implies that the map

\begin{equation*} A\ni a\mapsto\inf_{x^{\prime}\in\mathcal{X}}\bigg\{\int_{\mathcal{X}}\|z\|\,\widehat{\mathbb{P}}(x^{\prime},a)(\mathrm{d}z) + L_{\widehat{\mathbb{P}}}\|x^{\prime}\|\bigg\} \in [0,\infty) \end{equation*}

is upper semicontinuous, which in turns ensures that $C_P$ is finite as A is compact. Now, let $(x,a)\in\mathcal{X} \times A$ and $\mathbb{P}\in\mathcal{P}(x,a)=\mathcal{B}^{(q)}_\varepsilon(\widehat{\mathbb{P}}(x,a))$ be arbitrarily chosen. Then by following the calculations in [Reference Neufeld and Sester22, Proof of Proposition 4.1, (6.34)] (with $p=1$ in the notation of [Reference Neufeld and Sester22]), using the Lipschitz continuity of $\widehat{\mathbb{P}}$ we obtain, for any arbitrary $x^{\prime}\in \mathcal{X}$ , that

\begin{equation*} \int_{\mathcal{X}}1+\|y\|\,\mathbb{P}(\mathrm{d}y) \leq 1 + \varepsilon + \int_{\mathcal{X}}\|z\|\,\widehat{\mathbb{P}}(x^{\prime},a)(\mathrm{d}z) + L_{\widehat{\mathbb{P}}}(\|x^{\prime}\| + \|x\|). \end{equation*}

Since $x^{\prime}\in \mathcal{X}$ was arbitrarily chosen, we see from (4.11) that

\begin{equation*} \int_{\mathcal{X}}1+\|y\|\,\mathbb{P}(\mathrm{d}y) \leq 1 + \varepsilon + \sup_{a\in A}\inf_{x^{\prime}\in\mathcal{X}}\bigg\{ \int_{\mathcal{X}}\|z\|\,\widehat{\mathbb{P}}(x^{\prime},a)(\mathrm{d}z) + L_{\widehat{\mathbb{P}}}\|x^{\prime}\|\bigg\} + L_{\widehat{\mathbb{P}}}\|x\| \leq C_P(1+ \|x\|), \end{equation*}

which shows that [Reference Neufeld, Sester and Šikić23, Standing Assumption 2.2(ii)] indeed holds.

4.2. Proof of Theorem 3.1

Proof. (i) First note that as, by assumption, $\mathbb{P}^{\mathrm{true}}(x,a)\in\mathcal{P}(x,a)$ for all $(x,a)\in\mathcal{X}\times A$ , we have $0 \leq V^{\mathrm{true}}(x_0)-V(x_0)$ for all $x_0\in\mathcal{X}$ . To compute the upper bound, we fix any $v\in C_b(\mathcal{X},\mathbb{R})$ which is $L_r$ -Lipschitz and we define the operator $\mathcal{T}^{\mathrm{true}}$ by (4.1). Then, by Lemma 4.4 and [Reference Neufeld, Sester and Šikić23, Theorem 2.7(ii)], we have

(4.12) \begin{equation} V^{\mathrm{true}}(x_0) = \lim_{n\rightarrow\infty}(\mathcal{T}^{\mathrm{true}})^nv(x_0), \qquad V(x_0) = \lim_{n\rightarrow\infty}(\mathcal{T})^nv(x_0) \end{equation}

for all $x_0 \in \mathcal{X}$ and for $\mathcal{T}$ as defined in [Reference Neufeld, Sester and Šikić23, (8)]. Moreover, by [Reference Neufeld, Sester and Šikić23, Theorem 2.7(iii)], there exists a worst case transition kernel $\mathcal{X} \times A \ni (x,a)\mapsto \mathbb{P}^{\mathrm{wc}}(x,a) $ with $\mathbb{P}^{\mathrm{wc}}(x,a) \in \mathcal{P}(x,a)$ for all $(x,a) \in \mathcal{X} \times A$ such that, by denoting, for any ${\mathbf{a}} =(a_t)_{t\in \mathbb{N}_0}\in \mathcal{A}$ ,

\begin{align*} \mathbb{P}_{x_0,{\mathbf{a}}}^{\mathrm{wc}} \,:\!=\, \delta_{x_0}\otimes\mathbb{P}^{\mathrm{wc}}\otimes\mathbb{P}^{\mathrm{wc}} \otimes\mathbb{P}^{\mathrm{wc}}\otimes\mathbb{P}^{\mathrm{wc}}\cdots\in\mathcal{M}_1(\Omega), \end{align*}

we have

(4.13) \begin{equation} V(x_0) = \sup_{{\mathbf{a}}\in\mathcal{A}}\mathbb{E}_{\mathbb{P}_{x_0,{\mathbf{a}}}^{\mathrm{wc}}} \Bigg[\sum_{t=0}^\infty\alpha^t r(X_t,a_t(X_t),X_{t+1})\Bigg] = \lim_{n\rightarrow\infty}(\mathcal{T}^{\mathrm{wc}})^nv(x_0), \qquad x_0 \in \mathcal{X}, \end{equation}

where $\mathcal{T}^{\mathrm{wc}}$ is as defined in (4.3). Therefore, by (4.12), (4.13), Lemma 4.2, and Lemma 4.3, we have, for all $x_0 \in \mathcal{X}$ ,

(4.14) \begin{align} V^{\mathrm{true}}(x_0) - V(x_0) & = \lim_{n\rightarrow\infty}(\mathcal{T}^{\mathrm{true}})^nv(x_0) - \lim_{n\rightarrow\infty}(\mathcal{T}^{\mathrm{wc}})^nv(x_0) \nonumber \\ & \leq \lim_{n\rightarrow\infty}\big|(\mathcal{T}^{\mathrm{true}})^nv(x_0) - (\mathcal{T}^{\mathrm{wc}})^nv(x_0)\big| \nonumber \\ & \leq 2L_r\varepsilon(1+\alpha)\lim_{n\rightarrow\infty}\sum_{i=0}^{n-1}\alpha^i\sum_{j=0}^i L_P^j \nonumber \\ & = 2L_r\varepsilon(1+\alpha)\sum_{i=0}^\infty\alpha^i\sum_{j=0}^i L_P^j < \infty. \end{align}

(ii) In the case $\mathbb{P}^{\mathrm{true}} = \widehat{\mathbb{P}}$ , due to Lemma 4.2(ii), we may use (4.5) and replace the inequality (4.14) by

\begin{align*} V^{\mathrm{true}}(x_0) - V(x_0) \leq L_r\varepsilon(1+\alpha)\lim_{n\rightarrow\infty}\sum_{i=0}^{n-1}\alpha^i \sum_{j=0}^i L_P^j = L_r\varepsilon(1+\alpha)\sum_{i=0}^\infty\alpha^i\sum_{j=0}^i L_P^j < \infty.\\[-48pt] \end{align*}

Acknowledgements

We thank an anonymous referee of [Reference Neufeld and Sester21] who raised a question that led to this note.

Funding information

The first author gratefully acknowledges financial support by the MOE AcRF Tier 1 Grant RG74/21 and by the Nanyang Assistant Professorship Grant (NAP Grant) Machine Learning based Algorithms in Finance and Insurance. The second author gratefully acknowledges financial support by the NUS Start-Up Grant Tackling model uncertainty in Finance with machine learning.

Competing interests

There were no competing interests to declare which arose during the preparation or publication process of this article.

Supplementary material

The supplementary material for this article (Python code for Example 3.1 can be found at https://github.com/juliansester/MDP_Bound.

References

Angiuli, A., Fouque, J.-P. and Lauriere, M. (2021). Reinforcement learning for mean field games, with applications to economics. Preprint, arXiv:2106.13755.Google Scholar
Bartl, D. and Wiesel, J. (2022). Sensitivity of multiperiod optimization problems in adapted Wasserstein distance. Preprint, arXiv:2208.05656.Google Scholar
Bäuerle, N. and Glauner, A. (2021), Q-learning for distributionally robust Markov decision processes. In Modern Trends in Controlled Stochastic Processes:, eds A. Piunovsky and Y. Zhang. Springer, New York, pp. 108–128.CrossRefGoogle Scholar
Bäuerle, N. and Glauner, A. (2022). Distributionally robust Markov decision processes and their connection to risk measures. Math. Operat. Res. 47, 17571780.CrossRefGoogle Scholar
Bäuerle, N. and Rieder, U. (2011). Markov Decision Processes with Applications to Finance. Springer, New York.CrossRefGoogle Scholar
Cao, J., Chen, J., Hull, J. and Poulos, Z. (2021). Deep hedging of derivatives using reinforcement learning. J Financial Data Sci. 3, 1027.CrossRefGoogle Scholar
Charpentier, A., Elie, R. and Remlinger, C. (2021). Reinforcement learning in economics and finance. Comput. Economics 62, 425462.CrossRefGoogle Scholar
Chen, Z., Yu, P. and Haskell, W. B. (2019). Distributionally robust optimization for sequential decision-making. Optimization 68, 23972426.CrossRefGoogle Scholar
El Ghaoui, L. and Nilim, A. (2005). Robust solutions to Markov decision problems with uncertain transition matrices. Operat. Res. 53, 780798.Google Scholar
Feinberg, E. A. and Shwartz, A. (2012). Handbook of Markov Decision Processes: Methods and Applications (Int. Ser. Operat. Res. Manag. Sci. 40). Springer, New York.Google Scholar
Kaelbling, L. P., Littman, M. L. and Moore, A. W. (1996). Reinforcement learning: A survey. J. Artificial Intell. Res. 4, 237285.CrossRefGoogle Scholar
Kallus, N. and Uehara, M. (2020). Double reinforcement learning for efficient off-policy evaluation in Markov decision processes. J. Mach. Learn. Res. 21, 67426804.Google Scholar
Kern, P. (2020). Sensitivity and statistical inference in Markov decision models and collective risk models. PhD dissertation, Saarland University.Google Scholar
Kiszka, A. and Wozabal, D. (2022). A stability result for linear Markovian stochastic optimization problems. Math. Program. 191, 871906.CrossRefGoogle Scholar
Levin, E., Pieraccini, R. and Eckert, W. (1998). Using Markov decision process for learning dialogue strategies. In Proc. 1998 IEEE Int. Conf. Acoustics, Speech and Signal Processing, Vol. 1. IEEE, Piscataway, NJ, pp. 201–204.CrossRefGoogle Scholar
Li, M., Sutter, T. and Kuhn, D. (2023). Policy gradient algorithms for robust MDPs with non-rectangular uncertainty sets. Preprint, arXiv:2305.19004.Google Scholar
Liu, Z., Bai, Q., Blanchet, J., Dong, P., Xu, W., Zhou, Z. and Zhou, Z. (2022). Distributionally robust Q-learning. Proc. Mach. Learn. Res. 162, 1362313643.Google Scholar
Mannor, S., Mebel, O. and Xu, H. (2016). Robust MDPs with k-rectangular uncertainty. Math. Operat. Res. 41, 14841509.CrossRefGoogle Scholar
Müller, A. (1997). How does the value function of a Markov decision process depend on the transition probabilities? Math. Operat. Res. 22, 872885.Google Scholar
Natarajan, M. and Kolobov, A. (2022). Planning with Markov decision processes: An AI perspective. Springer, New York.Google Scholar
Neufeld, A. and Sester, J. (2024). Robust Q-learning algorithm for Markov decision processes under Wasserstein uncertainty. Automatica 168, 111825.CrossRefGoogle Scholar
Neufeld, A. and Sester, J. (2024). Non-concave distributionally robust stochastic control in a discrete time finite horizon setting. Preprint, arXiv:2404.05230.Google Scholar
Neufeld, A., Sester, J. and Šikić, M. (2023). Markov decision processes under model uncertainty. Math. Finance 33, 618665.CrossRefGoogle Scholar
Panaganti, K. and Kalathil, D. (2022). Sample complexity of robust reinforcement learning with a generative model. Proc. Mach. Learn. Res. 151, 95829602.Google Scholar
Puterman, M. L. (1990). Markov decision processes. In Handbooks in Operations Research and Management Science, Vol. 2. Elsevier, Amsterdam, pp. 331434.CrossRefGoogle Scholar
Puterman, M. L. (2014). Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley, Chichester.Google Scholar
Si, N., Zhang, F., Zhou, Z. and Blanchet, J. (2020). Distributionally robust policy evaluation and learning in offline contextual bandits. Proc. Mach. Learn. Res. 119, 88848894.Google Scholar
Si, N., Zhang, F., Zhou, Z. and Blanchet, J. (2023). Distributional robust batch contextual bandits. Manag. Sci. 69, 57725793.CrossRefGoogle Scholar
Sutton, R. S. and Barto, A. G. (2018). Reinforcement Learning: An Introduction. MIT press, Cambridge, MA.Google Scholar
Uğurlu, K. (2018). Robust optimal control using conditional risk mappings in infinite horizon. J. Comput. Appl. Math. 344, 275287.CrossRefGoogle Scholar
Villani, C. (2008). Optimal Transport: Old and New (Grundlehren der mathematischen Wissenschaften 338). Springer, Berlin.Google Scholar
Wang, Y. and Zou, S. (2022). Policy gradient method for robust reinforcement learning. Preprint, arXiv:2205.07344.Google Scholar
White, D. J. (1993). A survey of applications of Markov decision processes. J. Operat. Res. Soc. 44, 10731096.CrossRefGoogle Scholar
Wiesemann, W., Kuhn, D. and Rustem, B. (2013). Robust Markov decision processes. Math. Operat. Res. 38, 153183.CrossRefGoogle Scholar
Wiesemann, W., Kuhn, D. and Sim, M. (2014). Distributionally robust convex optimization. Operat. Res. 62, 13581376.CrossRefGoogle Scholar
Xu, X. and Mannor, S. (2012). Distributionally robust Markov decision processes. Math. Operat. Res. 37, 288300.CrossRefGoogle Scholar
Yang, W., Zhang, L. and Zhang, Z. (2022). Towards theoretical understandings of robust Markov decision processes: Sample complexity and asymptotics. Ann. Statist. 50, 32233248.CrossRefGoogle Scholar
Zähle, H. (2022). A concept of copula robustness and its applications in quantitative risk management. Finance Stoch. 26, 825875.CrossRefGoogle Scholar
Zhou, Z., Zhou, Z., Bai, Q., Qiu, L., Blanchet, L. and Glynn, P. (2021). Finite-sample regret bound for distributionally robust offline tabular reinforcement learning. Proc. Mach. Learn. Res. 130, 33313339.Google Scholar
Figure 0

Figure 1. The difference between the non-robust and the robust value function compared with the upper bound from (3.2) in the setting described in Example 3.1 with $\varepsilon>0$ and for different initial values of the Markov decision process. Initial values larger than 5 are omitted due to the setting-specific symmetry $V(x_0)-V^{\mathrm{true}}(x_0) = V(10-x_0)-V^{\mathrm{true}}(10-x_0)$ for $x_0\in\{0,1,\dots,10\}$.