Extended Laplace principle for empirical measures of a Markov chain

Stephan Eckstein

doi:10.1017/apr.2019.6

Extended Laplace principle for empirical measures of a Markov chain

Part of: Markov processes Limit theorems

Published online by Cambridge University Press: 22 July 2019

Stephan Eckstein

Show author details

Stephan Eckstein*: Affiliation:
University of Konstanz
*: *Postal address: Department of Mathematics, University of Konstanz, 78464 Konstanz, Germany. Email address: stephan.eckstein@uni-konstanz.de

Article contents

Abstract
Introduction
Proofs of Theorem and Corollary
Proof of Corollary
Applications to robust Markov chains
Footnotes
References

Rights & Permissions

Abstract

We consider discrete-time Markov chains with Polish state space. The large deviations principle for empirical measures of a Markov chain can equivalently be stated in Laplace principle form, which builds on the convex dual pair of relative entropy (or Kullback– Leibler divergence) and cumulant generating functional f ↦ ln ʃ exp (f). Following the approach by Lacker (2016) in the independent and identically distributed case, we generalize the Laplace principle to a greater class of convex dual pairs. We present in depth one application arising from this extension, which includes large deviation results and a weak law of large numbers for certain robust Markov chains—similar to Markov set chains—where we model robustness via the first Wasserstein distance. The setting and proof of the extended Laplace principle are based on the weak convergence approach to large deviations by Dupuis and Ellis (2011).

Keywords

Large deviations Markov chain convex duality distributional uncertainty

MSC classification

Primary: 60J05: Discrete-time Markov processes on general state spaces

Secondary: 60F10: Large deviations

Type: Original Article
Information: Advances in Applied Probability , Volume 51 , Issue 1 , March 2019 , pp. 136 - 167

DOI: https://doi.org/10.1017/apr.2019.6 [Opens in a new window]
Copyright: © Applied Probability Trust 2019

1. Introduction

Throughout the paper (E, d) denotes a Polish space, $\mathcal{P}(E)$ denotes the space of Borel probability measures on E endowed with the topology of weak convergence, and C _b(E) denotes the space of continuous and bounded functions mapping E into ℝ. Let a Markov chain with state space E be given by its initial distribution $\pi_0 \in \mathcal{P}(E)$ and Borel measurable transition kernel $\pi\colon E \rightarrow \mathcal{P}(E)$, and denote by $\pi_n \in\mathcal{P}(E^n)$ the joint distribution of the first n steps of the Markov chain. Define the empirical measure map $L_n \colon E^n \rightarrow \mathcal{P}(E)$ by

\[ L_n(x_1,\ldots,x_n) = \frac{1}{n} \sum_{i=1}^n \delta_{x_i}, \]

and recall the relative entropy $R\colon \mathcal{P}(E) \times \mathcal{P}(E) \rightarrow [0,\infty]$ given by

$$R(\nu ,\mu ) = \int_E {\log } ({{{\rm{d}}\nu } \over {{\rm{d}}\mu }}){\rm{d}}\nu \quad {\rm{ if }}\nu \ll \mu ,\qquad R(\nu ,\mu ) = \infty \quad {\rm{else}}.$$

The main goal of this paper is to generalize the large deviations result for empirical measures of a Markov chain in its Laplace principle form. Under suitable assumptions on the Markov chain, the usual Laplace principle for empirical measures of a Markov chain states that, for all

$F \in C_b(\mathcal{P}(E))$,

(1.1)

\begin{align} \label{intro1} \lim_{n\rightarrow \infty} \frac{1}{n} \ln \int_{E^n} \exp(n F \circ L_n) \text{d}\pi_n = \sup_{\nu \in \mathcal{P}(E)} (F(\nu) - I(\nu)). \end{align}

Here, $I \colon \mathcal{P}(E) \rightarrow [0,\infty]$ is the rate function, given in the setting of [Reference Dupuis and Ellis20, Chapter 8] by

\[ I(\nu) = \inf_{q \colon \nu q = \nu} \int_E R(q(x),\pi(x)) \nu(\text{d} x), \]

where the infimum is over all stochastic kernels q on E that have ν as an invariant measure. In this paper a stochastic kernel q on E is a Borel measurable mapping $q\colon E \rightarrow \mathcal{P}(E)$, and $\nu q \in \mathcal{P}(E)$ is defined by νq(A) := ∫_E q(x, A)ν(dx) for $\nu \in \mathcal{P}(E)$, where we write q(x, A) = q(x)(A) for x ∈ E and Borel sets A ⊆ E. The Laplace principle (1.1)—in the mentioned setting of [Reference Dupuis and Ellis20]— is equivalent to the more commonly used form of the large deviations result for empirical measures of a Markov chain, which states that, for all Borel sets $A\subseteq \mathcal{P}(E)$,

\[ -\inf_{\nu \in \textit{\AA}} I(\nu) \leq \liminf \frac{1}{n} \ln\pi_n(L_n \in \textit{\AA}) \leq \limsup \frac{1}{n} \ln\pi_n(L_n \in \bar{A}) \leq -\inf_{\nu \in \bar{A}} I(\nu), \]

where Å denotes the interior and Ā the closure of A. Large deviation probabilities of Markov chains have been studied in a variety of settings and under different assumptions; see e.g. [Reference De Acosta13], [Reference Dembo and Zeitouni16], [Reference Donsker and Varadhan17], [Reference Donsker and Varadhan18], [Reference Jain30], and [Reference Ney and Nummelin36].

The way we generalize the Laplace principle is by using the fact that both sides of the Laplace principle (1.1) can be stated solely in terms of relative entropy, its chain rule, and its convex dual pair. Equation (1.1) can therefore be formulated analogously for functionals resembling the relative entropy, in the sense that these functionals have to satisfy the same type of chain rule and duality. The kind of convex duality referred to is Fenchel Moreau duality, which is often studied in the context of convex risk measures; see, for example, [Reference Acciaio and Penner1], [Reference Bartl4], [Reference Cheridito and Kupper12], and [Reference Lacker33].

The original idea for extensions of Laplace principles of this form is due to Lacker [Reference Lacker34], who pursued this in the context of independent and identically distributed (i.i.d.) sequences of random variables instead of Markov chains. The initial goal was to provide a setting to study more than just exponential tail behavior of random variables, as is given by large deviations theory. The extension of Sanov’s theorem he proved [Reference Lacker34, Theorem 3.1] can be used to derive many interesting results, such as polynomial large deviation upper bounds, robust large deviation bounds, robust laws of large numbers, asymptotics of optimal transport problems, and more, while several possibilities remain unexplored.

In this paper the same type of extension for Markov chains is obtained. To this end, we work in a setting similar to that of [Reference Dupuis and Ellis20, Chapter 8]. In particular, the results from [Reference Dupuis and Ellis20, Chapter 8] are a special case of Theorem 1.1. To showcase the potential implications of Theorem 1.1, we focus on one broad application related to robust Markov chains, summarized in Theorems 1.2 and 1.3.

1.1. Main results

Let $\beta \colon\mathcal{P}(E) \times \mathcal{P}(E) \rightarrow (\!-\!\infty,\infty]$ be a Borel measurable function which is bounded from below and satisfies β(ν, ν) = 0 for all $\nu \in \mathcal{P}(E)$. One may think of β(·, ·) = R(·, ·). To state the chain rule, we introduce the following notation for the decomposition of an n-dimensional measure $\nu \in \mathcal{P}(E^n)$ into kernels $\nu_{i,i+1} \colon E^i \rightarrow \mathcal{P}(E)$ for i = 1, …, n − 1 and $\nu_{0,1} \in \mathcal{P}(E)$:

\[ \nu(\text{d} x_1,\ldots,\text{d} x_n) = \nu_{0,1}(\text{d} x_1) \prod_{i=1}^{n-1} \nu_{i,i+1}(x_1,\ldots,x_i,\text{d} x_{i+1}). \]

For $\theta \in \mathcal{P}(E)$, define $\beta^{\theta}_n \colon \mathcal{P}(E^n) \rightarrow (\!-\!\infty,\infty]$ by

\[ \beta^{\theta}_n(\nu) = \beta(\nu_{0,1},\theta) + \int_{E^n} \sum_{i=1}^{n-1} \beta(\nu_{i,i+1}(x_1,\ldots,x_{i}),\pi(x_i)) \nu(\text{d} x_1,\ldots,\text{d} x_n), \]

where in the case in which β(·, ·) = R(·, ·) we obtain $\beta^{\pi_0}_n(\nu) = R(\nu,\pi_n)$ for $\nu \in \mathcal{P}(E^n)$ by the chain rule for relative entropy. Note that $\beta_n^{\cdot}(\kern-1pt\cdot\kern-1pt)$ is well defined as the term inside the integral is Borel measurable, for example, by [Reference Bertsekas and Shreve6, Proposition 7.27]. Define $\rho^{\theta}_n$ as the convex dual of $\beta_n^{\theta}$by

\[ \rho^{\theta}_n(\:f) = \sup_{\mu \in \mathcal{P}(E^n)} \bigg(\int_{E^n} f \text{d} \mu - \beta^{\theta}_n(\mu)\!\bigg) \]

for Borel measurable functions f : E ⁿ → ℝ, where we adopt the convention ∞ − ∞ := −∞. For β(·, ·) = R(·, ·), we obtain $\rho_n^{\pi_0}(\:f) = \ln \int_{E^n} \exp(\:f) \text{d}\pi_n$ by the Donsker–Varadhan variational formula for the relative entropy. In the above definitions, θ is a placeholder for variable initial distributions, which is required as a tool in the proof. For the actual statement, only $\beta_n^{\pi_0}$ and $\rho_n := \rho^{\pi_0}_n$ are needed. We write ρ := ρ ₁ and $\rho^{\theta} := \rho^{\theta}_1$.

The assumptions for the main theorem are stated below. Assumption M is [Reference Dupuis and Ellis20, Condition 8.4.1.], and Assumption T is a direct generalization of [Reference Dupuis and Ellis20, Condition 8.2.2].

Assumption M

The following conditions hold on the Markov chain.

(M1) Define the k-step transition kernel π ^(k) of the Markov chain recursively by π ^(k)(x, A) : = ∫_E π(y, A)π ^(k−1)(x, dy) for x ∈ E and Borel sets A ⊆ E.
Assume that there exist l ₀, n ₀ ∈ ℕ such that, for all x, y ∈ E,
$$\sum\limits_{i = {l_0}}^\infty {{1 \over {{2^i}}}} {\pi ^{(i)}}(x) \ll \sum\limits_{j = {n_0}}^\infty {{1 \over {{2^j}}}} {\pi ^{({\kern 1pt} j)}}(y).$$
(M2) π has an invariant measure, i.e. there exists μ ^∗ ∈ P(E) such that μ ^∗π = μ ^∗.

Assumption B

The following assumptions hold on β.

(B1) The mapping $\mathcal{P}(E)\times\mathcal{P}(E^2)\ni(\theta,\mu) \mapsto \beta^{\theta}_2(\mu)$ is convex.
(B2) The mapping $\mathcal{P}(E)\times\mathcal{P}(E^2)\ni(\theta,\mu) \mapsto \beta^{\theta}_2(\mu)$ is lower semi-continuous.
(B3) If ν is not absolutely continuous with respect to μ then β(ν, μ) = ∞.

Assumption T

At least one of the following assumptions must hold in order to guarantee the tightness of certain families of random variables.

(T1) There exists a Borel measurable function U : E → [0, ∞) such that the following conditions hold:

(a) inf_x∈E (U(x) − ρ ^π(x)(U)) > −∞;
(b) {x ∈ E : U(x) − ρ ^π(x)(U) ≤ M} is a relatively compact subset of E for all M ∈ ℝ;
(c) ρ(U) < ∞.
(T1’) E is compact.

In the case in which β(·, ·) = R(·, ·), we usually impose another Condition on π in the form of the Feller property, i.e. continuity of x ↦ π(x); see, e.g. [Reference Dupuis and Ellis20, Condition 8.3.1]. Here, this is implicitly included in Condition (B2). Indeed, we can quickly check that, for (B2) to hold in the case in which β(·, · = R(·, ·), the following is sufficient: if ${\theta _n}{\rm{ }}\buildrel w \over \longrightarrow \theta \in {\cal P}(E)$ then ${\theta _n} \otimes \pi \buildrel w \over \longrightarrow \theta \otimes \pi \in {\cal P}({E^2})$ has to hold as well. The Feller property implies this; see [Reference Dupuis and Ellis20, Lemma 8.3.2].

The following extension of the Laplace principle for empirical measures of a Markov chain is the main result.

Theorem 1.1

Define the rate function $I\colon \mathcal{P}(E) \rightarrow (\!-\!\infty,\infty]$ by

(1.2)

\begin{align} \label{eq::RateFunction} I(\nu) := \inf_{q : \nu q = \nu} \int_E \beta(q(x),\pi(x)) \nu(\text{d} x) = \inf_{q : \nu q = \nu} \beta_2^{\nu}(\nu\otimes q). \end{align}

Under Condition (B1), (B2), and Assumption T, the upper bound

\[ \limsup_{n \rightarrow \infty} \frac{1}{n} \rho_n(n F\circ L_n) \leq \sup_{\nu \in \mathcal{P}(E)}( F(\nu) - I(\nu)) \]

holds for all upper semicontinuous and bounded functions $F \colon \mathcal{P}(E) \rightarrow \mathbb{R}$.

Under Condition (M1), (M2), (B1), and (B3), the lower bound

\[ \liminf_{n \rightarrow \infty} \frac{1}{n} \rho_n(n F\circ L_n) \geq \sup_{\nu \in \mathcal{P}(E)}( F(\nu) - I(\nu)) \]

holds for all $F \in C_b(\mathcal{P}(E))$.

Intuition, applicability, and difficulties in dealing with the above result are very similar to the i.i.d. case and are described in detail in the introduction of [Reference Lacker34]. The main differences for Markov chains are conditions (B1) and (B2). To verify these conditions, we would ideally like to have a better expression for $\beta_2^{\cdot}(\kern-1pt\cdot\kern-1pt)$ than is given by the definition, which is often not trivial. In the applications of this paper the choices of β are convenient in this regard. Some of the applications pursued in the i.i.d. case, e.g. [Reference Lacker34, Chapter 4 and 6] appear more difficult to obtain for Markov chains. A thorough analysis of the range of applications of Theorem 1.1 remains incomplete for now, as the goal of this work is to give a detailed account of one application of Theorem 1.1 to robust Markov chains.

The following corollary complements Theorem 1.1.

Corollary 1.1

(a) If $ (\theta, \nu) \mapsto \beta_2^{\theta}(\nu) $ is lower semicontinuous then I is lower semicontinuous. If $ (\theta, \nu) \mapsto \beta_2^{\theta}(\nu) $ is convex then I is convex.
(b) If the Theorem 1.1 upper bound holds and, additionally, I has compact sub-level sets, then the upper bound extends to all functions $F\colon \mathcal{P}(E) \rightarrow [ \!-\!\infty,\infty)$ which are upper semicontinuous and bounded from above.

1.2. Applications to robust Markov chains

In this paper robustness broadly refers to uncertainty about the correct model specification of the Markov chain. This type of uncertainty is often studied in terms of nonlinear expectations (see, e.g. [Reference Cerreia-Vioglio, Maccheroni and Marinacci10], [Reference Lan and Zhang35], [Reference Peng38], and [Reference Peng39]) and distributional robustness (see, e.g. [Reference Blanchet and Murthy8], [Reference Esfahani and Kuhn23], [Reference Gao and Kleywegt25], and [Reference Hanasusanto, Roitch, Kuhn and Wiesemann27]). Here, the main point is to take uncertainty with respect to the transition kernel π into consideration. Conceptually, a robust transition kernel is the following. If the Markov chain is at point x ∈ E, the next step of the Markov chain is not necessarily determined by a fixed measure π(x), but rather can be determined by any measure $\hat{\pi} \in P(x) \subseteq \mathcal{P}(E)$. In our context, P(x) will be defined as a neighborhood of π(x) with respect to the first Wasserstein distance.

The existing literature on robust Markov chains focuses on finite state spaces, where transition probabilities are uncertain in some convex and closed sets, usually expressed via matrix intervals. For example Škulj [Reference Škulj42] gave a good overview of the field. Robust Markov chains are studied under the names of Markov set chains (see, e.g. [Reference Hartfiel and Seneta28], [Reference Hartfiel29], and [Reference Kurano, Song, Hosaka and Huang32]), imprecise Markov chains (see, e.g. [Reference De Cooman, Hermans and Quaeghebeur14]), as well as Markov chains with interval probabilities (see, e.g. [Reference Škulj41] and [Reference Škulj42]). Recently, continuous-time versions have also received attention (see, e.g. [Reference Erreygers, Rottondi, Verticale and De Bock22] and [Reference Škulj43]). While different types of laws of large numbers are studied frequently, large deviations theory seems to be absent in the current literature on robust Markov chains. Robust Markov chains find applications in several areas in machine learning and operations research, as well as in reinforcement learning [Reference Nilim and El Ghaoui37], [Reference Wiesemann, Kuhn and Rustem46], [Reference Yang47], and [Reference Yu and Xu48], network control [Reference Erreygers, Rottondi, Verticale and De Bock22], [Reference Rottondi, Erreygers, Verticale and De Bock40], and server assignment [Reference Kirkizlar, Andradóttir and Ayhan31].

In the following, the asymptotic behavior of such Markov chains is analyzed. The type of asymptotics studied is worst-case behaviors over all possible distributions, in the sense of large deviation probabilities (Theorem 1.2) and a law of large numbers (Theorem 1.3) of empirical measures of robust Markov chains. Worst-case behavior for large deviations means that the slowest possible rate of convergence to 0 of a tail event is identified. For laws of large numbers, we give upper bounds—or by changing signs, lower bounds—for law of large number type limits.

Define the first Wasserstein distance d _W on $\mathcal{P}(E)$ by

\[ d_W(\mu,\nu) = \inf_{\tau \in \Pi(\mu,\nu)} \int_E d(x,y)\tau(\text{d} x,\text{d} y) \]

for $\mu,\nu \in \mathcal{P}(E)$, where $\Pi(\mu,\nu) \subseteq \mathcal{P}(E^2)$ denotes the set of measures with first marginal μ and second marginal ν. See, for example, [Reference Gibbs and Su26] for an overview regarding the Wasserstein distance. In order to avoid complications with respect to compatibility of the weak convergence and Wasserstein distance, we assume that E is compact for the applications.

Fix r ≥ 0. The set of possible joint distributions of the robust Markov chain up to step n is characterized by $M_n(\pi_0) \subseteq \mathcal{P}(E^n)$ defined by

\begin{multline*} M_n(\pi_0) := \{\nu \in \mathcal{P}(E^n) \colon d_W(\nu_{0,1},\pi_0)\leq r \text{ and } d_W(\nu_{i,i+1}(x_1,\ldots,x_i),\pi(x_i)) \leq r \nu\text{-a.s.} \\ \text{ for } i=1,\ldots ,n-1\}. \end{multline*}

Note that elements of M _n(π ₀) do not have to be Markov chains. In this context the Markov property only applies to the evolution of the set of measures, but each individual measure can depend on the entire path.

For technical reasons related to Condition (B3), we also consider the following modification:

\begin{align} \underline{M}_{n}(\pi_0) :=& \{ \nu \in M_n(\pi_0) \colon \nu \ll \pi_0\otimes \pi \otimes\cdots\otimes \pi \}. \end{align}

Both definitions above can of course be stated for arbitrary $\theta \in \mathcal{P}(E)$ instead of π ₀. We show that

\[ \beta(\nu,\mu) := \inf_{\hat{\mu} \in M_1(\mu)} R(\nu,\hat{\mu}) \]

satisfies the assumptions for the upper bound of Theorem 1.1, and that

\[ \underline{\beta}(\nu,\mu) := \inf_{\hat{\mu} \in \underline{M}_1(\mu)} R(\nu,\hat{\mu}) \]

satisfies the assumptions for the lower bound of Theorem 1.1. In Lemma 3.1 and Lemma 3.5 we will characterize $\beta_n^\theta$ and $\underline{\beta}_n^\theta$ in terms of M _n(θ) and $\underline{M}_n(\theta)$.

The rate functions I and I̠ denote the rate functions for β and $\underline{\beta}$, respectively, as given by (1.2) and can be expressed as

\begin{align} I(\nu) = \inf_{q: \nu q = \nu}\inf_{\mu \in M_2(\nu)} R(\nu \otimes q, \mu), \quad \quad \underline{I}(\nu) &= \inf_{q: \nu q = \nu} \inf_{\mu \in \underline{M}_2(\nu)} R(\nu \otimes q, \mu). \end{align}

In the r = 0 case, these rate functions simplify to those used in [Reference Dupuis and Ellis20, Chapter 8], since, for r = 0, it holds that $M_2(\nu) = \underline{M}_2(\nu) = \{\nu \otimes \pi\}$.

Theorem 1.1 yields the following result.

Theorem 1.2

Assume that (E, d) is compact. Let β, $\underline{\beta} $ and M _n(θ), $\underline{M}_n(\theta)$ for $\theta \in \mathcal{P}(E)$ be given as above. Let $\underline{I}$ and I̠ denote the rate functions for β and $\underline{\beta}$, respectively, as given by (1.2).

(a) If π satisfies the Feller property, it holds that, for Borel sets $A \subseteq \mathcal{P}(E)$,
\[ \limsup_{n\rightarrow \infty} \sup_{\mu \in M_n(\pi_0)} \frac{1}{n} \ln \mu(L_n \in \bar{A}) \leq -\inf_{\nu \in \bar{A}} I(\nu). \]
(b) If π satisfies (M), it holds that, for Borel sets $A \subseteq \mathcal{P}(E)$,
\[ \liminf_{n\rightarrow \infty} \sup_{\mu \in \underline{M}_n(\pi_0)} \frac{1}{n} \ln \mu(L_n \in \textit{\AA}) \geq -\inf_{\nu \in \textit{\AA}} \underline{I}(\nu). \]

For a (numerical) illustration of the above result, see Example 3.1. Among other things, the example showcases that one can identify conditions such that there is no difference between the upper and lower bounds, and, thus, the above identifies precise asymptotic rates. Note that in finite state spaces one can guarantee $M_n(\theta) = \underline{M}_n(\theta)$ by assuming that π(x)(y) > 0 for all x, y ∈ E.

The following is the law of large numbers result for robust Markov chains, which is based on the choices

\begin{align} \beta(\mu,\nu) &:= \begin{cases} 0& \text{if } d_W(\mu,\nu)\leq r, \\ \infty& \text{otherwise,} \end{cases}\quad \quad \underline{\beta}(\mu,\nu) &:= \begin{cases} 0& \text{if } d_W(\mu,\nu)\leq r \text{ and } \mu \ll \nu, \\ \infty& \text{otherwise,} \end{cases} \end{align}

again for fixed r ≥ 0.

Theorem 1.3

Assume that (E, d) is compact. Let M _n(θ), $\underline{M}_n(\theta)$ for $\theta \in \mathcal{P}(E)$ be given as above.

(a) If π satisfies the Feller property, it holds that, for all $F\colon \mathcal{P}(E) \rightarrow [\!-\!\infty,\infty)$, which are upper semicontinuous and bounded from above,
\[ \limsup_{n\rightarrow\infty} \sup_{\mu \in M_n(\pi_0)} \int_{E^n} F\circ L_n \text{d}\mu \leq \sup_{\substack{\nu \in \mathcal{P}(E) \colon \text{there exists }q, \nu q = \nu\colon \nu \otimes q \in M_2(\nu)}} F(\nu). \]
(b) If π satisfies (M), it holds that, for all $F \in C_b(\mathcal{P}(E))$,
\[ \liminf_{n\rightarrow \infty} \sup_{\mu \in \underline{M}_n(\pi_0)} \int_{E^n} F \circ L_n \text{d}\mu \geq \sup_{\substack{\nu \in \mathcal{P}(E) \colon \text{there exists } q, \nu q = \nu\colon \nu \otimes q \in \underline{M}_2(\nu)}} F(\nu). \]

This result is easiest interpreted by looking at the r = 0 case. If both the upper and lower bounds hold, the above states that

$${\pi _n} \circ L_n^{ - 1}{\rm{ }}\buildrel w \over \longrightarrow {\delta _{{\mu ^*}}} \in {\cal P}({\cal P}(E)),$$

where μ ^∗ is the unique invariant measure under the Markov chain transition kernel π, which— under Assumption M—always exists (see [Reference Dupuis and Ellis20, Lemma 8.6.2(a)]).

Specifically, the choices F(ν) : = ∫_E f dν for f ∈ C _b(E) in the above theorem can be interpreted as a robust Cesàro limit of a Markov chain. Indeed, for r = 0, this yields

$$\mathop {\lim }\limits_{n \to \infty } {1 \over n}\sum\limits_{i = 1}^n {{\pi _0}} {\pi ^{(i - 1)}}\buildrel w \over \longrightarrow {\mu ^*}.$$

For r > 0, however, we obtain a result which strongly resembles, e.g. [Reference Hartfiel and Seneta28, Theorem 4.1], but in a more general state space.

1.3. Generalizations and relation to the literature

In this paper robustness is modeled via the first Wasserstein distance because it is both tractable and frequently used. Nevertheless, the question arises whether the presented approach can be applied more generally, specifically related to the existing literature in finite state spaces. In this section we roughly outline potential extensions.

In the existing literature regarding robust Markov chains in finite state spaces—where we mainly refer to [Reference Hartfiel and Seneta28] and [Reference Škulj42] as references—the starting point is a robust transition kernel $P \colon E \rightarrow 2^{\mathcal{P}(E)}$, satisfying certain convexity and closedness conditions. For our approach however, we start with both a transition kernel $\pi \colon E \rightarrow \mathcal{P}(E)$ and a mapping $U\colon \mathcal{P}(E) \rightarrow 2^{\mathcal{P}(E)}$, with the relation of the approaches being P = U ○ π.

In Section 1.2 we used $U(\mu) = \{\hat{\mu} \in \mathcal{P}(E) \colon d_W(\mu,\hat{\mu})\leq r\}$. The setting of Section 1.2 translates to $\beta(\nu,\mu) = \inf_{\hat{\mu}\in U(\mu)} R(\nu,\hat{\mu})$ for large deviation results (Theorem 1.2) and β(ν, μ) = ∞ · 1_U(μ)^C(ν) for law of large number results (Theorem 1.3). Furthermore, $M_n(\theta) = \{\mu \in \mathcal{P}(E^n) \colon \mu_{0,1} \in U(\theta), \mu_{i,i+1}(x_1,\ldots,x_i) \in U(\pi(x_i))$ μ-a.s. for i = 1, …, n − 1} for $\theta \in \mathcal{P}(E)$. In general, the following conditions on U would allow for a similar type of proof for analogs of Theorems 1.2 and 1.3, where the assumptions on E (compactness) and π (Feller property and/or Assumption M) stay the same.

(a) μ ∈ U(μ) for all $\mu \in \mathcal{P}(E)$.
(b) The graph of U, i.e. $\{(\mu,\hat{\mu}) \in \mathcal{P}(E)^2 : \hat{\mu} \in U(\mu)\}$, is closed and convex.

Here, (a) implies that β(μ, μ) = 0 for all $\mu \in \mathcal{P}(E)$. That the graph of U is convex implies Condition (B1); see Lemma 3.2 and the subsequent paragraph, as well as Lemma 3.7. Closedness of the graph is used to verify Condition (B2); see Lemmas 3.3, 3.4, and 3.7. For the large deviations result, closedness of the graph also guarantees a representation of $\beta_n^{\theta}$ in terms of M _n(θ); see Lemmas 3.1 and 3.5.

The assumption that E has to be compact can likely be loosened by assuming that U is compact valued instead, even though an analog of Lemma 3.3 is then more difficult to obtain.

1.4. Further ideas and outlook

There are several possibilities for further directions that the main result, Theorem 1.1, can be used for. We refer again to the paper by Lacker [Reference Lacker34] in which a range of applications are discussed in detail for the i.i.d. setting. For Markov chains, further natural choices for β that may lead to interesting applications are, for example, the following.

• φ-divergences, i.e. for a convex function φ : ℝ₊ → ℝ,
\[ \beta(\nu,\mu) := \int_E \varphi\Big(\frac{\text{d}\nu}{\text{d}\mu}\Big) \text{d}\mu, \]
which is understood to be ∞ if ν is not absolutely continuous with respect to μ.
• Transport costs, i.e. for a lower semicontinuous c: E ² → [0, ∞],
\[ \beta(\nu,\mu) := \inf_{\tau \in \Pi(\nu,\mu)} \int_E c \text{d}\tau. \]
• The L ^p norm of the Radon–Nikodym derivative
\[ \beta(\nu,\mu) := \Big\|\frac{\text{d}\nu}{\text{d}\mu}\Big\|_{L^p(\mu)}. \]
• The sum of different choices for β, i.e. for β ⁽ⁱ⁾ (i = 1, …, K), define
\[ \beta(\nu,\mu) := \sum_{i=1}^K \beta^{(i)}(\nu,\mu). \]

The first choice of φ-divergences is certainly interesting, and the key conditions (B1) and (B2) appear obtainable. Yet, the manifestations of this choice are difficult to interpret, since the resulting functionals β _n and ρ _n are very complex; see also [Reference Lacker34, Chapter 1.5.]. The second choice of transport costs is analyzed in detail in [Reference Lacker34, Chapter 6] in the i.i.d. case. For Markov chains, conditions (B1) and (B2) are satisfied in compact spaces. This is shown in detail in the supplementary material to this paper [Reference Eckstein21], as it nicely illustrates standard methods used to apply Theorem 1.1. Condition (B3) for the lower bound is, in general, not satisfied (consider E = ℝ and the Euclidean distance c), but can be satisfied for specific c (e.g. c(x, y) = 0 if x = y and c(x, y) = ∞ otherwise). The resulting limit theorem is loosely related to all kinds of optimization problems, including optimal transport balls; see, e.g. [Reference Agueh and Carlier2], [Reference Bartl, Drapeau and Tangpi5], [Reference Blanchet and Carlier7], [Reference Blanchet and Murthy8], [Reference Esfahani and Kuhn23], and [Reference Gao and Kleywegt25]. The third idea leads to polynomial large deviation bounds in the i.i.d. case (see [Reference Lacker34, Chapter 4]). For Markov chains, while β is a very natural choice, β is in general not jointly convex, and, hence, Condition (B1) is not satisfied. Nevertheless, one should still keep this choice in mind, as there might be possible adaptations to make it applicable. The fourth point arises from the observation that all major assumptions on β, (B1), (B2), and (B3), are closed under finite summation, and, furthermore, the resulting dual ρ _n is tractable via the inf-convolution of the individual duals; see, e.g. [Reference Cheridito11, Chapter 1.6] for background.

Some choices of β have direct implications for applications, e.g. the large deviation bounds for robust Markov chains from Theorem 1.2 can be used to analyze and fine-tune systems using robust Markov chains; see, e.g. [Reference Erreygers, Rottondi, Verticale and De Bock22], [Reference Kirkizlar, Andradóttir and Ayhan31], [Reference Rottondi, Erreygers, Verticale and De Bock40], [Reference Wiesemann, Kuhn and Rustem46], [Reference Yang47], and [Reference Yu and Xu48]. On the other hand, there may be some hidden applications that are more nuanced. One example is to use the results from Section 1.2 to analyze certain complex stochastic processes, which are themselves not Markovian (or even precisely determined at all), but can be shown to be close to a certain Markov chain (in the sense of Section 1.2). Processes like this may arise naturally, for example, in numerical implementations of certain Markovian algorithms. The stochastic process corresponding to the numerical implementation, including numerical inaccuracies and other potential sources of error, may be difficult to model precisely. However, it is natural to think of this process as close to the process corresponding to the theoretical algorithm, and, hence, the asymptotic behavior may be analyzed using the results from Section 1.2. The reason this requires the results from this paper as opposed to the results from the existing literature on robust Markov chains is that usually finite state spaces are too restrictive, while the assumption of compact state spaces (as in this paper) is often reasonable.

1.5. Structure of the paper

In Section 2 we prove Theorem 1.1 and Corollary 1.1. The method of proof is oriented at [Reference Dupuis and Ellis20, Chapters 8 and 9], while also using tools from convex duality and measurable selection. In Section 2.1 we provide results relating to the lower bound and its proof, and in Section 2.2 we provide results relating to the upper bound and its proof. In Section 2.3 we provide the proof of Corollary 1.1.

In Section 3 we present in depth the applications to robust Markov chains. Aside from using Theorem 1.1 and Corollary 1.1, Section 3 is self-contained, so readers who prefer to read Section 3 before Section 2 can easily do so. A large part of Section 3 is devoted to verifying conditions (B1) and (B2) for the different choices of β. Furthermore, the large deviation results obtained are illustrated in Example 3.1.

Many of the smaller results not listed in the introduction are interesting in their own right, e.g. Lemmas 2.1, 2.2, and 3.3.

In the supplementary material [Reference Eckstein21] to this paper, applicability of the main theorem for the choice of β as a transport cost is shown.

2. Proofs of Theorem 1.1 and Corollary 1.1

2.1. The lower bound of Theorem 1.1

At some points in this section it is necessary to evaluate $\rho^{\theta}_n$ at universally measurable functions, which is still well defined. More precisely, upper semi-analytic functions are the object of interest, the reason made obvious in Lemma 2.1. In particular, upper semi-analytic functions are universally measurable; see, e.g. [Reference Bertsekas and Shreve6, Chapter 7] for background.

2.1.1. Preliminary results

Lemma 2.1

(See also [Reference Lacker34, Proposition A.1].) For $\theta \in \mathcal{P}(E)$, f : Eⁿ → ℝ upper semi-analytic, and 0 < k < n, it holds that

\[ \rho_n^{\theta}(\:f) = \rho_k^{\theta}(g), \]

where g: E ^k → ℝ is defined by

\[ g(x_1,\ldots,x_k) = \rho_{n-k}^{\pi(x_k)}(\:f(x_1,\ldots,x_k,\cdot)). \]

Furthermore, g is upper semi-analytic.

Proof. First, let $\nu \in \mathcal{P}(E^k)$ and let $K\colon E^k \rightarrow \mathcal{P}(E^{n-k})$ be a stochastic kernel. For notational purposes, we write x̄ = (x ₁, …, x _k) for x ₁, …, x _k ∈ E and

\[ K(x_1,\ldots,x_k) = K(\skew1\bar{x}) = K^{\skew1\bar{x}}. \]

Denote the decomposition of K ^x̄ in the usual way, i.e.

\[ K^{\skew1\bar{x}} = K^{\skew1\bar{x}}_{0,1} \otimes K^{\skew1\bar{x}}_{1,2} \otimes \cdots \otimes K^{\skew1\bar{x}}_{n-k-1,n-k}. \]

For the decompositions of ν and ν ⊗ K, the trivial ν ⊗ K-almost-sure equalities hold:

\begin{gather*} \nu_{i,i+1}(x_1,\ldots,x_i) = (\nu \otimes K)_{i,i+1}(x_1,\ldots,x_i) \quad\text{ for } i = 0,\ldots,k-1, \\ K^{\skew1\bar{x}}_{i,i+1}(x_{k+1},\ldots,x_{k+i}) = (\nu \otimes K)_{k+i,k+i+1}(x_1,\ldots,x_{k+i}) \quad\text{ for } i = 0,\ldots,n-k-1. \end{gather*}

Hence,

\begin{align} &\beta_k^{\theta}(\nu) + \int_{E^k} \beta_{n-k}^{\pi(x_k)}(K^{\skew1\bar{x}}) \nu(\text{d} x_1,\ldots,\text{d} x_k) \\ &\quad \quad= \int_{E^n} \beta(\nu_{0,1},\theta) + \bigg(\sum_{i=1}^{k-1} \beta(\nu_{i,i+1}(x_1,\ldots,x_i),\pi(x_i))\bigg) + \beta(K^{\skew1\bar{x}}_{0,1},\pi(x_k)) \\ &\quad \quad + \bigg(\sum_{i=1}^{n-k-1} \beta(K^{\skew1\bar{x}}_{i,i+1}(x_{k+1},\ldots,x_{k+i}), \pi(x_{k+i}))\bigg) \\ &\quad \quad \times K^{\skew1\bar{x}}(dx_{k+1},\ldots,dx_n) \nu(\text{d} x_1,\ldots,\text{d} x_k) \\ &\quad \quad= \beta_n^{\theta}(\nu\otimes K)\!. \end{align}

Using the above and a standard measurable selection argument [Reference Bertsekas and Shreve6, Proposition 7.50], we obtain

\begin{align} \rho_k^{\theta}(g) &= \sup_{\nu\in \mathcal{P}(E^k)} \bigg( \int_{E^k} g \text{d}\nu - \beta_k^{\theta}(\nu)\bigg) \\ &= \sup_{\nu \in \mathcal{P}(E^k)} \biggl( \int_{E^k} \sup_{\mu \in \mathcal{P}(E^{n-k})} \bigg( \int_{E^{n-k}} f(x_1,\ldots,x_n) \mu(\text{d} x_{k+1},\ldots,\text{d} x_n) - \beta_{n-k}^{\pi(x_k)}(\mu)\bigg) \\ &\quad \quad\times\nu(\text{d} x_1,\ldots,\text{d} x_k) - \beta_k^{\theta}(\nu)\bigg) \\ &= \sup_{\nu \in \mathcal{P}(E^k)}\sup_{\substack{K\colon E^k \rightarrow \mathcal{P}(E^{n-k}),\\K\text{ Borel}}} \bigg( \int_{E^n} f \text{d}\nu \otimes K - \beta_n^{\nu}(\nu \otimes K)\bigg) \\ &= \rho_n^{\theta}(\:f). \end{align}

That g is upper semi-analytic can be shown as follows. Both the mappings

\begin{align} (x_k,\nu) \mapsto -\beta_{n-k}^{\pi(x_k)}(\nu),\quad \quad (x_1,\ldots,x_k,\nu) \mapsto \int_{E^{n-k}} f(x_1,\ldots,x_k,\cdot) \text{d}\nu \end{align}

are upper semi-analytic by [Reference Bertsekas and Shreve6, Proposition 7.48], where, for the first mapping, we implicitly have to use [Reference Bertsekas and Shreve6, Proposition 7.27] as mentioned after the definition of $\beta_n^{\cdot}(\kern-1pt\cdot\kern-1pt)$. The sum of these mappings is therefore still upper semi-analytic (see, e.g. [Reference Bertsekas and Shreve6, Lemma 7.30(4)]) and, hence, by [Reference Bertsekas and Shreve6, Proposition 7.47], g is upper semi-analytic.

Lemma 2.2

Under Condition (B3), for all $\theta \in \mathcal{P}(E)$ and f : E ⁿ → ℝ upper semi-analytic, it holds that

\[ \rho_n^{\theta}(\:f) \geq \int_E \rho_n^{\delta_x}(\:f) \theta(\text{d} x). \]

Proof. By Condition (B3), it holds that, for all x ∈ E,

\begin{align} \rho_n^{\delta_x}(\:f) &= \sup_{\nu \in \mathcal{P}(E^n)} \Bigg( \int_{E^n} f \text{d}\nu - \beta_n^{\delta_x}(\nu)\Bigg) \\ &= \sup_{\substack{\nu \in \mathcal{P}(E^n) \colon \nu_{0,1} = \delta_x}} \Bigg( \int_{E^n} f \text{d}\nu - \beta_n^{\delta_x}(\nu)\Bigg) \\ &= \sup_{\nu \in \mathcal{P}(E^{n-1})} \Bigg( \int_{E^n} f \text{d}(\delta_x \otimes \nu) - \beta_n^{\delta_x}(\delta_x \otimes \nu)\Bigg). \end{align}

Hence, we obtain, for $\theta \in \mathcal{P}(E)$,

\begin{align} &\int_E \rho_n^{\delta_{x_1}}(\:f) \theta(\text{d}x_1) \\ &\quad= \int_E \sup_{\nu \in \mathcal{P}(E^{n-1})} \Bigg( \int_{E^n} f \text{d}(\delta_{x_1} \otimes \nu) - \beta_n^{\delta_{x_1}}(\delta_{x_1} \otimes \nu)\Bigg) \theta(\text{d} x_1) \\ &\quad= \int_E \sup_{\nu \in \mathcal{P}(E^{n-1})} \biggl( \int_{E^{n-1}} f(x_1,\cdot) \text{d}\nu \\ &\quad \quad- \int_{E^{n-1}} \sum_{k=2}^n \beta(\nu_{k-2,k-1}(x_2,\ldots,x_{k-1}),\pi(x_{k-1}) \nu(\text{d} x_2,\ldots,\text{d} x_n)\biggr) \theta(\text{d} x_1) \\ &\quad\stackrel{(*)}{=} \sup_{\substack{K : E \rightarrow \mathcal{P}(E^{n-1})\\ K \text{Borel}}}\Bigg( \int_{E^n} f \text{d}\theta \otimes K - \beta_n^{\theta}(\theta \otimes K)\Bigg) \\ &\quad\leq \sup_{\nu \in \mathcal{P}(E^n)} \Bigg(\int_{E^n} f \text{d}\nu - \beta_n^{\theta}(\nu)\Bigg) \\ &\quad= \rho_n^{\theta}(\:f). \end{align}

Here, (*) follows by a standard measurable selection argument; see, for example, [Reference Bertsekas and Shreve6, Proposition 7.50].

Lemma 2.3

Let (X _i)_i∈ℕ be an E-valued sequence of random variables such that $\lim_{n\rightarrow \infty}({1}/{n}) \sum_{i= 1}^{n} F(X_i) = \mathbb{E}[F(X_1)]$ holds almost surely for all F ∈ C _b(E). Let ν ⁽ⁿ⁾ = ℙ ○ (X ₁, …, X _n)⁻¹ be the distribution of (X ₁, …, X _n) for n ∈ ℕ. Then ${\nu ^{(n)}} \circ L_n^{ - 1}{\rm{ }}\buildrel w \over \longrightarrow {\delta _{{\nu ^{(1)}}}}$.

Proof. By a standard separability argument (cf. [Reference Varadarajan44, Proof of Theorem 3.1]), it follows that $L_n(X_1,\ldots,X_n) \buildrel w \over \longrightarrow \nu^{(1)}$ holds ℙ-a.s. Hence, by dominated convergence,

$${\nu ^{(n)}} \circ L_n^{ - 1}{\rm{ }}\buildrel w \over \longrightarrow {\delta _{{\nu ^{(1)}}}}.$$

For the following results, note that, under Assumption M, π has a unique invariant measure, which we denote by μ ^∗; see [Reference Dupuis and Ellis20, Lemma 8.6.2(a)]. Furthermore, recall that, for k ∈ ℕ, we denote by π ^(k) the k-step transition kernel of the Markov chain, as defined in Condition (M1).

Lemma 2.4

([Reference Dupuis and Ellis20, Lemma 8.6.2(b)].) Let Assumption M be satisfied. Let A ⊆ E be a Borel set such that π ^{(l ₀)}(x ₀, A) > 0 for some x ₀ ∈ E. Then μ ^∗(A) > 0, where μ ^∗ is the unique invariant measure under π.

Lemma 2.5

(Adapted version of [Reference Dupuis and Ellis20, Lemma 8.6.2(c)].) Let Assumption M and Condition (B3) be satisfied. Let $\nu \in \mathcal{P}(E)$ satisfy $\beta_2^{\nu}(\nu\otimes p) \lt \infty$ for some stochastic kernel p on E such that νp = ν. Then it holds that ν ≪ μ ^∗, where μ ^∗ is the unique invariant measure under π.

Proof. Let Ω₀ ⊆ E be a Borel set such that ν(Ω₀) = 1 and p(x) ≪ π(x) for all x ∈ Ω₀, which we can choose by (B3) and since $\beta _2^\nu (\nu \otimes p){\rm{ \lt }}\infty$. Define $\tilde{p}(x) := {\bf 1}_{\Omega_0}(x) p(x) + {\bf 1}_{\Omega_0^C}(x) \pi(x)$. Since p̃(x) ≪ π(x) for all x ∈ E, we have p̃ ^{(l ₀)}(x) ≪ π ^{(l ₀)}(x) for all x ∈ E, where l ₀ is the constant from Condition (M1).

Now choose a Borel set A ⊆ E such that ν(A) > 0. By iterating ν p̃ = ν, we obtain a Borel set B ⊆ E with ν(B) > 0 and p̃ ^{(l ₀)}(x, A) > 0 for all x ∈ B. Hence, π ^{(l ₀)}(x, A) > 0 for all x ∈ B and, therefore, by Lemma 2.4, μ ^∗(A) > 0.

2.1.2. Proof of Theorem 1.1: the lower bound. Let $F\in C_b(\mathcal{P}(E))$ and ε > 0 be fixed. We have to show that

\begin{align}\label{lowerboundmain} \liminf_{n\rightarrow \infty} \frac{1}{n} \rho_n(n F \circ L_n) \geq \sup_{\nu \in \mathcal{P}(E)} ( F(\nu) - I(\nu)) - 4 \varepsilon. \end{align}

We do this by showing that every subsequence has a further subsequence which satisfies this inequality. So we fix a subsequence and relabel it n ∈ ℕ. Labeling subsequences by the same index as the original sequence will be a common practice throughout the remainder of this paper.

Outline of the proof. First, we show that there exists a Borel set Φ ⊆ E such that π ^{(l ₀)}(y, Φ) = 1 for all y ∈ E, and, for all x ∈ Φ,

(2.1)

\begin{align} \label{ineqlb1} \liminf_{n\rightarrow \infty} \frac{1}{n} \rho_{n-l_0}^{\delta_x}(n F\circ L_n(x_1,\ldots,x_{l_0}, \cdot)) \geq \sup_{\nu \in \mathcal{P}(E)} (F(\nu)-I(\nu)) - 3\varepsilon \end{align}

holds for all x ₁, …, x _{l ₀} ∈ E and a further subsequence (the same subsequence for all x ₁, …, x _{l ₀}). This subsequence then remains fixed for the rest of the proof and is again labeled by n ∈ ℕ.

The next step is to use Lemma 2.1, i.e. for all f ∈ C _b(E ⁿ),

\[ \rho_n(\:f) = \rho_{l_0}((x_1,\ldots,x_{l_0})\mapsto \rho_{n-{l_0}}^{\pi(x_{l_0})}(\:f(x_1,\ldots,x_{l_0},\cdot)), \]

where l ₀ is the constant from Condition (M1). This is used together with Lemma 2.2, i.e. for all f ∈ C _b(E ⁿ) and $\theta \in \mathcal{P}(E)$,

\[ \rho_n^{\theta}(\:f) \geq \int_E \rho_n^{\delta_x}(\:f) \theta(\text{d} x). \]

We then use these two results to show that

(2.2)

\begin{align} \label{ineqlb2} \rho_n(nF\circ L_n) \geq \rho_{l_0}(g_n), \end{align}

where

\[g_n(x_1,\ldots,x_{l_0}) = \int_{\Phi} \rho_{n-l_0}^{\delta_x}(nF \circ L_n(x_1,\ldots,x_{l_0},\cdot)) \pi^{(l_0)}(x_{l_0},\text{d} x).\]

We conclude by combining the first limit result (2.1) and inequality (2.2), which works by Fatou’s lemma, using monotonicity of ρ _n and the fact that ρ _n(c) ≥ c for all c ∈ ℝ.

Step 1. We show that (2.1) holds for all x ∈ Φ and x ₁, …, x _{l ₀} ∈ E, where Φ and the required further subsequence is specified later.

We can, without loss of generality, choose $\nu_0 \in \mathcal{P}(E)$ such that

\[ -\infty \lt \sup_{\nu \in \mathcal{P}(E)}( F(\nu) - I(\nu)) \leq F(\nu_0)-I(\nu_0) + \varepsilon \lt \infty, \]

since if the supremum equals −∞, there is nothing to show. Then

\[\inf_{q : \nu_0 q = \nu_0} \int_E \beta(q(x),\pi(x)) \nu_0(dx) = I(\nu_0) \lt \infty. \]

Choose a stochastic kernel p on E with ν ₀p = ν ₀ such that

\[ \infty > I(\nu_0) + \varepsilon \geq \int_E \beta(p(x),\pi(x)) \nu_0(\text{d} x) = \beta_2^{\nu_0}(\nu_0 \otimes p). \]

By (B3), we can choose a Borel set N ⊆ E with ν ₀(N) = 0 such that p(x) ≪ π(x) for all x ∈ N ^C. Define the stochastic kernel p ₀ on E by p ₀(x) := 1_N(x)π(x) + 1_{N ^C} (x)p(x) for x ∈ E and show that

\[ \infty > I(\nu_0) + \varepsilon \geq \beta_2^{\nu_0}(\nu_0 \otimes p) = \beta_2^{\nu_0}(\nu_0 \otimes p_0). \]

For all x ∈ E, p ₀(x) ≪ π(x) holds. Next, we replace ν ₀ and p ₀ by ν ₁ and p ₁, such that $F(\nu_1) + \beta_2^{\nu_1}(\nu_1 \otimes p_1) \geq F(\nu_0) + \beta_2^{\nu_0}(\nu_0 \otimes p_0) - 2\varepsilon$ and, additionally, p ₁ is pointwise equivalent to π.

By conditions (M1) and (M2), π has a unique invariant measure, denoted by μ ^∗; see [Reference Dupuis and Ellis20, Lemma 8.6.2(a)]. By the lower boundedness of β, we can choose κ ₀ ∈ (0, 1) such that

\[ (1-\kappa_0)\beta_2^{\nu_0}(\nu_0\otimes p_0) \leq \beta_2^{\nu_0}(\nu_0\otimes p_0)+\varepsilon.\]

By continuity of F, we can further choose κ ₁ > 0 such that, for all $0 \leq \hat{\kappa} \leq \kappa_1$,

\[F((1-\hat{\kappa})\nu_0 + \hat{\kappa} \mu^*) \geq F(\nu_0)-\varepsilon.\]

Choose κ : = min{κ ₀, κ ₁} and define ν ₁ := (1 − κ)ν ₀ + κμ ^∗ and

\[ p_1(x) = \frac{\text{d}\nu_0}{\text{d}\nu_1}(x) (1-\kappa) p_0(x) + \frac{\text{d}\mu^*}{\text{d}\nu_1}(x) \kappa \pi(x). \]

Then one quickly checks that ν ₁ ⊗ p ₁ = (1 − κ)(ν ₀ ⊗ p ₀) + κ(μ ^∗ ⊗ π), in particular, therefore, ν ₁p ₁ = ν ₁. By the convexity of $\beta_2^{\cdot}(\kern-1pt\cdot\kern-1pt)$ and the assumption that β(ν, ν) = 0 for all $\nu \in \mathcal{P}(E)$, it holds that

\begin{align} \beta_2^{\nu_1}(\nu_1\otimes p_1) \leq (1-\kappa)\beta_2^{\nu_0}(\nu_0\otimes p_0) + \kappa \beta_2^{\mu^*}(\mu^*\otimes \pi) \leq \beta_2^{\nu_0}(\nu_0\otimes p_0) + \varepsilon, \end{align}

and, thus,

\[F(\nu_1) - \beta_2^{\nu_1}(\nu_1 \otimes p_1) \geq F(\nu_0) - \beta_2^{\nu_0}(\nu_0 \otimes p_0) - 2\varepsilon. \]

Since $\beta_2^{\nu_1}(\nu_1 \otimes p_1) \lt \infty$, without loss of generality, p ₁(x) ≪ π(x) for all x ∈ E. By Lemma 2.5 (which yields ν ₁ ≪ μ ^∗ and, hence, dμ ^∗(x)/ dν ₁ > 0 for ν ₁-almost all x ∈ E) and by the construction of p ₁, it also holds that π(x) ≪ p ₁(x), again without loss of generality for all x ∈ E.

So p ₁ also satisfies (M1), as every kernel which is pointwise equivalent to π satisfies (M1), notably with the same constants l ₀ and n ₀.

It follows that the Markov chain with initial distribution ν ₁ and transition kernel p ₁ is ergodic; see [Reference Dupuis and Ellis20, Lemma 8.6.2(a)]. It follows from the pointwise ergodic theorem (for both ergodic theorems used, see [Reference Dupuis and Ellis20, Appendix A.4] or the references therein, i.e. [Reference Breiman9, Corollaries 6.23 and 6.25]) that the sequence

\[(\mu^{(n)})_{n\in\mathbb{N}} := \bigg(\nu_1 \otimes \Big(\bigotimes_{i=1}^{n-1} p_1\Big)\bigg)_{n\in\mathbb{N}} \]

satisfies the conditions of Lemma 2.3 and, thus, $\mu^{(n)}\circ L_n^{-1} \buildrel w \over \longrightarrow \delta_{\nu_1}$. This yields

(2.3)

\begin{align}\label{eqlim1} \lim_{n\rightarrow \infty} \int_{E^n} | F\circ L_n - F(\nu_1)| \text{d}\mu^{(n)} = \lim_{n\rightarrow \infty} \int_{\mathcal{P}(E)} | F - F(\nu_1)| \text{d}\mu^{(n)} \circ L_n^{-1} = 0. \end{align}

Let (X _n)_n∈ℕ be a sequence of E-valued random variables such that (X ₁, …, X _n) ~ μ ⁽ⁿ⁾ for all n ∈ ℕ. We see that

\begin{gather*} \mathbb{E}[\beta(p_1(X_1),\pi(X_1))] = \beta_2^{\nu_1}(\nu_1 \otimes p_1), \\ \mathbb{E}[| \beta(p_1(X_1),\pi(X_1))|] \leq \Big| \min_{x\in \mathcal{P}(E)^2} \beta(x)\Big| + \beta_2^{\nu_1}(\nu_1 \otimes p_1) \lt \infty, \end{gather*}

and, thus, by the L ₁-ergodic theorem,

(2.4)

\begin{gather} \lim_{n\rightarrow \infty} \mathbb{E}\bigg[ \bigg| \frac{1}{n} \sum_{i=1}^{n-1} \beta(p_1(X_i),\pi(X_i))-\beta_2^{\nu_1}(\nu_1 \otimes p_1)\bigg|\bigg] = 0 \label{eqlim2} \\ \Longleftrightarrow\quad \lim_{n\rightarrow \infty}\int_{E^n} \bigg| \frac{1}{n} \sum_{i=1}^{n-1} \beta(p_1(x_i),\pi(x_i)) - \beta_2^{\nu_1}(\nu_1 \otimes p_1)\bigg| \mu^{(n)}(\text{d} x_1,\ldots,\text{d} x_n) = 0.\label{eqlim3} \end{gather}

For $\theta \in \mathcal{P}(E)$ and a stochastic kernel $q\colon E\rightarrow \mathcal{P}(E)$, we define

\begin{align} (\mu^{(\theta,q,n)})_{n\in\mathbb{N}} &:= \bigg(\theta \otimes \Big(\bigotimes_{i=1}^{n-1} q\Big)\bigg)_{n\in\mathbb{N}}. \end{align}

By the above limits (2.3) and (2.4), and the fact that L ₁-convergence implies almost-sure convergence of a subsequence, we can choose a Borel set Φ ⊆ E, ν ₁ (Φ) = 1 such that, for all x ∈ Φ and a subsequence (again labeled by n ∈ ℕ),

(2.5)

\begin{align}\label{Feqdx} \lim_{n\rightarrow \infty} \int_{E^n} | F\circ L_n - F(\nu_1) | \mu^{(\delta_x,p_1,n)}(\text{d} x_1,\ldots,\text{d} x_n) = 0 \end{align}

and

\begin{gather*}\\[-24pt] \lim_{n\rightarrow \infty} \int_{E^n} \bigg| \frac{1}{n} \sum_{i=1}^{n-1} \beta(p_1(x_i),\pi(x_i)) - \beta_2^{\nu_1}(\nu_1 \otimes p_1)\bigg| \mu^{(\delta_x,p_1,n)}(\text{d} x_1,\ldots,\text{d} x_n) = 0 \\ \Longrightarrow\quad \lim_{n\rightarrow \infty} \beta_n^{\delta_x}(\mu^{(\delta_x,p_1,n)}) = \beta_2^{\nu_1}(\nu_1\otimes p_1). \label{betaeqdx} \end{gather*}

Since ν ₁ and μ ^∗ are equivalent by Lemma 2.4, μ ^∗(Φ) = 1. Since μ ^∗(Φ) = 1, π ^{(l ₀)}(Φ) = 1 holds, as otherwise Lemma 2.4 would imply that μ ^∗(Φ^C) > 0. So we have found the set mentioned at the beginning of the proof and the required subsequence. It remains to show that (2.1) holds for all x ∈ Φ and x ₁, …, x _{l ₀} ∈ E.

Let x ₁, …, x _{l ₀} ∈ E. By (2.5), dominated convergence, and the triangle inequality,

\begin{align} &\int_{E^{n-l_0}} | F\circ L_n(x_1,\ldots,x_n) - F(\nu_1) | \mu^{(\delta_x,p_1,n-l_0)}(\text{d} x_{l_0+1},\ldots,\text{d} x_n) \\ &\quad \quad\leq \int_{E^{n-l_0}} | F\circ L_n(x_1,\ldots,x_n) - F\circ L_{n-l_0}(x_{l_0+1},\ldots,x_n)| \mu^{(\delta_x,p_1,n-l_0)}(\text{d} x_{l_0+1},\ldots,\text{d} x_n) \\ &\quad \quad + \int_{E^{n-l_0}} | F\circ L_{n-l_0}(x_{l_0+1},\ldots,x_n) - F(\nu_1)| \mu^{(\delta_x,p_1,n-l_0)}(\text{d} x_{l_0+1},\ldots,\text{d} x_n) \\ &\quad \quad\rightarrow 0, \end{align}

since F is continuous and ||L _n(x ₁, …, x _{l ₀}, ·) − L _{n−l ₀}||_v ≤ 2l ₀/n → 0, where || · ||_v denotes the total variation norm. Thus,

\begin{align} \int_{E^{n-l_0}} F\circ L_n(x_1,\ldots,x_n) \mu^{(\delta_x,p_1,n-l_0)}(\text{d} x_{l_0+1},\ldots,\text{d} x_n) \rightarrow F(\nu_1). \end{align}

Finally, it follows that

\begin{align} &\liminf_{n\rightarrow \infty} \frac{1}{n} \rho_{n-l_0}^{\delta_x}(n F \circ L_{n-l_0}(x_1,\ldots x_{l_0},\cdot)) \\ &\quad \quad = \liminf_{n\rightarrow \infty} \sup_{\nu \in \mathcal{P}(E^{n-l_0})} \bigg( \int_{E^{n-l_0}} F \circ L_n(x_1,\ldots,x_n) \nu(\text{d} x_{l_0+1},\ldots,\text{d} x_n) - \beta_{n-l_0}^{\delta_x}(\nu) \bigg) \\ &\quad \quad \geq \liminf_{n\rightarrow \infty} \bigg(\int_{E^{n-l_0}} F \circ L_n(x_1,\ldots,x_n) \mu^{(\delta_x,p_1,n-l_0)}(\text{d} x_{l_0+1},\ldots,\text{d} x_n) \\ &\quad \quad \quad \quad \quad - \beta_{n-l_0}^{\delta_x}(\mu^{(\delta_x,p_1,n-l_0)})\bigg) \\ &\quad \quad = F(\nu_1) - \beta_2^{\nu_1}(\nu_1 \otimes p_1) \\ &\quad \quad \geq \sup_{\nu\in\mathcal{P}(E)}( F(\nu) - I(\nu)) - 3 \varepsilon. \end{align}

Step 2. First, define g _n : E ^{l ₀} → ℝ for n > l ₀ by

$${g_n}({x_1}, \ldots ,{x_{{l_0}}}) = \int_\Phi {\rho _{n - {l_0}}^{{\delta _x}}} (nF \circ {L_n}({x_1}, \ldots ,{x_{{l_0}}}, \cdot ))\pi ({x_{{l_0}}},{\rm{d}}x).$$

Then g _n is upper semi-analytic, since $(x,x_1,\ldots,x_{l_0}) \mapsto \rho_{n-l_0}^{\delta_x}(n F \circ L_n(x_1,\ldots,x_{l_0},\cdot))$ is (by [Reference Bertsekas and Shreve6, Propositions 7.47 and 7.48]; see also Lemma 2.1) and, thus, g _n is as well (by [Reference Bertsekas and Shreve6, Prop. 7.48]).

By Fatou’s lemma (applicable since $\rho_n^{\delta_x}(nF\circ L_n)/{n} \geq -\|F\|_{\infty}$, for all x ₁, …, x _l0 ∈ E,

\begin{align} \liminf_{n\rightarrow \infty} \frac{1}{n} g_n(x_1,\ldots,x_n) &\geq \int_{\Phi} \liminf_{n\rightarrow \infty} \frac{1}{n} \rho_{n-l_0}^{\delta_x}(n F \circ L_n(x_1,\ldots,x_{l_0},\cdot)) \pi(x_{l_0},\text{d} x) \\ &\geq \sup_{\nu\in\mathcal{P}(E)}(F(\nu)-I(\nu))-3\varepsilon. \end{align}

We define the sets

$${\Omega _n}: = \{ ({x_1}, \ldots ,{x_{{l_0}}}) \in {E^{{l_0}}}:{1 \over n}{g_j}({x_1}, \ldots ,{x_{{l_0}}}) \ge \mathop {\sup }\limits_{\nu \in {\cal P}(E)} (F(\nu ) - I(\nu )) - 4\varepsilon {\rm{ for all }}j \ge n\} $$

for n ∈ ℕ, which are universally measurable and satisfy Ω₁ ⊆ Ω₂ ⊆ Ω₃ · · · with $ \bigcup_{i=1}^{\infty} \Omega_i = E^{l_0}$. For n ℕ, let p _n := μ ^{π ₀,π,l ₀} (Ω_n). Then, by continuity from below, p _n → 1 for n → ∞. We have, by Lemmas 2.1 and 2.2, and the monotonicity of ρ _{l ₀},

\begin{align} \liminf_{n\rightarrow \infty} \frac{1}{n} \rho_n(nF\circ L_n) &\geq \liminf_{n\rightarrow \infty} \frac{1}{n} \rho_{l_0}(g_n)\\ &\geq \liminf_{n\rightarrow \infty} \frac{1}{n} \rho_{l_0}\Big({\bf 1}_{\Omega_n} n\Big(\sup_{\nu \in \mathcal{P}(E)} (F(\nu)-I(\nu)) - 4\varepsilon\Big) - {\bf 1}_{\Omega_n^C} n \|F\|_{\infty}\Big)\\ &\geq \liminf_{n\rightarrow \infty} \Big(p_n \Big(\sup_{\nu \in \mathcal{P}(E)} (F(\nu)-I(\nu)) - 4\varepsilon\Big) - (1-p_n) \|F\|_{\infty}\Big)\\ &= \sup_{\nu \in \mathcal{P}(E)} (F(\nu)-I(\nu)) - 4\varepsilon, \end{align}

where the last inequality uses β(ν, ν) = 0 for all $\nu \in \mathcal{P}(E)$, which implies that $\beta_{l_0}^{\pi_0}(\mu^{\pi_0, \pi, l_0}) = 0$ and, hence, ρ _{l ₀}(f) ≥ ∫_{E ^{l ₀}} f dμ ^{π ₀,π,l ₀} for all f ∈ C _b(E ^{l ₀}).

2.2. The upper bound of Theorem 1.1

2.2.1. Preliminary results. The following theorem is essential for the proof of the upper bound. It is based on Proposition 8.2.5 and Theorem 8.2.8 of [Reference Dupuis and Ellis20].

Theorem 2.1

Suppose that Assumption T holds, and let $(\mu^{(n)})_{n\in\mathbb{N}} \subseteq \mathcal{P}(E^n)$ be a sequence of measures such that

$$\sup_{n\in\mathbb{N}} \frac{1}{n}\beta^{\pi_0}_n(\mu^{(n)}) \lt \infty.$$

For n ∈ ℕ, let X _n = (X _n,1, …, X _n,n) be E ⁿ-valued random variables with distribution μ ⁽ⁿ⁾. Define the sequence of $\mathcal{P}(E\times E)$-valued random variables (γ _n)_n∈ℕ by

$${\gamma _{n - 1}}: = {1 \over {n - 1}}\sum\limits_{i = 1}^{n - 1} {{\delta _{{X_{n,i}}}}} \otimes \mu _{i,i + 1}^{(n)}({X_{n,1}}, \ldots ,{X_{n,i}}).$$

It holds that

(i) (γ _n)_n∈ℕ is tight;
(ii) for every convergent (in distribution) subsequence of (γ _n)_n∈ℕ, there exists a probability space $(\bar{\Omega},\bar{\mathcal{F}},\bar{\mathbb{P}})$ such that, on this space, there exist random variables $\bar{\gamma}_n \sim \gamma_n$ and $\bar{\gamma}\sim \gamma $ with $\bar{\gamma}_n \buildrel w \over \longrightarrow \bar{\gamma}$, $\bar{\mathbb{P}}$-a.s. Furthermore, $\bar{\gamma}^{(1)} = \bar{\gamma}^{(2)}$, $\bar{\mathbb{P}}$-a.s., where $\bar{\gamma}^{(1)}$ and $\bar{\gamma}^{(2)}$ are the first and second marginals of $\bar{\gamma}$, respectively.

Proof. For the proof of (i), there is nothing to show if (T1’) holds. So we only consider the case in which (T1) holds. Define the sequence of first marginals ${({\tilde L_n})_{n \in }}: = {(\gamma _n^{(1)})_{n \in }}$. We first show that (L̃ _n)_n∈ℕ is tight. The idea is to use (T1) which yields a tightness function c on E defined by

c(x) := U(x) - \rho^{\pi(x)}(U),

and, thus, a tightness function G on $\mathcal{P}(E)$ defined by

$G(\theta) := \int_E c \text{d} \theta,$

where we refer to Appendix A.3.17 of [Reference Dupuis and Ellis20] and the preceding definition, as well as Lemma 8.2.4 of [Reference Dupuis and Ellis20] for properties of a tightness function. In the following, we show that 𝔼[∫_E c dL̃ _n] ≤ K ∈ ℝ uniformly in n ∈ ℕ, which is sufficient to yield the claim since

$$\mathbb{E}\bigg[\int_E c \text{d}\tilde{L}_n \bigg] = \int_{\mathcal{P}(E)} \bigg(\int_E c \text{d}\theta\bigg) \mathbb{P}\circ\tilde{L}_n^{-1}(\text{d}\theta)$$

and the set $\{Q\in \mathcal{P}(\mathcal{P}(E)) \colon \int_{\mathcal{P}(E)} G(\theta) Q(\text{d}\theta) \leq M \}$ is tight for every M ∈ ℝ by Lemma 8.2.4 of [Reference Dupuis and Ellis20].

In a first step, we assume that U is bounded. Then, for all x ∈ E, by the definition of ρ ^π(x), for all $\nu \in \mathcal{P}(E)$,

(2.6)

\begin{align}\label{eqhl1} \beta(\nu,\pi(x)) \geq \int_E U \text{d}\nu -\rho^{\pi(x)}(U). \end{align}

For i ∈ {1, 2, …, n − 1}, $\mu_{i,i+1}^{(n)}(X_{n,1},\ldots,X_{n,i})$ is a regular conditional distribution of X _n,i+1 given σ(X _n,1, …, X _n,i) and, therefore (see, for example, [Reference Dudley19, Theorem 10.2.5], where U was bounded),

$$\mathbb{E}[U(X_{n,i+1}) \mid X_{n,1},\ldots,X_{n,i}] = \int_E U \text{d}\mu_{i,i+1}^{(n)}(X_{n,1},\ldots,X_{n,i}).$$

We calculate

\begin{align} &\mathbb{E}[U(X_{n,i+1}) - U(X_{n,i})] \\ &\quad \quad= \mathbb{E}[\mathbb{E}[U(X_{n,i+1}) | X_{n,1},\ldots,X_{n,i}]-U(X_{n,i})] \\ &\quad \quad= \mathbb{E}\bigg[\int_E U d\mu_{i,i+1}^{(n)}(X_{n,1},\ldots,X_{n,i}) - U(X_{n,i})\bigg] \\ &\quad \quad= \mathbb{E}\bigg[\int_E U d\mu_{i,i+1}^{(n)}(X_{n,1},\ldots,X_{n,i}) - \rho^{\pi(X_{n,i})}\bigg] + \mathbb{E}[\rho^{\pi(X_{n,i})} - U(X_{n,i})] \\ &\quad \quad\stackrel(2.6){\leq} \mathbb{E}[\beta(\mu_{i,i+1}^{(n)}(X_{n,1},\ldots,X_{n,i}), \pi(X_{n,i}))] - \mathbb{E}[c(X_{n,i})]. \end{align}

Summing the above inequalities over i ∈ {1, 2, …, n − 1} yields

\begin{align} &\mathbb{E}[U(X_{n,n}) - U(X_{n,1})] \leq \sum_{i=1}^{n-1} ( \mathbb{E}[\beta(\mu_{i,i+1}^{(n)}(X_{n,1},\ldots,X_{n,i}), \pi(X_{n,i}))] - \mathbb{E}[c(X_{n,i})] ) \\& \quad \quad \quad \quad \quad \quad \Rightarrow \sum_{i=1}^{n-1} \mathbb{E}[c(X_{n,i})] \\ &\quad \quad \quad \quad \quad \quad \leq \mathbb{E}[U(X_{n,1})] + \sum_{i=1}^{n-1} \mathbb{E}[\beta(\mu_{i,i+1}^{(n)}(X_{n,1},\ldots,X_{n,i}), \pi(X_{n,i}))],\end{align}

where 𝔼[U(X _n,n)] ≥ 0 is used. Dividing the above inequality by (n − 1), we obtain

\begin{align} &\mathbb{E}\Bigg[\int_E c d\tilde{L}_{n-1}\Bigg] \\ &\quad \quad= \frac{1}{n-1} \sum_{i=1}^{n-1} \mathbb{E}[c(X_{n,i})] \\ &\quad \quad\leq \frac{1}{n-1} \bigg(\mathbb{E}[U(X_{n,1})] + \sum_{i=1}^{n-1} \mathbb{E}[\beta(\mu_{i,i+1}^{(n)}(X_{n,1},\ldots,X_{n,i}), \pi(X_{n,i}))]\bigg) \\ &\quad \quad\leq \frac{1}{n-1} \bigg(\beta(\mu_{0,1}^{(n)},\pi_0) + \rho^{\pi_0}(U) + \sum_{i=1}^{n-1} \mathbb{E}[\beta(\mu_{i,i+1}^{(n)}(X_{n,1},\ldots,X_{n,i}), \pi(X_{n,i}))]\bigg) \\ &\quad \quad= \frac{1}{n-1} \beta^{\pi_0}_n(\mu^{(n)}) + \frac{1}{n-1}\rho^{\pi_0}(U). \end{align}

The last term of the above inequality chain is uniformly bounded for all n ≥ 2 by assumption and part (c) of (T1), and we denote this bound by K ∈ ℝ.

Now, let us show that the above holds for unbounded U. Let U _k := U ∧ k (for k ∈ ℕ) and c _k(x) := U _k(x) − ρ ^π(x)(U _k). We have shown that

$$\mathbb{E} \left[ {\int_E {{c_k}} {\rm{d}}{{\tilde L}_{n - 1}}} \right] \le {1 \over {n - 1}}\beta _n^{{\pi _0}}({\mu ^{(n)}}) + {1 \over {n - 1}}{\rho ^{{\pi _0}}}({U_k}) \le {1 \over {n - 1}}\beta _n^{{\pi _0}}({\mu ^{(n)}}) + {1 \over {n - 1}}{\rho ^{{\pi _0}}}(U).$$

One quickly verifies that $c_k \geq c\wedge(\inf_{\tau \in \mathcal{P}(E)^2} \beta(\tau))$. Indeed, $c_k(x) \geq \inf_{\tau \in \mathcal{P}(E)^2} \beta(\tau)$, if U(x) ≥ k and c _k(x) ≥ c(x) if U(x) ≤ k. Hence, the c _k are uniformly bounded below by a constant owing to the lower boundedness of β and (T1). Furthermore, for all x ∈ E, c(x) = lim_k→∞ c _k(x) by monotone convergence and, therefore, by Fatou’s lemma

$$\mathbb{E} \left[ {\int_E c {\rm{d}}{{\tilde L}_{n - 1}}} \right] \le \mathop {\lim \inf }\limits_{k \to \infty } \quad \mathbb{E}\left[ {\int_E {{c_k}} {\rm{d}}{{\tilde L}_{n - 1}}} \right] \le {1 \over {n - 1}}\beta _n^{{\pi _0}}({\mu ^{(n)}}) + {1 \over {n - 1}}{\rho ^{{\pi _0}}}(U) \le K.$$

This shows that (L̃ _n)_n∈ℕ is tight.

Next, we show that the sequence of second marginals of (γ _n)_n∈N is tight, i.e. we prove tightness of the sequence ${(\gamma _n^{(2)})_{n \in {\mathbb{N}}}}$ given by $\gamma_{n-1}^{(2)} = ({1}/{(n-1)}) \sum_{i=1}^{n-1} \mu_{i,i+1}^{(n)}(X_{n,1},\ldots,X_{n,i})$. This follows from

\begin{align} \mathbb{E}\bigg[\int_E c \text{d}\gamma_n^{(2)}\bigg] &= \frac{1}{n}\sum_{i = 1}^n \mathbb{E}\bigg[\int_E c \text{d}\mu_{i,i+1}^{(n+1)}(X_{n+1,1},\ldots, X_{n+1,i})\bigg] \\ &\stackrel{(*)}{=} \frac{1}{n}\sum_{i=1}^{n} \mathbb{E}[\mathbb{E}[c(X_{n+1,i}) \mid X_{n+1,1},\ldots,X_{n+1,i}]] \\ &= \frac{1}{n}\sum_{i=1}^n \mathbb{E}[ c(X_{n+1,i})] \\ &= \mathbb{E}\bigg[\int_E c \text{d}\tilde{L}_n\bigg] \\ &\leq K, \end{align}

where the last inequality is uniformly in n∈ ℕ as shown above. Note that while equality (*) requires integrability, we can circumvent this requirement by using the same arguments as above, in that we first assume U is bounded and use Fatou’s lemma for the transition to the general case.

Tightness of (γ _n)_n∈ℕ now follows from tightness of the marginals ${(\gamma _n^{(2)})_{n \in {\mathbb{N}}}}$ and (L̃ _n)_n∈ℕ. For part (ii), choose any subsequence still denoted by (γ _n)_n∈ℕ that converges in distribution, which means there exists a $\mathcal{P}(E\times E)$-valued random variable γ such that

$$\mathbb{P}\circ \gamma_n^{-1} \buildrel w \over \longrightarrow \mathbb{P}\circ \gamma^{-1}.$$

With Skorokhod’s representation theorem (see, e.g. [Reference Ethier and Kurtz24, page 102]), we can go over to a probability space $(\bar{\Omega},\bar{\mathcal{F}},\bar{\mathbb{P}})$ such that on this space there exist random variables $\bar{\gamma}_n \sim \gamma_n$ and $\bar{\gamma}\sim \gamma$ with $\bar{\gamma}_n \buildrel w \over \longrightarrow \bar{\gamma}$, $\bar{\mathbb{P}}$-a.s..

It only remains to show that $\bar{\gamma}^{(1)} = \bar{\gamma}^{(2)}$ holds $\bar{\mathbb{P}}$-a.s. Since $\mu_{i,i+1}^{(n)}(X_{n,1},\ldots,X_{n,i})$ is a regular conditional distribution of X _n,i+1 given X _n,1, …, X _n,i,

$$\label{eq2hl} \mathbb{E}\bigg[\bigg(\:f(X_{n,i+1}) - \int_E f \text{d}\mu_{i,i+1}^{(n)}(X_{n,1},\ldots,X_{n,i})\bigg) \biggm| X_{n,1},\ldots,X_{n,i} \bigg] = 0$$

for f ∈ C _b(E), n ∈ ℕ, and i ∈ {1, … , n−1}. This means that the terms inside the expectation form (for fixed n) a martingale difference sequence. For ease of notation, we write

\begin{align} a_{n,i} := f(X_{n,i}), \quad\text{and}\quad b_{n,i} := \int_E f \text{d}\mu_{i-1,i}^{(n)}(X_{n,1},\ldots,X_{n,i-1}), \end{align}

and get for n ≥ 2,

\begin{align} &\bar{\mathbb{E}}\bigg[\bigg(\int_E f \text{d} \bar{\gamma}_{n-1}^{(1)} - \int_E f \text{d} \bar{\gamma}_{n-1}^{(2)}\bigg)^2 \bigg] \\ &\quad \quad=\mathbb{E}\bigg[\bigg(\int_E f \text{d} \gamma_{n-1}^{(1)} - \int_E f \text{d} \gamma_{n-1}^{(2)}\bigg)^2 \bigg] \\ &\quad \quad= \mathbb{E}\bigg[\bigg( \frac{1}{n-1} \sum_{i=1}^{n-1} a_{n,i} - b_{n,i+1}\bigg)^2\bigg] \\ &\quad \quad= \frac{1}{(n-1)^2} \mathbb{E}\bigg[ \bigg((b_{n,1}-b_{n,n})+\bigg(\sum_{i=1}^{n-1} a_{n,i}-b_{n,i}\bigg)\!\bigg)^2\bigg] \\ &\quad \quad= \frac{1}{(n-1)^2} \mathbb{E}\bigg[ (b_{n,1}-b_{n,n})^2 + 2(b_{n,1}-b_{n,n})\bigg(\sum_{i=1}^{n-1} a_{n,i}-b_{n,i}\bigg) \\ &\quad {82pt}+ \bigg(\sum_{i=1}^{n-1} (a_{n,i}-b_{n,i})^2\bigg) \bigg] \\ &\quad \quad\leq \frac{4+8(n-1)+4(n-1)}{(n-1)^2} \|\:f\|_{\infty}^2, \end{align}

which converges to 0 for n → ∞. By the triangle inequality,

\begin{align} \bar{\mathbb{E}}\bigg[\bigg(\int_E f \text{d} \bar{\gamma}^{(1)} - \int_E f \text{d}\bar{\gamma}^{(2)}\bigg)^2\bigg] = 0, \end{align}

which implies that $\int_E f \text{d}\bar{\gamma}^{(1)} = \int_E f \text{d}\bar{\gamma}^{(2)}$, $\bar{\mathbb{P}}$-a.s. for every f ∈ C _b(E). By a standard separability argument (cf. [Reference Varadarajan44, Proof of Theorem 3.1]), it follows that $\bar{\gamma}^{(1)} = \bar{\gamma}^{(2)}$, $\mathbb{\bar{P}}$-a.s.

2.2.2. Proof of Theorem 1.1: the upper bound. Let $F\colon \mathcal{P}(E) \rightarrow \mathbb{R}$ be bounded and upper semi-continuous. By definition,

$$\frac {1}{n} \rho_n(nF\circ L_n) = \sup_{\mu \in \mathcal{P}(E^n)}\bigg( \int_{E^n} F\circ L_n \text{d}\mu - \frac{1}{n} \beta^{\pi_0}_n(\mu)\bigg).$$

Using the boundedness of F, the lower boundedness of β, and the fact that β(ν, ν) = 0 for all $\nu \in \mathcal{P}(E)$, we can show that the right-hand side of the above equation is bounded below by − ||F||_∞ and bounded above by $\| F \|_{\infty} + \inf_{\tau \in \mathcal{P}(E)^2} |\beta(\tau)|$. Thus, for each n ∈ ℕ, we can choose $\mu^{(n)} \in \mathcal{P}(E^n)$ such that

(2.7)

\begin{align} \label{mteq1} \frac{1}{n} \rho_n(nF\circ L_n) - \frac{1}{n} \leq \int_{E^n} F\circ L_n \text{d}\mu^{(n)} - \frac{1}{n} \beta^{\pi_0}_n(\mu^{(n)}) \end{align}

and

$$\sup_{n\in\mathbb{N}} \frac{1}{n}\beta^{\pi_0}_n(\mu^{(n)}) \lt \infty.$$

The latter will be used to apply Theorem 2.1 in a few moments. First, we use β(ν, ν) = 0 for all $\nu \in \mathcal{P}(E)$ and the convexity of $\beta_2^{\cdot}(\kern-1pt\cdot\kern-1pt)$ to calculate

(2.8)

\begin{align}\label{eq::ubeq2} \frac{1}{n} \beta^{\pi_0}_n(\mu^{(n)}) &= \frac{1}{n} \beta(\mu_{0,1}^{(n)},\pi_0) + \frac{1}{n}\sum_{i=1}^{n-1} \int_{E^n} \beta(\mu^{(n)}_{i,i+1}(x_1,\ldots,x_i),\pi(x_i)) \mu^{(n)}(\text{d} x_1,\ldots,\text{d} x_n) \\ &= \frac{1}{n} \beta^{\pi_0}_2(\mu_{0,1}^{(n)}\otimes \pi) + \int_{E^n}\frac{1}{n} \sum_{i=1}^{n-1} \beta^{\delta_{x_i}}_2(\delta_{x_i}\otimes \mu^{(n)}_{i,i+1}(x_1,\ldots,x_i)) \mu^{(n)}(\text{d} x_1,\ldots,\text{d} x_n) \\ &\geq \int_{E^n} \beta_2^{(\pi_0 + \sum_{i=1}^{n-1}\delta_{x_i})/{n}}\bigg(\frac{1}{n} \bigg(\mu^{(n)}_{0,1}\otimes\pi+\sum_{i=1}^{n-1} \delta_{x_i} \otimes \mu^{(n)}_{i,i+1}(x_1,\ldots,x_i)\bigg)\!\bigg) \\ &\times\mu^{(n)}(\text{d} x_1,\ldots,\text{d} x_n), \end{align}

where ‘⊗’ denotes the product measure if both arguments are measures.

For n ∈ ℕ, let X _n = (X _n,1, …, X _n,n) be E ⁿ-valued random variables with distribution μ ⁽ⁿ⁾. Define the sequence of $\mathcal{P}(E\times E)$-valued random variables (γ _n)_n∈ℕ by

$$\gamma_{n-1} := \frac{1}{n-1}\sum_{i=1}^{n-1} \delta_{X_{n,i}} \otimes \mu^{(n)}_{i,i+1}(X_{n,1},\ldots,X_{n,i}).$$

For any subsequence, Theorem 2.1(i) yields a further subsequence (again labeled by n ∈ ℕ and fixed for the rest of the proof of the upper bound) such that (γ _n)_n∈ℕ converges in distribution. By Theorem 2.1(ii), there exists a probability space $(\bar{\Omega},\bar{\mathcal{F}},\bar{\mathbb{P}})$, such that on this space, there exist random variables $\bar{\gamma}_n \sim \gamma_n$ and $\bar{\gamma}\sim \gamma $ with $\bar{\gamma}_n \buildrel w \over \longrightarrow \bar{\gamma}$, $\bar{\mathbb{P}}$-a.s. Furthermore, $\bar{\gamma}^{(1)} = \bar{\gamma}^{(2)}$, $\bar{\mathbb{P}}$-a.s., where $\bar{\gamma}^{(1)}$ and $\bar{\gamma}^{(2)}$ are the first and second marginals of $\bar{\gamma}$, respectively.

Define the sequence of first marginals of ${({\bar \gamma _n})_n}_{ \in {\mathbb{N}}}$ as ${({\bar L_n})_{n \in {\mathbb{N}}}}: = {(\bar \gamma _n^{(1)})_{n \in {\mathbb{N}}}}$ and $\bar{L} := \bar{\gamma}^{(1)}$, and note that $\bar{L}_n \buildrel w \over \longrightarrow \bar{L}$, $\mathbb{\bar{P}}$-a.s. With these definitions, (2.7), and (2.8), we obtain

\begin{align} &\frac{1}{n} \rho_n(nF\circ L_n) - \frac{1}{n} \\ &\quad \quad\leq \bar{\mathbb{E}}\Big[F\Big(\frac{n-1}{n}\bar{L}_{n-1} + \frac{1}{n} \delta_{\skew1\bar{x}_{n,n}}\Big) - \beta^{{\pi_0}/{n} + {(n-1)}\bar{L}_{n-1}/{n}}_2\Big(\frac{\mu_{0,1}^{(n)}\otimes \pi}{n} + \frac{n-1}{n}\bar{\gamma}_{n-1} \Big) \Big], \end{align}

where the X̅ _n,n are (redefined) random variables on $(\bar{\Omega},\bar{\mathcal{F}},\bar{\mathbb{P}})$ such that $(X_{n,n},\gamma_{n-1}) \sim (\skew1\bar{x}_{n,n},\bar{\gamma}_{n-1})$ for all n ∈ ℕ. For ease of notation, define

\begin{gather*} t_{n,0} := \frac{n-1}{n}\bar{L}_{n-1} + \frac{1}{n} \delta_{\skew1\bar{x}_{n,n}},\quad \quad t_{n,1} := \frac{\pi_0}{n} + \frac{n-1}{n} \bar{L}_{n-1}, \\ t_{n,2} := \frac{\mu_{0,1}^{(n)}\otimes \pi}{n} + \frac{n-1}{n}\bar{\gamma}_{n-1}, \end{gather*}

and note that $t_{n,0} \buildrel w \over \longrightarrow \bar{L}$, $t_{n,1} \buildrel w \over \longrightarrow \bar{L},$, and $t_{n,2} \buildrel w \over \longrightarrow \bar{\gamma}$, all $\bar{\mathbb{P}}$-a.s. Therefore, by the upper semi-continuity of F and $-\beta_2^{\cdot}(\kern-1pt\cdot\kern-1pt)$,

\begin{align} \limsup_{n\rightarrow \infty} \frac{1}{n}\rho_n(n F \circ L_n) &\leq \limsup_{n\rightarrow \infty} \bar{\mathbb{E}}[F(t_{n,0}) - \beta_2^{t_{n,1}}(t_{n,2})] \\ &\leq \bar{\mathbb{E}}[F \circ \bar{L} - \beta^{\bar{L}}_2(\bar{\gamma})]\\ & = \bar{\mathbb{E}}\bigg[ F\circ \bar{L} - \int_E \beta(\bar{\gamma}_{1,2}(x),\pi(x)) \bar{L}(dx)\bigg]\\ &\leq \sup_{\nu \in \mathcal{P}(E)}\bigg( F(\nu) - \inf_{q: \nu q = \nu}\int_{E} \beta(q(x),\pi(x)) \nu(\text{d} x)\!\bigg), \end{align}

where in the last inequality we used the fact that $\bar{\gamma}^{(1)} = \bar{\gamma}^{(2)}$ holds $\bar{\mathbb{P}}$-a.s. We have shown that every subsequence has a further subsequence such that this inequality holds, which implies it also holds for the whole sequence.

2.3. Proof of Corollary 1.1

Claim 2.1

If $\beta_2^{\cdot}(\kern-1pt\cdot\kern-1pt)$ is lower semicontinuous then I is lower semicontinuous. If $\beta_2^{\cdot}(\kern-1pt\cdot\kern-1pt)$ is convex then I is convex.

Proof. To prove the lower semicontinuity, let $\nu_n \buildrel w \over \longrightarrow \nu \in \mathcal{P}(E)$. We have to show that

$\liminf_{n\rightarrow \infty} I(\nu_n) \geq I(\nu).$

Note that I is bounded below. If the left-hand side of the above inequality equals ∞ then there is nothing to prove. So, for any subsequence, we can choose a further subsequence still denoted by (ν _n)_n∈ℕ such that I(ν _n) < ∞ for all n. Thus, we can choose stochastic kernels q _n such that

$$\beta_2^{\nu_n}(\nu_n \otimes q_n) \leq I(\nu_n) + \frac{1}{n} \quad\text{and}\quad \nu_n q_n = \nu_n.$$

Since ν _nq _n = ν _n and the sequence (ν _n)_n∈ℕ is tight by Prokhorov, the sequence (ν _n ⊗ q _n)_n∈ℕ is tight as well. We go over to a further subsequence still denoted by (ν _n ⊗ q _n)_n∈ℕ such that ν _n ⊗ q _n → ν ⊗ q, where νq = ν follows by convergence of the marginals. By the lower semicontinuity of $\beta_2^{\cdot}(\kern-1pt\cdot\kern-1pt)$,

$$\liminf_{n\rightarrow \infty} I(\nu_n) \geq \liminf_{n\rightarrow \infty} \beta_2^{\nu_n}(\nu_n \otimes q_n) - \frac{1}{n} \geq \beta_2^{\nu}(\nu \otimes q) \geq I(\nu).$$

To prove the convexity, note that $I(\nu) = \inf_{\substack{\tau \in \mathcal{P}(E^2)\colon\tau_1 = \tau_2 = \nu}} \beta_2^{\nu}(\tau)$. Let ν ₁, $\nu_2 \in \mathcal{P}(E)$ and τ ⁽¹⁾, $\tau^{(2)} \in \mathcal{P}(E^2)$ with $\tau^{(1)}_1 = \tau^{(1)}_2 = \nu_1$, and $\tau^{(2)}_1 = \tau^{(2)}_2 = \nu_2$. Then

\begin{align} \lambda \beta_2^{\nu_1}(\tau^{(1)}) + (1-\lambda) \beta_2^{\nu_2}(\tau^{(2)}) &\geq \beta_2^{\lambda \nu_1 + (1-\lambda) \nu_2}(\lambda \tau^{(1)} + (1-\lambda) \tau^{(2)}) \\ &\geq \inf_{\substack{\tau \in \mathcal{P}(E^2)\colon \tau_1 = \tau_2 = \lambda \nu_1 + (1-\lambda) \nu_2}} \beta_2^{\lambda \nu_1 + (1-\lambda) \nu_2}(\tau) \\ &= I(\lambda \nu_1 + (1-\lambda) \nu_2). \end{align}

Taking the infimum on the left-hand side over all such τ ⁽¹⁾ and τ ⁽²⁾ yields the claim.

Claim 2.2

If the upper bound of Theorem 1.1 holds, and, additionally, I has compact sublevel sets, then the upper bound extends to all functions $F\colon \mathcal{P}(E) \rightarrow [\!-\!\infty,\infty)$ which are upper semicontinuous and bounded from above.

Proof. Let $F\colon \mathcal{P}(E) \rightarrow [\!-\!\infty,\infty)$ be upper semicontinuous and bounded from above. Define F _m := −m ∨ F (m ∈ ℕ). By assumption, for all m ∈ ℕ,

$$\limsup_{n\rightarrow \infty} \frac{1}{n} \rho_n(nF \circ L_n) \leq \limsup_{n\rightarrow \infty} \frac{1}{n} \rho_n(nF_m \circ L_n) \leq \sup_{\nu \in \mathcal{P}(E)} ( F_m(\nu) - I(\nu)),$$

so it remains to only show that

$$\mathop {\lim \sup }\limits_{m \to \infty } {S_m}: = \mathop {\lim \sup }\limits_{m \to \infty } \mathop {\sup }\limits_{\nu \in {\rm{P}}(E)} ({F_m}(\nu ) - I(\nu )) \le \mathop {\sup }\limits_{\nu \in {\mathcal{P}}(E)} (F(\nu ) - I(\nu )){\mathcal{ = :}}S.$$

The S _m are decreasing (for increasing m). If S _m → −∞ there is nothing to show. So assume that the S _m are bounded below by C ∈ ℝ. Choose $\nu_m \in \mathcal{P}(E)$ such that

$$F_m(\nu_m) - I(\nu_m) \geq S_m - \frac{1}{m} \geq C-1.$$

So the I(ν _m) are uniformly bounded. By compact sublevel sets of I, for any subsequence, we can choose a further subsequence still denoted by (ν _m)_m∈ℕ such that $\nu_m \buildrel w \over \longrightarrow \nu_\infty$ for some $\nu_\infty \in \mathcal{P}(E)$. Then by the upper semicontinuity of F and −I,

$$\mathop {\lim \sup }\limits_{m \to \infty } {F_m}({\nu _m}) - I({\nu _m}) \le F({\nu _\infty }) - I({\nu _\infty }) \le S.$$

3. Applications to robust Markov chains

3.1. Robust large deviations

In this section (E, d) is assumed to be compact. The main goal of this section is to show Theorem 1.2 and illustrate it in Example 3.1. To this end, we show the respective upper bound in Theorem 3.1 and the respective lower bound in Lemma 3.6. The intermediate results in this section are concerned with representation formulae for the functionals β _n (see Lemmas 3.1 and 3.5) and the verification of conditions (B1) and (B2) (see Lemmas 3.2, 3.3, and 3.4).

In the following part leading up the Theorem 3.1, we assume that π satisfies the Feller property. We work with

$$\beta(\nu,\mu) := \inf_{\hat{\mu} \colon d_W(\mu,\hat{\mu}) \leq r} R(\nu,\hat{\mu}) = \inf_{\hat{\mu} \in M_1(\mu)} R(\nu,\hat{\mu})$$

for some fixed r ≥ 0. Recall that

\begin{multline*} M_n(\theta) := \{\nu \in \mathcal{P}(E^n) \colon d_W(\nu_{0,1},\theta)\leq r \text{ and } d_W(\nu_{i,i+1} (x_1,\ldots,x_i),\pi(x_i)) \leq r, \nu\text{-a.s.} \\ \text{for } i=1,\ldots,n-1\}. \end{multline*}

To be precise, the above definition requires the Condition d _W(ν _i,i+1(x ₁, …, x _i), π(x _i)) ≤ r to hold for ν-almost all (x ₁, …, x _n) ∈ E ⁿ for every decomposition of ν, where the respective ν-null set may depend on the given decomposition. Equivalently, the definition could state that there has to exist one decomposition of ν such that this Condition holds pointwise. That this notion is equivalent follows from the fact that decompositions of ν are only unique up to ν-almost-sure equality.

Lemma 3.1

(See also [Reference Bartl3, Lemma 4.4] and [Reference Lacker34, Proposition 5.2].) For all n ∈ ℕ, it holds that

$$\beta_n^{\theta}(\nu) = \inf_{\hat{\mu} \in M_n(\theta)} R(\nu,\hat{\mu}).$$

Proof. Fix $\theta \in \mathcal{P}(E)$. Define the sets

$$Q_0 := \{\hat{\mu} \in \mathcal{P}(E) \colon d_W(\hat{\mu},\theta) \leq r\}$$

and, for i = 1, …, n − 1 and x ₁, …, x _i ∈ E,

$$Q_i(x_1,\ldots,x_i) := \{\hat{\mu} \in \mathcal{P}(E) \colon d_W(\hat{\mu},\pi(x_i)) \leq r\}.$$

We note that M _n(θ) = Q ₀ ⊗ Q ₁ ⊗ · · · ⊗ Q _n−1, where Q ₀ ⊗ Q ₁ ⊗ · · · ⊗ Q _n−1 is defined as the set of measures $\mu = K_0 \otimes K_1 \otimes K_2 \otimes \cdots \otimes K_{n-1} \in \mathcal{P}(E^n)$, where K ₀ ∈ Q ₀ and $K_i : E^{i} \rightarrow \mathcal{P}(E)$ are Borel measurable kernels such that K _i(x ₁, …, x _i) ∈ Q _i(x ₁, …, x _i) for μ-almost all x ₁, …, x _i. Since, for all i = 1, …, n, the set $\{(x_1,\ldots,x_i,\hat{\mu}) \in E^i \times \mathcal{P}(E) : \hat{\mu} \in Q_i(x_1,\ldots,x_i)\}$ is trivially Borel, a measurable selection argument (e.g. [Reference Bertsekas and Shreve6, Proposition 7.50]) yields, for $\nu \in \mathcal{P}(E^n)$,

\begin{align} &\inf_{\hat{\mu} \in M_n(\theta)} R(\nu,\hat{\mu}) \\ &\quad \quad= \inf_{K_0\otimes \cdots \otimes K_{n-1} \in Q_0 \otimes \cdots \otimes Q_{n-1}} \sum_{i=0}^{n-1} \int_{E^n} R(\nu_{i,i+1}(x_1,\ldots,x_i),K_i(x_1,\ldots,x_i)) \nu(\text{d} x_1,\ldots,\text{d} x_n) \\ &\quad \quad\stackrel{(*)}{=} \sum_{i=0}^{n-1} \int_{E^n} \inf_{\hat{\mu} \in Q_i(x_1,\ldots,x_i)} R(\nu_{i,i+1}(x_1,\ldots,x_i),\hat{\mu}) \nu(\text{d} x_1,\ldots,\text{d} x_n) \\ &\quad \quad= \beta(\nu_{0,1},\theta) + \sum_{i=1}^{n-1} \int_{E^n} \beta(\nu_{i,i+1}(x_1,\ldots,x_i),\pi(x_i)) \nu(\text{d} x_1,\ldots,\text{d} x_n) \\ &\quad \quad= \beta_n^{\theta}(\nu), \end{align}

where rigorously step (*) works inductively; see the proofs of [Reference Bartl3, Lemma 4.4] and [Reference Lacker34, Proposition 5.2].

Lemma 3.2

Let θ ₁, $\theta_2 \in \mathcal{P}(E),~\nu_1 \in M_2(\theta_1),~\nu_2 \in M_2(\theta_2),$, ν ₁ ∈ M ₂(θ ₁), ν ₂ ∈ M ₂(θ ₂), and λ ∈ (0, 1). Then

$$\lambda \nu_1 + (1-\lambda) \nu_2 \in M_2(\lambda \theta_1 + (1-\lambda) \theta_2).$$

Proof. Write ν ₁ = μ ₁ ⊗ K ₁ and ν ₂ = μ ₂ ⊗ K ₂ for some μ ₁, $\mu_1,\mu_2 \in \mathcal{P}(E)$ and K ₁, K ₂ stochastic kernels on E. Furthermore, K ₁ and K ₂ are chosen such that d _W(K _i(x), π(x)) ≤ r for all x ∈ E and i ∈ {1, 2}. We have the equality

(3.1)

\begin{align}\label{eqa1} \lambda \nu_1 + (1-\lambda)\nu_2 = (\lambda \mu_1 + (1-\lambda)\mu_2) \otimes K, \end{align}

where $K\colon E \rightarrow \mathcal{P}(E)$ is defined by

$$\matrix{ {K(x)} \hfill & { = {{{\rm{d}}{\mu _1}} \over {{\rm{d}}(\lambda {\mu _1} + (1 - \lambda ){\mu _2})}}(x)\lambda {K_1}(x) + {{{\rm{d}}{\mu _2}} \over {{\rm{d}}(\lambda {\mu _1} + (1 - \lambda ){\mu _2})}}(x)(1 - \lambda ){K_2}(x)} \hfill \cr {} \hfill & { = :{\lambda _x}{K_1}(x) + (1 - {\lambda _x}){K_2}(x).} \hfill \cr } $$

Equation (3.1) obviously holds for Borel sets of the form A × B ⊆ E ², which extends the equality to arbitrary Borel sets by Carathéodory. So K is a pointwise convex combination of K ₁ and K ₂. Since, for the first Wasserstein distance, the Kantorovich duality (see, e.g. [Reference Villani45, Chapter 5]) implies that

\begin{align} d_W(\lambda \mu_1 + (1-\lambda) \mu_2, \lambda \theta_1 + (1-\lambda) \theta_2) \leq \lambda d_W(\mu_1,\theta_1) + (1-\lambda) d_W(\mu_2,\theta_2) \leq r \end{align}

and, for all x ∈ E,

\begin{align} &d_W(\lambda_x K_1(x) + (1-\lambda_x) K_2(x), \lambda_x \pi(x) + (1-\lambda_x) \pi(x)) \\ &\quad \quad\leq \lambda_x d_W(K_1(x),\pi(x)) + (1-\lambda_x) d_W(K_2(x),\pi(x)) \\ &\quad \quad\leq r, \end{align}

the claim follows.

That $\beta_2^{\cdot}(\kern-1pt\cdot\kern-1pt)$ is convex follows by the previous lemma and the convexity of R(·, ·), since

\begin{align} &\beta_2^{\lambda \theta_1 + (1-\lambda) \theta_2}(\lambda \nu_1 + (1-\lambda) \ \nu_2) \\ &\quad \quad= \inf_{\hat{\mu} \in M_2(\lambda \theta_1 + (1-\lambda) \theta_2)} R(\lambda \nu_1 + (1-\lambda) \nu_2,\hat{\mu}) \\ &\quad \quad\stackrel{\eqref{Mnconv}}{\leq} \inf_{\hat{\mu}_1 \in M_2(\theta_1),\, \hat{\mu}_2 \in M_2(\theta_2)} R(\lambda \nu_1 + (1-\lambda) \nu_2, \lambda \hat{\mu}_1 + (1-\lambda) \hat{\mu}_2) \\ &\quad \quad\leq \inf_{\hat{\mu}_1 \in M_2(\theta_1),\, \hat{\mu}_2 \in M_2(\theta_2)} \lambda R(\nu_1,\hat{\mu}_1) + (1-\lambda) R(\nu_2,\hat{\mu}_2) \\ &\quad \quad= \lambda \beta_2^{\theta_1}(\nu_1) + (1-\lambda) \beta_2^{\theta_2}(\nu_2) \end{align}

It remains to show that $\beta_2^{\cdot}(\kern-1pt\cdot\kern-1pt)$ is lower semicontinuous. To this end, we first show the following.

Lemma 3.3

If π satisfies the Feller property then M ₂(θ) is closed.

Proof. Recall that μ ⊗ K ∈ M ₂(θ) if and only if both

(3.2)

\begin{align} d_W(\mu,\theta) &\leq r \label{firstcon}, \end{align}

(3.3)

\begin{align} d_W(K(x),\pi(x)) &\leq r \quad\text{for } \mu \text{-a.a. } x\in E. \label{secondcon} \end{align}

Condition (3.2) is closed (which is obvious once it is rewritten via the Kantorovich duality), so we focus on Condition (3.3). Since, by assumption, (E, d) is compact and thus totally bounded, the set of Lipschitz-1 functions mapping E onto ℝ which are absolutely bounded by 1 (denoted by Lip₁) is separable with respect to the sup-norm (follows since the space of uniformly bounded and continuous functions is separable and every subset of a separable metric space is again separable). We denote by {f ₁, f ₂, …} ⊆ Lip₁ a countable dense subset. Furthermore, we are going to use the fact that, for bounded and measurable functions h: E → ℝ and $\nu \in \mathcal{P}(E)$,

\begin{align} (h \geq 0, \nu\text{-a.s.}) \quad\Longleftrightarrow\quad \bigg(\text{for all } g\in C_b(E),g\geq 0\colon \int_E g(x) h(x) \nu(\text{d} x) \geq 0\bigg), \end{align}

which holds because E is a Polish space and thus the function 1_A for the Borel set A := {h < 0} can be approximated in L ₁ (ν) by a sequence of nonnegative, continuous, and bounded functions.

We can rewrite Condition (3.3) as follows

\begin{align} &d_W(K(x),\pi(x)) \leq r \text{ for } \mu \text{-a.a.~} x\in E \\ &\quad \quad\Longleftrightarrow\quad \biggl(\text{for all } f \in \int_Ef \text{d} K(x) - \int_E f \text{d} \pi(x) \leq r\biggr) \quad\text{for } \mu \text{-a.a. } x\in E \\ &\quad \quad\Longleftrightarrow\quad\biggl( \text{for all }i \in \mathbb{N}\colon \int_Ef_i \text{d} K(x) - \int_E f_i \text{d} \pi(x) \leq r\biggr) \quad\text{for } \mu \text{-a.a. } x\in E \\ &\quad \quad\Longleftrightarrow\quad\biggl( \text{for all } i \in \mathbb{N} \text{ and all } g \in C_b(E), g\geq 0 \colon \\ &\int_E g(x) \biggl( \int_E f_i(y) K(x,\text{d} y) - \int_E f_i(y) \pi(x,\text{d} y) - r \biggr) \mu(\text{d} x) \leq 0 \biggr) \\ &\quad \quad\Longleftrightarrow\quad\biggl( \text{for all } i \in \mathbb{N} \text{ and all } g \in C_b(E), g\geq 0 \colon \int_{E^2} g(x) f_i(y) \mu \otimes K (\text{d} x,\text{d} y) \\ &- \!\int_E g(x) \biggl(\int_E f_i \text{d} \pi(x) - r \biggr) \mu(\text{d} x) \leq 0 \biggr). \end{align}

The last line expresses a closed Condition if π satisfies the Feller property, which guarantees that x ↦ ∫_E f dπ(x) is continuous for all f ∈ C _b(E).

Lemma 3.4

The function $\beta_2^{\cdot}(\kern-1pt\cdot\kern-1pt)$ is lower semicontinuous.

Proof. Let $(\theta_n,\nu_n) \buildrel w \over \longrightarrow (\theta,\nu) \in \mathcal{P}(E) \times \mathcal{P}(E^2)$ as n → ∞. We have to show that

$$\liminf_{n\rightarrow \infty} \beta_2^{\theta_n}(\nu_n) \geq \beta_2^{\theta}(\nu),$$

which is done by choosing an arbitrary subsequence and showing that there exists a further subsequence such that the inequality holds. So we start with a subsequence still denoted by (θ _n, ν _n)_n∈ℕ. Let $\hat{\mu}_n \in M_2(\theta_n)$ such that

$$\beta_2^{\theta_n}(\nu_n) \geq R(\nu_n,\hat{\mu}_n) - \frac{1}{n},$$

and choose a further subsequence still denoted by (θ _n, ν _n)_n∈ℕ such that d _W(θ _n, θ) ≤ 1/n and $\hat{\mu}_n$ converges weakly to some $\hat{\mu} \in \mathcal{P}(E^2)$. We show that $\hat{\mu} \in M_2(\theta)$. To this end, define

$$M_2^{r,n}(\theta) := \left\{ \mu \otimes K \in \mathcal{P}(E^2) \colon d_W(\mu,\theta) \leq r+\frac{1}{n},~ d_W(K(x),\pi(x))\leq r \text{ for } \mu\text{-a.a.~} x\in E\right\},$$

which is closed, as the proof of the previous lemma trivially carries over to this set. We see that $\hat{\mu}_m \in M_2^{r,n}(\theta)$ for all m ≥ n, and, therefore, $\hat{\mu} \in M_2^{r,n}(\theta)$ for all n ∈ ℕ, which yields $\hat{\mu} \in M_2(\theta)$. Finally, by the lower semicontinuity of R(·, ·), we obtain

$$\mathop {\lim \inf }\limits_{n \to \infty } \beta _2^{{\theta _n}}({\nu _n}) \ge \mathop {\lim \inf }\limits_{n \to \infty } R({\nu _n},{\hat \mu _n}) \ge R(\nu ,\hat \mu ) \ge \mathop {\inf }\limits_{\mu \in {M_2}(\theta )} R(\nu ,\mu ) = \beta _2^\theta (\nu ).$$

The rate function I corresponding to the choice of β as defined at the beginning of the section is given by

$$I(\nu) := \inf_{q : \nu q = \nu} \int_E \inf_{K_x \in M(\pi(x))} R(q(x),K_x) \nu(\text{d} x) \quad\text{for \nu \in \mathcal{P}(E). }$$

Using the above observations to apply the main theorem, we obtain the following result.

Theorem 3.1

For all functions $F \colon \mathcal{P}(E) \rightarrow [\!-\!\infty,\infty)$ which are upper semicontinuous and bounded from above, it holds that

$$\limsup_{n\rightarrow \infty} \sup_{\mu \in M_n(\pi_0)} \frac{1}{n} \ln \int_{E^n} \exp(n F \circ L_n) \text{d}\mu \leq \sup_{\nu \in \mathcal{P}(E)} ( F(\nu) - I(\nu) ).$$

Furthermore, for all closed sets $A \subseteq \mathcal{P}(E)$, it holds that

$$\limsup_{n\rightarrow \infty} \sup_{\mu \in M_n(\pi_0)} \frac{1}{n} \ln \mu(L_n \in A) \leq -\inf_{\nu \in A} I(\nu).$$

Proof. For the first claim, apply Theorem 1.1, which by the compactness of E and, thus, by Corollary 1.1 extends to all functions $F \colon \mathcal{P}(E) \rightarrow [\!-\!\infty,\infty)$ which are upper semicontinuous and bounded from above. To arrive at the given form of ρ _n, we use Lemma 3.1 to obtain, for f ∈ C _b (E ⁿ),

\begin{align}\\[-24pt] \rho_n(\:f) &= \sup_{\nu\in\mathcal{P}(E^n)} \bigg( \int_{E^n} f \text{d}\nu - \inf_{\mu \in M_n(\pi_0)} R(\nu,\mu) \bigg) \\ &= \sup_{\mu \in M_n(\pi_0)} \sup_{\nu\in\mathcal{P}(E^n)} \bigg(\int_{E^n} f \text{d}\nu - R(\nu,\mu) \bigg)\\ &= \sup_{\mu \in M_n(\pi_0)} \ln \int_{E^n} \exp(\:f) \text{d}\mu, \end{align}

where the last step follows by the Gibbs variational formula for the relative entropy.

With the first claim established, the second claim follows by choosing F = −∞1_{A ^C} for a closed set $A \subseteq \mathcal{P}(E)$.

For the large deviations bound in Theorem 3.1 to be nonvacuous for a closed set, $A \subseteq \mathcal{P}(E)$ requires that

(3.4)

\begin{align} \label{sense} \inf_{\nu \in A} I(\nu) > 0. \end{align}

Intuitively, (3.4) holds if and only if, for all pairs ν ∈ A and q with νq = ν, there is some Borel set S ⊆ E with ν(S) > 0 such that d _W(q(x), π(x)) > r for all x ∈ S.

To properly address the question of whether the attained bound is sharp, we need a lower bound in accordance with the upper bound. The choice of β that leads to Theorem 3.1 cannot yield a lower bound with our approach, since Condition (B3) is not satisfied for r > 0 and, hence, the lower bound of Theorem 1.1 cannot be applied.

In the following we therefore consider the functional $\underline{\beta}$, which is chosen such that it resembles β and satisfies (B3), albeit at the cost of not satisfying (B2). This will lead to the lower bound of Theorem 1.2 proven in Lemma 3.6. Define

\begin{gather*} \underline{\beta}(\nu,\mu) := \inf_{\substack{\hat{\mu} \colon d_W(\mu,\hat{\mu}) \leq r,\, \hat{\mu} \ll \mu}} R(\nu,\hat{\mu}), \quad \quad \underline{M}_n(\theta) := \{\nu \in M_n(\theta) \colon \nu \ll \theta \otimes \pi \otimes \cdots \otimes \pi\}, \\ \underline{I}(\nu) := \inf_{q : \nu q = \nu} \int_E\inf_{K_x \in \underline{M}(\pi(x))} R(q(x),K(x)) \nu(\text{d} x) \end{gather*}

Furthermore, we assume that, for the analysis of the lower bound, π satisfies Assumption M, but no longer has to satisfy the Feller property.

Lemma 3.5

(See also [Reference Bartl3, Lemma 4.4] and [Reference Lacker34, Proposition 5.2].) For all n ∈ ℕ, it holds that

$$\underline{\beta}_n^{\theta}(\nu) = \inf_{\mu \in \underline{M}_n(\theta)} R(\nu,\mu).$$

Proof. The proof is the same as that of Lemma 3.1, except here we need measurability of the sets

$$S_i := \{ (x_1,\ldots,x_i,\hat{\mu}) \in E^i \times \mathcal{P}(E) : d_W(\hat{\mu},\pi(x_i)) \leq r \text{ and } \hat{\mu} \ll \pi(x_i)\}$$

for i ∈ {1, …, n − 1}. That these sets are indeed Borel measurable can be seen as follows: Define the function $g \colon \mathcal{P}(E) \times \mathcal{P}(E) \times E \rightarrow \mathbb{R}_+$ by

$$g(\mu,\nu,x) = \frac{\text{d}\mu_{|\nu}}{\text{d}\nu}(x).$$

Here μ _|ν denotes the absolutely continuous part of μ with respect to ν as given by Lebesgue’s decomposition theorem. Then g is Borel as shown in [Reference Dellacherie and Meyer15, V.58 and subsequent remark]. We have μ ≪ ν if and only if ∫_E g(μ, ν,· ) dν = 1, which shows that S _i is Borel (as the other conditions that define S _i are trivially Borel).

To arrive at the given form of $\rho_n$, we use the following equivalence for measures $\nu_1,\nu_2\in\mathcal{P}(E)$ and stochastic kernels $K_1,K_2\colon E \rightarrow \mathcal{P}(E)$ (see, e.g. [Reference Bartl3, Lemma A.2]):

\begin{align} &(\nu_1 \otimes K_1 \ll \nu_2 \otimes K_2 \in \mathcal{P}(E^2)) \\ &\quad \quad\Longleftrightarrow\quad(\nu_1 \ll \nu_2\text{ and }K_1(x) \ll K_2(x)\text{ for }\nu_1\text{-almost all }x\in E).\tag*{\qedhere} \end{align}

In complete analogy to the choice of β leading to Theorem 3.1, we see that $\underline{\beta}$ satisfies (B1), which is a consequence of Lemma 3.5 in combination with Lemma 3.2, where one additionally uses the fact that

\begin{align} &(\mu_1 \ll \theta_1 \text{ and } \mu_2 \ll \theta_2) \\ &\quad \quad\Longrightarrow\quad \lambda \mu_1 + (1-\lambda) \mu_2 \ll \lambda \theta_1 + (1-\lambda) \theta_2 \quad\text{for } \mu_1,\mu_2,\theta_1,\theta_2 \in \mathcal{P}(E),\,\lambda \in (0,1). \end{align}

As (B3) and Assumption M are satisfied as well, Theorem 1.1 yields, for all $F\in C_b(\mathcal{P}(E))$,

$$\liminf_{n\rightarrow \infty} \sup_{\mu \in \underline{M}_n(\pi_0)} \frac{1}{n} \int_{E^n} \exp(F \circ L_n) \text{d}\mu \geq \sup_{\nu \in \mathcal{P}(E)} ( F(\nu) - \underline{I}(\nu)),$$

which leads to the following lemma.

Lemma 3.6

Let Assumption M be satisfied. For $G \subseteq \mathcal{P}(E)$ open, it holds that

$$\liminf_{n\rightarrow \infty} \sup_{\mu \in \underline{M}_n(\pi_0)} \frac{1}{n} \ln \mu(L_n \in G) \geq - \inf_{\nu \in G} \underline{I}(\nu).$$

Proof. The proof is an adapted version of [Reference Dupuis and Ellis20, Theorem 1.2.3].

We work with the Laplace principle lower bound stated just before the lemma. Without loss of generality, assume that inf_ν∈G I̠ (ν) < ∞. Let ν ∈ G such that I̠ (ν) < ∞. Choose M ∈ ℝ such that I̠ (ν) < M and k ∈ ℕ such that $B(\nu,{1}/{k}) := \{ \mu \in \mathcal{P}(E) \colon \skew3\hat{d}(\mu,\nu) \leq {1}/{k} \} \subseteq G$, where d̂ is some metric on $\mathcal{P}(E)$ compatible with weak convergence. Define

$$h(\theta) := -M((\skew3\hat{d}(\nu,\theta) \cdot k) \wedge 1).$$

We find that −M ≤ h ≤ 0, h(ν) = 0, and h(θ) = −M for θ ∈ B(ν, 1/k)^C. Thus, for any $\mu \in \mathcal{P}(E^n)$,

∫_{E ⁿ} exp (nh ○ L _n) dμ ≤ exp (−nM) + μ(L _n ∈ B(ν, δ)) ≤ max{2 exp (−nM), 2μ(L _n ∈ B(ν, δ))}.

Therefore,

\begin{align}\\[-24pt] &\max\biggl\{ \liminf_{n\rightarrow \infty} \sup_{\mu \in \underline{M}_n(\pi_0)} \frac{1}{n} \ln \mu(L_n \in B(\nu,\delta)) , -M \biggr\} \\ &\quad \quad\geq \liminf_{n\rightarrow \infty} \sup_{\mu \in \underline{M}_n(\pi_0)} \frac{1}{n} \ln \int_{E^n} \exp(n h\circ L_n) \text{d}\mu \\ &\quad \quad\geq \sup_{\hat{\nu}\in\mathcal{P}(E)} ( h(\hat{\nu}) - \underline{I}(\hat{\nu})) \\ &\quad \quad\geq h(\nu)-\underline{I}(\nu) \\ &\quad \quad= -\underline{I}(\nu). \end{align}

Since M > I(ν),

$$\liminf_{n\rightarrow \infty} \sup_{\mu \in \underline{M}_n(\pi_0)} \frac{1}{n} \ln \mu(L_n \in B(\nu,\delta)) \geq -\underline{I}(\nu),$$

and using the facts that B(ν, δ) ⊆ G and the above reasoning works for all ν ∈ G with I̠(ν) < ∞, we obtain the claim.

The proof of Theorem 1.2 is now complete, as it follows from Theorem 3.1 and Lemma 3.6.

The following illustrates the obtained results. Note that to calculate the rates, as is usual in large deviations theory, the necessary minimization can be solved efficiently (at least in theory) over convex sets A, since I is convex.

Example 3.1

Consider the state space {1, 2, 3} with discrete metric, i.e. d(i, j) = 0 if i = j and d(i, j) = 1 otherwise. The Markov chain is given by its initial distribution π ₀ = δ ₃ and transition kernel π with matrix representation

\begin{bmatrix} 0.6 & 0.2 & 0.2 \\ 0.3 & 0.4 & 0.3 \\ 0 & 0.3 & 0.7 \end{bmatrix}\!.

Suppose that we are interested in the tail event that the empirical measure L _n under the Markov chain is close (in a certain sense) to the initial distribution π ₀. We are uncertain of the precise model specification of the Markov chain and want to find the worst case (i.e. slowest possible) convergence rate to 0 of this tail event.

Formally, let r = 0.05 and take, for κ = 0.2, the set of measures A = B _{d _W}(δ ₃, κ), i.e. the Wasserstein-1-ball around δ ₃ with radius κ. The set {L _n ∈ A} models the abovementioned tail event. What is the (exponential) asymptotic rate of convergence of

(3.5)

\begin{align} \label{eqrate} \sup_{\mu \in M_n(\delta_3)} \mu(L_n \in A) \rightarrow 0 \end{align}

as n → ∞? Note that r and the transition kernel are as always implicitly included in M _n(δ ₃).

Calculating the upper bound of Theorem 1.2 yields a worst-case exponential rate:

$$r_{\text{worst case}} \approx 0.0511.$$

This is significantly lower than the normal rate for the Markov chain without the robustness (i.e. the r = 0 case), which is

$$r_{\text{normal}} \approx 0.0910.$$

In Figure 1 we present the difference in convergence speed. Notably, the optimizer of the optimization problem to obtain the worst-case rate also yields a kernel $\hat{\pi}$ such that $\pi_0 \otimes \hat{\pi} \otimes \cdots \otimes \hat{\pi} \in M_n(\pi_0)$ and the Markov chain with transition kernel $\hat{\pi}$ attains the worst-case rate, i.e.

$$\pi_0 \otimes \hat{\pi} \otimes \cdots \otimes \hat{\pi} (L_n \in A) \sim \exp(\!-\!n \cdot r_{\text{worst case}}).$$

FIGURE 1: Illustration of convergence rates, simulated (100 paths) realized convergence, and the stationary distributions under the normal Markov chain and the robust worst-case Markov chain.

In other words, the worst-case rate in (3.5) is obtained and one sequence of optimal measures is Markovian, with transition kernel $\hat{\pi}$ given by the matrix

\begin{bmatrix} 0.6-r & 0.2 & 0.2+r \\ 0.3-r & 0.4 & 0.3+r \\ 0 & 0.3-r & 0.7+r \end{bmatrix}\!.

In Figure 1 we also present a simulated convergence rate for both the initial Markov chain and the Markov chain with worst-case transition kernel $\hat{\pi}$ (100 paths simulated), and a comparison of the respective stationary distributions.

Note that in the above example the rates are asymptotically sharp, as the worst-case kernel $\hat{\pi}$ for the rate function is already absolutely continuous with respect to π, so using I̠ instead of I yields the same rate.

Using the above example, we can get an idea when the upper and lower bounds of Theorem 1.2 may not coincide. If we do not restrict ourselves to I̠, it may happen that no optimal kernel $\hat{\pi}$ is absolutely continuous with respect to the initial kernel π. In this case, we can no longer guarantee that some near optimal kernel $\hat{\pi}$ satisfies Condition (M1), which is also needed in the nonrobust case to show the large deviations lower bound.

3.2. Robust weak law of large numbers

Let (E, d) be compact. In this section, Theorem 1.3 is proven. We first show the upper bound in Theorem 3.2 and then explain how to obtain the lower bound.

Up to Theorem 3.2, let π satisfy the Feller property. Define

$$\beta(\mu,\nu) := \bigg\{\!\!\begin{array}{ll} 0& \text{if } d_W(\mu,\nu)\leq r, \\ \infty & \text{otherwise,} \end{array}\!.$$

for some r ≥ 0 and obtain

$$\beta_n^{\theta}(\nu) = \infty \cdot {\bf 1}_{(M_n(\theta))^C}(\nu).$$

Lemma 3.7

The function $\beta_2^{\cdot}(\kern-1pt\cdot\kern-1pt)$ is convex and lower semicontinuous.

Proof. We first show convexity. Let θ ₁, $\theta_1, \theta_2 \in \mathcal{P}(E)$, ν ₁ $\nu_1,\nu_2 \in \mathcal{P}(E^2)$, and λ ∈ (0, 1). We have to show that

$$\beta_2^{\lambda \theta_1 + (1-\lambda) \theta_2}(\lambda \nu_1 + (1-\lambda) \nu_2) \leq \lambda \beta_2^{\theta_1}(\nu_1) + (1-\lambda) \beta_2^{\theta_2}(\nu_2).$$

To this end, it suffices to show that if the right-hand side is 0, the left-hand side has to be 0 as well. If the right-hand side is 0 then both ν ₁ ∈ M ₂(θ ₁) and ν ₂ ∈ M ₂(θ ₂). It follows by Lemma 3.2 that λν ₁ + (1 − λ)ν ₂ ∈ M ₂(λθ ₁ + (1 − λ)θ ₂) and, thus, the left-hand side is also 0.

We now show lower semicontinuity. Let $ (\theta_n,\nu_n) \buildrel w \over \longrightarrow (\theta,\nu) \in \mathcal{P}(E)\times \mathcal{P}(E^2) $. We have to show that

$$\liminf_{n\rightarrow \infty} \beta_2^{\theta_n}(\nu_n) \geq \beta_2^{\theta}(\nu).$$

Without loss of generality, the left-hand side is not equal to ∞. We have to show that the right-hand side is 0. We first choose an arbitrary subsequence and then a further subsequence still denoted by (θ _n, ν _n)_n∈N such that, for all n ∈ ℕ,

\begin{align} \beta_2^{\theta_n}(\nu_n) \lt \infty,\quad \quad d_W(\theta_n,\theta) \leq \frac{1}{n}. \end{align}

It follows that ν _n ∈ M ₂(θ _n) for all n ∈ ℕ and with the same notation and arguments as in the proof of Lemma 3.4, it follows that $\nu_n \in M_2(\theta_n)$ for all n ∈ ℕ and, thus, ν ∈ M ₂(θ), i.e.$\beta_2^{\theta}(\nu) = 0$.

By applying Theorem 1.1 and Corollary 1.1 we obtain the following result.

Theorem 3.2

For all upper semicontinuous and bounded from above functions $F\colon \mathcal{P}(E) \rightarrow [\!-\!\infty,\infty)$ it holds that

$$\limsup_{n\rightarrow\infty} \sup_{\mu \in M_n(\pi_0)} \int_{E^n} F\circ L_n \text{d}\mu \leq \sup_{\substack{\nu \in \mathcal{P}(E) \colon \text{there exists } q, \nu q = \nu \text{ such that }\nu \otimes q \in M_2(\nu)}} F(\nu).$$

We now focus on the lower bound in Theorem 1.3. Let π satisfy Assumption M (but no longer has to satisfy the Feller property). We define

$$\underline{\beta}(\mu ,\nu ): = \{ \matrix{ 0 \hfill & {{\rm{if }}{d_W}(\mu ,\nu ) \le r{\rm{ and }}\mu \ll \nu ,} \hfill \cr \infty \hfill & {{\rm{otherwise,}}} \hfill \cr } $$

so that (B3) holds. We obtain

$$\underline{\beta}_n^{\theta}(\nu) = \infty \cdot {\bf 1}_{(\underline{M}_n(\theta))^C}(\nu).$$

Proving (B1) for $\underline{\beta}$ works completely analogous to the case of β in Lemma 3.7 by replacing M _n by $\underline{M}_n$. Applying Theorem 1.1 yields

$$\liminf_{n\rightarrow \infty} \sup_{\mu \in \underline{M}_n(\pi_0)} \int_{E^n} F \circ L_n \text{d}\mu \geq \sup_{\substack{\nu \in \mathcal{P}(E) \colon \text{there exists } q, \nu q = \nu \text{ such that } \nu \otimes q \in \underline{M}_2(\nu)}} F(\nu)$$

for all $F\in C_b(\mathcal{P}(E))$. Theorem 1.3 is shown.

Acknowledgements

The author would like to thank Daniel Bartl, Michael Kupper, Daniel Lacker, and an anonymous referee for their valuable comments and suggestions. In particular, thanks to Daniel Lacker for providing the proof of Lemma 3.3.

Footnotes

The supplementary material for this article can be found at http://doi.org/10.1017/apr.2019.6.

References

Acciaio, B. and Penner, I. (2011). Dynamic risk measures. In Advanced Mathematical Methods for Finance, Springer, Heidelberg, pp. 1–34.Google Scholar

Agueh, M. and Carlier, G. (2011). Barycenters in the Wasserstein space. SIAM J. Math. Anal. 43, 904–924.CrossRef Google Scholar

Bartl, D. (2016). Exponential utility maximization under model uncertainty for unbounded endowments. Preprint. Available at https://arxiv.org/abs/1610.00999.Google Scholar

Bartl, D. (2016). Pointwise dual representation of dynamic convex expectations. Preprint. Available at https://arxiv.org/abs/1612.09103v1.Google Scholar

Bartl, D., Drapeau, S. and Tangpi, L. (2017). Computational aspects of robust optimized certainty equivalents. Preprint. Available at https://arxiv.org/abs/1706.10186v1.Google Scholar

Bertsekas, D. P. and Shreve, S. E. (1996). Stochastic Optimal Control: The Discrete-Time Case. Athena Scientific.Google Scholar

Blanchet, A. and Carlier, G. (2016). Optimal transport and Cournot-Nash equilibria. Math. Operat. Res. 41, 125–145.CrossRef Google Scholar

Blanchet, J. and Murthy, K. R. A. (2016). Quantifying distributional model risk via optimal transport. Preprint. Available at https://arxiv.org/abs/1604.01446.Google Scholar

Breiman, L. (1992). Probability (Classics Appl. Math. 7). Society for Industrial and Applied Mathematics, Philadelphia, PA.Google Scholar

Cerreia-Vioglio, S., Maccheroni, F. and Marinacci, M. (2016). Ergodic theorems for lower probabilities. Proceedings of the American Mathematical Society 144, 3381–3396.CrossRef Google Scholar

Cheridito, P. (2013). Convex analysis. Princeton University lecture notes.Google Scholar

Cheridito, P. and Kupper, M. (2011). Composition of time-consistent dynamic monetary risk measures in discrete time. Internat. J. Theoret. Appl. Finance 14, 137–162.CrossRef Google Scholar

De Acosta, A. (1990). Large deviations for empirical measures of Markov chains. J. Theoret. Prob. 3, 395–431.CrossRef Google Scholar

De Cooman, G., Hermans, F. and Quaeghebeur, E. (2009). Imprecise Markov chains and their limit behavior. Prob. Eng. Inf. Sci. 23, 597–635.CrossRef Google Scholar

Dellacherie, C. and Meyer, P.-A. (1982). Probability and Potential B: Theory of Martingales. Elsevier, Amsterdam.Google Scholar

Dembo, A. and Zeitouni, O. (2010). Large Deviations Techniques and Applications (Stoch. Modelling Appl. Prob. 38). Springer, Berlin.CrossRef Google Scholar

Donsker, M. D. and Varadhan, S. R. S. (1975). Asymptotic evaluation of certain Markov process expectations for large time, I. Commun. Pure Appl. Math. 28, 1–47.CrossRef Google Scholar

Donsker, M. D. and Varadhan, S. R. S. (1976). Asymptotic evaluation of certain markov process expectations for large time. III. Commun. Pure Appl. Math. 29, 389–461.CrossRef Google Scholar

Dudley, R. M. (2002). Real Analysis and Probability (Cambridge Studies in Advanced Mathematics 74). Cambridge University Press.Google Scholar

Dupuis, P. and Ellis, R. S. (2011). A Weak Convergence Approach to the Theory of Large Deviations. John Wiley, Hoboken, NJ.Google Scholar

Eckstein, S. (2019). Extended Laplace principle for empirical measures of a Markov chain. Supplementary material. Available at http://doi.org/10.1017/apr.2019.6.CrossRef Google Scholar

Erreygers, A., Rottondi, C., Verticale, G. and De Bock, J. (2018). Imprecise Markov models for scalable and robust performance evaluation of flexi-grid spectrum allocation policies. Preprint. Available at https://arxiv.org/abs/1801.05700.Google Scholar

Esfahani, P. M. and Kuhn, D. (2015). Data-driven distributionally robust optimization using the Wasserstein metric: performance guarantees and tractable reformulations. Preprint. Available at https://arxiv.org/abs/1505.05116.Google Scholar

Ethier, S. N. and Kurtz, T. G. (1986). Markov Processes. John Wiley, New York.CrossRef Google Scholar

Gao, R. and Kleywegt, A. J. (2016). Distributionally robust stochastic optimization with Wasserstein distance. Preprint. Available at https://arxiv.org/abs/1604.02199.Google Scholar

Gibbs, A. L. and Su, F. E. (2002). On choosing and bounding probability metrics. Internat. Statist. Rev. 70, 419–435.CrossRef Google Scholar

Hanasusanto, G. A., Roitch, V., Kuhn, D. and Wiesemann, W. (2015). A distributionally robust perspective on uncertainty quantification and chance constrained programming. Math. Program. 151, 35–62.CrossRef Google Scholar

Hartfiel, D. and Seneta, E. (1994). On the theory of Markov set-chains. Adv. Appl. Prob. 26, 947–964.CrossRef Google Scholar

Hartfiel, D. J. (2006). Markov Set-Chains. Springer, Heidelberg.Google Scholar

Jain, N. C. (1990). Large deviation lower bounds for additive functionals of Markov processes. Ann. Prob. 18, 1071–1098.CrossRef Google Scholar

Kirkizlar, E., Andradóttir, S. and Ayhan, H. (2010). Robustness of efficient server assignment policies to service time distributions in finite-buffered lines. Naval Res. Logistics 57, 563–582.CrossRef Google Scholar

Kurano, M., Song, J., Hosaka, M. and Huang, Y. (1998). Controlled Markov set-chains with discounting. J. Appl. Prob. 35, 293–302.CrossRef Google Scholar

Lacker, D. (2015). Law invariant risk measures and information divergences. Preprint. Available at https://arxiv.org/abs/1510.07030.Google Scholar

Lacker, D. (2016). A non-exponential extension of sanov’s theorem via convex duality. Preprint. Available at https://arxiv.org/abs/1609.04744.Google Scholar

Lan, Y. and Zhang, N. (2017). Strong limit theorems for weighted sums of negatively associated random variables in nonlinear probability. Preprint. Available at https://arxiv.org/abs/1706.05788.Google Scholar

Ney, P. and Nummelin, E. (1987). Markov additive processes II. large deviations. Ann. Prob. 15, 593–609.CrossRef Google Scholar

Nilim, A. and El Ghaoui, L. (2005). Robust control of Markov decision processes with uncertain transition matrices. Operat. Res. 53, 780–798.CrossRef Google Scholar

Peng, S. (2009). Survey on normal distributions, central limit theorem, brownian motion and the related stochastic calculus under sublinear expectations. Sci. China Ser. A 52, 1391–1411.CrossRef Google Scholar

Peng, S. (2010). Nonlinear expectations and stochastic calculus under uncertainty. Preprint. Available at https://arxiv.org/abs/1002.4546.Google Scholar

Rottondi, C., Erreygers, A., Verticale, G. and De Bock, J. (2017). Modelling spectrum assignment in a two-service flexi-grid optical link with imprecise continuous-time Markov chains. In 13th Internat. Conf. on Design of Reliable Communication Networks (DRCN 2017), Munich, pp. 1–8.Google Scholar

Škulj, D. (2006). Finite discrete time Markov chains with interval probabilities. In Soft Methods for Integrated Uncertainty Modelling (Adv. Soft Comput. 37), Springer, Heidelberg, pp. 299–306.CrossRef Google Scholar

Škulj, D. (2009). Discrete time Markov chains with interval probabilities. Internat. J. Approx. Reason. 50, 1314–1329.CrossRef Google Scholar

Škulj, D. (2015). Efficient computation of the bounds of continuous time imprecise Markov chains. Appl. Math. Comput. 250, 165–180.Google Scholar

Varadarajan, V. S. (1958). Weak convergence of measures on separable metric spaces. Sankhyā 19, 15–22.Google Scholar

Villani, C. (2008). Optimal Transport: Old and New, Vol. 338. Springer, Heidelberg.Google Scholar

Wiesemann, W., Kuhn, D. and Rustem, B. (2013). Robust Markov decision processes. Math. Operat. Res. 38, 153–183.CrossRef Google Scholar

Yang, I. (2017). A convex optimization approach to distributionally robust Markov decision processes with Wasserstein distance. IEEE Control Systems Lett. 1, 164–169.CrossRef Google Scholar

Yu, P. and Xu, H. (2016). Distributionally robust counterpart in Markov decision processes. IEEE Trans. Automatic Control 61, 2538–2543.CrossRef Google Scholar

FIGURE 1: Illustration of convergence rates, simulated (100 paths) realized convergence, and the stationary distributions under the normal Markov chain and the robust worst-case Markov chain.

Eckstein supplementary material

Supplementary material 1

PDF 221.4 KB

Article contents

Extended Laplace principle for empirical measures of a Markov chain

Abstract

Keywords

MSC classification

1. Introduction

1.1. Main results

Assumption M

Assumption B

Assumption T

Theorem 1.1

Corollary 1.1

1.2. Applications to robust Markov chains

Theorem 1.2

Theorem 1.3

1.3. Generalizations and relation to the literature

1.4. Further ideas and outlook

1.5. Structure of the paper

2. Proofs of Theorem 1.1 and Corollary 1.1

2.1. The lower bound of Theorem 1.1

2.1.1. Preliminary results

Lemma 2.1

Lemma 2.2

Lemma 2.3

Lemma 2.4

Lemma 2.5

2.2. The upper bound of Theorem 1.1

Theorem 2.1

2.3. Proof of Corollary 1.1

Claim 2.1

Claim 2.2

3. Applications to robust Markov chains

3.1. Robust large deviations

Lemma 3.1

Lemma 3.2

Lemma 3.3

Lemma 3.4

Theorem 3.1

Lemma 3.5

Lemma 3.6

Example 3.1

3.2. Robust weak law of large numbers

Lemma 3.7

Theorem 3.2

Acknowledgements

Footnotes

References

Eckstein supplementary material

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests