Hostname: page-component-745bb68f8f-kw2vx Total loading time: 0 Render date: 2025-02-06T04:50:48.900Z Has data issue: false hasContentIssue false

A non-exponential extension of Sanov’s theorem via convex duality

Published online by Cambridge University Press:  29 April 2020

Daniel Lacker*
Affiliation:
Columbia University
*
*Postal address: 306 Mudd, 500 West 120th St, New York, NY 10027, USA. Email address: daniel.lacker@columbia.edu
Rights & Permissions [Opens in a new window]

Abstract

This work is devoted to a vast extension of Sanov’s theorem, in Laplace principle form, based on alternatives to the classical convex dual pair of relative entropy and cumulant generating functional. The abstract results give rise to a number of probabilistic limit theorems and asymptotics. For instance, widely applicable non-exponential large deviation upper bounds are derived for empirical distributions and averages of independent and identically distributed samples under minimal integrability assumptions, notably accommodating heavy-tailed distributions. Other interesting manifestations of the abstract results include new results on the rate of convergence of empirical measures in Wasserstein distance, uniform large deviation bounds, and variational problems involving optimal transport costs, as well as an application to error estimates for approximate solutions of stochastic optimization problems. The proofs build on the Dupuis–Ellis weak convergence approach to large deviations as well as the duality theory for convex risk measures.

Type
Original Article
Copyright
© Applied Probability Trust 2020

1. Introduction

An original goal of this paper was to extend the weak convergence methodology of Dupuis and Ellis [Reference Dupuis and Ellis22] to the context of non-exponential (e.g. heavy-tailed) large deviations. While we claim only modest success in this regard, we do find some general-purpose large deviation upper bounds which can be seen as polynomial-rate analogs of the upper bounds in the classical theorems of Sanov and Cramér. At least as interesting, however, are the abstract principles behind these bounds, which have broad implications beyond the realm of large deviations. Let us first describe these abstract principles before specializing them in various ways.

Let E be a Polish space, and let $\mathcal{P}(E)$ denote the set of Borel probability measures on E endowed with the topology of weak convergence. Let B(E) (resp. $C_b(E)$ ) denote the set of measurable (resp. continuous) and bounded real-valued functions on E. For $n \ge 1$ and $\nu \in \mathcal{P}(E^n)$ , define $\nu_{0,1} \in \mathcal{P}(E)$ and measurable maps $\nu_{k-1,k} \colon E^{k-1} \rightarrow \mathcal{P}(E)$ for $k=2,\ldots,n$ via the disintegration

\[\nu({\mathrm{d}} x_1,\ldots,{\mathrm{d}} x_n) = \nu_{0,1}({\mathrm{d}} x_1)\prod_{k=2}^n\nu_{k-1,k}(x_1,\ldots,x_{k-1})({\mathrm{d}} x_k).\]

In other words, if $(X_1,\ldots,X_n)$ is an $E^n$ -valued random variable with law $\nu$ , then $\nu_{0,1}$ is the law of $X_1$ , and $\nu_{k-1,k}(X_1,\ldots,X_{k-1})$ is the conditional law of $X_k$ given $(X_1,\ldots,X_{k-1})$ . Of course, $\nu_{k-1,k}$ are uniquely defined up to $\nu$ -almost sure equality.

The protagonist of the paper is a proper (i.e. not identically $\infty$ ) convex function $\alpha \colon \mathcal{P}(E) \rightarrow ({-}\infty,\infty]$ with compact sub-level sets; that is, $\{\nu \in \mathcal{P}(E) \colon \alpha(\nu) \le c\}$ is compact for every $c \in {\mathbb R}$ . For $n \ge 1$ define $\alpha_n \colon \mathcal{P}(E^n) \rightarrow ({-}\infty,\infty]$ by

\[\alpha_n(\nu) = \int_{E^n}\sum_{k=1}^n\alpha(\nu_{k-1,k}(x_1,\ldots,x_{k-1}))\,\nu({\mathrm{d}} x_1,\ldots,{\mathrm{d}} x_n),\]

and note that $\alpha_1 \equiv \alpha$ . Define the convex conjugate $\rho_n \colon B(E^n) \rightarrow {\mathbb R}$ by

(1.1) \begin{equation}\rho_n(\,f) = \sup_{\nu \in \mathcal{P}(E^n)}\bigg(\int_{E^n}f\,{\mathrm{d}} \nu - \alpha_n(\nu)\bigg) \quad \text{and} \quad \rho \equiv \rho_1.\end{equation}

Our main interest is in evaluating $\rho_n$ at functions of the empirical measure $L_n \colon E^n \rightarrow \mathcal{P}(E)$ defined by

\[L_n(x_1,\ldots,x_n) = \dfrac{1}{n}\sum_{i=1}^n\delta_{x_i}.\]

The main abstract result of the paper is the following extension of Sanov’s theorem, proved in more generality in Section 2.2 by adapting the weak convergence techniques of Dupuis and Ellis [Reference Dupuis and Ellis22].

Theorem 1.1. For $F \in C_b(\mathcal{P}(E))$ ,

\[\lim_{n\rightarrow\infty}\dfrac{1}{n}\rho_n(nF \circ L_n) = \sup_{\nu \in \mathcal{P}(E)}(F(\nu) - \alpha(\nu)).\]

The guiding example is the relative entropy, $\alpha(\!\cdot\!) = H(\cdot \mid \mu)$ , where $\mu \in \mathcal{P}(E)$ is a fixed reference measure, and H is defined by

(1.2) \begin{equation}H(\nu \mid \mu) = \int_E\log({\mathrm{d}} \nu/{\mathrm{d}} \mu)\,{\mathrm{d}} \nu \quad \text{for } \nu \ll \mu, \qquad H(\nu \mid \mu) = \infty \quad \text{otherwise},\end{equation}

Letting $\mu^n$ denote the n-fold product measure, it turns out that $\alpha_n(\!\cdot\!) = H(\cdot \mid \mu^n)$ , by the so-called chain rule of relative entropy [Reference Dupuis and Ellis22, Theorem B.2.1]. The dual $\rho_n$ is well known to be $\rho_n(\,f) = \log\int_{E^n} {\mathrm{e}}^{\,f}\,{\mathrm{d}} \mu^n$ , and the duality formulas relating $\rho_n$ and $\alpha_n$ are often known as the Gibbs variational principle or the Donsker–Varadhan formula [Reference Dupuis and Ellis22, Proposition 1.4.2 and Lemma 1.4.3]. In this case Theorem 1.1 reduces to the Laplace principle form of Sanov’s theorem:

\[\lim_{n\rightarrow\infty}\dfrac{1}{n}\log\int_{E^n} {\mathrm{e}}^{nF\circ L_n}\,{\mathrm{d}} \mu^n = \sup_{\nu \in \mathcal{P}(E)}(F(\nu) - H(\nu \mid \mu)).\]

Well-known theorems of Varadhan and of Dupuis and Ellis (see [Reference Dupuis and Ellis22, Theorems 1.2.1 and 1.2.3]) assert the equivalence of this form of Sanov’s theorem with the more common form: for every Borel set $A \subset \mathcal{P}(E)$ with closure $\overline{A}$ and interior $A^\circ$ ,

(1.3) \begin{align}-\inf_{\nu \in A^\circ}H(\nu \mid \mu) &\le \liminf_{n\rightarrow\infty}\dfrac{1}{n}\log\mu^n(L_n \in A) \notag \\*&\le \limsup_{n\rightarrow\infty}\dfrac{1}{n}\log\mu^n(L_n \in A) \le -\inf_{\nu \in \overline{A}}H(\nu \mid \mu).\end{align}

To derive this heuristically, apply Theorem 1.1 to the function

\begin{align*} F(\nu) = \begin{cases}0 &\text{if } \nu \in A, \\-\infty &\text{otherwise}.\end{cases}\end{align*}

For general $\alpha$ , Theorem 1.1 does not permit an analogous equivalent formulation in terms of deviation probabilities. In fact, for many $\alpha$ , Theorem 1.1 has nothing to do with large deviations (see Sections 1.3 and 1.4 below). Nonetheless, for certain $\alpha$ , Theorem 1.1implies interesting large deviation upper bounds, which we prove by formalizing the aforementioned heuristic. While many $\alpha$ admit fairly explicit known formulas for the dual $\rho$ , the recurring challenge in applying Theorem 1.1 is finding a useful expression for $\rho_n$ , and herein lies but one of many instances of the wonderful tractability of relative entropy. The examples to follow do admit good expressions for $\rho_n$ , or at least workable one-sided bounds, but we also catalog in Section 1.5 some natural alternative choices of $\alpha$ for which we did not find useful bounds or expressions for $\rho_n$ .

The functional $\rho$ is (up to a sign change) a convex risk measure, in the language of Föllmer and Schied [Reference Föllmer and Schied28]. A rich duality theory for convex risk measures has emerged over the past two decades, primarily geared toward applications in financial mathematics and optimization. We take advantage of this theory in Section 2 to demonstrate how $\alpha$ can be reconstructed from $\rho$ , which shows that $\rho$ could be taken as the starting point instead of $\alpha$ . Additionally, the theory of risk measures provides insight into how to deal with the subtleties that arise in extending the domain of $\rho$ (and Theorem 1.1) to accommodate unbounded functions or stronger topologies on $\mathcal{P}(E)$ . Section 1.6 briefly reinterprets Theorem 1.1 in a language more consistent with the risk measure literature. The reader familiar with risk measures may notice a time-consistent dynamic risk measure (see [Reference Acciaio, Penner, Di Nunno and Øksendal1] for definitions and survey) hidden in the definition of $\rho_n$ above.

We will make no use of the interpretation in terms of dynamic risk measures, but it did inspire a recursive formula for $\rho_n$ (similar to a result of [Reference Cheridito and Kupper14]). To state it loosely, if $f \in B(E^n)$ then we may write

(1.4) \begin{equation}\rho_n(\,f) = \rho_{n-1}(g), \quad \text{where } g(x_1,\ldots,x_{n-1}) \,:\!= \rho(\,f(x_1,\ldots,x_{n-1},\cdot)).\end{equation}

To make rigorous sense of this, we must note that $g \colon E^{n-1} \rightarrow {\mathbb R}$ is merely upper semianalytic and not Borel-measurable in general, and argue that $\rho$ is well-defined for such functions. We make this precise in Proposition A.1. This recursive formula is not essential for any of the main arguments but is convenient for some calculations.

1.1. Non-exponential large deviations

Our first application, and the one we discuss in the most detail, comes from applying (an extension of) Theorem 1.1 with

(1.5) \begin{equation}\alpha(\nu) = \|{\mathrm{d}} \nu/{\mathrm{d}} \mu\|_{L^p(\mu)}-1 \quad \text{for } \nu \ll \mu, \qquad \alpha(\nu) = \infty \quad \text{otherwise},\end{equation}

where $\mu \in \mathcal{P}(E)$ is fixed. We state the abstract result first. For a continuous function $\psi \colon E \rightarrow {\mathbb R}_+ \,:\!= [0,\infty)$ , let $\mathcal{P}_\psi(E)$ denote the set of $\nu \in \mathcal{P}(E)$ satisfying $\int\psi\,{\mathrm{d}} \nu \lt \infty$ . Equip $\mathcal{P}_\psi(E)$ with the topology induced by the linear maps $\nu \mapsto \int f\,{\mathrm{d}} \nu$ , where $f \colon E \rightarrow {\mathbb R}$ is continuous and $|\,f| \le 1+\psi$ . Recall in the following that $\mu^n$ denotes the n-fold product measure.

Theorem 1.2. Let $q \in (1,\infty)$ , and let $p=q/(q-1)$ denote the conjugate exponent. Let $\mu \in \mathcal{P}(E)$ , and suppose $\int\psi^q\,{\mathrm{d}} \mu \lt \infty$ for some continuous $\psi \colon E \rightarrow {\mathbb R}_+$ . Then, for every closed set $A \subset \mathcal{P}_\psi(E)$ ,

\[\limsup_{n\rightarrow\infty}\,n^{q-1}\mu^n(L_n \in A) \le \bigg(\inf_{\nu \in A}\|{\mathrm{d}} \nu/{\mathrm{d}} \mu\|_{L^p(\mu)}-1\bigg)^{-q}.\]

We view Theorem 1.2 as a non-exponential version of the upper bound of Sanov’s theorem, and the proof is given in Section 4.1. At this level of generality, there cannot be a matching lower bound for open sets as in the classical case (1.3), as will be explained more in Section 1.1.2. Of course, Sanov’s theorem applies without any moment assumptions, but the upper bound provides no information in many heavy-tailed contexts. We illustrate this with three applications below, all of which take advantage of the crucial fact that Theorem 1.2 applies to arbitrary closed sets A, which enables a natural contraction principle (i.e. continuous mapping). The first example gives new results on the rate of convergence of empirical measures in Wasserstein distance. Second, we derive non-exponential upper bounds analogous to Cramér’s theorem for sums of independent and identically distributed (i.i.d.) random variables with values in Banach spaces. Lastly, we derive error bounds for the usual Monte Carlo scheme in stochastic optimization, essentially providing a heavy-tailed analog of the results of [Reference Kaniovski, King and Wets37].

1.1.1. Rate of convergence of empirical measures in Wasserstein distance

First, some terminology: a compatible metric on E is any metric on E which generates the given Polish topology. For $\mu,\nu \in \mathcal{P}(E)$ and $q \ge 1$ define the q-Wasserstein distance $\mathcal{W}_q(\mu,\nu)$ by

(1.6) \begin{equation}\mathcal{W}^q_q(\mu,\nu) = \inf_{\pi \in \Pi(\mu,\nu)}\int_{E \times E} d^q(x,y) \pi({\mathrm{d}} x,{\mathrm{d}} y),\end{equation}

where $\Pi(\mu,\nu)$ is the set of probability measures on $E \times E$ with first marginal $\mu$ and second marginal $\nu$ . In Section 4.2 we will prove the following.

Corollary 1.1. (Wasserstein convergence rate.) Let d be any compatible metric on E. Let $q \gt r \ge 1$ , and let $\mu \in \mathcal{P}(E)$ satisfy $\int_E d^q(x,x_0)\mu({\mathrm{d}} x) \lt \infty$ for some (equivalently, for any) $x_0 \in E$ . Then, for each $a \gt 0$ ,

\begin{equation*} \limsup_{n\to\infty} n^{{q}/{r}-1}\mu^n (\mathcal{W}_r(L_n,\mu) \ge a) < \infty.\end{equation*}

In particular,

\begin{equation*}\limsup_{n\to\infty} n^{q-1}\mu^n (\mathcal{W}_1(L_n,\mu) \ge a) < \infty.\end{equation*}

In other words, $\mu^n (\mathcal{W}_r(L_n,\mu) \ge a) = {\mathrm{O}}(n^{1-q/r})$ . In the $r=1$ case, a comparison with a more classical setting reveals that this rate is the right one, in a sense. Suppose $X_i$ are i.i.d. real-valued random variables with law $\mu$ , mean zero, and ${\mathbb E}|X_1|^q \lt \infty$ . Then the a.s. inequality

\[\dfrac{1}{n}\sum_{i=1}^nX_i \le \mathcal{W}_1\Bigg(\dfrac{1}{n}\sum_{i=1}^n\delta_{X_i},\mu\Bigg)\]

and Corollary 1.1 give ${\mathbb P}(X_1+\cdots + X_n \ge an) = {\mathrm{O}}(n^{1-q})$ for each $a \gt 0$ . It is known in this context that ${\mathbb P}(X_1 + \cdots + X_n \gt an) = {\mathrm{o}}(n^{1-q})$ , and this exponent cannot be improved under the sole assumption of a finite qth moment [Reference Petrov48, Chapter IX, Theorems Reference Föllmer and Schied27Reference Föllmer and Schied28]. A similar argument in the case r > 1 indicates that the exponent $q/r - 1$ is sharp in the first claim of Corollary 1.3.

There is now a substantial literature on rates of convergence of empirical measures of i.i.d. sequences in Wasserstein distance, and we refer to the recent paper [Reference Fournier and Guillin30] for the state of the art and an overview of the many applications in quantization, interacting particle systems, etc. Yet, our result seems quite new in several respects. First, while the $n\to\infty$ convergence rate of the expected distance

\[M_n^{(r)}\,:\!=\int_{E^n}\mathcal{W}_r^r(L_n,\mu)\,{\mathrm{d}} \mu^n\]

is well understood, the (asymptotic) rate of convergence in n of the deviation probabilities given in Corollary 1.1 appears to be new. Case (3) in Theorem 2 of [Reference Fournier and Guillin30] gives some non-asymptotic bounds on these probabilities which are worse than ours in the $n\to\infty$ regime; the closest counterpart among their results is a bound of ${\mathrm{O}}(n^{1-q+\epsilon})$ for any $\epsilon \gt 0$ , but it is given only for $a \gt 1$ and $r \lt q/2$ .

A second novelty of Corollary 1.1 is that it is valid in arbitrary Polish spaces, whereas most of the prior literature deals with Euclidean spaces. In the setting of tail probability bounds, a notable exception is the work of Boissard [Reference Boissard12], which shows exponential decay in n of the probabilities $\mu^n (\mathcal{W}_1(L_n,\mu) \ge a)$ but under assumptions that the measure $\mu$ has finite exponential moments or satisfies a transport inequality. In the study of the expected distances $M_n^{(r)}$ , it is well known that the dimension of the underlying space (or more generally a notion of metric entropy as in [Reference Dudley19] and [Reference Weed and Bach54]) must absolutely come into play. For example, in Euclidean space $E \subset {\mathbb R}^d$ with $d \gt 2$ , $M_n^{(1)}$ is known to be asymptotic to $n^{-1/d}$ , at least when E is compact and $\mu$ is absolutely continuous with respect to Lebesgue measure [Reference Dudley19]. Corollary 1.1 shows that this dimension dependence disappears from the probabilistic rate of convergence. Note that this leads to no contradiction: writing

\[M_n^{(1)} = \int_0^\infty\mu^n (\mathcal{W}_1(L_n,\mu) \ge a) \,{\mathrm{d}} a\]

and applying Corollary 1.1 does not imply $n^{q-1}M_n^{(1)}\to 0$ , as the dominated convergence theorem does not apply here.

1.1.2. Cramér’s upper bound

While Cramér’s theorem in full generality, like Sanov’s, does not require any finite moments, the upper bound is often vacuous when the underlying random variables have heavy tails. This simple observation has driven a large and growing literature on large deviation asymptotics for sums of i.i.d. random variables, to be reviewed shortly. This literature is full of precise asymptotics, mostly out of reach of our abstract framework. However, from Theorem 1.2 we can derive a modest alternative to Cramér’s upper bound which is notable in its wide applicability. See Section 4.2 for a proof of the following.

Corollary 1.2. (Cramér upper bound.) Let $q \in (1,\infty)$ , and let E be a separable Banach space. Let $(X_i)_{i=1}^\infty$ be i.i.d. E-valued random variables with ${\mathbb E}\|X_1\|^q \lt \infty$ . Define $\Lambda \colon E^* \rightarrow {\mathbb R} \cup \{\infty\}$ by

\[\Lambda(x^*) = \inf \{m \in {\mathbb R} \colon {\mathbb E} [[(1+\langle x^*,X_1\rangle - m)^+]^q ] \le 1 \},\]

and define $\Lambda^*(x) = \sup_{x^* \in E^*} (\langle x^*,x\rangle - \Lambda(x^*))$ for $x \in E$ . Then, for every closed set $A \subset E$ ,

\[\limsup_{n\rightarrow\infty}\,n^{q-1}\,{\mathbb P}\Bigg(\dfrac{1}{n}\sum_{i=1}^nX_i \in A\Bigg) \le \bigg(\inf_{x \in A}\Lambda^*(x)\bigg)^{-q}.\]

Here, $E^*$ denotes the continuous dual of E.

In analogy with the classical Cramér’s theorem, the function $\Lambda$ in Corollary 1.2 plays the role of the cumulant generating function. In both Theorem 1.2 and Corollary 1.2, notice that as soon as the constant on the right-hand side is finite we may conclude that the probabilities in question are ${\mathrm{O}}(n^{1-q})$ , consistent with some now-standard results on one-dimensional heavy-tailed sums for events of the form $A=[r,\infty)$ , for $r \gt 0$ . For instance, as we mentioned in the previous subsection, if $(X_i)_{i=1}^\infty$ are i.i.d. real-valued random variables with mean zero and ${\mathbb E}|X_1|^q \lt \infty$ , then the sharpest result possible under these assumptions is ${\mathbb P}(X_1 + \cdots + X_n \gt nr) = {\mathrm{o}}(n^{1-q})$ . For $q \gt 2$ , the Fuk–Nagaev inequality gives a related non-asymptotic bound; see [Reference Nagaev43, Corollary 1.8], or [Reference Einmahl and Li25] for a Banach space version.

In general, we cannot expect a matching lower bound in Corollary 1.2, and thus we cannot expect one in Theorem 1.2. If stronger assumptions are made on $X_i$ , such as regular variation, then corresponding lower bounds are known for certain sets A, but it remains unclear whether or not our abstract approach can recover such lower bounds. Refer to [Reference Borovkov and Borovkov13], [Reference Foss, Korshunov and Zachary29], and [Reference Mikosch and Nagaev41] for detailed overviews of such results, as well as the more recent [Reference Denisov, Dieker and Shneer17], [Reference Rhee, Blanchet and Zwart49], and references therein. Indeed, precise asymptotics require detailed assumptions on the shape of the tails of $X_i$ , and this is especially true in multivariate and infinite-dimensional contexts. An interesting recent line of work extends the theory of regular variation to metric spaces [Reference De Haan and Lin15, Reference Hult and Lindskog34, Reference Hult, Lindskog, Mikosch and Samorodnitsky35, Reference Lindskog, Resnick and Roy40], but again the assumptions on the underlying $\mu$ are much stronger than mere existence of a finite moment.

The only real strengths of our Corollary 1.2, compared to the deep literature on sums of heavy-tailed random variables, is its broad applicability. It requires only finite moments, applies in general (separable) Banach spaces, and allows for arbitrary closed sets A, the latter point being useful in that it enables contraction principle arguments.

Before turning to the next application, it is worth mentioning a few more loosely related papers. In connection with concentration of measure, the papers of Bobkov and Ding [Reference Bobkov and Ding11, Reference Ding18] studied transport inequalities involving functionals like (1.5), resulting in characterizations of certain non-exponential tail bounds. Less closely related, Atar et al. [Reference Atar, Chowdhary and Dupuis4] exploited a variational representation for exponential integrals involving the functional (1.5) and showed how to use it to bound, for example, a large deviation probability for one model in terms of an alternative more tractable model; their work does not, however, appear to be applicable to situations with heavy tails.

1.1.3. Stochastic optimization

Let $\mathcal{X}$ be another Polish space. Consider a continuous function $h \colon \mathcal{X} \times E \rightarrow {\mathbb R}$ bounded from below, and define $V \colon \mathcal{P}(E) \rightarrow {\mathbb R}$ by

\[V(\nu) = \inf_{x \in \mathcal{X}}\int_Eh(x,w)\nu({\mathrm{d}} w).\]

Fix $\mu \in \mathcal{P}(E)$ again as a reference measure. The most common and natural approach to solving the optimization problem $V(\mu)$ numerically is to construct i.i.d. samples $X_1,X_2,\ldots$ with law $\mu$ and instead study $V(L_n(X_1,\ldots,X_n))$ , where as usual

\[L_n(X_1,\ldots,X_n)=\dfrac{1}{n}\sum_{i=1}^n\delta_{X_i}.\]

The two obvious questions are then as follows.

  1. (A) Does $V(L_n(X_1,\ldots,X_n))$ converge to $V(\mu)$ ?

  2. (B) Do the minimizers of $V(L_n(X_1,\ldots,X_n))$ converge to those of $V(\mu)$ in some sense?

The answers to these questions are known to be affirmative in very general settings, using a form of set-convergence for question (B); see [Reference Dupacová and Wets21], [Reference Kall and Guddat36], and [Reference King and Wets38]. Given this, we then hope to quantify the rate of convergence for both of these questions. This is done in the language of large deviations in a paper of Kaniovski et al. [Reference Kaniovski, King and Wets37], under a strong exponential integrability assumption derived from Cramér’s condition. In this section we complement their results by showing that under weaker integrability assumptions we can still obtain polynomial rates of convergence.

Theorem 1.3. Suppose $\mathcal{X}$ is compact. Suppose the function h is jointly continuous, and its sub-level sets are compact. Let $q \in (1,\infty)$ and $\mu \in \mathcal{P}(E)$ be such that, if

\[\psi(w) \,:\!= \Big(\sup_{x \in \mathcal{X}}h(x,w)\Big)^+,\]

then $\int_E\psi^q\,{\mathrm{d}} \mu \lt \infty$ . Then, for each $\epsilon \gt 0$ ,

\[\limsup_{n\rightarrow\infty}n^{q-1}\mu^n(|V(L_n)-V(\mu)| \ge \epsilon) < \infty.\]

The proof is given in Section 4.3, where we also present a related result on the rate of convergence of the optimizers themselves, addressing question (B) above.

1.2. Uniform upper bounds and martingales

Certain classes of dependent sequences admit uniform upper bounds, which we derive from Theorem 1.1 by working with

\begin{equation*} \alpha(\nu) = \inf_{\mu \in M}H(\nu \mid \mu),\end{equation*}

for a given convex weakly compact set $M \subset \mathcal{P}(E)$ . The conjugate $\rho$ , unsurprisingly, is $\rho(\,f) = \sup_{\mu \in M}\log\int {\mathrm{e}}^{\,f}\,{\mathrm{d}} \mu$ , and $\rho_n$ turns out to be tractable as well, that is,

\begin{equation*} \rho_n(\,f) = \sup_{\mu \in M_n}\log\int_{E^n} {\mathrm{e}}^{\,f}\,{\mathrm{d}} \mu,\end{equation*}

where $M_n$ is defined as the set of laws $\mu \in \mathcal{P}(E^n)$ with $\mu_{k-1,k} \in M$ for each $k=1,\ldots,n$ , $\mu$ -almost surely; in other words, $M_n$ is the set of laws of $E^n$ -valued random variables $(X_1,\ldots,X_n)$ , when the law of $X_1$ belongs to M and so does the conditional law of $X_k$ given $(X_1,\ldots,X_{k-1})$ , almost surely, for each $k=2,\ldots,n$ . Theorem 1.1 becomes

\[\lim_{n\rightarrow\infty}\dfrac{1}{n}\log\sup_{\mu \in M_n}\int_{E^n} {\mathrm{e}}^{nF\circ L_n}\,{\mathrm{d}} \mu = \sup_{\mu \in M, \ \nu \in \mathcal{P}(E)}(F(\nu) - H(\nu \mid \mu)) \quad \text{for } F \in C_b(\mathcal{P}(E)).\]

From this we derive a uniform large deviation upper bound, for closed sets $A \subset \mathcal{P}(E)$ :

(1.7) \begin{equation}\limsup_{n\rightarrow\infty}\dfrac{1}{n}\log\sup_{\mu \in M_n}\mu(L_n \in A) \le -\inf_{\mu \in M, \nu \in A}H(\nu \mid \mu).\end{equation}

With a prudent choice of M, this specializes to an asymptotic relative of the Azuma–Hoeffding inequality. The novel feature here is that we can work with arbitrary closed sets and in multiple dimensions.

Theorem 1.4. Let $\varphi \colon {\mathbb R}^d \rightarrow {\mathbb R}$ , and define $\mathcal{S}_{d,\varphi}$ to be the set of ${\mathbb R}^d$ -valued martingales $(S_k)_{k=0}^n$ , defined on a common but arbitrary probability space, satisfying $S_0=0$ and

\[{\mathbb E} [\!\exp (\langle y, S_k-S_{k-1}\rangle)\mid S_0,\ldots,S_{k-1}] \le {\mathrm{e}}^{\varphi(y)} \ \ \textit{a.s.}\quad \textit{for } k=1,\ldots,n, \ y \in {\mathbb R}^d.\]

Then, for closed sets $A \subset {\mathbb R}^d$ , we have

\[\limsup_{n\rightarrow\infty}\sup_{(S_k)_{k=0}^n \in \mathcal{S}_{d,\varphi}}\dfrac{1}{n}\log{\mathbb P}(S_n/n \in A) \le -\inf_{x \in A}\varphi^*(x),\]

where $\varphi^*(x) = \sup_{y \in {\mathbb R}^d}(\langle x,y\rangle - \varphi(y))$ .

By taking $(S_k)_{k=0}^n$ to be a random walk (i.e. the increments are i.i.d.) such that $\varphi(y) = \log{\mathbb E}[{\mathrm{e}}^{\langle y,S_1-S_0\rangle}] \lt \infty$ , it is readily checked that the bound of Theorem 1.4 coincides with the upper bound from Cramér’s theorem and is thus sharp. Föllmer and Knispel [Reference Föllmer and Knispel26] found some results which loosely resemble (1.7) (see Corollary 5.3 therein), based on an analysis of the same risk measure $\rho$ . See also [Reference Hu33] and [Reference Fuqing and Mingzhou31] for somewhat related results on large deviations for capacities.

1.3. Laws of large numbers

Some specializations of Theorem 1.1 appear to have nothing to do with large deviations. For example, suppose $M \subset \mathcal{P}(E)$ is convex and compact, and let

\[\alpha(\nu) = \begin{cases}0 &\text{if } \nu \in M, \\\infty &\text{otherwise}.\end{cases}\]

It can be shown that $\rho_n(\,f) = \sup_{\mu \in M_n}\int_{E^n}f\,{\mathrm{d}} \mu$ , where $M_n$ is defined as in Section 1.2, for instance by a direct computation using (1.4). Theorem 1.1 then becomes

(1.8) \begin{equation}\lim_{n\rightarrow\infty}\sup_{\mu \in M_n}\int_{E^n}F \circ L_n\,{\mathrm{d}} \mu = \sup_{\mu \in M}F(\mu)\quad \text{for each } F \in C_b(\mathcal{P}(E)).\end{equation}

When $M =\{\mu\}$ is a singleton, so is $M_n = \{\mu^n\}$ , and this simply expresses the weak convergence $\mu^n \circ L_n^{-1} \rightarrow \delta_\mu$ . The general case can be interpreted as a robust law of large numbers, where ‘robust’ refers to perturbations of the joint law of an i.i.d. sequence. More precisely, noting that $\sup_{\mu \in M}F(\mu) = \sup_{Q \in \mathcal{P}(M)}\int F\,{\mathrm{d}} Q$ , one can derive from (1.8) certain forms of set-convergence (e.g. Painlevé–Kuratowski) of the sequence $\{\mu \circ L_n^{-1} \colon \mu \in M_n\}$ toward $\mathcal{P}(M) \,:\!= \{Q \in \mathcal{P}(\mathcal{P}(E)) \colon Q(M)=1\}$ , though we refrain from lengthening the paper with further details. In another direction, (1.8) is closely related to laws of large numbers under nonlinear expectations [Reference Peng46].

1.4. Optimal transport costs

Another interesting consequence of Theorem 1.1 comes from choosing $\alpha$ as an optimal transport cost. Fix $\mu \in \mathcal{P}(E)$ and a lower semicontinuous function $c \colon E^2 \rightarrow [0,\infty]$ , and define

\[\alpha(\nu) = \inf_{\pi \in \Pi(\mu,\nu)}\int c\,{\mathrm{d}} \pi,\]

where $\Pi(\mu,\nu)$ was defined immediately after (1.6). Under a modest additional assumption on c (stated shortly in Corollary 1.3, proved later in Lemma 6.2), $\alpha$ satisfies our standing assumptions.

The dual $\rho$ can be identified using Kantorovich duality, and $\rho_n$ turns out to be the value of a stochastic optimal control problem. To illustrate this, it is convenient to work with probabilistic notation. Suppose $(X_i)_{i=1}^\infty$ is a sequence of i.i.d. E-valued random variables with common law $\mu$ , defined on some fixed probability space. For each n, let $\mathcal{Y}_n$ denote the set of $E^n$ -valued random variables $(Y_1,\ldots,Y_n)$ where $Y_k$ is $(X_1,\ldots,X_k)$ -measurable for each $k=1,\ldots,n$ . We think of elements of $\mathcal{Y}_n$ as adapted control processes. For each $n \ge 1$ and each $f \in B(E^n)$ , we show in Proposition 6.1 that

(1.9) \begin{equation}\rho_n(\,f) = \sup_{(Y_1,\ldots,Y_n) \in \mathcal{Y}_n}{\mathbb E}\Bigg[\,f(Y_1,\ldots,Y_n) - \sum_{i=1}^nc(X_i,Y_i)\Bigg].\end{equation}

The expression (1.9) yields the following corollary of Theorem 1.1.

Corollary 1.3. Suppose that for each compact set $K \subset E$ , the function $h_K(y) \,:\!= \inf_{x \in K}c(x,y)$ has pre-compact sub-level sets. That is, the closure of $\{y \in E \colon h_K(y) \le m\}$ is compact for each $m \ge 0$ . This assumption holds, for example, if E is a subset of Euclidean space and there exists $y_0 \in E$ such that $c(x,y) \rightarrow \infty$ as $d(y,y_0) \rightarrow \infty$ , uniformly for x in compacts. For each $F \in C_b(\mathcal{P}(E))$ , we have

(1.10) \begin{align}& \lim_{n\rightarrow\infty}\sup_{(Y_k)_{k=1}^n \in \mathcal{Y}_n}{\mathbb E}\Bigg[F(L_n(Y_1,\ldots,Y_n)) - \dfrac{1}{n}\sum_{i=1}^nc(X_i,Y_i)\Bigg]\notag \\* & \quad = \sup_{\nu \in \mathcal{P}(E)}(F(\nu) - \alpha(\nu)) \notag \\*& \quad = \sup_{\pi \in \Pi(\mu)}\bigg(F(\pi(E \times \cdot)) - \int_{E \times E}c\,{\mathrm{d}} \pi\bigg),\end{align}

where $\Pi(\mu) = \cup_{\nu \in \mathcal{P}(E)}\Pi(\mu,\nu)$ .

This can be seen as a long-time limit of the optimal value of the control problems. However, the renormalization in n is a bit peculiar in that it enters inside the terminal cost F, and there does not seem to be a direct connection with ergodic control. A direct proof of (1.10) is possible but seems to be no simpler and potentially narrower in scope.

While the pre-limit expression in (1.10) may look peculiar, we include this example in part because it is remarkably tractable and in part because the limiting object is quite ubiquitous, encompassing a wide variety of variational problems involving optimal transport costs. Two notable recent examples can be found in the study of Cournot–Nash equilibria in large-population games [Reference Blanchet and Carlier10] and in the theory of Wasserstein barycenters [Reference Agueh and Carlier2].

1.5. Alternative choices of $\alpha$

There are many other natural choices of $\alpha$ for which the implications of Theorem 1.1 remain unclear. For example, consider the $\varphi$ -divergence

\[\alpha(\nu) = \int_E\varphi({\mathrm{d}} \nu/{\mathrm{d}} \mu)\,{\mathrm{d}} \mu \quad \text{for } \nu \ll \mu, \qquad \alpha(\nu)=\infty \quad \text{otherwise},\]

where $\mu \in \mathcal{P}(E)$ and $\varphi \colon {\mathbb R}_+ \rightarrow {\mathbb R}$ is convex and satisfies $\varphi(x)/x \rightarrow \infty$ as $x \rightarrow \infty$ . This $\alpha$ has weakly compact sub-level sets, according to [16, Lemma 6.2.16], and it is clearly convex. The dual, known in the risk literature as the optimized certainty equivalent, was computed by Ben-Tal and Teboulle [Reference Ben-Tal and Teboulle7, Reference Ben-Tal and Teboulle8] to be

\[\rho(\,f) = \inf_{m \in {\mathbb R}}\bigg(\int_E\varphi^*(\,f(x)-m)\mu({\mathrm{d}} x) + m\bigg),\]

where $\varphi^*(x) = \sup_{y \in {\mathbb R}}(xy - \varphi(y))$ is the convex conjugate. We did not find any good expressions or estimates for $\rho_n$ or $\alpha_n$ , so the interpretation of the main Theorem 1.1 eludes us in this case.

A related choice is the shortfall risk measure introduced by Föllmer and Schied [Reference Föllmer and Schied27]:

\begin{equation*} \rho(\,f) = \inf\bigg\{m \in {\mathbb R} \colon \int_E\ell(\,f(x)-m)\mu({\mathrm{d}} x) \le 1\bigg\}.\end{equation*}

This choice of $\rho$ and the corresponding (tractable!) $\alpha$ are discussed briefly in Section 4.1. The choice of $\ell(x) = [(1+x)^+]^q$ corresponds to (1.5), and we make extensive use of this in Section 4, as was discussed in Section 1.1. The choice of $\ell(x) = {\mathrm{e}}^x$ recovers the classical case $\rho(\,f) = \log\int_E {\mathrm{e}}^{\,f}\,{\mathrm{d}} \mu$ . Aside from these two examples, for general $\ell$ , we found no useful expressions or estimates for $\rho_n$ or $\alpha_n$ . In connection with tails of random variables, shortfall risk measures have an intuitive appeal stemming from the following simple analog of Chernoff’s bound, observed in [Reference Lacker39, Proposition 3.3]. If $\gamma(\lambda) = \rho(\lambda f)$ for all $\lambda \ge 0$ , where f is some given measurable function, then $\mu(\,f \gt t) \le 1/\ell(\gamma^*(t))$ for all $t \ge 0$ , where $\gamma^*(t) = \sup_{\lambda \ge 0}(\lambda t - \gamma(\lambda))$ .

It is worth pointing out the natural but ultimately fruitless idea of working with

\[\rho(\,f) = \varphi^{-1}\bigg(\int_E\varphi(\,f)\,{\mathrm{d}} \mu\bigg),\]

where $\varphi$ is increasing. Such functionals were studied first – it seems – by Hardy, Littlewood, and Pólya [Reference Hardy, Littlewood and Pólya32, Chapter Reference Aliprantis and Border3], providing necessary and sufficient conditions for $\rho$ to be convex (rediscovered in [Reference Ben-Tal and Teboulle7]). Using the formula (1.4) to compute $\rho_n$ , this choice would lead to the exceptionally pleasant formula

\[\rho_n(\,f) = \varphi^{-1}\bigg(\int_{E^n}\varphi(\,f)\,{\mathrm{d}} \mu^n\bigg),\]

which we observed already in the classical case $\varphi(x)= {\mathrm{e}}^x$ . Unfortunately, however, such a $\rho$ cannot come from a functional $\alpha$ on $\mathcal{P}(E)$ , in the sense that (1.1) cannot hold unless $\varphi$ is affine or exponential. The problem, as is known in the risk measure literature, is that the additivity property $\rho(\,f+c)=\rho(\,f)+c$ for all $c \in {\mathbb R}$ and $f \in B(E)$ fails unless $\varphi$ is affine or exponential (cf. [Reference Föllmer and Schied28, Proposition 2.46]).

1.6. Interpreting Theorem 1.1 in terms of risk measures

It is straightforward to rewrite Theorem 1.1 in a language more in line with the literature on convex risk measures, for which we again defer to [Reference Föllmer and Schied28] for background. Let $(\Omega,\mathcal{F})$ be a measurable space, and suppose $\varphi$ is a convex risk measure on the set $B(\Omega,\mathcal{F})$ of bounded measurable functions. That is, $\varphi \colon B(\Omega,\mathcal{F}) \rightarrow {\mathbb R}$ is convex, $\varphi(\,f + c) = \varphi(\,f)+c$ for all $f \in B(\Omega,\mathcal{F})$ and $c \in {\mathbb R}$ , and $\varphi(\,f) \ge \varphi(g)$ whenever $f \ge g$ pointwise. Suppose we are given a sequence of E-valued random variables $(X_i)_{i=1}^\infty$ , i.e. measurable maps $X_i \colon \Omega \rightarrow E$ . Assume $X_i$ have the following independence property, identical to Peng’s notion of independence under nonlinear expectations [Reference Peng47]: for $n \ge 1$ and $f \in B(E^n)$ ,

\begin{equation*} \varphi(\,f(X_1,\ldots,X_n)) = \varphi [\varphi(\,f(X_1,\ldots,X_{n-1},x)) |_{x = X_n}].\end{equation*}

In particular, $\varphi(\,f(X_i))=\varphi(\,f(X_1))$ for all i. Define $\alpha \colon \mathcal{P}(E) \rightarrow ({-}\infty,\infty]$ by

\[\alpha(\nu) = \sup_{f \in B(E)}\bigg(\int_Ef\,{\mathrm{d}} \nu - \varphi(\,f(X_1))\bigg).\]

Additional assumptions on $\varphi$ (see e.g. Theorem 2.2 below) can ensure that $\alpha$ has weakly compact sub-level sets, so that Theorem 1.1 applies. Then, for $F \in C_b(\mathcal{P}(E))$ ,

(1.11) \begin{equation}\lim_{n\rightarrow\infty}\dfrac{1}{n}\varphi(nF(L_n(X_1,\ldots,X_n))) = \sup_{\nu \in \mathcal{P}(E)}(F(\nu)-\alpha(\nu)){.}\end{equation}

Indeed, in our previous notation, $\rho_n(\,f)=\varphi(\,f(X_1,\ldots,X_n))$ for $f \in B(E^n)$ .

In the risk measure literature, one thinks of $\varphi(\,f)$ as the risk associated with an uncertain financial loss $f \in B(\Omega,\mathcal{F})$ . With this in mind, and with $Z_n=F(L_n(X_1,\ldots,X_n))$ , the quantity $\varphi(nZ_n)$ appearing in (1.11) is the risk-per-unit of an investment in n units of $Z_n$ . One might interpret $Z_n$ as capturing the composition of the investment, while the multiplicative factor n represents the size of the investment. As n increases, say to $n+1$ , the investment is ‘rebalanced’ in the sense that one additional independent component, $X_{n+1}$ , is incorporated and the size of the total investment is increased by one unit. The limit in (1.11) is then an asymptotic evaluation of the risk-per-unit of this rebalancing scheme.

1.7. Extensions

Broadly speaking, the book of Dupuis and Ellis [Reference Dupuis and Ellis22] and numerous subsequent works illustrate how the classical convex duality between relative entropy and cumulant generating functions can serve as a foundation from which to derive an impressive range of large deviation principles. Similarly, each alternative dual pair $(\alpha,\rho)$ should provide an alternative foundation for a potentially equally wide range of limit theorems. From this perspective, our work raises more questions than it answers by restricting attention to analogs of the two large deviation principles of Sanov and Cramér. It is possible, for instance, that an analog of Mogulskii’s theorem (see [Reference Mogul’skii42] or [Reference Dupuis and Ellis22, Section Reference Aliprantis and Border3]) holds in our context, though one must not expect any such analog to look too much like a heavy-tailed large deviation principle, in light of the negative result of [Reference Rhee, Blanchet and Zwart49, Section 4.4]. These speculations are pursued no further but are meant to convey the versatility of our framework. In fact, extensions and applications of our framework have appeared since the first version of this paper. First, [Reference Eckstein23] extended the ideas beyond the i.i.d. setting, to the study of occupation measures of Markov chains. More recently, [Reference Backhoff-Veraguas, Lacker and Tangpi5] applied Theorem 1.1 to obtain new limit theorems for Brownian motion, with connections to Schilder’s theorem, vanishing noise limits of BSDEs and PDEs, and Schrödinger problems.

1.8. Outline of the paper

The remainder of the paper is organized as follows. Section 2 begins by clarifying the $(\alpha,\rho)$ duality, explaining some useful properties of $\rho$ and $\rho_n$ and extending their definitions to unbounded functions before giving. In Section 2.2 we give the proof of an extension of Theorem 1.1, which contains Theorem 1.1 as a special case but is extended to stronger topologies and unbounded functions F. See also Section 2.3 for abstract analogs of the contraction principle and Cramér’s theorem. Section 3 elaborates on the additional topological assumptions needed for the extension of the main theorem. Then, Section 4 focuses on the particular choice of $\alpha$ in (1.5), providing proofs of the claims of Section 1.1. Sections 5 and 6 respectively elaborate on the examples of 1.2 and 1.4. Appendix A proves a different representations of $\rho_n$ , namely those of (1.4). Finally, two minor technical results are relegated to Appendix B.

2. Convex duality and an extension of Theorem 1.1

We begin by outlining the key features of the $(\alpha,\rho)$ duality, as a first step toward stating and proving an extension of the main theorem as well as an abstract analog of Cramér’s theorem. The first two theorems below are borrowed from the literature on convex risk measures, for which an excellent reference is the book of Föllmer and Schied [Reference Föllmer and Schied28]. While we will make use of some of the properties listed in Theorem 2.1, the goal of the first two theorems is more to illustrate how one can make $\rho$ the starting point rather than $\alpha$ . In particular, Theorem 2.2 will not be needed below. For proofs of Theorems 2.1 and 2.2, refer to Bartl [Reference Bartl6, Theorem 2.6].

Theorem 2.1. Suppose $\alpha \colon \mathcal{P}(E) \rightarrow ({-}\infty,\infty]$ is convex and has weakly compact sub-level sets. Define $\rho \colon B(E) \rightarrow {\mathbb R}$ as in (1.1). Then the following hold.

  1. (R1) If $f \ge g$ pointwise then $\rho(\,f) \ge \rho(g)$ .

  2. (R2) If $f \in B(E)$ and $c \in {\mathbb R}$ , then $\rho(\,f+c)=\rho(\,f)+c$ .

  3. (R3) If $f,f_n \in B(E)$ with $f_n \uparrow f$ pointwise, then $\rho(\,f_n) \uparrow \rho(\,f)$ .

  4. (R4) If $f_n \in C_b(E)$ and $f \in B(E)$ with $f_n \downarrow f$ pointwise, then $\rho(\,f_n) \downarrow \rho(\,f)$ .

Moreover, for $\nu \in \mathcal{P}(E)$ we have

(2.1) \begin{equation}\alpha(\nu) = \sup_{f \in C_b(E)}\bigg(\int_Ef\,{\mathrm{d}} \nu - \rho(\,f)\bigg).\end{equation}

Theorem 2.2. Suppose $\rho \colon B(E) \rightarrow {\mathbb R}$ is convex and satisfies properties (R1–R4) of Theorem 2.1. Define $\alpha \colon \mathcal{P}(E) \rightarrow ({-}\infty,\infty]$ by (2.1). Then $\alpha$ is convex and has weakly compact sub-level sets. Moreover, the identity (1.1) holds.

For the rest of the paper, unless stated otherwise, we work at all times with the standing assumptions on $\alpha$ described in the Introduction.

Standing assumptions. The function $\alpha \colon \mathcal{P}(E) \rightarrow ({-}\infty,\infty]$ is convex, has weakly compact sub-level sets, and is not identically equal to $\infty$ . Lastly, $\rho$ is defined as in (1.1).

We next extend the domain of $\rho$ and $\rho_n$ to unbounded functions. Let $\overline{{\mathbb R}} = {\mathbb R} \cup \{-\infty,\infty\}$ . We adopt the convention that $\infty - \infty \,:\!= -\infty$ , although this will have few consequences aside from streamlined definitions. In particular, if $\nu \in \mathcal{P}(E^n)$ and a measurable function $f \colon E^n \rightarrow \overline{{\mathbb R}}$ and satisfies $\int f^-\,{\mathrm{d}} \nu = \int f^+\,{\mathrm{d}} \nu = \infty$ , we define $\int f\,{\mathrm{d}} \nu = -\infty$ .

Definition 2.1. For $n \ge 1$ and measurable $f \colon E^n \rightarrow \overline{{\mathbb R}}$ , define

\[\rho_n(\,f) = \sup_{\nu \in \mathcal{P}(E^n)}\bigg(\int_{E^n}f\,{\mathrm{d}} \nu - \alpha_n(\nu)\bigg).\]

As usual, abbreviate $\rho \equiv \rho_1$ . It is worth emphasizing that while $\rho(\,f)$ is finite for bounded f, it can be either $+\infty$ or $-\infty$ when f is unbounded.

2.1. Stronger topologies on $\mathcal{P}(E)$

As a last preparation, we discuss a well-known class of topologies on subsets of $\mathcal{P}(E)$ with which we will work frequently. Given a continuous function $\psi \colon E \rightarrow {\mathbb R}_+ \,:\!= [0,\infty)$ , define

\[\mathcal{P}_\psi(E) = \bigg\{\mu \in \mathcal{P}(E) \colon \int_E\psi\,{\mathrm{d}} \mu < \infty\bigg\}.\]

Endow $\mathcal{P}_\psi(E)$ with the (Polish) topology generated by the maps $\nu \mapsto \int_Ef\,{\mathrm{d}} \nu$ , where $f \colon E \rightarrow {\mathbb R}$ is continuous and $|\,f| \le 1+\psi$ ; we call this the $\psi$ -weak topology. A useful fact about this topology is that a set $M \subset \mathcal{P}_\psi(E)$ is pre-compact if and only if for every $\epsilon \gt 0$ there exists a compact set $K \subset E$ such that

\[\sup_{\mu \in M}\int_{K^c}\psi\,{\mathrm{d}} \mu \le \epsilon.\]

This is easily proved directly using Prokhorov’s theorem, or refer to [28, Corollary A.47]. It is worth noting that if d is a compatible metric on E and $\psi(x)=d^p(x,x_0)$ for some fixed $x_0 \in E$ and $p \ge 1$ , then the $\psi$ -weak topology is simply the p-Wasserstein topology associated with the metric d [Reference Villani52, Theorem 7.12].

2.2. An extension of Theorem 1.1

In this section we state and prove a useful generalization of Theorem 1.1 for stronger topologies and unbounded functions, taking advantage of the preparations of the previous sections. At all times in this section, the standing assumptions on $(\alpha,\rho)$ (stated early in Section 2) are in force. Part of Theorem 2.3 below requires the assumption that the sub-level sets of $\alpha$ are pre-compact in $\mathcal{P}_\psi(E)$ , and this rather opaque assumption will be explored in more detail in Section 3.1.

Theorem 2.3. Let $\psi \colon E \rightarrow {\mathbb R}_+$ be continuous. If $F \colon \mathcal{P}_\psi(E) \rightarrow {\mathbb R} \cup \{\infty\}$ is lower semicontinuous (with respect to the $\psi$ -weak topology) and bounded from below, then

\[\liminf_{n\rightarrow\infty}\dfrac{1}{n}\rho_n(nF \circ L_n) \ge \sup_{\nu \in \mathcal{P}_\psi(E)}(F(\nu) - \alpha(\nu)).\]

Suppose also that the sub-level sets of $\alpha$ are pre-compact subsets of $\mathcal{P}_\psi(E)$ . If $F \colon \mathcal{P}_\psi(E) \rightarrow {\mathbb R} \cup \{-\infty\}$ is upper semicontinuous and bounded from above, then

\[\limsup_{n\rightarrow\infty}\dfrac{1}{n}\rho_n(nF \circ L_n) \le \sup_{\nu \in \mathcal{P}_\psi(E)}(F(\nu) - \alpha(\nu)).\]

Proof of lower bound. Let us first prove the lower bound. It is immediate from the definition that $n^{-1}\alpha_n(\nu^n) = \alpha(\nu)$ for each $\nu \in \mathcal{P}(E)$ , recalling that $\nu^n$ denotes the n-fold product measure. Thus

(2.2) \begin{align}\dfrac{1}{n}\rho_n(nF(L_n)) &= \sup_{\nu \in \mathcal{P}(E^n)}\bigg\{\int_{E^n} F \circ L_n \,{\mathrm{d}} \nu - \dfrac{1}{n}\alpha_n(\nu)\bigg\} \notag \\* &\ge \sup_{\nu \in \mathcal{P}(E)}\bigg\{\int_{E^n} F \circ L_n \,{\mathrm{d}} \nu^n - \dfrac{1}{n}\alpha_n(\nu^n)\bigg\} \notag \\* &= \sup_{\nu \in \mathcal{P}(E)}\bigg\{\int_{E^n} F \circ L_n \,{\mathrm{d}} \nu^n - \alpha(\nu)\bigg\}.\end{align}

For $\nu \in \mathcal{P}(E)$ , the law of large numbers (see [Reference Dudley20, Theorem 11.4.1]) implies $\nu^n \circ L_n^{-1} \rightarrow \delta_\nu$ weakly, i.e. in $\mathcal{P}(\mathcal{P}(E))$ . For $\nu \in \mathcal{P}_{\psi}(E)$ , the convergence takes place in $\mathcal{P}(\mathcal{P}_{\psi}(E))$ . Lower semicontinuity of F on $\mathcal{P}_\psi(E)$ then implies (e.g. by [Reference Dupuis and Ellis22, Theorem A.3.12]), for each $\nu \in \mathcal{P}_\psi(E)$ ,

\begin{align*}\liminf_{n\rightarrow\infty}\dfrac{1}{n}\rho_n(nF(L_n)) &\ge \liminf_{n\rightarrow\infty}\int_{E^n} F \circ L_n \,{\mathrm{d}} \nu^n - \alpha(\nu) \\* &\ge F(\nu) - \alpha(\nu).\end{align*}

Take the supremum over $\nu$ to complete the proof of the lower bound.

Proof of upper bound, F bounded. The upper bound is more involved. First we prove it in four steps under the assumption that F is bounded.

Step 1. First we simplify the expression somewhat. For each $\nu \in \mathcal{P}(E^n)$ the definition of $\alpha_n$ and convexity of $\alpha$ imply

\begin{align*}\dfrac{1}{n}\alpha_n(\nu) &= \dfrac{1}{n}\sum_{k=1}^n\int_{E^n}\alpha(\nu_{k-1,k}(x_1,\ldots,x_{k-1}) )\nu({\mathrm{d}} x_1,\ldots,{\mathrm{d}} x_n) \\* &\ge \int_{E^n}\alpha\Bigg(\dfrac{1}{n}\sum_{k=1}^n\nu_{k-1,k}(x_1,\ldots,x_{k-1})\Bigg)\nu({\mathrm{d}} x_1,\ldots,{\mathrm{d}} x_n).\end{align*}

Combine this with (2.2) to get

(2.3) \begin{equation}\dfrac{1}{n}\rho_n(nF(L_n)) \le \sup_{\nu \in \mathcal{P}(E^n)}\int_{E^n}\Bigg[F(L_n) - \alpha\Bigg(\dfrac{1}{n}\sum_{k=1}^n\nu_{k-1,k}\Bigg)\Bigg]\,{\mathrm{d}} \nu.\end{equation}

Now choose arbitrarily some $\mu_f$ such that $\alpha(\mu_f) \lt \infty$ . The choice $\nu = \mu_f^n$ and boundedness of F show that the supremum in (2.3) is bounded below by $-\|F\|_{\infty} - \alpha(\mu_f)$ , where $\|F\|_\infty \,:\!= \sup_{\nu \in \mathcal{P}_\psi(E)}|F(\nu)|$ . For each n, choose $\nu^{(n)} \in \mathcal{P}(E^n)$ attaining the supremum in (2.3) to within $1/n$ . Then

(2.4) \begin{equation}\int_{E^n}\alpha\Bigg(\dfrac{1}{n}\sum_{k=1}^n\nu^{(n)}_{k-1,k}\Bigg)\,{\mathrm{d}} \nu^{(n)} \le 2\|F\|_{\infty} + \alpha(\mu_f) + \dfrac{1}{n}.\end{equation}

It is convenient to switch now to a probabilistic notation. On some sufficiently rich probability space, find an $E^n$ -valued random variable $(Y^n_1,\ldots,Y^n_n)$ with law $\nu^{(n)}$ . Define the random measures

\[S_n \,:\!= \dfrac{1}{n}\sum_{k=1}^n\nu^{(n)}_{k-1,k}(Y^n_1,\ldots,Y^n_{k-1}), \quad \skew3\widetilde{S}_n \,:\!= \dfrac{1}{n}\sum_{k=1}^n\delta_{Y^n_k}.\]

Use (2.3) and unwrap the definitions to find

(2.5) \begin{equation}\dfrac{1}{n}\rho_n(nF(L_n)) \le {\mathbb E}[F(\skew3\widetilde{S}_n) - \alpha(S_n)] + 1/n.\end{equation}

Moreover, (2.4) implies

(2.6) \begin{equation}\sup_n{\mathbb E}[\alpha(S_n)] \le 2\|F\|_\infty + \alpha(\mu_f) + 1 < \infty.\end{equation}

Step 2. We next show that the sequence $(S_n,\skew3\widetilde{S}_n)$ is tight, viewed as $\mathcal{P}_\psi(E)\times\mathcal{P}_\psi(E)$ -valued random variables. Here we use the assumption that the sub-level sets of $\alpha$ are $\psi$ -weakly compact subsets of $\mathcal{P}_\psi(E)$ . It then follows from (2.6) that $(S_n)$ is tight (see e.g. [Reference Dupuis and Ellis22, Theorem A.3.17]).

To see that the pair $(S_n,\skew3\widetilde{S}_n)$ is tight, it remains to check that $(\skew3\widetilde{S}_n)_n$ is tight. To this end, we first notice that $S_n$ and $\skew3\widetilde{S}_n$ have the same mean measure for each n, in the sense that for every $f \in B(E)$ we have

(2.7) \begin{align} {\mathbb E}\bigg[\int_Ef\,{\mathrm{d}} S_n\bigg] &= {\mathbb E}\Bigg[\dfrac{1}{n}\sum_{k=1}^n{\mathbb E} [ f(Y^n_k) \mid Y^n_1,\ldots,Y^n_{k-1}]\Bigg] \notag \\* & = {\mathbb E}\Bigg[\dfrac{1}{n}\sum_{k=1}^nf(Y^n_k)\Bigg] \notag \\ &= {\mathbb E}\bigg[\int_Ef\,{\mathrm{d}}\skew3\widetilde{S}_n\bigg].\end{align}

To prove $(\skew3\widetilde{S}_n)$ is tight, it suffices (by Prokhorov’s theorem) to show that for all $\epsilon \gt 0$ there exists a $\psi$ -weakly compact set $K \subset \mathcal{P}_\psi(E)$ such that $P(\skew3\widetilde{S}_n \notin K) \le \epsilon$ . We will look for K of the form

\[K = \cap_{k=1}^\infty\bigg\{\nu \colon \int_{C_k^c}\psi\,{\mathrm{d}} \nu \le 1/k\bigg\},\]

where $(C_k)_{k=1}^\infty$ is a sequence of compact subsets of E to be specified later; indeed, sets K of this form are pre-compact in $\mathcal{P}_\psi(E)$ according to a form of Prokhorov’s theorem discussed in Section 2.1 (see also [Reference Föllmer and Schied28, Corollary A.47]). For such a set K, use Markov’s inequality and (2.7) to compute

(2.8) \begin{align}P (\skew3\widetilde{S}_n \notin K ) &\le \sum_{k=1}^\infty P\bigg(\int_{C_k^c}\psi\,{\mathrm{d}}\skew3\widetilde{S}_n> 1/k\bigg) \notag \\*&\le \sum_{k=1}^\infty k\,{\mathbb E}\int_{C_k^c}\psi\,{\mathrm{d}}\skew3\widetilde{S}_n \notag \\*& = \sum_{k=1}^\infty k\,{\mathbb E}\int_{C_k^c}\psi\,{\mathrm{d}} S_n.\end{align}

By a form of Jensen’s inequality (see Lemma B.2),

\[\sup_n\alpha({\mathbb E} S_n) \le \sup_n{\mathbb E}[\alpha(S_n)] < \infty,\]

where ${\mathbb E} S_n$ is the probability measure on E defined by $({\mathbb E} S_n)(A) = {\mathbb E}[S_n(A)]$ . Hence, the sequence $({\mathbb E} S_n)$ is pre-compact in $\mathcal{P}_\psi(E)$ , thanks to the assumption that sub-level sets of $\alpha$ are pre-compact subsets of $\mathcal{P}_\psi(E)$ . It follows that for every $\epsilon \gt 0$ there exists a compact set $C \subset E$ such that $\sup_n{\mathbb E}\int_{C^c}\psi\,{\mathrm{d}} S_n \le \epsilon$ . With this in mind, we may choose $C_k$ to make (2.8) arbitrarily small, uniformly in n. This shows that $(\skew3\widetilde{S}_n)$ is tight, completing Step 2.

Step 3. We next show that every limit in distribution of $(S_n,\skew3\widetilde{S}_n)$ is concentrated on the diagonal $\{(\nu,\nu) \colon \nu \in \mathcal{P}_\psi(E)\}$ . By definition of $\nu^{(n)}_{k-1,k}$ , we have

\[{\mathbb E}\bigg[ f(Y^n_k) - \int_E f\,{\mathrm{d}} \nu^{(n)}_{k-1,k}(Y^n_1,\ldots,Y^n_{k-1})\mid Y^n_1,\ldots,Y^n_{k-1}\bigg] = 0 \quad \text{for } k=1,\ldots,n\]

for every $f \in B(E)$ . That is, the terms inside the expectation form a martingale difference sequence. Thus, for $f \in B(E)$ , we have

(2.9) \begin{align}{\mathbb E}\bigg[\bigg(\int_E f\,{\mathrm{d}} S_n - \int_E f\,{\mathrm{d}}\skew3\widetilde{S}_n\bigg)^2 \bigg] &= {\mathbb E}\Bigg[\Bigg(\dfrac{1}{n}\sum_{k=1}^n\bigg( f(Y^n_k) - \int_E f\,{\mathrm{d}} \nu^{(n)}_{k-1,k}(Y^n_1,\ldots,Y^n_{k-1})\bigg)\Bigg)^2\Bigg] \notag \\* &= \dfrac{1}{n^2}\sum_{k=1}^n{\mathbb E}\bigg[\bigg( f(Y^n_k) - \int_E f\,{\mathrm{d}} \nu^{(n)}_{k-1,k}(Y^n_1,\ldots,Y^n_{k-1})\bigg)^2\bigg] \notag \\* &\le 2\|\,f\|_{\infty}^2/n,\end{align}

where $\|\,f\|_\infty \,:\!= \sup_{x \in E}|\,f(x)|$ . It is straightforward to check that (2.9) implies that every weak limit of $(S_n,\skew3\widetilde{S}_n)$ is concentrated on (i.e. almost surely belongs to) the diagonal $\{(\nu,\nu) \colon \nu \in \mathcal{P}(E)\}$ (cf. [Reference Dupuis and Ellis22, Lemma 2.5.1(b)]). Indeed, if $(S,\skew3\widetilde{S})$ is some $\mathcal{P}_\psi(E) \times \mathcal{P}_\psi(E)$ -valued random variable such that $(S_{n_k},\skew3\widetilde{S}_{n_k})$ converges in law to $(S,\skew3\widetilde{S})$ , then (2.9) implies

\begin{equation*}{\mathbb E}\bigg[\bigg(\int_E f\,{\mathrm{d}} S - \int_E f\,{\mathrm{d}}\skew3\widetilde{S}\bigg)^2\bigg] = 0,\end{equation*}

for each $f \in C_b(E)$ , by continuity of the map

\[ \mathcal{P}_\psi(E) \times \mathcal{P}_\psi(E) \ni (\nu,\widetilde{\nu}) \mapsto \bigg(\int_E f\,{\mathrm{d}} \nu - \int_E f\,{\mathrm{d}}\widetilde{\nu}\bigg)^2.\]

Hence, $\int_E f\,{\mathrm{d}} S = \int_E f\,{\mathrm{d}}\skew3\widetilde{S}$ a.s. for each $f \in C_b(E)$ , and arguing with a countable separating family from $C_b(E)$ (see e.g. [Reference Parthasarathy45, Theorem 6.6]) allows us to deduce that $S=\skew3\widetilde{S}$ a.s.

Step 4. We can now complete the proof of the upper bound. With Step 3 in mind, fix a subsequence and a $\mathcal{P}_\psi(E)$ -valued random variable $\eta$ such that $(S_n,\skew3\widetilde{S}_n) \rightarrow (\eta,\eta)$ in distribution (where we relabeled the subsequence). Recall that $\alpha$ is bounded from below and $\psi$ -weakly lower semicontinuous, whereas F is upper semicontinuous and bounded. Returning to (2.5), we conclude now that

\begin{align*}\limsup_{n\rightarrow\infty}\dfrac{1}{n}\rho_n(nF(L_n)) &\le \limsup_{n\rightarrow\infty}{\mathbb E} [F(\skew3\widetilde{S}_n) - \alpha(S_n)] \\* &\le {\mathbb E}[F(\eta) - \alpha(\eta)] \\* &\le \sup_{\nu \in \mathcal{P}_\psi(E)} \{F(\nu) - \alpha(\nu)\}.\end{align*}

Of course, we abused notation by relabeling the subsequences, but we have argued that for every subsequence there exists a further subsequence for which this bound holds, which proves the upper bound for F bounded.

Proof of upper bound, unbounded F. With the proof complete for bounded F, we now remove the boundedness assumption using a natural truncation procedure. Let $F \colon \mathcal{P}(E) \rightarrow E \cup \{-\infty\}$ be upper semicontinuous and bounded from above. For $m \gt 0$ let $F_m \,:\!= F \vee ({-}m)$ . Since $F_m$ is bounded and upper semicontinuous, the previous step yields

\[\limsup_{n\rightarrow\infty}\dfrac{1}{n}\rho_n(nF_m(L_n)) \le \sup_{\nu \in \mathcal{P}_\psi(E)} \{F_m(\nu) - \alpha(\nu)\} =\!:\, S_m,\]

for each $m \gt 0$ . Since $F_m \ge F$ , we have

\[\rho_n(nF_m(L_n)) \ge \rho_n(nF(L_n))\]

for each m, and it remains only to show that

(2.10) \begin{equation}\lim_{m \rightarrow \infty}S_m = \sup_{\nu \in \mathcal{P}_\psi(E)}\{F(\nu) - \alpha(\nu)\} =\!:\, S.\end{equation}

Clearly $S_m \ge S$ , since $F_m \ge F$ . Note that $S \lt \infty$ , as F and $\alpha$ are bounded from above and from below, respectively. If $S = -\infty$ , then $F(\nu) = -\infty$ whenever $\alpha(\nu) \lt \infty$ , and we conclude that, as $m\rightarrow\infty$ ,

\[S_m \le -m - \inf_{\nu \in \mathcal{P}(E)}\alpha(\nu) \ \rightarrow \ -\infty = S.\]

Now suppose instead that S is finite. Fix $\epsilon \gt 0$ . For each $m \gt 0$ , find $\nu_m \in \mathcal{P}(E)$ such that

(2.11) \begin{equation}F_m(\nu_m) - \alpha(\nu_m) + \epsilon \ge S_m \ge S.\end{equation}

Since F is bounded from above and $S \gt -\infty$ , it follows that $\sup_m\alpha(\nu_m) \lt\infty$ . The sub-level sets of $\alpha$ are $\psi$ -weakly compact, and thus the sequence $(\nu_m)$ has a limit point (in $\mathcal{P}_\psi(E)$ ). Let $\nu_\infty$ denote any limit point, and suppose $\nu_{m_k} \rightarrow \nu_\infty$ . Note that $\inf_m F_m(\nu_m) \gt -\infty$ in light of (2.11), because $\alpha$ is bounded from below. Hence, for all sufficiently large m, we have $F_m(\nu_m)=F(\nu_m)$ . Thus

\begin{equation*}\limsup_{k\rightarrow\infty} \{F_{m_k}(\nu_{m_k}) - \alpha(\nu_{m_k})\} \le F(\nu_\infty) - \alpha(\nu_\infty)\le S,\end{equation*}

where the second inequality follows from upper semicontinuity of F and lower semicontinuity of $\alpha$ . This holds for any limit point of the pre-compact sequence $(\nu_m)$ , and it follows from (2.11) that

\[S \le \limsup_{m\rightarrow\infty}S_m \le \limsup_{m\rightarrow\infty}\{F_m(\nu_m) - \alpha(\nu_m)\} + \epsilon \le S + \epsilon.\]

Since $\epsilon \gt 0$ was arbitrary, this proves (2.10).

Remark 2.1. Several natural choices of $\alpha$ in fact have sub-level sets which are compact in topology induced by bounded measurable test functions, i.e. the topology on $\mathcal{P}(E)$ generated by the maps $\nu \mapsto \int_E f\,{\mathrm{d}} \nu$ , where $f \in B(E)$ . While this topology is stronger than the usual weak convergence topology, the conclusion of Theorem 1.1 will likely still hold for bounded functions F which are continuous in this stronger (non-metrizable) topology. This is known to be true in the classical case $\alpha(\!\cdot\!)=H(\cdot \mid \mu)$ (see e.g. [Reference Dembo and Zeitouni16, Section 6.2]), where we recall the definition of relative entropy H from (1.2). For the sake of brevity, we do not pursue this generalization.

2.3. Contraction principles and an abstract form of Cramér’s theorem

Viewing Theorem 2.3 as an abstract form of Sanov’s theorem, we may derive from it a form of Cramér’s theorem. The key tool is an analog of the contraction principle from classical large deviations (cf. [Reference Dembo and Zeitouni16, Theorem 4.2.1]). In its simplest form, if $\varphi \colon \mathcal{P}(E) \rightarrow E'$ is continuous for some topological space E’, then for $F \in C_b(E')$ we have from Theorem 1.1

\begin{equation*}\lim_{n\rightarrow\infty}\dfrac{1}{n}\rho_n(nF \circ \varphi \circ L_n) = \sup_{\nu \in \mathcal{P}(E)} (F(\varphi(\nu)) - \alpha(\nu)) = \sup_{x \in E'}(F(x) - \alpha_\varphi(x)),\end{equation*}

where we define $\alpha_\varphi \colon E' \rightarrow ({-}\infty,\infty]$ by

\[\alpha_\varphi(x) \,:\!= \inf \{\alpha(\nu) \colon \nu \in \mathcal{P}(E), \ \varphi(\nu) = x \}.\]

This line of reasoning leads to the following extension of Cramér’s theorem.

Theorem 2.4. Let $(E,\|\cdot\|)$ be a separable Banach space with continuous dual $E^*$ . Define $\Lambda^* \colon E \rightarrow {\mathbb R} \cup \{\infty\}$ by

\begin{equation*}\Lambda^*(x) = \sup_{x^* \in E^*} (\langle x^*,x\rangle - \rho(x^*)).\end{equation*}

Define $S_n \colon E^n \rightarrow E$ by

\[S_n(x_1,\ldots,x_n) = \dfrac{1}{n}\sum_{i=1}^nx_i.\]

If $F \colon E \rightarrow {\mathbb R} \cup \{\infty\}$ is lower semicontinuous and bounded from below, then

\begin{equation*}\liminf_{n\rightarrow\infty}\dfrac{1}{n}\rho_n(nF \circ S_n) \ge \sup_{x \in E}(F(x)-\Lambda^*(x)).\end{equation*}

Suppose also that the sub-level sets of $\alpha$ are pre-compact subsets of $\mathcal{P}_\psi(E)$ , for $\psi(x) \,:\!= \|x\|$ . If $F \colon E \rightarrow {\mathbb R} \cup \{-\infty\}$ is upper semicontinuous and bounded from above, then

\begin{equation*}\limsup_{n\rightarrow\infty}\dfrac{1}{n}\rho_n(nF \circ S_n) \le \sup_{x \in E}(F(x)-\Lambda^*(x)).\end{equation*}

The proof makes use of a proposition, interesting in its own right, which generalizes the well-known result that the functions

\[t \mapsto \log\int_{\mathbb R} {\mathrm{e}}^{tx}\,\mu({\mathrm{d}} x) \quad \text{and} \quad t \mapsto \inf \bigg\{H(\nu \mid \mu) \colon \nu \in \mathcal{P}({\mathbb R}), \ \int_{\mathbb R} x\,\nu({\mathrm{d}} x) = t \bigg\}\]

are convex conjugates of each other (see e.g. [Reference Dupuis and Ellis22, Lemma 3.3.3]).

Proposition 2.1. Let $(E,\|\cdot\|)$ be a separable Banach space, and let $\psi(x) = \|x\|$ . Suppose the sub-level sets of $\alpha$ are pre-compact subsets of $\mathcal{P}_\psi(E)$ . Define $\Psi \colon E \rightarrow {\mathbb R} \cup \{\infty\}$ by

\[\Psi(x) = \inf\bigg\{\alpha(\nu) \colon \nu \in \mathcal{P}_\psi(E), \ \int_Ez\,\nu({\mathrm{d}} z) = x\bigg\},\]

where the integral is in the sense of Bochner. Define $\Psi^*$ on the continuous dual $E^*$ by

\[\Psi^*(x^*) = \sup_{x \in E}(\langle x^*,x\rangle - \Psi(x)).\]

Then $\Psi$ is convex and lower semicontinuous, and $\Psi^*(x^*) = \rho(x^*)$ for every $x^* \in E^*$ . In particular,

(2.12) \begin{equation}\Psi(x) = \sup_{x^* \in E^*}(\langle x^*,x\rangle - \rho(x^*)).\end{equation}

Proof. We first show that $\Psi$ is convex. Let $t \in (0,1)$ and $x_1,x_2 \in E$ . Fix $\epsilon \gt 0$ , and find $\nu_1,\nu_2 \in \mathcal{P}_\psi(E)$ such that $\int_Ez\nu_i({\mathrm{d}} z)=x_i$ and $\alpha(\nu_i) \le \Psi(x_i) + \epsilon$ . Convexity of $\alpha$ yields

\begin{align*} \Psi(tx_1 + (1-t)x_2) &\le \alpha(t\nu_1+(1-t)\nu_2) \\* & \le t\alpha(\nu_1) + (1-t)\alpha(\nu_2) \\* &\le t\Psi(x_1) + (1-t)\Psi(x_2) + \epsilon.\end{align*}

To prove that $\Psi$ is lower semicontinuous, first note that $\Psi$ is bounded from below since $\alpha$ is. Let $x_n \rightarrow x$ in E, and find $\nu_n \in \mathcal{P}_\psi(E)$ such that $\alpha(\nu_n) \le \Psi(x_n) + 1/n$ and $\int_Ez\nu_n({\mathrm{d}} z)=x_n$ for each n. Fix a subsequence $\{x_{n_k}\}$ such that $\Psi(x_{n_k}) \lt \infty$ for all k and $\Psi(x_{n_k})$ converges to a finite value (if no such subsequence exists, then there is nothing to prove, as $\Psi(x_n) \rightarrow \infty$ ). Then $\sup_{k}\alpha(\nu_{n_k}) \lt \infty$ , and because $\alpha$ has $\psi$ -weakly compact sub-level sets there exists a further subsequence (again denoted $n_k$ ) and some $\nu_\infty \in \mathcal{P}_\psi(E)$ such that $\nu_{n_k}\rightarrow\nu_\infty$ . The convergence $\nu_{n_k}\rightarrow\nu_\infty$ in the $\psi$ -weak topology implies

\[x = \lim_{k\rightarrow\infty}x_{n_k} = \lim_{k\rightarrow\infty}\int_Ez\nu_{n_k}({\mathrm{d}} z) = \int_Ez\,\nu_\infty({\mathrm{d}} z).\]

Using lower semicontinuity of $\alpha$ we conclude

(2.13) \begin{equation}\Psi(x) \le \alpha(\nu_\infty) \le \liminf_{k\rightarrow\infty}\alpha(\nu_{n_k}) \le \liminf_{k\rightarrow\infty}\Psi(x_{n_k}).\end{equation}

For every sequence $(x_n)$ in E and any subsequence thereof, this argument shows that there exists a further subsequence for which (2.13) holds, and this proves that $\Psi$ is lower semicontinuous. Next, compute $\Psi^*$ as follows:

\begin{align*}\Psi^*(x^*) &= \sup_{x \in E} (\langle x^*,x\rangle - \Psi(x)) \\* &= \sup_{x \in E}\,\sup\bigg\{\langle x^*,x\rangle - \alpha(\nu) \colon \nu \in \mathcal{P}_\psi(E), \ \int_Ez\nu({\mathrm{d}} z)=x \bigg\} \\ &= \sup_{\nu \in \mathcal{P}_\psi(E)}\bigg(\bigg\langle x^*,\int_Ez\nu({\mathrm{d}} z)\bigg\rangle - \alpha(\nu)\bigg) \\ &= \sup_{\nu \in \mathcal{P}_\psi(E)}\bigg(\int_E\langle x^*,z\rangle\nu({\mathrm{d}} z) - \alpha(\nu)\bigg) \\* &= \rho(x^*).\end{align*}

Indeed, we can take the supremum equivalently over $\mathcal{P}_\psi(E)$ or over $\mathcal{P}(E)$ in the last two steps, thanks to the assumption that $\alpha = \infty$ away from $\mathcal{P}_\psi(E)$ and our convention $\infty-\infty=-\infty$ . Because $\Psi$ is lower semicontinuous and convex, we conclude from the Fenchel–Moreau theorem [Reference Zalinescu55, Theorem 2.3.3] that it is equal to its biconjugate, which is precisely what (2.12) says.

Proof of Theorem 2.4. The map

\[\mathcal{P}_\psi(E) \ni \mu \mapsto F\bigg(\int_Ez\,\mu({\mathrm{d}} z)\bigg)\]

is upper (resp. lower) semicontinuous as soon as F is upper (resp. lower) semicontinuous. The claims then follow from Theorem 2.3 and Proposition 2.1.

3. Compactness of sub-level sets of $\alpha$ in $\mathcal{P}_\psi(E)$

Several results of the previous section, such as the upper bound of Theorem 2.3, operate under the assumption that the sub-level sets of $\alpha$ are pre-compact subsets of $\mathcal{P}_\psi(E)$ . This section compiles some related properties of $(\rho,\alpha)$ which will be useful when we encounter specific examples later in the paper.

3.1. Cramér’s condition

A first useful result is a condition under which the effective domain of $\alpha$ is contained in $\mathcal{P}_\psi(E)$ .

Proposition 3.1. Fix a measurable function $\psi \colon E \rightarrow {\mathbb R}_+$ . Suppose $\rho(\lambda \psi) \lt \infty$ for some $\lambda \gt 0$ . Then, for each $\nu \in \mathcal{P}(E)$ satisfying $\alpha(\nu) \lt \infty$ , we have $\int \psi\,{\mathrm{d}} \nu \lt \infty$ .

Proof. By definition, for each $\nu \in \mathcal{P}(E)$ ,

\[\infty \gt \rho(\lambda\psi) \ge \lambda\int \psi\,{\mathrm{d}} \nu - \alpha(\nu).\]

If $\alpha(\nu) \lt \infty$ then certainly $\int\psi\,{\mathrm{d}} \nu \lt \infty$ .

The next and more important proposition identifies a condition under which the sub-level sets of $\alpha$ are not only weakly compact (which we always assume) but also $\psi$ -weakly compact.

Proposition 3.2. Fix a continuous function $\psi \colon E \rightarrow {\mathbb R}_+$ . Suppose

(3.1) \begin{equation}\lim_{m\rightarrow\infty}\rho(\lambda\psi 1_{\{\psi \ge m\}}) = \rho(0) \quad \text{{for all} } \lambda \gt 0.\end{equation}

Then, for each $c \in {\mathbb R}$ , the weak and $\psi$ -weak topologies coincide on $\{\nu \in \mathcal{P}(E) \colon \alpha(\nu) \le c\} \subset \mathcal{P}_\psi(E)$ ; in particular, the sub-level sets of $\alpha$ are $\psi$ -weakly compact.

A first step in the proof comes from the following simple lemma, worth stating separately for emphasis.

Lemma 3.1. Fix a continuous function $\psi \colon E \rightarrow {\mathbb R}_+$ . Suppose (3.1) holds. Then $\rho(\lambda\psi) \lt \infty$ for every $\lambda \ge 0$ . In particular, for each $\nu \in \mathcal{P}(E)$ satisfying $\alpha(\nu) \lt \infty$ , we have $\int\psi\,{\mathrm{d}} \nu \lt \infty$ .

Proof. The second claim is just Proposition 3.1. For $m,\lambda \gt 0$ we have $\lambda\psi \le \lambda m + \lambda\psi1_{\{\psi \ge m\}}$ , and thus properties (R1) and (R2) of Theorem 2.1 imply

\[\rho(\lambda \psi) \le \lambda m + \rho(\lambda\psi 1_{\{\psi \ge m\}}).\]

By (3.1), for m sufficiently large the right-hand side is finite.

Proof of Proposition 3.2. Fix $c \in {\mathbb R}$ , and abbreviate $S = \{\nu \in \mathcal{P}(E) \colon \alpha(\nu) \le c\}$ . Assume $S \neq \emptyset$ . Note that Lemma 3.1 implies $S \subset \mathcal{P}_\psi(E)$ . It suffices to prove that the map $\nu \mapsto \int_Ef\,{\mathrm{d}} \nu$ is weakly continuous on S for every continuous $f \colon E \to {\mathbb R}$ with $|\,f| \le 1 + \psi$ . Note that for $\eta_n,\eta \in \mathcal{P}({\mathbb R})$ with $\eta_n\to\eta$ weakly, we have $\int g\,{\mathrm{d}}\eta_n \to \int g\,{\mathrm{d}}\eta$ for each continuous function g which is uniformly integrable, in the sense that

\[\lim_{m\rightarrow\infty}\sup_n\int_{\{|g| \ge m\}}|g|\,{\mathrm{d}}\eta_n = 0.\]

(See [Reference Dupuis and Ellis22, Theorem A.3.19].) Applying this to the image measures $\{\nu \circ f^{-1} \colon \nu \in S\}$ for f as above, we find that it suffices to prove the uniform integrability condition

\[\lim_{m\rightarrow\infty}\sup_{\nu \in S}\int_{\{\psi \ge m\}}\psi\,{\mathrm{d}} \nu = 0.\]

By definition of $\rho$ , for $m \gt 0$ and $\nu \in S$ ,

(3.2) \begin{equation}\lambda\int_{\{\psi \ge m\}}\psi\,{\mathrm{d}} \nu \le \rho(\lambda\psi 1_{\{\psi \ge m\}}) + \alpha(\nu) \le \rho(\lambda\psi 1_{\{\psi \ge m\}}) + c {.}\end{equation}

Given $\epsilon \gt 0$ , choose $\lambda \gt 0$ large enough that $(\epsilon + \rho(0) + c)/\lambda \le \epsilon$ . Then choose m large enough that $\rho(\lambda\psi 1_{\{\psi \ge m\}}) \le \epsilon + \rho(0)$ , which is possible because of assumption (3.1). It then follows from (3.2) that $\int_{\{\psi \ge m\}}\psi\,{\mathrm{d}} \nu \le \epsilon$ , and the proof is complete.

We refer to (3.1) as the strong Cramér condition. Several extensions of the classical form of Sanov’s theorem to stronger topologies rely on what might be called a ‘strong Cramér condition’. For instance, if $\psi \colon E \rightarrow {\mathbb R}_+$ is continuous, the results of Schied [Reference Schied50] indicate that Sanov’s theorem can be extended to the $\psi$ -weak topology if (and essentially only if) $\log\int_E {\mathrm{e}}^{\lambda\psi}\,{\mathrm{d}} \mu \lt \infty$ for every $\lambda \ge 0$ ; see also [Reference Wang, Wang and Wu53] and [Reference Eichelsbacher and Schmock24].

The form of our strong Cramér condition (3.1) was heavily inspired by the work of Owari [Reference Owari44] on continuous extensions of monotone convex functionals. In several cases of interest (namely, Propositions 4.1 and 5.3 below), it turns out that a converse to Lemma 3.1 is true, that is, the strong Cramér condition (3.1) is equivalent to the statement that $\rho(\lambda\psi) \lt \infty$ for all $\lambda \gt 0$ . In general, however, the strong Cramér condition is the strictly stronger statement. Consider the following simple example, borrowed from [Reference Owari44, Example 3.7]. Let $E = \{0,1,\ldots,\}$ be the natural numbers, and define $\mu_n \in \mathcal{P}(E)$ by $\mu_1\{0\}=1$ , $\mu_n\{0\} = 1-1/n$ , and $\mu_n\{n\} = 1/n$ . Let M denote the closed convex hull of $(\mu_n)$ . Then M is convex and weakly compact. Define $\alpha(\mu) = 0$ for $\mu \in M$ and $\alpha(\mu)=\infty$ otherwise. Then $\alpha$ satisfies our standing assumptions, and $\rho(\,f) = \sup_{\mu \in M}\int f\,{\mathrm{d}} \mu = \sup_n\int f\,{\mathrm{d}} \mu_n$ . Finally, let $\psi(x)=x$ for $x \in E$ . Then $\rho(\lambda\psi) = \lambda \lt \infty$ because $\int \psi\,{\mathrm{d}} \mu_n = 1$ for all n, and similarly $\rho(\lambda \psi1_{\{\psi \ge m\}}) = \lambda$ because $\int\psi1_{\{\psi \ge m\}}\,{\mathrm{d}} \mu_n = 1_{\{n \ge m\}}$ . In particular, $\rho(\lambda\psi) \lt \infty$ for all $\lambda \gt 0$ , but the strong Cramér condition fails.

Finally, we remark that it is conceivable that a converse to Proposition 3.1 might hold, that is, the strong Cramér condition (3.1) may be equivalent to the pre-compactness of the sub-level sets of $\alpha$ in $\mathcal{P}_\psi(E)$ . Indeed, the results of Schied [Reference Schied50, Theorem Reference Agueh and Carlier2] and Owari [Reference Owari44, Theorem 3.8] suggest that this may be the case. But this remains an open problem.

3.2. Implications of $\psi$ -weakly compact sub-level sets

This section contains two results to be used occasionally below. First is a useful lemma that aids in the computation of $\rho(\,f)$ for certain unbounded f in Section 4.

Lemma 3.2. If $f \colon E \to {\mathbb R}$ is upper semicontinuous and bounded from above, then

\begin{equation*}\rho(\,f) = \lim_{m\to\infty}\rho(\,f \vee ({-}m)) = \inf_{m \ge 0}\rho(\,f \vee ({-}m)).\end{equation*}

If $f \colon E \to {\mathbb R}$ is measurable and bounded from below, then

\begin{equation*}\rho(\,f) = \lim_{m\to\infty}\rho(\,f \wedge m) = \sup_{m \ge 0}\rho(\,f \wedge m).\end{equation*}

Lastly, let $\psi \colon E \rightarrow {\mathbb R}_+$ be continuous. If the sub-level sets of $\alpha$ are pre-compact subsets of $\mathcal{P}_\psi(E)$ , and if $f \colon E \to {\mathbb R}$ is measurable with $f \ge -c(1+\psi)$ pointwise for some $c \ge 0$ , then

\begin{equation*}\rho(\,f) = \lim_{m\to\infty}\rho(\,f \wedge m) = \sup_{m \ge 0}\rho(\,f \wedge m).\end{equation*}

Proof. The second claim is a special case of the final claim with $\psi \equiv 0$ . To prove the final claim, note first that $\rho(\,f \wedge m)$ is non-decreasing in m (see (R1) of Theorem 2.1). We find

\begin{align*} \rho(\,f) &= \sup_{\nu \in \mathcal{P}_\psi(E)}\bigg(\int_E f\,{\mathrm{d}} \nu - \alpha(\nu)\bigg) = \sup_{m \ge 0}\sup_{\nu \in \mathcal{P}_\psi(E)}\bigg(\int_E f \wedge m\,{\mathrm{d}} \nu - \alpha(\nu)\bigg) = \sup_{m \ge 0}\rho(\,f \wedge m).\end{align*}

Indeed, for each $\nu \in \mathcal{P}_\psi(E)$ , the monotone convergence theorem applies because $f \wedge m$ for $m \ge 0$ are bounded from below by the $\nu$ -integrable function $-c(1+\psi)$ . To prove the first claim, abbreviate $f_m= f \vee ({-}m)$ for $m \ge 0$ . Monotonicity of $\rho$ implies $\inf_{m \ge 0}\rho(\,f_m) \ge \rho(\,f)$ , so we need only prove the reverse inequality. Assume without loss of generality that $\inf_{m \ge 0}\rho(\,f_m) \gt -\infty$ . For each n, we may find for each n some $\nu_n \in \mathcal{P}_\psi(E)$ such that

(3.3) \begin{equation}-\infty < \inf_{m \ge 0}\rho(\,f_m) \le \rho(\,f_n) \le \int_Ef_n\,{\mathrm{d}} \nu_n - \alpha(\nu_n) + 1/n.\end{equation}

This implies $\sup_n\alpha(\nu_n) \lt \infty$ , because f is bounded from above. Pre-compactness of the sub-level sets of $\alpha$ allows us to extract a subsequence ${n_k}$ and $\nu_\infty \in \mathcal{P}(E)$ such that $\nu_{n_k} \rightarrow \nu_\infty$ weakly. By Skorokhod’s representation, we may construct random variables $X_k$ and $X_\infty$ with respective laws $\nu_{n_k}$ and $\nu_\infty$ such that $X_k \rightarrow X_\infty$ a.s. The upper semicontinuity assumption implies $\limsup_{k\rightarrow\infty}f_{n_k}(X_k) \le f(X_\infty)$ almost surely. We then conclude from Fatou’s lemma that

\[\limsup_{k\rightarrow\infty}\int_Ef_{n_k}\,{\mathrm{d}} \nu_{n_k} = \limsup_{k\rightarrow\infty}{\mathbb E}[\,f_{n_k}(X_k)] \le {\mathbb E}[\,f(X_\infty)] = \int_Ef\,{\mathrm{d}} \nu_\infty.\]

Since $\alpha$ is weakly lower semicontinuous, we conclude from (3.3) that

\begin{equation}\inf_{m \ge 0}\rho(\,f_m) \le \int_Ef\,{\mathrm{d}} \nu_\infty - \alpha(\nu_\infty) \le \sup_{\nu \in \mathcal{P}(E)}\bigg(\int_Ef\,{\mathrm{d}} \nu - \alpha(\nu)\bigg) = \rho(\,f).\end{equation}

4. Non-exponential large deviations

The goal of this section is to prove Theorem 1.2 and its consequences detailed in Section 1.1, but along the way we will explore a particularly interesting class of $(\alpha,\rho)$ pairs.

4.1. Shortfall risk measures

Fix $\mu \in \mathcal{P}(E)$ and a non-decreasing, non-constant, convex function $\ell \colon {\mathbb R} \rightarrow {\mathbb R}_+$ satisfying $\ell(x) \lt 1$ for all $x \lt 0$ . Let $\ell^*(y) = \sup_{x \in {\mathbb R}}(xy - \ell(x))$ denote the convex conjugate, and define $\alpha \colon \mathcal{P}(E) \rightarrow [0,\infty]$ by

(4.1) \begin{align}\alpha(\nu) = \begin{cases}\displaystyle \inf_{t \gt 0}\dfrac{1}{t}\bigg(1 + \int_E\ell^*\bigg(t\dfrac{{\mathrm{d}} \nu}{{\mathrm{d}} \mu}\bigg)\,{\mathrm{d}} \mu\bigg) &\text{if } \nu \ll \mu, \\[7pt]\infty &\text{otherwise}.\end{cases}\end{align}

Note that $\ell^*(x) \ge - \ell(0) \ge -1$ , by assumption and by continuity of $\ell$ , so that $\alpha \ge 0$ . Define $\rho$ as usual by (1.1). It is known [Reference Föllmer and Schied28, Proposition 4.115] that, for $f \in B(E)$ ,

(4.2) \begin{equation}\rho(\,f) = \inf\bigg\{m \in {\mathbb R} \colon \int_E\ell(\,f(x)-m)\mu({\mathrm{d}} x) \le 1\bigg\}.\end{equation}

Refer to the book of Föllmer and Schied [Reference Föllmer and Schied28, Section 4.9] for a thorough study of the properties of $\rho$ . Notably, they show that $\rho$ satisfies all of properties (R1–R4) of Theorem 2.1, and that both dual formulas hold:

\begin{equation*}\rho(\,f) = \sup_{\nu \in \mathcal{P}(E)}\bigg(\int_Ef\,{\mathrm{d}} \nu - \alpha(\nu)\bigg), \quad\alpha(\nu) = \sup_{f \in B(E)}\bigg(\int_Ef\,{\mathrm{d}} \nu - \rho(\,f)\bigg).\end{equation*}

If $\ell(x)= {\mathrm{e}}^x$ we recover $\rho(\,f)=\log\int_E {\mathrm{e}}^{\,f}\,{\mathrm{d}} \mu$ and $\alpha(\nu) = H(\nu \mid \mu)$ . If $\ell(x) = [(1+x)^+]^q$ for some $q \ge 1$ , then

(4.3) \begin{equation}\alpha(\nu) = \|{\mathrm{d}} \nu/{\mathrm{d}} \mu\|_{L^p(\mu)}-1\quad \text{for } \nu \ll \mu, \qquad \alpha(\nu) = \infty \quad \text{otherwise},\end{equation}

where $p=q/(q-1)$ , and where of course

\[\|\,f\|_{L^p(\mu)} = \bigg(\int |\,f|^p\,{\mathrm{d}} \mu\bigg)^{1/p}{.}\]

See [Reference Föllmer and Schied28, Example 4.118] or [Reference Lacker39, Section 3.1] for this computation. The $-1$ is a convenient normalization, ensuring that $\alpha(\nu)=0$ if and only if $\nu=\mu$ .

In the rest of this subsection we work with $\alpha$ and $\rho$ given as in (4.1) and (4.2). The following result shows how the strong Cramér condition (3.1) simplifies in the present context. It is essentially contained in [Reference Owari44, Proposition 7.3], but we include the short proof.

Proposition 4.1. Let $\psi \colon E \rightarrow {\mathbb R}_+$ be measurable. Suppose

\[ \int_E\ell(\lambda\psi(x))\mu({\mathrm{d}} x) < \infty\quad {for\ all }\ \lambda \gt 0.\]

Then the strong Cramér condition holds,

\[ \lim_{m\rightarrow\infty}\rho(\lambda\psi 1{\{\psi \ge m\}})\rightarrow 0\quad {for\ each }\ \lambda \gt 0.\]

In particular, the sub-level sets of $\alpha$ are compact subsets of $\mathcal{P}_\psi(E)$ .

Proof. The final claim is simply an application of Proposition 3.2. Fix $\epsilon \gt 0$ and $\lambda \gt 0$ . Since $\ell$ is non-decreasing, the following two limits hold:

\begin{equation*}\lim_{m\rightarrow\infty}\mu(\psi < m) = 1, \quad \lim_{m\rightarrow\infty}\int_{\{\psi \ge m\}}\ell (\lambda\psi(x) - \epsilon )\mu({\mathrm{d}} x) = 0.\end{equation*}

Since $\ell({-}\epsilon) \lt 1$ , it follows that, for sufficiently large m,

\begin{align*}1 &\ge \ell({-}\epsilon)\mu(\psi < m) + \int_{\{\psi \ge m\}}\ell (\lambda\psi(x) - \epsilon )\mu({\mathrm{d}} x) \\* &= \int_E\ell(\lambda\psi(x)1_{\{\psi \ge m\}}(x) - \epsilon)\mu({\mathrm{d}} x).\end{align*}

Next, the second assertion of Lemma 3.2 implies $\rho(\lambda\psi 1_{\{\psi \ge m\}}) = \sup_{n\ge 0}\rho(n \wedge (\lambda\psi 1_{\{\psi \ge m\}}))$ . For each n, we use the identity (4.2), which is valid for bounded f, to get, for sufficiently large m,

\begin{align}\rho(\lambda\psi 1_{\{\psi \ge m\}}) &= \sup_{n\ge 0}\rho(n \wedge (\lambda\psi 1_{\{\psi \ge m\}})) \nonumber\\ &= \sup_{n \ge 0}\inf\bigg\{c \in {\mathbb R} \colon \int_E\ell (n \wedge (\lambda\psi 1_{\{\psi \ge m\}}) - c )\,{\mathrm{d}} \mu \le 1 \bigg\} \nonumber\displaybreak\\ &\le \inf\bigg\{c \in {\mathbb R} \colon \int_E\ell ( \lambda\psi 1_{\{\psi \ge m\}} - c )\,{\mathrm{d}} \mu \le 1 \bigg\} \nonumber\\ &\le \epsilon.\end{align}

Note that (4.2) is only valid, a priori, for bounded f, although the expression on the right-hand side certainly makes sense for unbounded f. The next results provide some useful cases for which the identity (4.2) carries over to unbounded functions, and these will be needed in the proof of Corollary 1.2. In the following, define $\ell(\pm \infty) = \lim_{x \rightarrow \pm\infty}\ell(x)$ .

Proposition 4.2. Let $\psi \colon E \rightarrow {\mathbb R}_+$ be continuous, and suppose $\int_E\ell(\lambda\psi(x))\mu({\mathrm{d}} x) \lt \infty$ for all $\lambda \gt 0$ . Suppose $f \colon E \rightarrow {\mathbb R}$ is continuous with $|\,f| \le c(1+\psi)$ pointwise for some $c \ge 0$ . Then the identity (4.2) holds.

Proof. Let H(f) denote the right-hand side of (4.2), well-defined for any measurable function $f \colon E \to {\mathbb R}$ . We must show $\rho(\,f)=H(\,f)$ for f as in the statement of the proposition. As was mentioned above, it is known from [Reference Föllmer and Schied28, Proposition 4.115] that $\rho(\,f)=H(\,f)$ for bounded f.

Step 1. Assume first that f is continuous and bounded from above, with $|\,f| \le c(1+\psi)$ . Let $f_n = f \vee ({-}n)$ for $n \ge 0$ . Since $f_n$ is bounded for each n, we have $\rho(\,f_n)=H(\,f_n)$ . The first assertion of Lemma 3.2 then implies $\rho(\,f) = \lim_{n\to\infty}\rho(\,f_n) = \lim_{n\to\infty}H(\,f_n)$ . It remains to show $H(\,f_n)\to H(\,f)$ . Clearly $H(\,f_n) \ge H(\,f_{n+1}) \ge H(\,f)$ for each n since $f_n \ge f_{n+1}$ pointwise and $\ell$ is non-decreasing, so the sequence $H(\,f_n)$ has a limit. As $\ell$ is continuous and strictly increasing in a neighborhood of the origin, note that H(f) is the unique solution $c \in {\mathbb R}$ of the equation

(4.4) \begin{equation}\int_E\ell(\,f(x)-c)\mu({\mathrm{d}} x) = 1.\end{equation}

Similarly, $H(\,f_n)$ uniquely solves

\[\int_E\ell(\,f_n(x)-H(\,f_n))\mu({\mathrm{d}} x) = 1.\]

Let $c = \lim_{n\to\infty}H(\,f_n)$ , and note that the integrands $\ell(\,f_n(x)-H(\,f_n))$ are uniformly bounded and converge pointwise to $\ell(\,f(x)-c)$ . Passing to the limit using dominated convergence shows that c solves the equation (4.4), which implies $c=H(\,f)$ .

Step 2. We now turn to general continuous f satisfying $|\,f| \le c(1+\psi)$ . Define $f_n = f \wedge n$ for $n \ge 0$ , so that $f_n$ is bounded from above. By Step 2, $\rho(\,f_n)=H(\,f_n)$ for each n. By Proposition 4.1, the sub-level sets of $\alpha$ are $\psi$ -weakly compact, and the third assertion of Lemma 3.2 yields $\rho(\,f_n) \to \rho(\,f)$ . It remains to show $H(\,f_n)\to H(\,f)$ . To see this, note first that $H(\,f_n) \le H(\,f_{n+1}) \le H(\,f)$ for each n since $f_n \le f_{n+1}$ pointwise. Let $\epsilon \gt 0$ and $c = H(\,f)-\epsilon$ , and note that the definition of H and monotonicity of $\ell$ imply $\int_E\ell(\,f(x) - c)\mu({\mathrm{d}} x) \gt 1$ . By monotone convergence, there exists n such that $\int_E\ell(\,f_n(x) - c)\mu({\mathrm{d}} x) \gt 1$ . The definition of H now implies $H(\,f_n) \gt c = H(\,f) - \epsilon$ . As $\epsilon \gt 0$ was arbitrary, we conclude $H(\,f_n)\to H(\,f)$ .

We record here for later use a simple but useful lemma.

Lemma 4.1. Define $\alpha$ as in (4.3). Let $\psi \colon E \rightarrow {\mathbb R}_+$ be continuous, and suppose $\int_E \psi^q\,{\mathrm{d}} \mu \lt \infty$ . Suppose $A \subset \mathcal{P}_\psi(E)$ is closed (in the $\psi$ -weak topology), and $\mu \notin A$ . Then $\inf_{\nu \in A}\alpha(\nu) \gt 0$ .

Proof. Recall that $\alpha$ as in (4.3) is the special case of (4.1) corresponding to $\ell(x) = [(1+x)^+]^q$ . Thus Proposition 4.1 and the assumption $\int_E \psi^q\,{\mathrm{d}} \mu \lt \infty$ ensure that the sub-level sets of $\alpha$ are $\psi$ -weakly compact. If $\inf_{\nu \in A}\alpha(\nu) =0$ , we may find $\nu_n \in A$ such that $\alpha(\nu_n) \rightarrow 0$ . The sequence $(\nu_n)$ admits a $\psi$ -weak limit point $\nu^*$ , which must of course belong to the $\psi$ -weakly closed set A. Lower semicontinuity and non-negativity of $\alpha$ imply $\alpha(\nu^*) = 0$ . This implies $\nu^*=\mu$ , as $t \mapsto t^p$ is strictly convex, and this contradicts the assumption that $\mu \notin A$ .

4.2. Proofs of Theorem 1.2, Corollary 1.1, and Corollary 1.2

With these generalities in hand, we now turn toward the proof of Theorem 1.2. The idea is to apply Theorem 2.3 with $\alpha$ defined as in (4.3). The following estimate is crucial.

Lemma 4.2. Let $q \in (1,\infty]$ , and let $p=q/(q-1)$ denote the conjugate exponent. Let $\alpha$ be as in (4.3). Then, for each $n \ge 1$ and $\nu \in \mathcal{P}(E^n)$ with $\nu \ll \mu^n$ ,

(4.5) \begin{equation}\alpha_n(\nu) \le n^{1/q}\|{\mathrm{d}} \nu/{\mathrm{d}} \mu^n\|_{L^p(\mu^n)}.\end{equation}

Proof. The case $p=\infty$ and $q=1$ follows by sending $p \rightarrow\infty$ in (4.5), so we prove only the case $p \lt \infty$ . As we will be working with conditional expectations, it is convenient to work with a more probabilistic notation. Fix n, and endow $\Omega = E^n$ with its Borel $\sigma$ -field as well as the probability $P = \mu^n$ . Let $X_i \colon E^n \rightarrow E$ denote the natural projections, and let $\mathcal{F}_k = \sigma(X_1,\ldots,X_k)$ denote the natural filtration, for $k=1,\ldots,n$ , with $\mathcal{F}_0 \,:\!= \{\emptyset,\Omega\}$ . For $\nu \in \mathcal{P}(E^n)$ and $k=1,\ldots,n$ , let $\nu_k$ denote a version of the regular conditional law of $X_k$ given $\mathcal{F}_{k-1}$ under $\nu$ , or symbolically $\nu_k \,:\!= \nu(X_k \in \cdot \mid \mathcal{F}_{k-1})$ . Let ${\mathbb E}^\nu$ denote integration with respect to $\nu$ . Since $P(X_k \in \cdot \mid \mathcal{F}_{k-1}) = \mu$ a.s., if $\nu \ll P$ then

\[\dfrac{{\mathrm{d}} \nu_k}{{\mathrm{d}} \mu} = \dfrac{{\mathbb E}^P[{\mathrm{d}} \nu/{\mathrm{d}} P \mid \mathcal{F}_k]}{{\mathbb E}^P[{\mathrm{d}} \nu/{\mathrm{d}} P \mid \mathcal{F}_{k-1}]} =\!:\, \dfrac{M_k}{M_{k-1}} \ \ \text{a.s.}, \quad\text{where } \dfrac{0}{0} \,:\!= 0.\]

Therefore

\[\alpha(\nu_k) = {\mathbb E}^P\bigg[\bigg(\dfrac{M_k}{M_{k-1}}\bigg)^p\mid \mathcal{F}_{k-1}\bigg]^{1/p}-1.\]

Note that $(M_k)_{k=0}^n$ is a non-negative martingale, with $M_0 = 1$ and $M_n = {\mathrm{d}} \nu/{\mathrm{d}} P$ . Then

\begin{align*} \alpha_n(\nu) &= {\mathbb E}^\nu\Bigg[\sum_{k=1}^n\alpha(\nu_k)\Bigg] \\* &= {\mathbb E}^P\Bigg[M_n\sum_{k=1}^n\bigg({\mathbb E}^P\bigg[\bigg(\dfrac{M_k}{M_{k-1}}\bigg)^p\mid \mathcal{F}_{k-1}\bigg]^{1/p}-1\bigg)\Bigg] \\* &= {\mathbb E}^P\Bigg[\sum_{k=1}^n ({\mathbb E}^P[ M_k^p \mid \mathcal{F}_{k-1}]^{1/p}-M_{k-1} )\Bigg].\end{align*}

Subadditivity of $x \mapsto x^{1/p}$ implies

\[({\mathbb E}^P[M_k^p \mid \mathcal{F}_{k-1}])^{1/p} \le ({\mathbb E}^P[M_k^p - M_{k-1}^p \mid \mathcal{F}_{k-1}])^{1/p} + M_{k-1},\]

where the right-hand side is well-defined because

\[{\mathbb E}^P[M_k^p \mid \mathcal{F}_{k-1}] \ge{\mathbb E}^P[M_k \mid \mathcal{F}_{k-1}]^p = M_{k-1}^p.\]

Concavity of $x \mapsto x^{1/p}$ and Jensen’s inequality yield

\begin{align} \alpha_n(\nu) & \le {\mathbb E}^P\Bigg[\sum_{k=1}^n ({\mathbb E}^P[M_k^p - M_{k-1}^p \mid \mathcal{F}_{k-1}])^{1/p}\Bigg] \nonumber\\ &\le n^{1-{1}/{p}}\Bigg({\mathbb E}^P\Bigg[\sum_{k=1}^n{\mathbb E}^P[M_k^p - M_{k-1}^p \mid \mathcal{F}_{k-1}]\Bigg]\Bigg)^{1/p} \nonumber\\ &= n^{1/q}({\mathbb E}^P[M_n^p - M_0^p])^{1/p} \nonumber\\ &\le n^{1/q}({\mathbb E}^P[M_n^p])^{1/p}.\end{align}

Proof of Theorem 1.2. Again, let $q \in (1,\infty)$ and $p=q/(q-1)$ , and let $\alpha$ be as in (4.3), noting that it corresponds to (4.1) with $\ell(x) = [(1+x)^+]^q$ . Then Proposition 4.1 and the assumption that $\int\psi^q\,{\mathrm{d}} \mu \lt \infty$ imply that the sub-level sets of $\alpha$ are pre-compact subsets of $\mathcal{P}_\psi(E)$ . Hence, Theorem 2.3 applies to the $\psi$ -weakly upper semicontinuous function $F \colon \mathcal{P}_\psi(E) \rightarrow [-\infty,0]$ defined by $F(\nu) = 0$ if $\nu \in A$ and $F(\nu) = -\infty$ otherwise. This yields

(4.6) \begin{equation}\limsup_{n\rightarrow\infty}\dfrac{1}{n}\rho_n(nF \circ L_n) \le -\inf_{\nu \in A}\alpha(\nu).\end{equation}

Now use Lemma 4.2, noting that $({1}/{n})n^{1/q} = n^{-1/p}$ , to get

\begin{align*}\dfrac{1}{n}\rho_n(nF\circ L_n) &= \sup_{\nu \in \mathcal{P}(E^n)}\bigg(\int_{E^n} F\circ L_n\,{\mathrm{d}} \nu- \dfrac{1}{n}\alpha_n(\nu)\bigg) \\* &= -\inf\bigg\{\dfrac{1}{n}\alpha_n(\nu) \colon \nu \in \mathcal{P}(E^n), \ \nu(L_n \in A)=1\bigg\} \\* &\ge -\inf\{n^{-1/p}\|{\mathrm{d}} \nu/{\mathrm{d}} \mu^n\|_{L^p(\mu^n)} \colon \nu \in \mathcal{P}(E^n), \ \nu \ll \mu^n, \ \nu(L_n \in A)=1\}.\end{align*}

Set $B_n = \{x \in E^n \colon L_n(x) \in A\}$ , and define $\nu \ll \mu^n$ by ${\mathrm{d}} \nu/{\mathrm{d}} \mu^n = 1_{B_n}/\mu^n(B_n)$ . A quick computation yields

\[\|{\mathrm{d}} \nu/{\mathrm{d}} \mu^n\|_{L^p(\mu^n)} = \mu^n(B_n)^{(1-p)/p} = \mu^n(B_n)^{-1/q}.\]

Thus

\[\dfrac{1}{n}\rho_n(nF \circ L_n) \ge - (n^{1/p}\mu^n(B_n)^{1/q})^{-1}.\]

Combine this with (4.6) to get

\begin{equation*}\limsup_{n\rightarrow\infty}-(n^{1/p}\mu^n(L_n \in A)^{1/q})^{-1} \le -\inf_{\nu \in A}\alpha(\nu).\end{equation*}

Recalling the definition of $\alpha$ from (4.3) and noting that $q/p = q-1$ , this inequality can be rewritten as the desired result.

Proof of Corollary 1.1. Define a continuous function $\psi \colon E \to {\mathbb R}_+$ by $\psi(x) = d^r(x,x_0)$ . Note that $\mathcal{W}_r$ then metrizes $\mathcal{P}_\psi(E)$ (see [Reference Villani52, Theorem 7.12]). Hence, the set

\[A = \{\nu \in \mathcal{P}_\psi(E) \colon \mathcal{W}_r(\nu,\mu) \ge a\}\]

is closed in $\mathcal{P}_\psi(E)$ . Because $\int \psi^{q/r}\,{\mathrm{d}} \mu = \int d^q(x,x_0)\,\mu({\mathrm{d}} x) \lt \infty$ by assumption, we may apply Theorem 1.2 with $q/r$ in place of q to get

\begin{equation*}\limsup_{n\to\infty} n^{{q}/{r}-1}\mu^n(\mathcal{W}_r(L_n,\mu) \ge a) \le \bigg(\inf_{\nu \in A} \alpha(\nu)\bigg)^{-q/r},\end{equation*}

where $\alpha$ is defined as in (4.3) with $p=(q/r)/(q/r-1)$ . It remains to show that $\inf_{\nu \in A} \alpha(\nu) \gt 0$ . But this follows from Lemma 4.1, since A is closed, $\mu \notin A$ , and $\int \psi^{q/r}\,{\mathrm{d}} \mu \lt \infty$ .

Proof of Corollary 1.2. Again, let $\alpha$ be as in (4.3), and note that it corresponds to the shortfall risk measure (4.2) with $\ell(x) = [(1+x)^+]^q$ . Let $\psi(x) = \|x\|$ , and consider the $\mathcal{P}_\psi(E)$ -closed set

\[B = \bigg\{\mu \in \mathcal{P}_\psi(E) \colon \int_Ez\,\mu({\mathrm{d}} z) \in A\bigg\},\]

where the integral is defined in the Bochner sense. Proposition 4.1 and the assumption that $\int\psi^q\,{\mathrm{d}} \mu = {\mathbb E}[\|X_1\|^q] \lt \infty$ imply that the sub-level sets of $\alpha$ are pre-compact subsets of $\mathcal{P}_\psi(E)$ . We may then apply Theorem 1.2 to get

\[\limsup_{n\rightarrow\infty}n^{q-1}{\mathbb P}\Bigg(\dfrac{1}{n}\sum_{i=1}^nX_i \in A\Bigg) \le \bigg(\inf_{\nu \in B}\alpha(\nu)\bigg)^{-q},\]

where again $\alpha$ is as in (4.3). It remains to simplify the right-hand side. Proposition 2.1 yields

\[\sup_{x^* \in E^*}(\langle x^*,x\rangle - \rho(x^*)) = \inf\bigg\{\alpha(\nu) \colon \nu \in \mathcal{P}_\psi(E), \ \int_Ez\,\nu({\mathrm{d}} z)=x\bigg\}\quad \text{for } x \in E.\]

Infimize over $x \in A$ on both sides to get

(4.7) \begin{equation}\inf_{\nu \in B}\alpha(\nu) = \inf_{x \in A}\sup_{x^* \in E^*}(\langle x^*,x\rangle - \rho(x^*)).\end{equation}

According to Proposition 4.2, for $x^* \in E^*$ we have

\begin{equation*}\rho(x^*) = \inf\bigg\{m \in {\mathbb R} \colon \int_E [(1+x^*(x)-m)^+]^q\mu({\mathrm{d}} x) \le 1\bigg\} = \Lambda(x^*),\end{equation*}

where the latter equality is simply the definition of $\Lambda$ given in the statement of Corollary 1.2. Hence, the identity (4.7) becomes $\inf_{\nu \in B}\alpha(\nu) = \inf_{x \in A}\Lambda^*(x)$ , and the proof is complete.

4.3. Stochastic optimization with heavy tails

This section elaborates on the application discussed in Section 1.1.3, concerning the convergence of Monte Carlo estimates for stochastic optimization problems. We use the notation of Section 1.1.3, and we begin with the proof of Theorem 1.3.

Proof of Theorem 1.3. Let $A = \{\nu \in \mathcal{P}_\psi(E) \colon |V(\nu)-V(\mu)| \ge \epsilon\}$ . The map

\[\mathcal{X} \times \mathcal{P}_\psi(E) \ni (x,\nu) \mapsto \int_Eh(x,w)\nu({\mathrm{d}} x)\]

is jointly continuous. By Berge’s theorem [Reference Aliprantis and Border3, Theorem 17.31], V is continuous on $\mathcal{P}_\psi(E)$ , so A is closed. Theorem 1.2 implies

\begin{equation*}\limsup_{n\rightarrow\infty}n^{q-1}\mu^n(|V(L_n)-V(\mu)| \ge \epsilon) = \limsup_{n\rightarrow\infty}n^{q-1}\mu^n(L_n \in A) \le \bigg(\inf_{\nu \in A}\alpha(\nu)\bigg)^{-q}.\end{equation*}

Note that $q/p=q-1$ , and finally use Lemma 4.1 to conclude $\inf_{\nu \in A}\alpha(\nu) \gt 0$ .

Remark 4.1. The joint continuity and compactness assumptions in Theorem 1.3 could likely be weakened, but we focus on the more novel integrability issues to ease the exposition.

Now we have shown that the optimal value itself converges, we turn to the convergence of optimizers themselves.

Theorem 4.1. Grant the assumptions of Theorem 1.3. Let $\hat{x} \colon \mathcal{P}_\psi(E) \rightarrow \mathcal{X}$ be any measurable function satisfying

\[\hat{x}(\nu) \in \arg\min_{x \in \mathcal{X}}\int_Eh(x,w)\nu({\mathrm{d}} w)\quad {for\ each\ }\ \nu.\]

Suppose there exist a measurable function $\varphi \colon {\mathbb R} \rightarrow {\mathbb R}$ and a compatible metric d on $\mathcal{X}$ such that

\[\varphi(d(\hat{x}(\mu),x)) \le \int_Eh(x,w)\mu({\mathrm{d}} w) - \int_Eh(\hat{x}(\mu),w)\mu({\mathrm{d}} w).\]

Then, for any $\epsilon \gt 0$ ,

\[\limsup_{n\rightarrow\infty}n^{q-1}\mu^n(\varphi(d(\hat{x}(\mu),\hat{x}(L_n))) \ge \epsilon) < \infty.\]

In particular, if $\varphi$ is strictly increasing with $\varphi(0)=0$ , then for any $\epsilon \gt 0$ ,

\[\limsup_{n\rightarrow\infty}n^{q-1}\mu^n(d(\hat{x}(\mu),\hat{x}(L_n)) \ge \epsilon) < \infty.\]

Such a function $\hat{x}$ exists because $(x,\nu) \mapsto \int_Eh(x,w)\nu({\mathrm{d}} w)$ is measurable in $\nu$ and continuous in x; see e.g. [Reference Aliprantis and Border3, Theorem 18.19].

Proof. Note that for $\epsilon \gt 0$ , on the event $\{\varphi(d(\hat{x}(\mu),\hat{x}(L_n))) \ge \epsilon\}$ we have

\begin{align*} \epsilon &\le \varphi(d(\hat{x}(\mu),\hat{x}(L_n))) \\* & \le \int_Eh(\hat{x}(L_n),w)\mu({\mathrm{d}} w) - \int_Eh(\hat{x}(\mu),w)\mu({\mathrm{d}} w) \\* & \le |V(L_n)-V(\mu)| + \sup_{x \in \mathcal{X}}\int_Eh(x,w)[\mu-L_n]({\mathrm{d}} w).\end{align*}

The first term converges at the right rate, thanks to Theorem 1.3, and it remains to check that

\[\limsup_{n\rightarrow\infty}n^{q-1}\mu^n\Bigg(\sup_{x \in \mathcal{X}}\int_Eh(x,w)[\mu-L_n]({\mathrm{d}} w) \ge \epsilon\Bigg) < \infty.\]

The map $(x,\nu) \mapsto \int_Eh(x,w)\nu({\mathrm{d}} w)$ is continuous on $\mathcal{X} \times \mathcal{P}_\psi(E)$ , so the map

\[\mathcal{P}_\psi(E) \ni \nu \mapsto \sup_{x \in \mathcal{X}}\int_Eh(x,w)[\mu-\nu]({\mathrm{d}} w)\]

is continuous by Berge’s theorem [Reference Aliprantis and Border3, Theorem 17.31]. Hence, the set

\[B \,:\!= \bigg\{\nu \in \mathcal{P}_\psi(E) \colon \sup_{x \in \mathcal{X}}\int_Eh(x,w)[\mu-\nu]({\mathrm{d}} w) \ge \epsilon\bigg\}\]

is closed in $\mathcal{P}_\psi(E)$ . Theorem 1.2 then implies

\begin{equation*}\limsup_{n\rightarrow\infty}n^{q-1}\mu^n\Bigg(\sup_{x \in \mathcal{X}}\int_Eh(x,w)[\mu-L_n]({\mathrm{d}} w) \ge \epsilon\Bigg) \le \bigg(\inf_{\nu \in B}\alpha(\nu)\bigg)^{-q},\end{equation*}

where $\alpha$ is defined as in (4.3). Finally, Lemma 4.1 implies that $\inf_{\nu \in B}\alpha(\nu) \gt 0$ .

Under the assumption $\int_E\psi^q\,{\mathrm{d}} \mu \lt \infty$ , we see that the value $V(L_n)$ always converges to $V(\mu)$ with the polynomial rate $n^{1-q}$ . To see when Theorem 4.1 applies, notice that in many situations, $\mathcal{X}$ is a convex subset of a normed vector space, and we have uniform convexity in the following form. There exists a strictly increasing function $\varphi$ such that $\varphi(0)=0$ and, for all $t \in (0,1)$ and $x,y \in \mathcal{X}$ ,

\begin{align*}&\int_Eh(tx + (1-t)y,w)\mu({\mathrm{d}} w) \\* & \quad\, \le t\int_Eh(x,w)\mu({\mathrm{d}} w) + (1-t)\int_Eh(y,w)\mu({\mathrm{d}} w) - t(1-t)\varphi(\|x-y\|).\end{align*}

See [Reference Kaniovski, King and Wets37, pages 202–203] for more on this.

5. Uniform large deviations and martingales

This section returns to the example of Section 1.2. We first record a useful abstract theorem of Föllmer and Schied [Reference Föllmer and Schied28], which will allow us to verify tightness of the sub-level sets of $\alpha$ before knowing it is convex by checking a property of $\rho$ .

Proposition 5.1. (Proposition 4.30 of [Reference Föllmer and Schied28].) Suppose a functional $\rho \colon B(E) \rightarrow {\mathbb R}$ admits the representation

\[\rho(\,f) = \sup_{\nu \in \mathcal{P}(E)}\bigg(\int_Ef\,{\mathrm{d}} \nu - \alpha(\nu)\bigg)\quad {for}\ f \in C_b(E),\]

for some functional $\alpha \colon \mathcal{P}(E) \rightarrow ({-}\infty,\infty]$ . Suppose also that there is a sequence $(K_n)$ of compact subsets of E such that

\[\lim_{n\rightarrow\infty}\rho(\lambda 1_{K_n}) = \rho(\lambda)\quad {{for\ all}}\ \lambda \ge 1.\]

Then $\alpha$ has tight sub-level sets.

Fix a convex weakly compact family of probability measures $M \subset \mathcal{P}(E)$ . Define

(5.1) \begin{equation}\alpha(\nu) = \inf_{\mu \in M}H(\nu \mid \mu),\end{equation}

where the relative entropy was defined in (1.2). In light of the classical formula [Reference Dupuis and Ellis22, Proposition 1.4.2]

\[\sup_{\nu \in B(E)}\bigg(\int_Ef\,{\mathrm{d}} \nu - H(\nu \mid \mu)\bigg) = \log \int_E {\mathrm{e}}^{\,f} \,{\mathrm{d}} \mu,\]

the $\rho$ corresponding to the functional $\alpha$ given by (5.1) is then

(5.2) \begin{align} \rho(\,f) &\,:\!= \sup_{\nu \in B(E)}\bigg(\int_Ef\,{\mathrm{d}} \nu - \alpha(\nu)\bigg) \notag \\* &= \sup_{\nu \in B(E)}\sup_{\mu \in M}\bigg(\int_Ef\,{\mathrm{d}} \nu - H(\nu \mid \mu)\bigg) \notag \\* &= \sup_{\mu \in M}\log\int_E {\mathrm{e}}^{\,f}\,{\mathrm{d}} \mu.\end{align}

Let us also take note of the famous Donsker–Varadhan formula [Reference Dupuis and Ellis22, Lemma 1.4.3]

(5.3) \begin{equation}H(\nu \mid \mu) = \sup_{f \in C_b(E)}\bigg( \int_Ef\,{\mathrm{d}} \nu - \log\int_E {\mathrm{e}}^{\,f}\,{\mathrm{d}} \mu\bigg).\end{equation}

Lemma 5.1. The functional $\alpha$ defined in (5.1) satisfies the standing assumptions. That is, it is convex and bounded from below, and its sub-level sets are weakly compact.

Proof. Note first that $-\log\int_E {\mathrm{e}}^{\,f}\,{\mathrm{d}} \mu$ is convex and weakly continuous in $\mu$ as well as concave and sup-norm continuous in f. Thus, using (5.3) and Sion’s minimax theorem [Reference Sion51], we find

\begin{align*}\alpha(\nu) &= \inf_{\mu \in M}\sup_{f \in C_b(E)}\bigg(\int_Ef\,{\mathrm{d}} \nu - \log\int_E {\mathrm{e}}^{\,f}\,{\mathrm{d}} \mu\bigg) \\* &= \sup_{f \in C_b(E)}\inf_{\mu \in M}\bigg(\int_Ef\,{\mathrm{d}} \nu - \log\int_E {\mathrm{e}}^{\,f}\,{\mathrm{d}} \mu\bigg) \\* &= \sup_{f \in C_b(E)}\bigg(\int_Ef\,{\mathrm{d}} \nu - \rho(\,f)\bigg).\end{align*}

This shows that $\alpha$ is convex and lower semicontinuous. It remains to prove that $\alpha$ has tight sub-level sets, which will follow from Proposition 5.1 once we check the second assumption therein. By Prokhorov’s theorem, there exist compact sets $K_1 \subset K_2 \subset \cdots$ such that $\sup_{\mu \in M}\mu(K_n^c) \le 1/n$ . Then, for $\lambda \ge 0$ , using the formula for $\rho$ of (5.2),

\begin{align*}\lambda \ge \rho(\lambda 1_{K_n}) &= \sup_{ \mu \in M}\log\int_E\exp(\lambda 1_{K_n})\,{\mathrm{d}} \mu \\* &= \sup_{ \mu \in M}\log [({\mathrm{e}}^\lambda - 1)\mu(K_n) + 1] \\* &\ge \log [({\mathrm{e}}^\lambda - 1)(1-1/n) + 1].\end{align*}

As $n\rightarrow\infty$ , the right-hand side converges to $\lambda$ , which shows $\rho(\lambda 1_{K_n})\rightarrow \lambda = \rho(\lambda)$ .

To compute $\rho_n$ , recall that for $M \subset \mathcal{P}(E)$ we define $M_n$ as the set of $\mu \in \mathcal{P}(E^n)$ satisfying $\mu_{0,1} \in M$ and $\mu_{k-1,k}(x_1,\ldots,x_{k-1}) \in M$ for all $k=2,\ldots,n$ and $x_1,\ldots,x_{n-1} \in E$ . (Recall that the conditional measures $\mu_{k-1,k}$ were defined in the Introduction.) Notice that $M_1=M$ .

Proposition 5.2. For each $n \ge 1$ , $\alpha_n(\nu) = \inf_{\mu \in M_n}H(\nu \mid \mu)$ . Moreover, for each measurable $f \colon E^n \rightarrow {\mathbb R} \cup \{-\infty\}$ satisfying $\int_{E^n} {\mathrm{e}}^{\,f}\,{\mathrm{d}} \mu \lt \infty$ for every $\mu \in M_n$ ,

\begin{equation*} \rho_n(\,f) = \sup_{\mu \in M_n}\log\int_{E^n} {\mathrm{e}}^{\,f}\,{\mathrm{d}} \mu.\end{equation*}

Proof. Given the first claim, the second follows from the well-known duality

\[\sup_{\nu \in \mathcal{P}(E^n)}\bigg(\int_{E^n}f\,{\mathrm{d}} \nu - H(\nu \mid \mu)\bigg) = \log\int_{E^n} {\mathrm{e}}^{\,f}\,{\mathrm{d}} \mu,\]

which holds for $\mu \in \mathcal{P}(E^n)$ as long as ${\mathrm{e}}^{\,f}$ is $\mu$ -integrable (see e.g. the proof of [Reference Dupuis and Ellis22, 1.4.2]). Indeed, this implies

\begin{align*} \rho_n(\,f) &= \sup_{\nu \in \mathcal{P}(E^n)}\bigg(\int_{E^n}f\,{\mathrm{d}} \nu - \alpha_n(\nu)\bigg) \\* & = \sup_{\mu \in M_n}\sup_{\nu \in \mathcal{P}(E^n)}\bigg(\int_{E^n}f\,{\mathrm{d}} \nu - H(\nu \mid \mu)\bigg) \\* &= \sup_{\mu \in M_n}\log\int_{E^n} {\mathrm{e}}^{\,f}\,{\mathrm{d}} \mu.\end{align*}

To prove the first claim, note that by definition

\begin{equation*}\alpha_n(\nu) = \sum_{k=1}^n \int_{E^n}\inf_{\mu \in M}H(\nu_{k-1,k}(x_1,\ldots,x_{k-1}) \mid \mu) \nu({\mathrm{d}} x_1,\ldots,{\mathrm{d}} x_n).\end{equation*}

For $k=2,\ldots,n$ , let $\mathcal{Y}_k$ denote the set of measurable maps from $E^{k-1}$ to M, and let $\mathcal{Y}_1 = M$ . Then the usual measurable selection argument [Reference Bertsekas and Shreve9, Proposition 7.50] yields

\begin{equation*}\alpha_n(\nu) = \sum_{k=1}^n \inf_{\eta_k \in \mathcal{Y}_k}\int_{E^n}H(\nu_{k-1,k}(x_1,\ldots,x_{k-1}) \mid \eta_k(x_1,\ldots,x_{k-1})) \nu({\mathrm{d}} x_1,\ldots,{\mathrm{d}} x_n).\end{equation*}

Now, if $(\eta_1,\ldots,\eta_n) \in \prod_{k=1}^n\mathcal{Y}_k$ , then the measure

\[\mu({\mathrm{d}} x_1,\ldots,{\mathrm{d}} x_n) = \eta_1({\mathrm{d}} x_1)\prod_{k=2}^n\eta_2(x_1,\ldots,x_{k-1})({\mathrm{d}} x_k)\]

is in M, and $\mu_{k-1,k} = \eta_k$ is a version of the conditional law. Thus

\begin{equation*}\alpha_n(\nu) \ge \inf_{\mu \in M}\sum_{k=1}^n\int_{E^n}H(\nu_{k-1,k}(x_1,\ldots,x_{k-1}) \mid \mu_{k-1,k}(x_1,\ldots,x_{k-1})) \nu({\mathrm{d}} x_1,\ldots,{\mathrm{d}} x_n).\end{equation*}

On the other hand, for every $\mu \in M_n$ , the vector $(\mu_{0,1},\mu_{1,2},\ldots,\mu_{n-1,n})$ belongs to $\prod_{k=1}^n\mathcal{Y}_k$ , and we deduce the opposite inequality. Hence

\begin{align*}\alpha_n(\nu) &\ge \inf_{\mu \in M}\sum_{k=1}^n\int_{E^n}H(\nu_{k-1,k}(x_1,\ldots,x_{k-1}) \mid \mu_{k-1,k}(x_1,\ldots,x_{k-1})) \nu({\mathrm{d}} x_1,\ldots,{\mathrm{d}} x_n) \\* &= \inf_{\mu \in M}H(\nu \mid \mu),\end{align*}

where the last equality follows from the chain rule for relative entropy [Reference Dupuis and Ellis22, Theorem B.2.1].

Theorem 2.3 now leads to the following uniform large deviation bound.

Corollary 5.1. For $F \in C_b(\mathcal{P}(E))$ , we have

\begin{equation*}\lim_{n\rightarrow\infty}\sup_{\mu \in M_n}\dfrac{1}{n}\log\int_{E^n} {\mathrm{e}}^{nF \circ L_n}\,{\mathrm{d}} \mu = \sup_{\nu \in \mathcal{P}(E), \ \mu \in M} (F(\nu) - H(\nu \mid \mu)).\end{equation*}

For closed sets $A \subset \mathcal{P}(E)$ , we have

\begin{equation*}\lim_{n\rightarrow\infty}\sup_{\mu \in M_n}\dfrac{1}{n}\log\mu(L_n \in A) \le -\inf \{H(\nu \mid \mu) \colon \nu \in A, \ \mu \in M\}.\end{equation*}

Proof. The first claim is an immediate consequence of Theorem 2.3 and the calculation of $\rho_n$ in Proposition 5.2. To prove the second claim, define F on $\mathcal{P}(E)$ by

\begin{align*}F(\nu) = \begin{cases}0 &\text{if } \nu \in A, \\-\infty &\text{otherwise.}\end{cases}\end{align*}

Then F is upper semicontinuous and bounded from above. Use Proposition 5.2 to compute

\begin{equation*}\rho_n(nF \circ L_n) = \sup_{\mu \in M_n}\log\int_{E^n}\exp(nF \circ L_n)\,{\mathrm{d}} \mu = \sup_{\mu \in M_n}\log\mu(L_n \in A).\end{equation*}

The proof is completed by applying Theorem 2.3 with this function F.

The following proposition simplifies the strong Cramér condition (3.1) in the present context.

Proposition 5.3. Let $\psi \colon E \rightarrow {\mathbb R}_+$ be measurable. Suppose that for every $\lambda \gt 0$ we have

(5.4) \begin{equation}\sup_{\mu \in M}\int_E {\mathrm{e}}^{\lambda\psi}\,{\mathrm{d}} \mu < \infty.\end{equation}

Then the strong Cramér condition holds, i.e. $\lim_{m\rightarrow\infty}\rho(\lambda\psi 1_{\{\psi \ge m\}})\rightarrow 0$ for all $\lambda \gt 0$ . In particular, the sub-level sets of $\alpha$ are pre-compact subsets of $\mathcal{P}_\psi(E)$ .

Proof. Because ${\mathrm{e}}^{\lambda\psi}$ is $\mu$ -integrable for each $\mu\in M$ and $\lambda \gt 0$ , Proposition 5.2 implies

\begin{align*}\rho(\lambda\psi 1_{\{\psi \ge m\}}) &= \sup_{\mu \in M}\log\int_E\exp (\lambda\psi 1_{\{\psi \ge m\}})\,{\mathrm{d}} \mu \\* &\le \sup_{\mu \in M}\log\bigg( 1 + \int_{\{\psi \ge m\}} {\mathrm{e}}^{\lambda\psi}\,{\mathrm{d}} \mu\bigg).\end{align*}

Now note that

\[1_{\{\psi \ge m\}} \le \dfrac{\psi}{m} \le \dfrac{1}{m} {\mathrm{e}}^\psi\]

pointwise, and thus the assumption (5.4) yields

\begin{equation}\lim_{m\to\infty}\sup_{\mu \in M}\int_{\{\psi \ge m\}} {\mathrm{e}}^{\lambda\psi}\,{\mathrm{d}} \mu \le \lim_{m\to\infty}\sup_{\mu \in M}\dfrac{1}{m}\int_E {\mathrm{e}}^{(1+\lambda)\psi}\,{\mathrm{d}} \mu = 0.\end{equation}

We are finally ready to specialize Corollary 5.1 to prove Theorem 1.4, similarly to how we specialized Theorem 1.2 to prove Corollary 1.2 in Section 4.

Proof of Theorem 1.4. Define

\[M = \bigg\{ \mu \in \mathcal{P}({\mathbb R}^d) \colon \log\int_{{\mathbb R}^d} {\mathrm{e}}^{\langle y,x\rangle}\mu({\mathrm{d}} x) \le \varphi(y)\ \ \text{{for all} } y \in {\mathbb R}^d\bigg\}.\]

We claim that M is weakly compact. Indeed, it is clearly convex, and closedness follows from Fatou’s lemma (cf. [Reference Dupuis and Ellis22, Theorem A.3.12]). To prove tightness, let $e_1,\ldots,e_d$ denote the standard basis vectors in ${\mathbb R}^d$ . Write $x=(x_1,\ldots,x_d)$ for a generic element of ${\mathbb R}^d$ . For each $\mu \in M$ and $t \gt 0$ , Markov’s inequality yields

\begin{align*}\mu \{x\in {\mathbb R}^d \colon\max_{i=1,\ldots,d}|x_i| \gt\t \} &\le \sum_{k=1}^d (\mu\{x \in {\mathbb R}^d \colon x_i \gt t/2\} + \mu\{x \in {\mathbb R}^d \colon -x_i \gt t/2\}) \\* &\le \sum_{k=1}^d {\mathrm{e}}^{-t/2}\int_{{\mathbb R}^d} ({\mathrm{e}}^{x_i} + {\mathrm{e}}^{-x_i})\,\mu({\mathrm{d}} x) \\* &\le {\mathrm{e}}^{-t/2}\sum_{k=1}^d {\mathrm{e}}^{\varphi(e_i) + \varphi({-}e_i)},\end{align*}

and we deduce that M is tight. Now define $\psi(x) = \sum_{i=1}^d|x_i|$ and note that

\begin{equation*}\sup_{\mu \in M}\int_{{\mathbb R}^d}\exp(\lambda\psi)\,{\mathrm{d}} \mu < \infty\quad \text{for all } \lambda \ge 0.\end{equation*}

Proposition 5.3 then shows that the strong Cramér condition holds. Define a closed set $B \subset \mathcal{P}_\psi({\mathbb R}^d)$ by

\[B = \bigg\{\nu \in \mathcal{P}_\psi({\mathbb R}^d) \colon \int_{{\mathbb R}^d} z\,\nu({\mathrm{d}} z) \in A\bigg\},\]

where A was the given closed subset of ${\mathbb R}^d$ . Corollary 5.1 yields

\begin{equation*}\limsup_{n\rightarrow\infty}\sup_{\mu \in M_n}\dfrac{1}{n}\log\mu(L_n \in B) \le -\inf\bigg\{\alpha(\nu) \colon \nu \in \mathcal{P}_\psi({\mathbb R}^d), \ \int x\,\nu({\mathrm{d}} x) \in A\bigg\},\end{equation*}

Now let $(S_0,\ldots,S_n) \in \mathcal{S}_{d,\varphi}$ . The law of $S_1$ belongs to M, and the conditional law of $S_k-S_{k-1}$ given $S_1,\ldots,S_{k-1}$ belongs almost surely to M, for each k, and so the law of $(S_1,S_2-S_1,\ldots,S_n-S_{n-1})$ belongs to $M_n$ . Thus

\[{\mathbb P} (S_n/n \in A) \le \sup_{\mu \in M_n}\mu(L_n \in B),\]

and all that remains is to prove that

\begin{equation*}\inf\bigg\{\alpha(\nu) \colon \nu \in \mathcal{P}_\psi({\mathbb R}^d), \ \int z\,\nu({\mathrm{d}} z) \in A\bigg\} \ge \inf_{x \in A}\varphi^*(x).\end{equation*}

To prove this, it suffices to show $\Psi(x) \ge \varphi^*(x)$ for every $x \in {\mathbb R}^d$ , where

\begin{equation*} \Psi(x) \,:\!= \inf\bigg\{\alpha(\nu) \colon \nu \in \mathcal{P}_\psi({\mathbb R}^d), \ \int z\,\nu({\mathrm{d}} z) =x\bigg\}.\end{equation*}

To this end, note that for all $y \in {\mathbb R}^d$

\[\rho(\langle \cdot,y\rangle) = \sup_{\mu \in M}\log\int_E {\mathrm{e}}^{\langle z,y\rangle}\mu({\mathrm{d}} z) \le \varphi(y),\]

and then use the representation of 2.1 to get

\begin{equation*}\Psi(x) = \sup_{y \in {\mathbb R}^d} (\langle x,y\rangle - \rho(\langle \cdot,y\rangle)) \ge \sup_{y \in {\mathbb R}^d}(\langle x,y\rangle - \varphi(y)) = \varphi^*(x).\tag*{$\Box$}\end{equation*}

6. Optimal transport and control

This section discusses the example of Section 1.4 in more detail. Again let E be a Polish space, and fix a lower semicontinuous function $c \colon E^2 \rightarrow [0,\infty]$ which is not identically equal to $\infty$ . Fix $\mu \in \mathcal{P}(E)$ , and define

\[\alpha(\nu) = \inf_{\pi \in \Pi(\mu,\nu)}\int c\,{\mathrm{d}} \pi,\]

where $\Pi(\mu,\nu)$ is the set of probability measures on $E \times E$ with first marginal $\mu$ and second marginal $\nu$ . Assume that $\int_Ec(x,x)\mu({\mathrm{d}} x) \lt \infty$ ; in many practical cases, $c(x,x)=0$ for all x, so this is not a restrictive assumption and merely ensures that $\alpha(\mu) \lt \infty$ . Kantorovich duality [Reference Villani52, Theorem 1.3] shows that

\begin{equation*}\alpha(\nu) = \sup\bigg(\int_Ef\,{\mathrm{d}} \nu - \int_Eg\,{\mathrm{d}} \mu \colon f,g \in C_b(E), \ f(y) - g(x) \le c(x,y) \ \ \text{{for all} } x,y\bigg).\end{equation*}

This immediately shows that $\alpha$ is convex and weakly lower semicontinuous. The next two lemmas identify, respectively, the dual $\rho$ and the modest conditions that ensure that $\alpha$ has compact sub-level sets.

Lemma 6.1. Given $\alpha$ as above, and defining $\rho$ as usual by (1.1), we have

(6.1) \begin{equation}\rho(\,f) = \int_ER_cf\,{\mathrm{d}} \mu\quad {for\ all}\ f \in B(E),\end{equation}

where $R_cf \colon E \rightarrow {\mathbb R}$ is defined by

\[R_cf(x) = \sup_{y \in E} (\,f(y) - c(x,y)).\]

Proof. Note that $R_cf$ is universally measurable (e.g. by [Reference Bertsekas and Shreve9, Proposition 7.50]), so the integral in (6.1) makes sense. Now compute

\begin{align*}\rho(\,f) &= \sup_{\nu \in \mathcal{P}(E)}\bigg(\int_Ef\,{\mathrm{d}} \nu - \alpha(\nu)\bigg) \\* &= \sup_{\nu \in \mathcal{P}(E)}\sup_{\pi \in \Pi(\mu,\nu)}\bigg(\int_Ef\,{\mathrm{d}} \nu - \int_{E^2}c\,{\mathrm{d}} \pi\bigg) \\* &= \sup_{\pi \in \Pi(\mu)}\int_{E^2} (\,f(y) - c(x,y))\pi({\mathrm{d}} x,{\mathrm{d}} y),\end{align*}

where $\Pi(\mu)$ is the set of $\pi \in \mathcal{P}(E \times E)$ with first marginal $\mu$ . Use the standard measurable selection theorem [Reference Bertsekas and Shreve9, Proposition 7.50] to find a measurable map $Y \colon E \rightarrow E$ such that $R_cf(x) = f(Y(x)) - c(x,Y(x))$ for $\mu$ -a.e. x. Then, choosing $\pi({\mathrm{d}} x,{\mathrm{d}} y) = \mu({\mathrm{d}} x)\delta_{Y(x)}({\mathrm{d}} y)$ shows

\[\rho(\,f) \ge \int_E (\,f(Y(x))-c(x,Y(x)))\mu({\mathrm{d}} x) = \int_ER_cf\,{\mathrm{d}} \mu.\]

On the other hand, it is clear that for every $\pi \in \Pi(\mu)$ we have

\begin{equation}\int_{E^2}(\,f(y) - c(x,y))\pi({\mathrm{d}} x,{\mathrm{d}} y) \le \int_{E}\sup_{y \in E}(\,f(y) - c(x,y))\mu({\mathrm{d}} x) = \int_ER_cf\,{\mathrm{d}} \mu.\end{equation}

Lemma 6.2. Suppose that for each compact set $K \subset E$ , the function $h_K(y) \,:\!= \inf_{x \in K}c(x,y)$ has pre-compact sub-level sets. Then $\alpha$ has compact sub-level sets. In fact, since c is lower semicontinuous, so is $h_K$ (see [Reference Aliprantis and Border3, Lemma 17.30]). Thus, our assumption is equivalent to requiring $\{y \in E \colon h_K(y) \le m\}$ to be compact for each $m \ge 0$ .

Proof. We already know that $\alpha$ has closed sub-level sets, so we must show only that they are tight. Fix $\nu \in \mathcal{P}(E)$ such that $\alpha(\nu) \lt \infty$ (noting that such $\nu$ certainly exist, as $\mu$ is one example). Fix $\epsilon \gt 0$ , and find $\pi \in \Pi(\mu,\nu)$ such that

(6.2) \begin{equation}\int c\,{\mathrm{d}} \pi \le \alpha(\nu) + \epsilon < \infty.\end{equation}

As finite measures on Polish spaces are tight, we may find a compact set $K \subset E$ such that $\mu(K^c) \le \epsilon$ . Set $K_n \,:\!= \{y \in E \colon h_K(y) \lt n\}$ for each n, and note that this set is pre-compact by assumption. Disintegrate $\pi$ by finding a measurable map $E \ni x \mapsto \pi_x \in \mathcal{P}(E)$ such that $\pi({\mathrm{d}} x,{\mathrm{d}} y) = \mu({\mathrm{d}} x)\pi_x({\mathrm{d}} y)$ . By Markov’s inequality, for each $n \gt 0$ and each $x \in K$ we have

\begin{equation*}\pi_x(K_n^c) \le \pi_x\{y \in E \colon c(x,y) \ge n\} \le \dfrac{1}{n}\int_E c(x,y)\pi_x({\mathrm{d}} y).\end{equation*}

Using this and inequality (6.2) along with the assumption that c is non-negative,

\begin{align*}\nu(K_n^c) &= \int_E\mu({\mathrm{d}} x)\pi_x(K_n^c) \\* &\le \mu(K^c) + \int_K\mu({\mathrm{d}} x)\pi_x(K_n^c) \\ &\le \epsilon + \dfrac{1}{n}\int_K\mu({\mathrm{d}} x)\int_E\pi_x({\mathrm{d}} y) c(x,y) \\ &\le \epsilon + \dfrac{1}{n}\int_{E \times E}c\,{\mathrm{d}} \pi \\* &\le \bigg(1 + \dfrac{1}{n}\bigg)\epsilon + \dfrac{1}{n}\alpha(\nu).\end{align*}

As $\epsilon$ was arbitrary, we have $\nu(K_n^c) \le \alpha(\nu)/n$ . Thus, for each $m \gt 0$ , the sub-level set $\{\nu \in \mathcal{P}(E) \colon \alpha(\nu) \le m\}$ is contained in the tight set

\begin{equation}\bigcap_{n=1}^\infty \{\nu \in \mathcal{P}(E) \colon \nu(K_n^c) \le m/n\}.\end{equation}

Let us now compute $\rho_n$ . It is convenient to work with more probabilistic notation, so let us suppose $(X_i)_{i=1}^\infty$ is a sequence of i.i.d. E-valued random variables with common law $\mu$ , defined on some fixed probability space. For each n, let $\mathcal{Y}_n$ denote the set of equivalence classes of a.s. equal $E^n$ -valued random variables $(Y_1,\ldots,Y_n)$ where $Y_k$ is $(X_1,\ldots,X_k)$ -measurable for each $k=1,\ldots,n$ .

Proposition 6.1. For each $n \ge 1$ and each $f \in B(E)$ ,

\[\rho_n(\,f) = \sup_{(Y_1,\ldots,Y_n) \in \mathcal{Y}_n}{\mathbb E}\Bigg[\,f(Y_1,\ldots,Y_n) - \sum_{i=1}^nc(X_i,Y_i)\Bigg].\]

Proof. The proof is by induction. Let us first rewrite $\rho$ in our probabilistic notation:

\[\rho(\,f) = {\mathbb E}\bigg[\sup_{y \in E}[\,f(y)-c(X_1,y)]\bigg].\]

Using a standard measurable selection argument [Reference Bertsekas and Shreve9, Proposition 7.50], we deduce that

\[\rho(\,f) = \sup_{Y_1 \in \mathcal{Y}_1}{\mathbb E}[\,f(Y_1)-c(X_1,Y_1)] {.}\]

The inductive step proceeds as follows. Suppose we have proved the claim for a given n. Fix $f \in B(E^{n+1})$ and define $g \in B(E^n)$ by

\[g(x_1,\ldots,x_n) \,:\!= \rho(\,f(x_1,\ldots,x_n,\cdot)),\]

so that by Proposition A.1 we have $\rho_{n+1}(\,f)=\rho_n(g)$ . Since $X_1$ and $X_{n+1}$ have the same distribution, we may relabel to find

\begin{align*}g(x_1,\ldots,x_n) &= \sup_{Y_1 \in \mathcal{Y}_1}{\mathbb E}[\,f(x_1,\ldots,x_n,Y_1)-c(X_1,Y_1)] \\* &= \sup_{Y_{n+1} \in \mathcal{Y}_{n+1}^1}{\mathbb E}[\,f(x_1,\ldots,x_n,Y_{n+1})-c(X_{n+1},Y_{n+1})],\end{align*}

where we define $\mathcal{Y}^1_{n+1}$ to be the set of $X_{n+1}$ -measurable E-valued random variables. Now note that any $(Y_1,\ldots,Y_n)$ in $\mathcal{Y}_n$ is $(X_1,\ldots,X_n)$ -measurable, and independence of $(X_i)_{i=1}^\infty$ implies

\[g(Y_1,\ldots,Y_n) = \sup_{Y_{n+1} \in \mathcal{Y}_{n+1}^1}{\mathbb E}[\,f(Y_1,\ldots,Y_n,Y_{n+1})-c(X_{n+1},Y_{n+1})\mid Y_1,\ldots,Y_n].\]

We claim that

(6.3) \begin{equation}{\mathbb E}[g(Y_1,\ldots,Y_n)] = \sup_{Y_{n+1}}{\mathbb E}[\,f(Y_1,\ldots,Y_n,Y_{n+1}) - c(X_{n+1},Y_{n+1})],\end{equation}

where the supremum is over $(X_1,\ldots,X_{n+1})$ -measurable E-valued random variables $Y_{n+1}$ . Indeed, once this is established, we conclude as desired (using Proposition A.1) that

\begin{align*} \rho_{n+1}(\,f) &= \rho_n(g) \\* & = \sup_{(Y_1,\ldots,Y_n) \in \mathcal{Y}_n}{\mathbb E}\Bigg[g(Y_1,\ldots,Y_n) - \sum_{i=1}^nc(X_i,Y_i)\Bigg] \\* &= \sup_{(Y_1,\ldots,Y_n) \in \mathcal{Y}_n}\sup_{Y_{n+1}}{\mathbb E}\Bigg[\,f(Y_1,\ldots,Y_n,Y_{n+1}) - \sum_{i=1}^{n+1}c(X_i,Y_i)\Bigg].\end{align*}

Hence, the rest of the proof is devoted to justifying (6.3), which is really an interchange of supremum and expectation.

Note that $\mathcal{Y}_{n+1}^1$ is a Polish space when topologized by convergence in measure. The function $h \colon E^n \times \mathcal{Y}^1_{n+1} \rightarrow {\mathbb R}$ given by

\[h(x_1,\ldots,x_n;\ Y_{n+1}) \,:\!= {\mathbb E} [\,f(x_1,\ldots,x_n,Y_{n+1})-c(X_{n+1},Y_{n+1}) ]\]

is jointly measurable. Note as before that independence implies that for every $(Y_1,\ldots,Y_n) \in \mathcal{Y}_n$ and $Y_{n+1} \in \mathcal{Y}^1_{n+1}$ we have, for a.e. $\omega$ ,

(6.4) \begin{equation}h(Y_1(\omega),\ldots,Y_n(\omega);\ Y_{n+1})= {\mathbb E}[ f(Y_1,\ldots,Y_n,Y_{n+1})-c(X_{n+1},Y_{n+1})\mid Y_1,\ldots,Y_n](\omega).\end{equation}

Using the usual measurable selection theorem [9, Proposition 7.50] we get

\begin{align*}{\mathbb E}[g(Y_1,\ldots,Y_n)] &= {\mathbb E}\bigg[\sup_{Y_{n+1} \in \mathcal{Y}_{n+1}^1}h(Y_1(\!\cdot\!),\ldots,Y_n(\!\cdot\!);\ Y_{n+1})\bigg] \\* &= \sup_{H \in \widetilde{\mathcal{Y}}_{n+1}^1}{\mathbb E}[h(Y_1(\!\cdot\!),\ldots,Y_n(\!\cdot\!);\ H(Y_1,\ldots,Y_n))],\end{align*}

where $\widetilde{\mathcal{Y}}_{n+1}^1$ denotes the set of measurable maps $H \colon E^n \rightarrow \mathcal{Y}^1_{n+1}$ . But a measurable map $H \colon E^n \rightarrow \mathcal{Y}^1_{n+1}$ can be identified almost everywhere with an $(X_1,\ldots,X_{n+1})$ -measurable random variable $Y_{n+1}$ . Precisely, by Lemma B.1 (in the appendix) there exists a jointly measurable map $\varphi \colon E^{n+1} \rightarrow E$ such that, for $\mu^n$ -a.e. $(x_1,\ldots,x_n) \in E^n$ , we have

\[\varphi(x_1,\ldots,x_{n+1}) = H(x_1,\ldots,x_n)(x_{n+1})\quad \text{for $ \mu$-a.e.\ $ x_{n+1} \in E$}.\]

Define $Y_{n+1} = \varphi(X_1,\ldots,X_{n+1})$ , and note that (6.4) implies, for a.e. $\omega$ ,

\begin{align*}& h(Y_1(\omega),\ldots,Y_n(\omega);\ H(Y_1,\ldots,Y_n)) \\*&\quad\, = {\mathbb E} [\,f(Y_1,\ldots,Y_n,Y_{n+1})-c(X_{n+1},Y_{n+1})\mid Y_1,\ldots,Y_n ](\omega).\end{align*}

This identification of $\widetilde{\mathcal{Y}}_{n+1}^1$ and the tower property of conditional expectations leads to (6.3).

Appendix A. A recursive formula for $\rho_n$

In this brief section we make rigorous the claim in (1.4). To do so requires a brief review of analytic sets, needed only for this section. A subset of a Polish space is analytic if it is the image of a Borel subset of another Polish space through a Borel-measurable function. A real-valued function f on a Polish space is upper semianalytic if $\{\,f \ge c\}$ is an analytic set for each $c \in {\mathbb R}$ . It is well known that every analytic set is universally measurable [Reference Bertsekas and Shreve9, Corollary 7.42.1], and thus every upper semianalytic function is universally measurable. The defining formula for $\rho_n$ given in (1.1) makes sense even when $f \colon E^n \rightarrow {\mathbb R}$ is bounded and universally measurable, or in particular when f is upper semianalytic.

Proposition A.1 Let $n \gt 1$ . Suppose $f \colon E^n \rightarrow {\mathbb R}$ is upper semianalytic. Define $g \colon E^{n-1} \rightarrow {\mathbb R}$ by

\[g(x_1,\ldots,x_{n-1}) = \rho (\,f(x_1,\ldots,x_{n-1},\cdot)).\]

Then g is upper semianalytic, and $\rho_n(\,f) = \rho_{n-1}(g)$ .

Proof. To show that g is upper semianalytic, note that

\begin{align*}g(x_1,\ldots,x_{n-1}) &= \rho(\,f(x_1,\ldots,x_{n-1},\cdot)) \\* &= \sup_{\nu \in \mathcal{P}(E)}\bigg(\int_Ef(x_1,\ldots,x_{n-1},\cdot)\,{\mathrm{d}} \nu - \alpha(\nu)\bigg).\end{align*}

Clearly $\alpha$ is Borel-measurable, as its sub-level sets are compact. It follows from [9, Proposition 7.48] that the term in parentheses is upper semianalytic as a function of $(x_1,\ldots,x_{n-1},\nu)$ . Hence, g is itself upper semianalytic, by [Reference Bertsekas and Shreve9, Proposition 7.47].

We now turn toward the proof of the recursive formula for $\rho_n$ . Note first that the definition of $\alpha_n$ can be written recursively by setting $\alpha_1 = \alpha$ and, for $\nu \in \mathcal{P}(E^n)$ and a kernel K from $E^n$ to E (i.e. a Borel-measurable map $x \mapsto K_x$ from $E^n$ to $\mathcal{P}(E)$ ), setting

(A.1) \begin{equation}\alpha_{n+1}(\nu({\mathrm{d}} x)K_x({\mathrm{d}} x_{n+1})) = \int_{E^n}\alpha(K_x)\nu({\mathrm{d}} x) + \alpha_n(\nu).\end{equation}

Fix $f \in B(E^{n+1})$ , and note that $g(x_1,\ldots,x_n) \,:\!= \rho(\,f(x_1,\ldots,x_n,\cdot))$ is upper semianalytic by the above argument. By definition,

(A.2) \begin{equation}\rho_n(g) = \sup_{\nu \in \mathcal{P}(E^n)}\bigg\{\int_{E^n}g\,{\mathrm{d}} \nu - \alpha_n(\nu)\bigg\}.\end{equation}

By a well-known measurable selection argument [Reference Bertsekas and Shreve9, Proposition 7.50], for each $\nu \in \mathcal{P}(E^n)$ it holds that

\begin{align*}\int_{E^n}g\,{\mathrm{d}} \nu &= \int_{E^n}\sup_{\eta \in \mathcal{P}(E)}\bigg(\int_Ef(x_1,\ldots,x_n,x_{n+1})\eta({\mathrm{d}} x_{n+1}) - \alpha(\eta) \bigg)\nu({\mathrm{d}} x) \\* &= \sup_{K}\bigg(\int_{E^n}\int_Ef(x_1,\ldots,x_{n+1})K_{x}({\mathrm{d}} x_{n+1})\nu({\mathrm{d}} x) - \int_{E^n}\alpha(K_x) \nu({\mathrm{d}} x)\bigg),\end{align*}

where we have abbreviated $x=(x_1,\ldots,x_n)$ , and where the supremum is over all kernels from $E^n$ to E, i.e. all Borel-measurable maps from $E^n$ to $\mathcal{P}(E)$ . A priori, the supremum should be taken over maps K from $E^n$ to $\mathcal{P}(E)$ which are measurable with respect to the smallest $\sigma$ -field containing the analytic sets. But any such map is universally measurable and thus agrees $\nu$ -a.e. with a Borel-measurable map. Every probability measure on $E^{n+1}$ can be written as $\nu({\mathrm{d}} x)K_x({\mathrm{d}} x_{n+1})$ for some $\nu \in \mathcal{P}(E^n)$ and some kernel K from $E^n$ to E. Thus, in light of (A.1) and (A.2),

\begin{align} \rho_n(g) &= \sup_{\nu \in \mathcal{P}(E^n)}\sup_{K}\Bigg[\int_{E^n}\int_Ef(x_1,\ldots,x_{n+1})K_{x}({\mathrm{d}} x_{n+1})\nu({\mathrm{d}} x) - \int_{E^n}\alpha(K_x) \nu({\mathrm{d}} x) - \alpha_n(\nu)\Bigg] \nonumber\\ &= \sup_{\nu \in \mathcal{P}(E^{n+1})}\bigg(\int_{E^n}f\,{\mathrm{d}}\nu - \alpha_{n+1}(\nu)\bigg) \nonumber\\ &= \rho_n(\,f).\end{align}

In general, the function g in Proposition A.1 can fail to be Borel-measurable. For instance, if E is compact and $\alpha \equiv 0$ , then our standing assumptions hold. In this case $\rho(\,f) = \sup_{x \in E}f(x)$ for $f \in B(E)$ . For $f \in B(E^2)$ we have $\rho(\,f(x,\cdot)) = \sup_{y \in E}f(x,y)$ . If $f(x,y)=1_A(x,y)$ for a Borel set $A \subset E^2$ whose projections are not Borel, then $\rho(\,f(x,\cdot))$ is not Borel. Credit is due to Daniel Bartl for pointing out this simple counterexample to an inaccurate claim in an earlier version of the paper; his paper [Reference Bartl6] shows why semianalytic functions are essential in this context.

Appendix B. Two technical lemmas

Here we state and prove a technical lemma that was used in the proof of Proposition 6.1 as well as a simple extension of Jensen’s inequality to convex functions of random measures. The first lemma essentially says that if $f=f(x,y)$ is a function of two variables such that the map $x \mapsto f(x,\cdot)$ is measurable, from E into an appropriate function space, then f is essentially jointly measurable.

Lemma B.1. Let $(\Omega,\mathcal{F},P)$ be a standard Borel probability space, let E be a Polish space, and let $\mu \in \mathcal{P}(E)$ . Let $L^0$ denote the set of equivalence classes of $\mu$ -a.e. equal measurable functions from E to E, and endow $L^0$ with the topology of convergence in measure. If $H \colon \Omega \rightarrow L^0$ is measurable, then there exists a jointly measurable function $h \colon \Omega \times E \rightarrow E$ such that, for P-a.e. $\omega$ , we have $H(\omega)(x) = h(\omega,x)$ for $\mu$ -a.e. $x \in E$ .

Proof. By Borel isomorphism, we may assume without loss of generality that $\Omega = E = [0,1]$ . In particular, $H(\omega)(x) \in [0,1]$ for all $\omega,x \in [0,1]$ . Let $L^1$ denote the set of $P\times\mu$ -integrable (equivalence classes of a.s. equal) measurable functions from $[0,1]^2$ to ${\mathbb R}$ . Define a linear functional $T \colon L^1 \rightarrow {\mathbb R}$ by

\[T(\varphi) = \int P({\mathrm{d}}\omega)\int\mu({\mathrm{d}} x)H(\omega)(x)\varphi(\omega,x).\]

This is well-defined because the function

\[\omega \mapsto \int\mu({\mathrm{d}} x)H(\omega)(x)\varphi(\omega,x)\]

is measurable; indeed, this is easily checked for $\varphi$ of the form $\varphi(\omega,x)=f(\omega)g(x)$ , for f and g bounded and measurable, and the general case follows from a monotone class argument. Because $|H(\omega)(x)|\le 1$ , it is readily checked that T is continuous. Thus T belongs to the continuous dual of $L^1$ , and there exists a bounded measurable function $h \colon [0,1]^2 \rightarrow {\mathbb R}$ such that

\[T(\varphi) = \int P({\mathrm{d}}\omega)\int\mu({\mathrm{d}} x)h(\omega,x)\varphi(\omega,x),\]

for all $\varphi \in L^1$ . It is straightforward to check that this h has the desired property.

Our final lemma, an infinite-dimensional form of Jensen’s inequality, is surely known, but we were unable to locate a precise reference, and the proof is quite short.

Lemma B.2. Fix $P \in \mathcal{P}(\mathcal{P}(E))$ , and define the mean measure $\overline{P} \in \mathcal{P}(E)$ by

\[\overline{P}(A) = \int_{\mathcal{P}(E)}m(A)\,P({\mathrm{d}} m).\]

Then, for any function $G \colon \mathcal{P}(E) \rightarrow ({-}\infty,\infty]$ which is convex, bounded from below, and weakly lower semicontinuous, we have

\[G(\overline{P}) \le \int_{\mathcal{P}(E)}G \,{\mathrm{d}} P.\]

Proof. Define (on some probability space) i.i.d. $\mathcal{P}(E)$ -valued random variables $(\mu_i)_{i=1}^\infty$ with common law P. Define the partial averages

\[S_n = \dfrac{1}{n}\sum_{i=1}^n\mu_i.\]

For any $f \in C_b(E)$ , the law of large numbers implies

\[\lim_{n\rightarrow \infty}\int_Ef\,{\mathrm{d}} S_n = \lim_{n\rightarrow \infty}\dfrac{1}{n}\sum_{i=1}^n\int_Ef\,{\mathrm{d}} \mu_i = {\mathbb E}\int_Ef\,{\mathrm{d}} \mu_1 = \int_Ef\,{\mathrm{d}}\overline{P} \ \ \text{a.s.}\]

This easily shows that $S_n \rightarrow \overline{P}$ weakly a.s. Use Fatou’s lemma and the assumptions on G to get

\begin{equation*}G(\overline{P}) \le \liminf_{n\rightarrow \infty}{\mathbb E}[G(S_n)] \le \liminf_{n\rightarrow \infty}{\mathbb E}\Bigg[\dfrac{1}{n}\sum_{i=1}^nG(\mu_i)\Bigg] = {\mathbb E}[G(\mu_1)] = \int_{\mathcal{P}(E)}G \,{\mathrm{d}} P.\tag*{$\Box$}\end{equation*}

Acknowledgements

The author is indebted to Stephan Eckstein and Daniel Bartl as well as two anonymous referees for their careful feedback, which greatly improved the exposition and accuracy of the paper.

References

Acciaio, B. andPenner, I. (2011). Dynamic risk measures. In Advanced Mathematical Methods for Finance, eds Di Nunno, G. andØksendal, B., pp. 134. Springer.Google Scholar
Agueh, M. andCarlier, G. (2011). Barycenters in the Wasserstein space. SIAM J. Math. Anal. 43 (2), 904924.CrossRefGoogle Scholar
Aliprantis, C. andBorder, K. (2007). Infinite Dimensional Analysis: A Hitchhiker’s Guide, 3rd edn. Springer.Google Scholar
Atar, R., Chowdhary, K. andDupuis, P. (2015). Robust bounds on risk-sensitive functionals via Rényi divergence. SIAM/ASA J. Uncertain. Quantif. 3 (1), 1833.CrossRefGoogle Scholar
Backhoff-Veraguas, J., Lacker, D. andTangpi, L. (2018). Non-exponential Sanov and Schilder theorems on Wiener space: BSDEs, Schrödinger problems and control. Available at arXiv:1810.01980 .Google Scholar
Bartl, D. (2019). Conditional nonlinear expectations. Stoch. Process Appl. arXiv:1612.09103v2Google Scholar
Ben-Tal, A. andTeboulle, M. (1986). Expected utility, penalty functions, and duality in stochastic nonlinear programming. Manag. Sci. 32 (11), 14451466.CrossRefGoogle Scholar
Ben-Tal, A. andTeboulle, M. (2007). An old–new concept of convex risk measures: the optimized certainty equivalent. Math. Finance 17 (3), 449476.CrossRefGoogle Scholar
Bertsekas, D. andShreve, S. (1996). Stochastic Optimal Control: The Discrete Time Case. Athena Scientific.Google Scholar
Blanchet, A. andCarlier, G. (2015). Optimal transport and Cournot–Nash equilibria. Math. Operat. Res. 41 (1), 125145.CrossRefGoogle Scholar
Bobkov, S. andDing, Y. (2014). Optimal transport and Rényi informational divergence. Preprint.Google Scholar
Boissard, E. (2011). Simple bounds for the convergence of empirical and occupation measures in 1-Wasserstein distance. Electron. J. Prob. 16, 22962333.CrossRefGoogle Scholar
Borovkov, A. A. andBorovkov, K. A. (2008). Asymptotic Analysis of Random Walks: Heavy-Tailed Distributions (Encyclopedia Math. Appl. 118). Cambridge University Press.Google Scholar
Cheridito, P. andKupper, M. (2011). Composition of time-consistent dynamic monetary risk measures in discrete time. Internat. J. Theoret. Appl. Finance 14 (1), 137162.CrossRefGoogle Scholar
De Haan, L. andLin, T. (2001). On convergence toward an extreme value distribution in C[0, 1]. Ann. Prob. 29, 467483.CrossRefGoogle Scholar
Dembo, A. andZeitouni, O. (2009). Large Deviations Techniques and Applications (Stoch. Model. Appl. Prob. 38). Springer Science & Business Media.Google Scholar
Denisov, D., Dieker, A. B. andShneer, V. (2008). Large deviations for random walks under subexponentiality: the big-jump domain. Ann. Prob. 36 (5), 19461991.CrossRefGoogle Scholar
Ding, Y. (2014). Wasserstein-Divergence transportation inequalities and polynomial concentration inequalities. Statist. Prob. Lett. 94, 7785.CrossRefGoogle Scholar
Dudley, R. M. (1969). The speed of mean Glivenko–Cantelli convergence. Ann. Math. Statist. 40 (1), 4050.CrossRefGoogle Scholar
Dudley, R. M. (2018). Real Analysis and Probability. CRC Press.Google Scholar
Dupacová, J. andWets, R. J. B. (1988). Asymptotic behavior of statistical estimators and of optimal solutions of stochastic optimization problems. Ann. Statist. 16, 15171549.CrossRefGoogle Scholar
Dupuis, P. andEllis, R. S. (2011). A Weak Convergence Approach to the Theory of Large Deviations (Wiley Series Prob. Statist. 902). John Wiley.Google Scholar
Eckstein, S. (2019). Extended Laplace principle for empirical measures of a Markov chain. Adv. Appl. Prob. 51 (1), 136167. arXiv:1709.02278CrossRefGoogle Scholar
Eichelsbacher, P. andSchmock, U. (1996). Large deviations of products of empirical measures and U-empirical measures in strong topologies. Sonderforschungsbereich 343, Diskrete Strukturen in der Math., Universität Bielefeld.Google Scholar
Einmahl, U. andLi, D. (2008). Characterization of LIL behavior in Banach space. Trans. Amer. Math. Soc. 360 (12), 66776693.CrossRefGoogle Scholar
Föllmer, H. andKnispel, T. (2011). Entropic risk measures: coherence vs. convexity, model ambiguity and robust large deviations. Stoch. Dynamics 11 (02n03), 333351.Google Scholar
Föllmer, H. andSchied, A. (2002). Convex measures of risk and trading constraints. Finance Stochast. 6 (4), 429447.CrossRefGoogle Scholar
Föllmer, H. andSchied, A. (2011). Stochastic Finance: An Introduction in Discrete Time. Walter de Gruyter.CrossRefGoogle Scholar
Foss, S., Korshunov, D. andZachary, S. (2011). An Introduction to Heavy-Tailed and Subexponential Distributions (Springer Series Operat. Research Financial Eng. 6). Springer.Google Scholar
Fournier, N. andGuillin, A. (2015). On the rate of convergence in Wasserstein distance of the empirical measure. Prob. Theory Relat. Fields 162 (3–4), 707738.CrossRefGoogle Scholar
Fuqing, G. andMingzhou, X. (2012). Relative entropy and large deviations under sublinear expectations. Acta Math. Sci. 32 (5), 18261834.CrossRefGoogle Scholar
Hardy, G. H., Littlewood, J. E. andPólya, G. (1952). Inequalities. Cambridge University Press.Google Scholar
Hu, F. (2010). On Cramér’s theorem for capacities. Comptes Rendus Mathématique 348 (17), 10091013.CrossRefGoogle Scholar
Hult, H. andLindskog, F. (2006). Regular variation for measures on metric spaces. Publ. Inst. Math. (Beograd) (NS) 80 (94), 121140.CrossRefGoogle Scholar
Hult, H., Lindskog, F., Mikosch, T. andSamorodnitsky, G. (2005). Functional large deviations for multivariate regularly varying random walks. Ann. Appl. Prob. 15 (4), 26512680.CrossRefGoogle Scholar
Kall, P. (1987). On Approximations and Stability in Stochastic Programming. In Parametric Optimization and Related Topics, eds Guddat, J.et al. pp. 387407. Akademie, Berlin.Google Scholar
Kaniovski, Y. M., King, A. J. andWets, R. J. B. (1995). Probabilistic bounds (via large deviations) for the solutions of stochastic programming problems. Ann. Operat. Res. 56 (1), 189208.CrossRefGoogle Scholar
King, A. J. andWets, R. J. B. (1991). Epi-consistency of convex stochastic programs. Stoch. Stoch. Reports 34 (1–2), 8392.CrossRefGoogle Scholar
Lacker, D. (2018). Liquidity, risk measures, and concentration of measure. Math. Operat. Res. 34 (3), 6931050, C2.arXiv:1510.07033CrossRefGoogle Scholar
Lindskog, F., Resnick, S. I. andRoy, J. (2014). Regularly varying measures on metric spaces: hidden regular variation and hidden jumps. Prob. Surveys 11, 270314.CrossRefGoogle Scholar
Mikosch, T. andNagaev, A. V. (1998). Large deviations of heavy-tailed sums with applications in insurance. Extremes 1 (1), 81110.CrossRefGoogle Scholar
Mogul’skii, A. A. (1977). Large deviations for trajectories of multi-dimensional random walks. Theory Prob. Appl. 21 (2), 300315.CrossRefGoogle Scholar
Nagaev, S. V. (1979). Large deviations of sums of independent random variables. Ann. Prob. 7 (5), 745789.CrossRefGoogle Scholar
Owari, K. (2014). Maximum Lebesgue extension of monotone convex functions. J. Funct. Anal. 266 (6), 35723611.CrossRefGoogle Scholar
Parthasarathy, K. R. (2005). Probability Measures on Metric Spaces (Prob. Math. Statist. 352). American Mathematical Society.Google Scholar
Peng, S. (2007). Law of large numbers and central limit theorem under nonlinear expectations. Available at arXiv:math/0702358 .Google Scholar
Peng, S. (2010). Nonlinear expectations and stochastic calculus under uncertainty. Available at arXiv:1002.4546 .Google Scholar
Petrov, V. (2012). Sums of Independent Random Variables. Springer Science & Business Media.Google Scholar
Rhee, C.-H., Blanchet, J. andZwart, B. (2016). Sample path large deviations for heavy-tailed Lévy processes and random walks. Available at arXiv:1606.02795 .Google Scholar
Schied, A. (1998). Cramer’s condition and Sanov’s theorem. Statist. Prob. Lett. 39 (1), 5560.CrossRefGoogle Scholar
Sion, M. (1958). On general minimax theorems. Pacific J. Math. 8 (1), 171176.CrossRefGoogle Scholar
Villani, C. (2003). Topics in Optimal Transportation (Graduate Studies Math. 58). American Mathematical Society.Google Scholar
Wang, R., Wang, X. andWu, L. (2010). Sanov’s theorem in the Wasserstein distance: a necessary and sufficient condition. Statist. Prob. Lett. 80 (5), 505512.CrossRefGoogle Scholar
Weed, J. andBach, F. (2019). Sharp asymptotic and finite-sample rates of convergence of empirical measures in Wasserstein distance. Bernoulli 25 (4A), 26202648.arXiv:1707.00087CrossRefGoogle Scholar
Zalinescu, C. (2002). Convex Analysis in General Vector Spaces. World Scientific.CrossRefGoogle Scholar