1. Introduction
Following the work of Wolfgang Doeblin ([Reference Doeblin6], [Reference Doeblin7], [Reference Doeblin8]), many classical results from Markov chain theory have built on fundamental connections between total variation distance, Markov chains and couplings. For some models, however, total variation convergence is too strong. In response, researchers have developed an alternative methodology based on monotonicity ([Reference Bhattacharya and Lee1], [Reference Dubins and Freedman10], [Reference Yahav36]). In this line of research, transition probabilities are assumed to have a form of monotonicity not required in the classical theory. At the same time, mixing conditions are generally weaker, as is the notion of convergence to the stationary distribution. Further contributions to this approach can be found in [Reference Bhattacharya, Majumdar and Hashimzade2], [Reference Hopenhayn and Prescott16], [Reference Kamihigashi and Stachurski20], and [Reference Razin and Yahav30]. For some recent extensions and applications in economics, see [Reference Kamihigashi and Stachurski21].
To give one example, consider a Markov chain {Xt} defined by
where {Wt}t≥1 is an independent and identically distributed (i.i.d.) Bernoulli($(\tfrac12)$) random sequence, taking values 0 and 1 with equal probability. For the state space take S = [0, 1]. Let Pt(x, ·) be the distribution of Xt given X 0 = x ∈ S. Clearly, if Xt is a rational number in S then so is X t+1. Similarly, if Xt is irrational then so is X t+1. Thus, if x and y are rational and irrational, respectively, the distributions Pt(x, ·) and Pt(y, ·) are concentrated on disjoint sets, and, hence, when ‖ · ‖ is the total variation norm,
Total variation convergence fails for this class of models.
At the same time, the right-hand side of (1.1) is increasing in the current state for each fixed value of the shock W t+1. Moreover, trajectories mix in a monotone sense: a trajectory starting at X 0 = 0 can approach 1 with a suitable string of shocks and a trajectory starting at 1 can approach 0. Using these facts, one can show using the results in [Reference Bhattacharya and Lee1], say, that a unique stationary distribution exists and the distribution of Xt converges to it in a complete metric defined over the Borel probability measures that is weaker than total variation convergence.
This is one example where monotone methods can be used to establish some form of stability, despite the fact that the classical conditions based around total variation convergence fail. Conversely, there are many models that the monotone methods developed in [Reference Bhattacharya and Lee1] and related papers cannot accommodate, while the classical theory based around total variation convergence handles them easily. One example is the simple ‘inventory’ model
where x + := max{x, 0}. Again, {Wt} is i.i.d. Assume that ln Wt is standard normal. The state space we take to be S = [0, K]. Figure 1 shows a typical trajectory when K = 100 and X 0 = 50.
On the one hand, the monotone methods in [Reference Bhattacharya and Lee1] cannot be applied here because of a failure of monotonicity with respect to the standard ordering of ℝ. On the other hand, the classical theory based around total variation convergence is straightforward to apply. For example, one can use Doeblin’s condition (see, e.g. [Reference Meyn and Tweedie25, Theorem 16.2.3]) to show the existence of a unique stationary distribution to which the distribution of Xt converges in total variation, regardless of the distribution of X 0. In the terminology of [Reference Meyn and Tweedie25], the process is uniformly ergodic.
The purpose of this paper is to show that both of these stability results (i.e. the two sets of results concerning the two models (1.1) and (1.2)), which were based on two hitherto separate approaches, can be derived from the same theoretical framework. More generally, we construct stability results that encompasses all uniformly ergodic models in the sense of [Reference Meyn and Tweedie25] and all monotone models shown to be stable in [Reference Bhattacharya and Lee1], as well as extending to other monotone or partially monotone models on state spaces other than ℝn.
We begin our analysis by introducing what is shown to be a complete metric γ on the set of Borel probability measures on a partially ordered Polish space that includes total variation distance, the Kolmogorov uniform distance, and the Bhattacharya distance ([Reference Bhattacharya and Lee1], [Reference Chakraborty and Rao3]) as special cases. We show that many fundamental concepts from conventional Markov chain theory using total variation distance and coupling have direct generalizations to the partially ordered setting when this new metric is adopted. Then, by varying the choice of partial order, we recover key aspects of both classical total-variation-based stability theory and monotone methods as special cases.
Prior to commencing we note that, in terms of existing literature, there is an additional line of research that deals with Markov models for which the classical conditions of irreducibility and total variation convergence fail. Irreducibility is replaced by an assumption that the law of motion for the state is itself contracting ‘on average’, and this contractivity is then passed on to an underlying metric over distributions that conforms in some way to the topology of the state space. See, for example, [Reference Diaconis and Freedman4], [Reference Jarner and Tweedie17], or [Reference Wu and Shao35]. Such results can be used to show stability of our first example, which contracts on average with respect to the usual metric on R. On the other hand, it cannot be directly applied to our second (i.e. inventory) example using the same metric, since the law of motion contains a jump. In [Reference Bhattacharya and Lee1] and [Reference Hopenhayn and Prescott16] one can find other applications where monotone methods can be used—including the results developed here— while the ‘contraction on average’ conditions of [Reference Diaconis and Freedman4] and [Reference Wu and Shao35] do not in general hold. See, for example, the growth model studied in Section 6.B of [Reference Hopenhayn and Prescott16], where a high marginal product of capital near zero generates rapid growth when the capital stock is low. One consequence is that the law of motion does not typically exhibit contraction on average in the neighbourhood of zero.
After preliminaries, we begin with a discussion of ‘ordered’ affinity, which generalizes the usual notion of affinity for measures. The concept of ordered affinity is then used to define the total ordered variation metric. Throughout the paper, longer proofs are deferred to Appendix A. The conclusion contains suggestions for future work.
2. Preliminaries
Let S be a Polish (i.e. separable and completely metrizable) space, let $\cal O$ be the open sets, let $\mathfrak C$ be the closed sets, and let $\cal B$ be the Borel sets. Let ${\cal M}_s$ denote the set of all finite signed measures on ($(S, \cal B)$). In other words, ${\cal M}_s$ is all countably additive set functions from $\cal B$ to ℝ. Let $\cal M$ and $\cal P$ be the finite measures and probability measures in ${\cal M}_s$, respectively. If κ and λ are in ${\cal M}_s$, then k ≤ λ means that κ(B) ≤ λ(B) for all B ∈ B. The symbol δx denotes the probability measure concentrated on x ∈ S.
Let bS be the set of all bounded $\cal B$-measurable functions from S into ℝ. If h ∈ bS and $\lambda \in \cal M_s$, then λ(h) := ∫ dλ. For f and g in bS, the statement f ≤ g means that f (x) ≤ g(x) for all x ∈ S. Let
The total variation norm of $\lambda \in \cal M_s$ is ||λ|| := suph∈H |λ(h)|. Given μ and ν in $\cal P$, a random element (X, Y) taking values in S × S and defined on a common probability space (Ω, $\cal F$, ℙ) is called a coupling of (μ, ν) if μ = ℙ ° X −1 and ν = ℙ ° Y −1 (i.e. if the distribution of (X, Y) has marginals μ and ν, respectively; see, e.g. [Reference Lindvall22] or [Reference Thorisson34]). The set of all couplings of (μ, ν) is denoted below by $\cal C(\mu, \nu)$. A sequence $\{\mu_n\} \subset \cal P$ converges to $\mu \in \cal P$ weakly if μn(h) → μ(h) as n → ∞ for all continuous h ∈ bS. In this case we write $\mu_n {\buildrel {\rm w} \over \longrightarrow} \mu$.
Given μ and $\nu \in \cal M$, their measure-theoretic infimum μ ∧ ν is the largest element of $\cal M$ dominated by both μ and ν. It can be defined by taking f and g to be densities of μ and ν, respectively, under the dominating measure λ : = μ + ν and defining μ ∧ ν by (μ ∧ ν)(B) := ∫B min{ f (x), g(x)}λ(dx) for all $B \in \cal B$. The total variation distance between μ and ν is related to μ ∧ ν via ||μ − ν|| = ||μ|| + ||ν|| − 2 ||μ ∧ ν||. See, for example, [Reference Pollard29]. For probability measures, we also have
The affinity between two measures μ, ν in $\cal M$ is the value α(μ, ν) := (μ ∧ ν)(S). The following properties are elementary.
Lemma 2.1. For all $ (\mu, \nu) \in \cal M \times {\cal M}$, we have
(a) 0 ≤ α(μ, ν) ≤ min{μ(S), ν(S)},
(b) α(μ, ν) = μ(S) = ν(S) if and only if μ = ν,
(c) α(cμ, cν) = cα(μ, ν) for all c ≥ 0.
There are several other common representations of affinity. For example, when μ and ν are both probability measures, we have
(See, e.g. [Reference Lindvall22] and [Reference Pollard29].) The second equality in (2.2) states that, if $ (X,Y) \in \cal C(\mu, \nu)$, then ℙ{X = Y} ≤ α(μ, ν), and, moreover, there exists a $ (X, Y) \in \cal C(\mu, \nu)$ such that equality is attained. Any such coupling is called a maximal or gamma coupling. See Theorem 5.2 of [Reference Lindvall22]. From (2.1) and (2.2) we obtain
3. Ordered affinity
We next introduce a generalization of affinity when S has a partial order. We investigate its properties in detail, since both our metric and the stability theory presented below rely on this concept.
3.1. Preliminaries
As before, let S be a Polish space. A closed partial order ‘⪯’ on S is a partial order ‘⪯’ such that its graph
is closed in the product topology. Henceforth, a partially ordered Polish space is any such pair (S, ⪯), where S is nonempty and Polish, and ‘⪯’ is a closed partial order on S. When no confusion arises, we denote it simply by S. The partial order is assumed to be closed in the theory developed below because we build a metric over Borel probability measures that depends on this partial order and closedness of the order is required to show that the metric is complete.
For such a space S, we call I ⊂ S increasing if x ∈ I and x ⪯ y implies that y ∈ I. We call h: S → ℝ increasing if x ⪯ y implies that h(x) ≤ h(y). We let $i\cal B,\,i\cal O$, and $i\mathfrak C$ denote the increasing Borel, open, and closed sets, respectively, while ibS is the increasing functions in bS. In addition,
• iH := H ∩ ibS = {h ∈ ibS: − 1 ≤ h ≤ 1}, and
• iH 0 := H 0 ∩ ibS = {h ∈ ibS: 0 ≤ h ≤ 1}.
If $B \in \cal B$ then i(B) is all y ∈ S such that x ⪯ y for some x ∈ B, while d(B) is all y ∈ S such that y ⪯x for some x ∈ B. Given μ and ν in $\cal M$, we say that μ is stochastically dominated by ν and write μ ⪯sd ν if μ(S) = ν(S) and μ(I) ≤ ν(I) for all $I \in i\cal B$. Equivalently, μ(S) = ν(S) and μ(h) ≤ ν(h) for all h in iH or iH 0. See [Reference Kamae and Krengel18].
One important partial order on S is the identity order, where x ⪯ y if and only if x = y. Then $i\cal B = \cal B$, ibS = bS, iH = H, iH 0 = H 0, and μ ⪯sd ν if and only if μ = ν.
Remark 3.1. Since S is a partially ordered Polish space, for any μ, ν in $\cal P$, we have μ = ν whenever μ(C) = ν(C) for all $C \in i \mathfrak C$, or, equivalently, μ(h) = ν(h) for all continuous h ∈ ibS. See [Reference Kamae and Krengel18, Lemma 1]. Hence, μ ⪯sd ν and ν ⪯sd μ imply that μ = ν.
Lemma 3.1. If $\lambda \in \cal M_s$ then $\sup_{I \in i\cal B} \lambda(I) = \sup_{h \in iH_0} \lambda(h)$ and
The proof is given in Appendix A.1. We can easily check that
3.2. Definition of ordered affinity
For each pair $ (\mu, \nu) \in {\cal M} \times \cal M$, let
We call Φ(μ, ν) the set of ordered component pairs for (μ, ν). Here ‘ordered’ means ordered by stochastic dominance. The set of ordered component pairs is always nonempty, since (μ ∧ ν, μ ∧ ν) is an element of Φ(μ, ν).
Example 3.1. If μ and ν are two measures satisfying μ ⪯sd ν, then (μ, ν) ∈ Φ (μ, ν).
Example 3.2. Given Bernoulli distributions μ = (δ 1 + δ 2)/2 and ν = (δ 0 + δ 1)/2, we have (μ′, ν′) ∈ Φ (μ, ν) when μ′= ν′= δ 1/2.
We call an ordered component pair (μ′, ν′) ∈ Φ (μ, ν) a maximal ordered component pair if it has greater mass than all others; that is, if
(We can restate this by replacing μ′(S) and μ″(S) with ν′(S) and ν″(S), respectively, since the mass of ordered component pairs is equal by the definition of stochastic dominance.) We let Φ∗(μ, ν) denote the set of maximal ordered component pairs for (μ, ν). Thus, if
then
Using the Polish space assumption, one can show that maximal ordered component pairs always exist.
Proposition 3.1. The set Φ∗(μ, ν) is nonempty for all ${(\mu, \nu) \in \cal M \times \cal M}$.
Proof. Fix ${(\mu, \nu) \in \cal M \times \cal M}$, and let s := αO(μ, ν). From the definition, we can take sequences $\{\mu'_n\}$ and $\{\nu'_n\}$ in $\cal M$ such that $ (\mu'_n, \nu'_n) \in \Phi(\mu, \nu) $ for all n ∈ ℕ and $\mu'_n(S) \uparrow s$. Since $\mu'_n \leq \mu$ and $\nu'_n \leq \nu$ for all n ∈ ℕ, Prokhorov’s theorem [Reference Dudley11, Theorem 11.5.4] implies that these sequences have convergent subsequences with $\mu'_{n_k} \buildrel {\rm w} \over \longrightarrow \mu'$ and $\nu'_{n_k} \buildrel {\rm w} \over \longrightarrow \nu'$ for some $\mu',\, \nu' \in \cal M$. We claim that (μ′, ν′) is a maximal ordered component pair.
Since $\mu'_{n_k} \buildrel {\rm w} \over \longrightarrow \mu'$ and $\mu'_n \leq \mu$ for all n ∈ ℕ, Theorem 1.5.5 of [Reference Hernández-Lerma and Lasserre15] implies that, for any Borel set B, we have μ′nk(B) → μ′(B) in ℝ. Hence, μ′(B) ≤ μ(B) and, in particular, μ′≤ μ. An analogous argument gives ν′ ≤ ν. Moreover, the definition of Φ(μ, ν) and stochastic dominance imply that $\mu'_n(S) = \nu'_n(S)$ for all n ∈ ℕ, and, therefore, μ ′(S) = ν ′(S). Also, for any $I \in i\cal B$, the fact that $\mu'_n(I) \leq \nu'_n(I)$ for all n ∈ ℕ gives μ ′ (I) ≤ ν ′(I). Thus, μ ′⪯sd ν ′. Finally, μ ′(S) = s, since $\mu'_n(S) \uparrow s$. Hence, (μ ′, ν ′) lies in Φ∗(μ, ν).
The value αO(μ, ν) defined in (3.3) gives the mass of the maximal ordered component pair. We call it the ordered affinity from μ to ν. On an intuitive level, we can think of αO(μ, ν) as the ‘degree’ to which μ is dominated by ν in the sense of stochastic dominance. Since (μ ⊥ ν, μ ⊥ ν) is an ordered component pair for (μ, ν), we have
where α(μ, ν) is the standard affinity defined in Section 2. In fact, αO(μ, ν) generalizes the standard the notion of affinity by extending it to arbitrary partial orders, as shown in the next lemma.
Lemma 3.2. If ⪯ is the identity order then $\alpha_0 = \alpha \; on \; \cal M \times \cal M$.
Proof. Fix ${ (\mu, \nu) \in \cal M \times \cal M}$, and let ‘⪯’ be the identity order (x ⪯ y if and only if x = y). Then ‘⪯sd’ also corresponds to equality, from which it follows that the supremum in (3.3) is attained by μ Λ ν. Hence, αO(μ, ν) = α(μ, ν).
3.3. Properties of ordered affinity
Let us list some elementary properties of αO. The following list should be compared with Lemma 2.1. It shows that analogous results hold for αO as hold for α. (Lemma 2.1 is in fact a special case of Lemma 3.3 with the partial order set to the identity order.)
Lemma 3.3. For all ${ (\mu, \nu) \in \cal M \times \cal M}$, we have
(a) 0 ≤ αO(μ, ν) ≤ min{μ(S), ν(S)},
(b) αO(μ, ν) = μ(S) = ν(S) if and only if μ⪯sd ν, and
(c) cαO(μ, ν) = αO(cμ, cν) whenever c ≥ 0.
Proof. Fix ${ (\mu, \nu) \in \cal M \times \cal M}$. Claim (a) follows directly from the definitions. Regarding claim (b), suppose first that μ⪯sd ν. Then (μ, ν) ∈ Φ(μ, ν) and, hence, αO(μ, ν) = μ(S). Conversely, if αO(μ, ν) = μ(S) then, since the only component μ′≤ μ with μ′(S) = μ(S) is μ itself, we must have (μ, ν′) ∈ Φ(μ, ν′) for some ν′≤ ν with μ⪯sd ν′. But then μ(I) ≤ ν′(I) ≤ ν(I) for any $I \in i\cal B$. Hence, μ⪯sdν.
Claim (c) is trivial if c = 0, so suppose instead that c > 0. Fix (μ′, ν′) ∈ Φ(μ, ν) such that αO(μ, ν) = μ′(S). It is clear that (cμ′, cν′) ∈ Φ(cμ, cν), implying that
For reverse inequality, we can apply (3.5) again to get
3.4. Equivalent representations
In (2.2) we noted that the affinity between two measures has several alternative representations. In our setting these results generalize as follows.
Theorem 3.1. For all ${(\mu, \nu) \in \cal P \times \cal P}$, we have
Evidently (2.2) is a special case of (3.6) because (3.6) reduces to (2.2) when ‘⪯’ is set to equality. For example, when ‘⪯’ is equality,
where the last step is from (2.1). Note also that, as shown in the proof of Theorem 3.1, the supremum can also be written in terms of the open increasing sets i𝒪 or the closed decreasing sets $d\mathfrak C$. In particular,
One of the assertions of Theorem 3.1 is the existence of a coupling $ (X, Y) \in \cal C(\mu, \nu)$ attaining ℙ{X ⪯ Y} = αO (μ, ν). Let us refer to any such coupling as an order maximal coupling for (μ, ν).
Example 3.3. For (x, y) ∈ S × S, we have
as can easily be verified from the definition or either of the alternative representations in (3.6). The map $ (x, y) \mapsto {\bf 1}_{\mathbb G}(x, y)$ is measurable due to the Polish assumption. As a result, for any $ (X, Y) \in \cal C(\mu, \nu)$, we have
with equality when (X, Y) is an order maximal coupling.
3.5. Comments on Theorem 3.1
The existence of an order maximal coupling shown in Theorem 3.1 implies two well-known results that are usually treated separately. One is the Nachbin–Strassen theorem (see, e.g. Theorem 1 of [Reference Kamae, Krengel and O’Brien19] or Chapter IV of [Reference Lindvall22]), which states the existence of a coupling $ (X, Y) \in \cal C(\mu, \nu) $ attaining ℙ{X ⪯ Y} = 1 whenever μ⪯sdν. The existence of an order maximal coupling for each (μ, ν) in $\cal P \times \cal P$ implies this statement, since, under the hypothesis that μ⪯sdν, we also have αO(μ, ν) = 1. Hence, any order maximal coupling satisfies ℙ{X ⪯ Y} = 1.
The other familiar result implied by the existence of an order maximal coupling is the existence of a maximal coupling in the standard sense (see the discussion of maximal couplings after (2.2) and the result on page 19 of [Reference Lindvall22]). Indeed, if we take ‘⪯’ to be the identity order then (3.6) reduces to (2.2), as already discussed.
4. Total ordered variation
Let S be a partially ordered Polish space. Consider the function on $\cal P \times \cal P$ given by
We call γ (μ, ν) the total ordered variation distance between μ and ν. The natural comparison is with (2.3), which renders the same value if αO is replaced by α. In particular, when ‘⪯’ is equality, ordered affinity reduces to affinity, and total ordered variation distance reduces to total variation distance. Since ordered affinities dominate affinities (see (3.4)), we have γ (μ, ν) ≤ ||μ − ν|| for all $ (\mu, \nu) \in \cal P \times \cal P $.
Other, equivalent, representations of γ are available. For example, in view of (3.6), for any $ (\mu, \nu) \in \cal P \times \cal P $, we have
By combining Lemma 3.1 and (3.2), we also have
It is straightforward to show that
Lemma 4.1. The function γ is a metric on $\cal P$.
Proof. The claim that γ is a metric follows in a straightforward way from the definition or the alternative representation (4.1). For example, the triangle inequality is easy to verify using (4.1). Also, γ (μ, ν) = 0 implies that μ = ν by (4.1) and Remark 3.1.
4.1. Connection to other modes of convergence
As well as total variation, the metric γ is closely related to the so-called Bhattacharya metric, which is given by
See [Reference Bhattacharya and Lee1] and [Reference Chakraborty and Rao3]. (In [Reference Chakraborty and Rao3] the metric is defined by taking the supremum over iH 0 rather than iH, but the two definitions differ only by a positive scalar.) The Bhattacharya metric can be thought of as an alternative way to generalize total variation distance, in the sense that, like γ, the metric β reduces to total variation distance when ‘⪯’ is the identity order (since iH equals H under this order). From (3.1) we have
and from this and (4.2) we have
Hence, β and γ are equivalent metrics.
The metric γ is also connected to the Wasserstein metric ([Reference Gibbs and Su12], [Reference Givens and Shortt13]). If ρ metrizes the topology on S then the Wasserstein distance between probability measures μ and ν is
The total ordered variation metric can be compared as follows. Consider the ‘directed semimetric’ $\hat \rho(x, y) \,:= \bf 1\{x {\npreceq} y\}$. In view of (3.6) we have
Thus, γ (μ, ν) is found by summing two partial, ‘directed Wasserstein deviations’. Summing the two directed differences from opposite directions yields a metric.
Proposition 4.1. If $\{\mu_n\}_{n \geq 0} \subset \cal P$ is tight and γ (μn, μ 0) → 0, then $\mu_n \buildrel {\rm w} \over \longrightarrow \mu_0$.
Proof. Let {μn} and μ := μ 0 satisfy the conditions of the proposition. Take any sub-sequence of {μn} and observe that, by Prokhorov’s theorem, this subsequence has a subsubsequence converging weakly to some $\nu \in \cal P$. Along this subsubsequence, for any continuous h ∈ ibS, we have both μn(h) → μ(h) and μn(h) → ν(h). This is sufficient for ν = μ by Remark 3.1. Thus, every subsequence of {μn} has a subsubsequence converging weakly to μ, and hence so does the entire sequence.
4.2. Completeness
To obtain completeness of ($\cal P,\,\gamma$), we adopt the following additional assumption, which is satisfied if, say, compact sets are order bounded (i.e. lie in order intervals) and order intervals are compact. (For example, ℝn with the usual pointwise partial order has this property.)
Assumption 4.1. If K ⊂ S is compact then i(K) ∩ d(K) is also compact.
Theorem 4.1. If Assumption 4.1 holds then ($\cal P$, γ) is complete.
Remark 4.1. In [Reference Chakraborty and Rao3] it was shown that β is a complete metric when S = ℝn. Due to equivalence of the metrics, Theorem 4.1 extends this result to partially ordered Polish spaces where Assumption 4.1 is satisfied.
5. Applications
In this section we show that many results in classical and monotone Markov chain theory, hitherto treated separately, can be derived from the same set of results based around total ordered variation and ordered affinity.
Regarding notation, if {Si} are partially ordered Polish spaces over i = 0, 1, 2, …, we often use a common symbol ‘⪯’ for the partial order on any of these spaces. On products of these spaces we use the product topology and pointwise partial order. Once again, the symbol ‘⪯’ is used for the partial order. For example, if (x 0, x 1) and (y 0, y 1) are points in S 0 × S 1, then (x 0, x 1) ⪯ (y 0, y 1) means that x 0 ⪯ y 0 and x 1 ⪯ y 1.
A function $P \colon (S_0, \cal B_1) \to [0, 1]$ is called a Markov kernel from S 0 to S 1 if x ↦ P(x, B) is $\cal B_0$-measurable for each $B \in \cal B_1$ and B ↦ P(x, B) is in $\cal P_1$ for all x ∈ S 0. If S 0 = S 1 = S, we will call P a Markov kernel on S, or just a Markov kernel. Following standard conventions (see, e.g. [Reference Meyn and Tweedie25]), for any Markov kernel P from S 0 to S 1, any h ∈ bS 1, and $\mu \in \cal P_0$, we define $\mu P \in \cal P_1$ and Ph ∈ bS 0 via
Also, μ ⊗ P denotes the joint distribution on S 0 × S 1 defined by
To simplify notation, we use Px to represent the measure δxP = P(x, ·); Pm is the mth composition of P with itself.
5.1. Order affinity and monotone Markov kernels
Let S be a Polish space partially ordered by ‘⪯’. A Markov kernel P is called monotone if Ph ∈ ibS 0 whenever h ∈ ibS 1. An equivalent condition is that μP ⪯sd νP whenever μ ⪯sd ν; or just P(x, ·) ⪯sd P(y, ·) whenever x ⪯ y. It is well known (see, e.g. Proposition 1 of [Reference Kamae, Krengel and O’Brien19]) that if μ ⪯sd ν and P is monotone, then μ ⊗ P ⪯ sd ν ⊗ P. Note that, when ‘⪯’ is the identity order, every Markov kernel is monotone.
Lemma 5.1. If P is a monotone Markov kernel from S 0 to S 1 and μ, μ′, ν, and ν′are probabilities in $\cal P_0$, then
Proof. Let P, μ, μ′, ν, and ν′ have the stated properties. In view of the equivalently representation in (3.6), the claim will be established if
This holds by the monotonicity of P and the order of μ, μ′, ν, and ν′.
Lemma 5.2. If P is a monotone Markov kernel from S 0 to S 1, then
Proof. Fix μ, ν in $\cal P_0$, and let ($\hat \mu, \hat \nu$) be a maximal ordered component pair for (μ, ν). From monotonicity of P and the fact the Markov kernels preserve the mass of measures, it is clear that $ (\hat \mu P, \hat \nu P)$) is an ordered component pair for (μP, νP). Hence,
On the other hand, for the joint distribution, the ordered affinity of the initial pair is preserved.
Lemma 5.3. If P is a monotone Markov kernel from S 0 to S 1, then
Proof. Fix μ, ν in $\cal P_0$, and let (X 0, X 1) and (Y 0, Y 1) be random pairs with distributions μ ⊗ P and ν ⊗ P, respectively. We have
Taking the supremum over all couplings in $\cal C(\mu \otimes P, \nu \otimes P)$ shows that αO(μ ⊗ P, ν ⊗ P) is dominated by αO(μ, ν).
To see the reverse inequality, let ($\hat \mu, \hat \nu$) be a maximal ordered component pair for (μ, ν). Monotonicity of P now gives $\hat \mu \otimes P {\preceq_{sd}} \hat \nu \otimes P$. Using this and the fact the Markov kernels preserve the mass of measures, we see that ($\hat \mu \otimes P, \hat \nu \otimes P$) is an ordered component pair for (μ ⊗ P, ν ⊗ P). Hence,
5.2. Monotone Markov chains
Given μ ∈ 𝒫 and Markov kernel P on S, a stochastic process {Xt}t≥0 taking values in $S^{\infty} \,:\!= \times_{t=0}^\infty S$ will be called a Markov chain with initial distribution μ and kernel P if the distribution of {Xt} on S ∞ is
(The meaning of the right-hand side is clarified in [Reference Lindvall22, Section III.8], [Reference Kamae, Krengel and O’Brien19, p. 903], and [Reference Meyn and Tweedie25, Section 3.4] for example.) If P is a monotone Markov kernel then (x, B) ↦ Qx(B) := Q δ x(B) is a monotone Markov kernel from S to S ∞. See Propositions 1 and 2 of [Reference Kamae, Krengel and O’Brien19].
There are various useful results about representations of Markov chains that are ordered almost surely. One is that, if the initial conditions satisfy μ ⪯sd ν and P is a monotone Markov kernel, then we can find Markov chains {Xt} and {Yt} with initial distributions μ and ν and kernel P such that Xt ⪯ Yt for all t almost surely. (See, e.g. Theorem 2 of [Reference Kamae, Krengel and O’Brien19].) This result can be generalized beyond the case where μ and ν are stochastically ordered, using the results presented above. For example, let μ and ν be arbitrary initial distributions and let P be monotone, so that Qx is likewise monotone. By Lemma 5.3 we have
In other words, the ordered affinity of the entire processes is given by the ordered affinity of the initial distributions. It now follows from Theorem 3.1 that there exist Markov chains {Xt} and {Yt} with initial distributions μ and ν and Markov kernel P such that
The standard result is a special case, since μ ⪯sd ν implies that αO(μ, ν) = 1, and, hence, the sequences are ordered almost surely.
5.3. Nonexpansiveness
It is well known that every Markov kernel is nonexpansive with respect to the total variation norm, so that
An analogous result is true for γ when P is monotone. That is,
The bound (5.2) follows directly from Lemma 5.2. Evidently (5.1) be recovered from (5.2) by setting ‘⪯’ to equality.
Nonexpansiveness is interesting partly in its own right (we apply it in proofs below) and partly because it suggests that, with some additional assumptions, we can strengthen it to contractiveness. We expand on this idea below.
5.4. An order coupling bound for Markov chains
Doeblin [Reference Doeblin7] established and exploited the coupling inequality
where μt and νt are the time t distributions of Markov chains {Xt} and {Yt} generated by common Markov kernel P. Even when the state space is uncountable, the right-hand side of (5.3) can often be shown to converge to zero by manipulating the joint distribution of (Xj, Yj) to increase the chance of a meeting ([Reference Doob9], [Reference Harris14], [Reference Lindvall22], [Reference Meyn and Tweedie25], [Reference Nummelin26], [Reference Orey27], [Reference Pitman28], [Reference Revuz31], [Reference Thorisson34]).
Consider the following generalization: given a monotone Markov kernel P on S and arbitrary $\mu, \nu \in \cal P$, we can construct Markov chains {Xt} and {Yt} with common kernel P and respective initial conditions μ and ν such that
This generalizes (5.3) because both the right- and left-hand sides of (5.4) reduce to (5.3) if we take $' \preceq '$ to be equality.
To prove (5.4), we need only show that
since, with (5.5) is established, we can reverse the roles of {Xt} and {Yt} in (5.5) to obtain $1 - \alpha_O( \nu P^t, \mu P^t) \leq \mathbb P\{Y_j \npreceq X_j \;\text {for any}\; j \leq t\}$ and then add this inequality to (5.5) to produce (5.4).
If {Xt} and {Yt} are Markov chains with kernel P and initial conditions μ and ν, then (3.6) yields αO(μPt, νPt) ≥ ℙ{Xt ⪯ Yt}. Therefore, we need only construct such chains with the additional property that
Intuitively, we can do so by using a ‘conditional’ version of the Nachbin–Strassen theorem, producing chains that, once ordered, remain ordered almost surely. This can be formalized as follows. By [Reference Machida and Shibakov23, Theorem 2.3], there exists a Markov kernel M on S × S such that $\mathbb G$ is absorbing for M (i.e. $M((x,y), \mathbb G) = 1$ for all (x, y) in $\mathbb G$),
for all (x, y) ∈ S × S and all $A, B \in \cal B$. Given M, let η be a distribution on S × S with marginals μ and ν, let Qη := η ⊗ M ⊗ M ⊗ · · · be the induced joint distribution, and let {(Xt, Yt)} have distribution Qη on (S × S)∞. By construction, Xt has distribution μPt and Yt has distribution νPt. Moreover, (5.6) is valid because $\mathbb G$ is absorbing for M, and, hence, $\mathbb P\{(X_j, Y_j) \notin \mathbb G $ for any $ j \leq t\} = \mathbb P\{(X_t, Y_t) \notin \mathbb G\}$.
5.5. Uniform ergodicity
Let S be a partially ordered Polish space satisfying Assumption 4.1, and let P be a monotone Markov kernel on S. A distribution π is called stationary for P if πP = π. Consider the value
which can be understood as an order-theoretic extension of the Markov–Dobrushin coefficient of ergodicity ([Reference Dobrushin5], [Reference Seneta32]). It reduces to the usual notion when ‘⪯’ is equality.
Theorem 5.1. If P is monotone then
Thus, strict positivity of σ(P) implies that μ ↦ μP is a contraction map on $ (\cal P, \gamma ) $. Moreover, in many settings, the bound in (5.8) cannot be improved upon, as we now see.
Lemma 5.4. If P is monotone, S is not a singleton, and any x, y in S have a lower bound in S, then,
The significance of Theorem 5.1 is summarized in the next corollary.
Corollary 5.1. Let P be monotone, and let S satisfy Assumption 4.1. If there exists an m ∈ ℕ such that σ(Pm) > 0, then P has a unique stationary distribution π in $\cal P$, and
Here ⌊x⌋ is the largest n ∈ ℕ with n ≤ x.
Proof of Corollary 5.1. Let P and μ be as in the statement of the theorem. The existence of a fixed point of μ ↦ μP, and hence a stationary distribution $\pi \in \cal P$, follows from Theorem 5.1 applied to Pm, Banach’s contraction mapping theorem, and the completeness of ($\cal P$, γ) shown in Theorem 4.1. The bound in (5.10) follows from (5.8) applied to Pm and the nonexpansiveness of P in the metric γ (see (5.2)).
5.6. Applications from the introduction
Let us see how Corollary 5.1 can be used to show stability of the two models discussed in the Introduction, beginning with the monotone model in (1.1). The state space is S = [0, 1] with its standard order and all assumptions are as per the discussion immediately following (1.1). We let P be the corresponding Markov kernel and consider the following coupling of (P 1, P 0). Let W be a draw from the Bernoulli($\tfrac12$) distribution, and let V = 1 − W. Then P 1 and P 0 are respectively the distributions of
Since X ≤ Y if and only if W =0, we have, by Lemma 5.1 and (3.6),
for all x, y ∈ S. From the definition in (5.7) we then have $\sigma(P) \geq \tfrac12$, and globally stability in the γ metric follows from Corollary 5.1.
Next we turn to the inventory model in (1.2), with state space S = [0, K] and {Wt} an i.i.d. process satisfying ℙ{Wt ≥ w} > 0 for any real w. Let {Xt} and {Yt} be generated by (1.2), with common shock sequence {Wt}, and respective initial conditions x, y in S. In view of (2.2), we have
Letting ‘⪯’ be equality on S, we have αO(Px, Py) = α(Px, Py) ≥ κ, and hence σ(P) ≥ κ. Globally stability in the γ metric follows from Corollary 5.1.
5.7. General applications
Next we show that Theorem 5.1 and in particular Corollary 5.1 cover as special cases two well-known results on Markov chain stability from the classical literature on one hand and the more recent monotone Markov chain literature on the other.
Consider first the standard notion of uniform ergodicity. A Markov kernel P on S is called uniformly ergodic if it has a stationary distribution π and $\sup_{x \in S} \| P_x^t - \pi \| \to 0$ as t → ∞. Uniform ergodicity was studied by Markov [Reference Markov24] in a countable state space, and by Doeblin [6], Yosida and Kakutani [Reference Yosida and Kakutani37], Doob [Reference Doob9], and many subsequent authors in a general state space. It is defined and reviewed in Chapter 16 of [Reference Meyn and Tweedie25]. One of the most familiar equivalent conditions for uniform ergodicity [Reference Meyn and Tweedie25, Theorem 16.0.2] is the existence of an m ∈ ℔ and a nontrivial $\phi \in \cal M$ such that $P^m_x \geq \phi$ for all x in S.
One can recover this result using Corollary 5.1. Take ‘⪯’ to be equality, in which case every Markov operator is monotone, γ is total variation distance, and Assumption 4.1 is always satisfied. Moreover, σ(Pm) reduces to the ordinary ergodicity coefficient of Pm, evaluated using the standard notion of affinity, and, hence,
Thus, all the conditions of Corollary 5.1 are satisfied, and
Now consider the setting of Bhattacharya and Lee [Reference Bhattacharya and Lee1], where S =ℝn, ‘⪯’ is the usual pointwise partial order ‘≤’ for vectors, and {gt} is a sequence of i.i.d. random maps from S to itself, generating {Xt} via Xt = gt(X t−1) = gt o · · · o g 1(X 0). The corresponding Markov kernel is P(x, B) = ℙ{g 1 (x) ∈ B}. The random maps are assumed to be order preserving on S, so that P is monotone. Bhattacharya and Lee used a ‘splitting condition’, which assumes existence of a x̄ ∈ S and m ∈ ℕ such that
(a) s 1 := ℙ{gm o · · · o g 1(y) ≤ x̄ for all y ∈ S} > 0, and
(b) s 2 := ℙ{gm o · · · o g 1(y) ≥ x̄ for all y ∈ S} > 0.
Under these assumptions, they show that $\sup_{x \in S} \beta(P^t_x, \, \pi)$ converges to zero exponentially fast in t, where β is the Bhattacharya metric introduced in (4.4). This finding extends earlier results by Dubins and Freedman [Reference Dubins and Freedman10] and Yahav [Reference Yahav36] to multiple dimensions.
This result can be obtained as a special case of Corollary 5.1. Certainly S is a partially ordered Polish space and Assumption 4.1 is satisfied. Moreover, the ordered ergodicity coefficient σ(Pm) is strictly positive. To see this, suppose that the splitting condition is satisfied at m ∈ ℕ. Pick any x, y ∈ S, and let {Xt} and {Yt} be independent copies of the Markov chain, starting at x and y, respectively. We have
The last term is strictly positive by assumption. Hence, all the conditions of Corollary 5.1 are satisfied, a unique stationary distribution π exists, and $\sup_{x \in S} \gamma(P^t_x, \, \pi)$ converges to zero exponentially fast in t. We showed in (4.5) that β ≤ 2γ, so the same convergence holds for the Bhattacharya metric.
We can also recover a related convergence result due to Hopenhayn and Prescott [Reference Hopenhayn and Prescott16, Theorem 2] that is routinely applied to stochastic stability problems in economics. They assumed that S is a compact metric space with a closed partial order, and a least element a and greatest element b. They supposed that P is monotone, and that there exists an x̄ in S and an m ∈ ℕ such that
In this setting, they showed that P has a unique stationary distribution π and $\mu P^t \buildrel {\rm w} \over \longrightarrow \pi$ for any $\mu \in \cal P$ as t → ∞. This result can be obtained from Corollary 5.1. Under the stated assumptions, S is Polish and Assumption 4.1 is satisfied. The coefficient σ(Pm) is strictly positive because, if we let {Xt} and {Yt} be independent copies of the Markov chain starting at b and a, respectively, then, since $ (X_m, Y_m) \in {\cal C}(P^m_b, P^m_a)$, we have
The last term is strictly positive by (5.11). Positivity of σ(Pm) now follows from Lemma 5.1, since a ⪯ x, y ⪯ b for all x, y ∈ S. Hence, by Corollary 5.1, there exists a unique stationary distribution π and $\gamma(\mu P^t, \pi) \to 0$ as t → ∞ for any $\mu \in \cal P$. This convergence implies weak convergence by Proposition 4.1 and compactness of S.
Appendix A
In this appendix we collect the remaining proofs. Throughout, in addition to notation defined above, cbS 0 denotes all continuous functions h: S → [0, 1], while
A.1. Proofs of Section 3 results
Proof of Lemma 3.1. For the first equality, fix $\lambda \in \cal M_s$ and let
Since 1I ∈ ibS for all $I \in i\cal B$, we have b(λ) ≥ s(λ). To see the reverse inequality, let h ∈ iH 0. Fix n ∈ ℕ. Let rj = j/n for j = 0, …, n. Define hn ∈ iH 0 by
Since h ≤ hn + 1/n, we have
For j = 0, …, n, let $I_{j} \,:\!= \{x \in S \colon h_{n}(x) \geq r_{j}\} \in i\cal B$. Note that
We have
We define f 0, …, f n−1 ∈ iH 0 and $A_{0}, \ldots, A_{n-1} \in i\cal B$ as follows. Let f 0 = hn and A 0 = In. Evidently,
Now suppose that, for some j ∈ {0, 1, …, n − 2}, we have
If λ(I n−j−1 \ Aj) > 0 then define
Note that in this case
If λ(I n−j−1 \ Aj) ≤ 0 then define
In this case λ(f j+1) − λ(fj) = (r n−j−2 − r n−j−1)λ(I n−j−1 \ Aj) ≥ 0. We also have (A.1) by construction. Thus, in both cases, we have (A.3) with j replaced by j + 1. Continuing this way, we see that (A.3) holds for all j = 0, …, n − 1.
Let j = n − 1 in (A.3). From the definition of rj and (A.2) we have r 0 = 0 and I 0 = S. Thus,
Since f n−1 = 1A n−1 and A n−1 = Ij for some j ∈ {0, …, n − 1}, recalling (A.1) we have
Applying suph∈iH 0 to the leftmost side, we see that b(λ) − 1/n ≤ s(λ). Since this is true for any n ∈ ℕ N, we obtain b(λ) ≤ s(λ).
The claim (3.1) follows from |a| = max{a, −a} and interchange of max and sup.
Proof of Theorem 3.1. Let (X, Y) be a coupling of (μ, ν), and define
Clearly, μ ′ ≤ μ, ν′ ≤ ν and μ′(S) = ℙ{X ⪯ Y} = ν′(S). Moreover, for any increasing set $I \in \cal B$, we clearly have μ′(I) = ν′(I). Hence, (μ′, ν′) ∈ Φ(μ, ν) and ℙ{X ⪯ Y} = μ′(S) ≤ αO(μ, ν). We now exhibit a coupling such that equality is attained. In doing so, we can assume that a : = αO(μ, ν) > 0. If not then, for any $ (X,Y) \in \cal C(\mu, \nu)$, we have 0 ≤ ℙ{X ⪯ Y} ≤ αO(μ, ν) = 0.
To begin, observe that, by Proposition 3.1, there exists a pair (μ′, ν′) ∈ Φ(μ, ν) with μ′(S) = ν′(S) = a. Let
By construction, μr, νr, μ′/a, and ν′/a are probability measures satisfying
We construct a coupling (X, Y) as follows. Let U, X′, Y′, Xr, and Yr be random variables on a common probability space such that
(a) $X' {\mathop = \limits^{\rm D}} \mu' / a$, $Y' {\mathop = \limits^{\rm D}} \nu' / a$, $X^r {\mathop = \limits^{\rm D}} \mu^r$, and $Y^r {\mathop = \limits^{\rm D}} \nu^r$,
(b) U is uniform on [0, 1] and independent of (X′, Y′, Xr, Yr), and
(c) ℙ{X′ ⪯ Y′} = 1.
The pair in (c) can be constructed via the Nachbin–Strassen theorem [Reference Kamae, Krengel and O’Brien19, Theorem 1], since μ′/a ⪯sd ν′/a. Now let
Evidently, $ (X,Y) \in \cal C(\mu, \nu) $. Moreover, for this pair, we have
By independence, the right-hand side is equal to ℙ{X′ ⪯ Y′} ℙ{U ≤ a} = a, so
We conclude that
Next, observe that, for any $ (X, Y) \in \cal C(\mu, \nu) $ and h ∈ ibS, we have
Specializing to h = 1I for some $I \in i\cal B$, we have $\mu(I) - \nu(I) \leq \mathbb P\{ X \npreceq Y \} = 1 - \mathbb P\{ X \preceq Y \}$. From this bound and (A.4), the proof of (3.6) will be complete if we can show that
To prove (A.5), let $\cal B \otimes \cal B$ be the product σ-algebra on S × S and let πi be the ith coordinate projection, so that π 1(x, y) = x and π 2(x, y) = y for any (x, y) ∈ S × S. As usual, given Q ⊂ S × S, we let π 1(Q) be all x ∈ S such that (x, y) ∈ Q, and similarly for π 2. Recall that $\mathfrak C$ is the closed sets in S and $d\mathfrak C$ is the decreasing sets in $\mathfrak C$. Strassen’s theorem [33] implies that, for any ∊ ≥ 0 and any closed set K ⊂ S × S, there exists a probability measure ξ on $ (S \times S, \cal B \otimes \cal B) $ with marginals μ and ν such that ξ(K) ≥ 1 − ∊ whenever
Note that if $F \in \mathfrak C$ then, since ‘⪯’ is a closed partial order, so is the smallest decreasing set d(F) that contains F. Let : $\epsilon \,:\!= \sup_{D \in d\mathfrak C} \{ \nu(D) - \mu(D) \}$, so that
Noting that d(F) can be expressed as $\pi_1(\mathbb G \cap (S \times F))$, it follows that, for any $F \in \mathfrak C$,
Since ‘⪯’ is closed, $\mathbb G$ is closed, and Strassen’s theorem applies. From this theorem we obtain a probability measure ξ on the product space S × S such that $\xi(\mathbb G) \geq 1 - \epsilon$ and ξ has marginals μ and ν.
Because complements of increasing sets are decreasing and vice versa, we have
Now consider the probability space $ (\Omega, \cal F, \mathbb P) = (S \times S, \cal B \otimes \cal B, \xi)$, and let X = π 1 and Y = π 2. We then have $\xi(\mathbb G) = \xi \{{(x,y) \in S \times S}{x \preceq y} = \mathbb P {X \preceq Y}\}$. Combining this equality with (A.6) implies (A.5).
A.2. Proofs of Section 4 results
We begin with an elementary lemma.
Lemma A.1. For any μ, $\nu \in \cal M$, we have μ ≤ ν whenever μ(h) ≤ ν(h) for all h ∈ cbS 0
Proof. Suppose that μ(h) ≤ ν(h) for all h ∈ cbS 0. We claim that μ(F)
To see this, let ρ be a metric compatible with the topology of S and let F be any closed subset of S. Let f ∊(x) := max{1 − ρ(x, F)/∊, 0} for ∊ > 0, x ∈ S, where ρ(x, F) = infy∈F ρ(x, y). Since ρ(·, F) is continuous and 0 ≤ f ∊≤ 1, we have f ∊ ∈ cbS 0. Let F ∊= {x ∈ S: ρ(x, F) < ∊} for ∊ > 0. Note that f ∊(x) = 1 for all x ∈ F, and that f ∊(x) = 0 for all x ∉ F ∊. Thus,
Since F = ∩ ∊>0 F ∊, we have lim∊↓0 ν(F ∊) = ν(F), so letting ∊ ↓ 0 in (A.8) yields μ(F) ≤ ν(F). Hence, (A.7) holds.
Let $B \in \cal B$ and fix ∊ > 0. Since all probability measures on a Polish space are regular, there exists a closed set F ⊂ B such that μ(B) ≤ μ(F) + ∊. Thus, by (A.7), we have μ(B) ≤ μ(F) + ∊ ≤ ν(F) + ∊ ≤ ν(B) + ∊. Since ∊ > 0 is arbitrary, this yields μ(B) ≤ ν(B). Hence, μ ≤ ν.
Proof of Theorem 4.1. Let {μn} be a Cauchy sequence in ($\cal P, \gamma $). Our first claim is that {μn} is tight. To show this, fix ∊ > 0. Let μ : = μN be such that
Let K be a compact subset of S such that μ(K) ≥ 1 − ∊ and let K̄ := i(K) ∩ d(K). We have
For n ≥ N, this bound, (4.3), (A.9), and the definition of K yield
Hence, {μn}n≥N is tight. It follows that {μn}n≥1 is likewise tight. As a result, by Prokhorov’s theorem, it has a subsequence that converges weakly to some $\mu^{*} \in \cal P$. We aim to show that γ (μn, μ*) → 0.
To this end, fix ϵ > 0 and let nϵ be such that γ(μm, μnϵ) < ϵ whenever m ≥ nϵ. Fix m ≥ nϵ and let ν : = μm. For all n ≥ nϵ, we have γ (ν, μn) < ϵ. Fixing any such n ≥ nϵ, we observe that, since g(μn, ν) < ϵ, there exists $(\tilde{\mu}_{n}, \tilde{\nu}_{n}) \in \Phi(\mu_{n}, \nu)$ with $\|\tilde{\mu}_{n}\| = \|\tilde{\nu}_{n}\| > 1-\epsilon$. Multiplying $\tilde{\mu}_{n}$ and $\tilde{\nu}_{n}$ by $ (1-\epsilon)/\|\tilde{\mu}_n\| < 1 $, denoting the resulting measures by $\tilde{\mu}_{n}$ and $\tilde{\nu}_{n}$ again, we have
Note that $\{\tilde{\nu}_{n}\}$ is tight. Since {μn} is tight, so is $\{\tilde{\mu}_{n}\}$. Thus, there exist subsequences $\{\mu_{n_{i}}\}_{i \in \mathbb N}, \{\tilde{\mu}_{n_{i}}\}_{i \in \mathbb N}$, and $\{\tilde{\nu}_{n_{i}}\}_{i \in \mathbb N}$ of {μn}, $\{\tilde{\mu}_{n}\}$, and $\{\tilde{\nu}_{n}\}$, respectively, such that, for some $\tilde{\mu}^{*}, \tilde{\nu}^{*} \in \cal M$ with $\|\tilde{\mu}^{*}\| = \|\tilde{\nu}^{*}\| = 1-\epsilon$, we have
Given h ∈ cbS 0, since $\tilde{\mu}_{n_{i}} (h) \leq \mu_{n_{i}} (h)$ and $\tilde{\nu}_{n_{i}} (h) \leq \nu (h)$ for all $i \in \mathbb N$ by (A.10), we have $\tilde{\mu}^{*} (h) \leq \mu^{*} (h)$ and $\tilde{\nu}^{*} (h) \leq \nu (h)$ by weak convergence. Thus, $\tilde{\mu}^{*} \leq \mu^{*}$ and ˜ν ∗ ≤ ν by Lemma A.1. We have $\tilde{\mu}^{*} \preceq_{sd} \tilde{\nu}^{*}$ by [Reference Kamae, Krengel and O’Brien19, Proposition 3]. It follows that $ (\tilde{\mu}^{*}, \tilde{\nu}^{*}) \in \Phi(\mu^{*}, \nu) $. We have $g(\mu^{*}, \nu) \leq 1 - \|\tilde{\mu}^{*}\| = \epsilon$.
By a symmetric argument, we also have g(ν, μ ∗) ≤ ϵ. Hence, γ (ν, μ ∗) ≤ 2ϵ. Recalling the definition of ν, we have now shown that, for all m ≥ nϵ, γ(μm, μ ∗) ≤ 2ϵ. Since was arbitrary, this concludes the proof.
A.3. Proofs of Section 5 results
We begin with some lemmas.
Lemma A.2. If P is monotone then $\sigma(P) = \inf_{(\mu, \nu) \in \cal P \times \cal P} \alpha_O(\mu P, \nu P)$.
Proof. Let P be a monotone Markov kernel. It suffices to show that the inequality $\sigma(P) \leq \inf_{(\mu, \nu) \in \cal P \times \cal P} \alpha_O(\mu P, \nu P)$ holds, since the reverse inequality is obvious. By the definition of σ(P) and the identities in (3.6), the claim will be established if we can show that
where x and y are chosen from S and μ and ν are chosen from $\cal P$. Let s be the value of the right-hand side of (A.11) and let ϵ > 0. Fix μ, $\nu \in \cal P$ and $I \in i\cal B$ such that μP(I) − νP(I) > s − ϵ, or, equivalently,
From this expression we see that there are x̄, ȳ ∈ S such that P(x̄, I) − P(ȳ, I) > s − ϵ. Hence, $\sup_{x,y} \sup_{I \in i\cal B} \{ P(x, I) - P(y, I) \} \geq s$, as was to be shown.
Lemma A.3. If μ, $\nu \in \cal M$ and ($(\tilde{\mu}, \tilde{\nu})$) is an ordered component pair of (μ, ν), then
Proof. Fix μ, ν in $\cal M$ and $(\tilde{\mu}, \tilde{\nu}) \in \Phi(\mu, \nu)$. Consider the residual measures $\hat{\mu} \,:\!= \mu - \tilde{\mu} $ and $\hat{\nu} \,:\!= \nu - \tilde{\nu}$. Let (μ ′, ν ′) be a maximal ordered component pair of ($\hat{\mu} P, \hat{\nu} P$), and define
We claim that (μ ∗, ν ∗) is an ordered component pair for (μP, νP). To see this, note that
and, similarly, ν ∗ ≤ νP. The measures μ ∗ and ν ∗ also have the same mass, since
Moreover, since $\tilde{\mu} \preceq_{sd} \tilde{\nu}$ and P is monotone, we have $\tilde{\mu} P \preceq_{sd} \tilde{\nu} P$. Hence, μ ∗ ⪯sd ν ∗, completing the claim that (μ ∗, ν ∗) is an ordered component pair for (μP, νP). As a result,
Proof of Theorem 5.1. Let μ, $\nu \in \cal P$. Let ($\tilde{\mu}, \tilde{\nu}$) be a maximal ordered component pair for (μ, ν). Let $\hat{\mu} = \mu - \tilde{\mu}$ and $\hat{\nu} = \nu - \tilde{\nu}$ be the residuals. Since $\|\tilde{\mu}\| = \|\tilde{\nu}\|$, we have $\|\hat{\mu}\| = \|\hat{\nu}\|$. Suppose first that $\|\hat{\mu}\| > 0$. Then $\hat{\mu} P/\|\hat{\mu}\|$ and $\hat{\nu} P/\|\hat{\mu}\|$ are both in $\cal P$. Thus, by Lemma A.2,
Applying the positive homogeneity property in Lemma 3.3 yields
Note that this inequality trivially holds if $\|\hat{\mu}\| = 0$. From the definition of g, we can write the same inequality as $g(\hat{\mu} P, \hat{\nu} P) \leq (1 - \sigma(P)) g(\mu, \nu)$. If we apply Lemma A.3 to the latter we obtain
Reversing the roles of μ and ν, we also have g(νP, μP) ≤ (1 − σ(P))g(ν, μ). Thus,
verifying the claim in (5.8).
Proof of Lemma 5.4. To see that (5.9) holds, fix ξ > σ(P) and suppose first that σ(P) = 1. Then (5.9) holds because the right-hand side of (5.9) can be made strictly negative by choosing μ, $\nu \in \cal P$ to be distinct. Now suppose that σ(P) < 1 holds. It suffices to show that,
Indeed, if we take (A.12) as valid, set ϵ:= ξ − σ(P) and choose x and y to satisfy the conditions in (A.12), then we have
Therefore, (5.9) holds.
To show that (A.12) holds, fix ϵ > 0. We can use σ(P) < 1 and the definition of σ(P) as an infimum to choose an δ ∈ (0, ϵ) and points x̄, ȳ ∈ S such that α 0(Px̄, Pȳ) < σ(P) + δ < 1. Note that x̄ ⪯ x̄ cannot hold here, because then α 0(Px̄, Pȳ) = 1, a contradiction. So suppose instead that $\bar x \npreceq \bar y$. Let y be a lower bound of x̄ and ȳ and let x := x̄. We claim that (A.12) holds for the pair (x, y).
To see this, observe that, by the monotonicity result in Lemma 5.1 and y ⪯ ȳ, we have
Moreover, y ⪯ x because x = x̄ and y is by definition a lower bound of x̄. Finally, $x \npreceq y$ because if not then x̄ = x ⪯ y ⪯ ȳ, contradicting our assumption that $\bar x \npreceq \bar y$.