On the rate of convergence for the length of the longest common subsequences in hidden Markov models

C. Houdré; G. Kerchev

doi:10.1017/jpr.2019.26

On the rate of convergence for the length of the longest common subsequences in hidden Markov models

Part of: Limit theorems Combinatorial probability Combinatorics

Published online by Cambridge University Press: 30 July 2019

C. Houdré and

G. Kerchev

Show author details

C. Houdré*: Affiliation:
Georgia Institute of Technology
G. Kerchev*: Affiliation:
Georgia Institute of Technology
*: *Postal address: School of Mathematics, Georgia Institute of Technology, 686 Cherry Street Atlanta, GA 30332-0160, USA.
*Postal address: School of Mathematics, Georgia Institute of Technology, 686 Cherry Street Atlanta, GA 30332-0160, USA.

Article contents

Abstract
Introduction
Mean convergence
Rate of convergence
References

Rights & Permissions

Abstract

Let (X, Y) = (Xn, Yn)n≥1 be the output process generated by a hidden chain Z = (Zn)n≥1, where Z is a finite-state, aperiodic, time homogeneous, and irreducible Markov chain. Let LCn be the length of the longest common subsequences of X1,..., Xn and Y1,..., Yn. Under a mixing hypothesis, a rate of convergence result is obtained for E[LCn]/n.

Keywords

Longest common subsequence rate of convergence hidden Markov model Hoeffding inequality mixing condition

MSC classification

Primary: 60C05: Combinatorial probability 05A05: Permutations, words, matrices

Secondary: 60F10: Large deviations

Type: Research Papers
Information: Journal of Applied Probability , Volume 56 , Issue 2 , June 2019 , pp. 558 - 573

DOI: https://doi.org/10.1017/jpr.2019.26 [Opens in a new window]
Copyright: © Applied Probability Trust 2019

1. Introduction

Longest common subsequences are often a key measure of similarity between two strings of letters. For two finite sequences (X ₁,..., X_n) and (Y ₁,..., Y_m) taking values in a finite alphabet 𝒜, the object of study is LCS(X ₁,..., X_n;Y ₁,..., Y_m), the length of the longest common subsequences of X ₁,..., X_n and Y ₁,..., Y_m, which is abbreviated as LC_n when n = m. Clearly, LC_n is the largest k such that there exist 1 ≤ i ₁ < ⋯ < i_k ≤ n and 1 ≤ j ₁ < ⋯ < j_k ≤ n with

$${X_{{i_s}}} = {Y_{{j_s}}},\quad {\rm{for all}}\,s = 1,2,3, \ldots ,k.$$

For two independent words sampled independently and uniformly at random from the alphabet, Chvátal and Sankoff [Reference ChváTal and Sankoff6] proved that lim_n _→∞ E[LC_n]/n = γ ^* and provided upper and lower bounds on λ ^*. This was followed by Alexander [Reference Alexander1], who obtained, for indpendent and identically distributed (i.i.d.) draws, the following generic rate of convergence result:

(1.1)

$$n{\gamma ^*} - C\sqrt {n\log n} \le {\rm{E}}[L{C_n}] \le n{\gamma ^*},$$

where C > 0 is an absolute constant.

From a practical point of view the independence assumptions, both between words and also among draws, has to be relaxed as they are often lacking. One such instance is in the field of computational biology where we compare similarities between two biological sequences. In particular, alignments of those sequences need to be qualified as occurring by chance or because of a structural relation. One way to generate alignments is with a hidden Markov model (HMM). The states of the hidden chain account for a match between two elements in X and Y or for an alignment of an element with a gap. Given X and Y we can find the most probable alignment using the Viterbi algorithm. This model is particularly useful when the similarity between X and Y is weak. In this case standard methods for pairwise alignment often fail to identify the correct alignment or test for its significance. With a hidden Markov model we can evaluate the total probability that X and Y are aligned by summing over all alignments, and this sum can be efficiently computed with the Forward algorithm. For more information the reader is referred to [Reference Durbin, Eddy, Krogh and Mitchison9, Chapter 4].

There are very few results on the asymptotics of the longest common subsequences in a model exhibiting dependence properties. A rare instance is due to Steele [Reference Steele18], who showed the convergence of E[LC_n]/n when (X, Y) is a random sequence for which there is a stationary ergodic coupling, e.g. an irreducible, aperiodic, positive recurrent Markov chain. The present paper studies the longest common subsequences for strings exhibiting a different Markov relation; namely, we study the case when (X, Y) is emitted by a latent Markov chain Z, i.e. when (Z, (X, Y)) is a hidden Markov model. Note that this framework includes the special case when (Z, X) and (Z ^′, Y) are hidden Markov models, with the same parameters, while Z and Z ^′ are independent. In this setting, mean convergence is quickly proved in Section 2. Then, the main contribution is a rate of convergence result, obtained in Section 3, which recovers, in particular, (1.1).

Throughout this manuscript the probability space (Ω, F, P) is assumed to be rich enough to consider all the random variables being studied.

2. Mean convergence

Recall that a hidden Markov model (Z, V) consists of a Markov chain Z = (Z_n)_n _≥1 which emits the observed variables V = (V_n)_n _≥1. The possible states in Z are each associated with a distribution on the values of V. In other words, the observation V is a mixture model where the choice of the mixture component for each observation depends on the component of the previous observation. The mixture components are given by the sequence Z. Note also that given Z, V is a Markov chain. For such a model the first easy result asserts the mean convergence of LC_n.

Proposition 2.1. Let Z be an aperiodic, irreducible, time-homogeneous, finite-state-space Markov chain. Let μ, P, and π be respectively the initial distribution, transition matrix, and stationary distribution of Z. Let each Z_n,n ≥ 1, generate a pair (X_n, Y_n) according to a distribution associated with the state of Z_n, i.e. let (Z, (X, Y)) be a hidden Markov model, where X = (X_n)_n _≥1 and Y = (Y_n)_n _≥1. Further, for all i ≥ 1 and j ≥ 1,let X_i and Y_j take their values in the common finite alphabet 𝒜 and let there exist a ∊ 𝒜 such that P(X_i = Y_j = a) > 0 for some i ≥ 1 and j ≥ 1. Then lim

$$\mathop {\lim }\limits_{n \to \infty } {{{\rm{E}}[L{C_n}]} \over n} = {\gamma ^*},$$

where λ ^* ∊ (0, 1].

Proof. If μ = π, the sequence (X, Y) is stationary and therefore by superadditivity and Fekete’s lemma or Kingman’s subadditivity theorem (see [Reference Steele19]) implies that

$$\mathop {\lim }\limits_{n \to \infty } {{{\rm{E}}[L{C_n}]} \over n} = \mathop {\sup }\limits_{k \ge 1} {{{\rm{E}}[L{C_k}]} \over k} = {\gamma ^*}$$

for some λ ^* ∊ (0, 1]. When μ ≠ π, a coupling technique will prove the result. Let $$\overline Z $$ be a Markov chain with initial and stationary distribution π and having the same transition matrix P as the chain Z. Assume, further, that the emission probabilities are the same for Z and $$\overline Z $$, and denote by $$(\overline Z ,(\overline X ,\overline Y ))$$ the corresponding HMM. Next, consider the coupling $$(Z,\overline Z )$$ where the two chains stay together after the first time i for which $${Z_i} = {\overline Z _i}$$, and let τ be the meeting time of Z and $$\overline Z $$. Next, and throughout, let X ⁽ⁿ⁾ ≔ (X ₁,..., X_n), and similarly for Y ⁽ⁿ⁾, $${\overline X ^{(n)}}$$, and $${\overline Y ^{(n)}}$$. Since $$LCS({X^{(n)}};{Y^{(n)}}) - LCS({\overline X ^{(n)}};{\overline Y ^{(n)}}) \le n$$, then for any K > 0,

(2.1)

$$\matrix{ {|{\rm{E}}[LCS({X^{(n)}};{Y^{(n)}}) - LCS({{\overline X }^{(n)}};{{\overline Y }^{(n)}})]|} \hfill \cr { = |{\rm{E}}[[LCS({X^{(n)}};{Y^{(n)}}) - LCS({{\overline X }^{(n)}};{{\overline Y }^{(n)}})]{{\bf{1}}_{\tau > K}}]} \hfill \cr { + {\rm{E}}[[LCS({X^{(n)}};{Y^{(n)}}) - LCS({{\overline X }^{(n)}};{{\overline Y }^{(n)}})]{{\bf{1}}_{\tau \le K}}]|} \hfill \cr { \le n{\rm{P}}(\tau > K) + K + |{\rm{E}}[[LC{S^K}({X^{(n)}};{Y^{(n)}}) - LC{S^K}({{\overline X }^{(n)}};{{\overline Y }^{(n)}})]{{\bf{1}}_{\tau \le K}}]|} \hfill \cr { \le n{\rm{P}}(\tau > K) + K,} \hfill \cr } $$

where LCS^K( · ; · ) is now the length of the longest common subsequences restricted to the letters X_i and Y_i for i > K, noting also that when τ ≤ K, then LCS^K(X ⁽ⁿ⁾;Y ⁽ⁿ⁾) and $$LC{S^K}({\overline X ^{(n)}};{\overline Y ^{(n)}})$$ are identically distributed. If K ∊ (mk, m(k + 1)], for some m ≥ 0, by an argument going back to Doeblin [Reference Doeblin7] (see also [Reference Thorisson20]),

(2.2)

$$\matrix{ {{\rm{P}}(\tau > K)} \hfill \cr {{\rm{ }} \le {\rm{P}}({Z_k} \ne \overline {{Z_k}} ,{Z_{2k}} \ne \overline {{Z_{2k}}} , \ldots ,{Z_{mk}} \ne \overline {{Z_{mk}}} )} \hfill \cr {{\rm{ }} = {\rm{P}}({Z_k} \ne \overline {{Z_k}} ){\rm{P}}({Z_{2k}} \ne \overline {{Z_{2k}}} |{Z_k} \ne \overline {{Z_k}} ) \cdots {\rm{P}}({Z_{mk}} \ne \overline {{Z_{mk}}} |{Z_{(m - 1)k}} \ne \overline {{Z_{(m - 1)k}}} )} \hfill \cr {{\rm{ }} \le {{(1 - \varepsilon )}^{m - 1}}} \hfill \cr {{\rm{ }} \le c{\alpha ^K},} \hfill \cr } $$

where $$\alpha = \root k \of {1 - \varepsilon } \in (0,1)$$ and c = 1/(1 − ∊)². Therefore, τ is finite with probability one. Choosing $$K = \sqrt n $$ yields P(τ > K) + K/n → 0 and finally E[LC_n]/n → λ ^* as n → ∞.

Clearly, E[LC_n] ≤ n; to see that λ ^* > 0, note first that, by aperiodicity and irreducibility, P^k ≥ for some fixed k and > 0, i.e. all the entries of the matrix P^k are larger than some positive quantity ∊. Therefore, P(X ₁ = Y_k ₊₁) > p for some p = p(k, ∊) > 0. Now,

(2.3)

$$L{C_{nk + 1}} \ge {{\bf{1}}_{{X_1} = {Y_{k + 1}}}} + {{\bf{1}}_{{X_{k + 1}} = {Y_{2k + 1}}}} + \cdots + {{\bf{1}}_{{X_{(n - 1)k + 1}} = {Y_{nk + 1}}}},$$

and hence

$${{np} \over {nk + 1}} \le {{{\rm{E}}[L{C_{nk + 1}}]} \over {nk + 1}}.$$

Letting n → ∞ implies that λ ^* ∊ [p/(k + 1), 1] ⊂ (0, 1], since p > 0.

Remark 2.1. (i) Under a further assumption, we can show that λ ^* > P(X ₁ = Y ₁). Indeed, assume that for all x, y ∊ 𝒜, z ∊ S, P(X_i = x, Y_i = y | Z_i = z) = P(X_i = y, Y_i = x | Z_i = z) > 0, and let Z be started at the stationary distribution. Then, for any n ≥ 2,

$$\matrix{ {{\rm{E}}[L{C_n}] \ge {\rm{E}}[L{C_{n - 2}}{{\bf{1}}_{{X_n} = {Y_n},{X_{n - 1}} = {Y_{n - 1}}}}] + 2{\rm{P}}({X_n} = {Y_n},{X_{n - 1}} = {Y_{n - 1}})} \hfill \cr {{\rm{ }} + {\rm{E}}[L{C_{n - 2}}{{\bf{1}}_{{X_n} = {Y_n},{X_{n - 1}} \ne {Y_{n - 1}}}}] + {\rm{P}}({X_n} \ne {Y_n},{X_{n - 1}} = {Y_{n - 1}})} \hfill \cr {{\rm{ }} + {\rm{E}}[L{C_{n - 2}}{{\bf{1}}_{{X_n} \ne {Y_n},{X_{n - 1}} = {Y_{n - 1}}}}] + {\rm{P}}({X_n} = {Y_n},{X_{n - 1}} \ne {Y_{n - 1}})} \hfill \cr {{\rm{ }} + {\rm{E}}[L{C_{n - 2}}{{\bf{1}}_{{X_n} \ne {Y_n},{X_{n - 1}} \ne {Y_{n - 1}}}}] + {\rm{P}}({X_n} \ne {Y_n},{X_{n - 1}} \ne {Y_{n - 1}},{X_n} = {Y_{n - 1}})} \hfill \cr {{\rm{ }} > {\rm{E}}[L{C_{n - 2}}] + {\rm{P}}({X_n} = {Y_n}) + {\rm{P}}({X_{n - 1}} = {Y_{n - 1}})} \hfill \cr {{\rm{ }} = {\rm{E}}[L{C_{n - 2}}] + 2{\rm{P}}({X_1} = {Y_1})} \hfill \cr } $$

by stationarity. Therefore, iterating, still using stationarity, and since E[LC ₀] = 0 while E[LC ₁] = P(X ₁ = Y ₁), it follows that for n ≥ 2, E[LC_n] > nP(X ₁ = Y ₁). Finally,

$${\gamma ^*} > {\rm{P}}({X_1} = {Y_1}) = \sum\limits_{\alpha \in {\cal A}} {\rm{P}}({X_1} = \alpha ){\rm{P}}({Y_1} = \alpha ),$$

and this inequality is strict since Fekete’s lemma (see, e.g., [Reference Steele19]) ensures that λ ^* = sup_n E[LC_n]/n.

(ii) Steele’s general result, see [Reference Steele18], asserts that Proposition 2.1 holds if there is a stationary ergodic coupling for (X, Y). Such an example is when the sequences X and Y are generated by two independent aperiodic, homogeneous, and irreducible hidden Markov chains with the same parameters (and so the same emission probabilities). Indeed, at first, when the hidden chains Z_Xand Z_Y generating respectively X and Y are started at the stationary distribution, convergence of E[LC_n]/n towards λ ^* follows from superadditivity and Fekete’s lemma (see [Reference Steele19]). As previously, λ ^* > 0 since the properties of the hidden chains imply (2.3). Then, when the initial distribution is not the stationary distribution, we can proceed with arguments as above. In particular, let τ ₁ and τ ₂ be the respective meeting times of the chains $$({Z_X},\overline {{Z_X}} )$$ and $$({Z_Y},\overline {{Z_Y}} )$$, and let τ = max (τ ₁,τ ₂). Then equation (2.1) continues to hold:

(2.4)

$$\matrix{ {|{\rm{E}}[LCS(X;Y) - LCS(\overline X ;\overline Y )]|} \hfill & \le \hfill & {n{\rm{P}}(\tau > K) + K} \hfill \cr {} \hfill & \le \hfill & {2n{\rm{P}}({\tau _1} > K) + K.} \hfill \cr } $$

Taking $$K = \sqrt n $$ and noting the exponential decay of P(τ ₁ > K) finishes the corresponding proof.

3. Rate of convergence

The previous section gives a mean convergence result; we now deal with its rate. Again, let (X, Y) be the outcome of a hidden Markov chain Z with μ, P, and π as initial distribution, transition matrix, and stationary distribution respectively. In this section we impose the additional restriction that the emission distributions for all states in the hidden chain are symmetric (this is discussed further in Proposition 3.1 and in the Appendix): for all x, y ∊ 𝒜 and all z ∊ S, P(X_i = x, Y_i = y | Z_i = z) = P(X_i = y, Y_i = x | Z_i = z). Symmetry clearly implies that the conditional law of X given Z and of Y given Z are the same since, for all x, y and z,

$$\matrix{ {{\rm{P}}({X_i} = x|{Z_i} = z)} \hfill & = \hfill & {\sum\limits_{y \in {\cal A}} {\rm{P}}({X_i} = x,{Y_i} = y|{Z_i} = z) = \sum\limits_{y \in {\cal A}} {\rm{P}}({X_i} = y,{Y_i} = x|{Z_i} = z)} \hfill \cr {} \hfill & = \hfill & {{\rm{P}}({Y_i} = x|{Z_i} = z).} \hfill \cr } $$

In turn this implies that X_i and Y_i are identically distributed.

Moreover, we need to control the dependency between X and Y, and a way to do so is via the β-mixing coefficient as given in Definition 3.3 of [Reference Bradley5], which we now recall.

Definition 3.1. Let ℱ ₁ and ℱ ₂ be two σ-fields ⊂ ℱ. Then the β-mixing coefficient associated with these sub-σ-fields of F is given by:

$$\beta ({{\cal F}_1},{{\cal F}_2}){1 \over 2}\sup \sum\limits_{i = 1}^I \sum\limits_{j = 1}^J |{\rm{P}}({A_i} \cap {B_j}) - {\rm{P}}({A_i}){\rm{P}}({B_j})|,$$

where the supremum is taken over all pairs of finite partitions {A ₁,..., A_I} and {B ₁,..., B_J} of Ω such that A_i ∊ ℱ ₁, for all i ∊ {1,..., I}, I ≥ 1 and B_j ∊ ℱ ₂ for all j ∊{1,..., J}, J ≥ 1.

In our case the above notion of the β-mixing coefficient is adopted for the σ -fields generated by two sequences. Moreover, by [Reference Bradley5, Proposition 3.21], for a fixed n ≥ 1, and since X ⁽ⁿ⁾ = (X ₁,..., X_n) and Y ⁽ⁿ⁾ = (Y ₁,..., Y_n) are discrete random vectors,

$$\matrix{ {\beta (n)} \hfill & {: = } \hfill & {\beta (\sigma ({X^{(n)}}),\sigma ({Y^{(n)}}))} \hfill \cr {} \hfill & = \hfill & {{1 \over 2}\sum\limits_{u \in {{\cal A}^n}} \sum\limits_{v \in {{\cal A}^n}} |{\rm{P}}({X^{(n)}} = u,{Y^{(n)}} = v) - {\rm{P}}({X^{(n)}} = u){\rm{P}}({Y^{(n)}} = v)|,} \hfill \cr } $$

where σ(X ⁽ⁿ⁾) and σ (Y ⁽ⁿ⁾)are the σ -fields generated by X ⁽ⁿ⁾ and Y ⁽ⁿ⁾. Clearly X ⁽ⁿ⁾ and Y ⁽ⁿ⁾ are independent if and only if β(n) = 0. Further, set β ^* ≔ lim_n _→∞ β(n), where the limit exists since β(n) is non-decreasing in n and β(n) ∊ [0, Reference Alexander1] (see [Reference Bradley5, Section 5]).

Remark 3.1. (i) Another definition of the β-mixing coefficient based on ‘past’ and ‘future’ is often studied in the literature: see, for instance, [Reference Bradley4, Section 2]. For a single sequence of random variables S = (S_k)_k _∊Z and for −∞ ≤ J ≤ L ≤ ∞,let

$${\cal F}_J^L: = \sigma ({S_k},J \le k \le L),$$

and for each n ≥ 1, let

$${\beta _n}: = \mathop {\sup }\limits_{j \in \mathbb Z} \beta (F_{ - \infty }^j,F_{j + n}^\infty ).$$

In particular, [Reference Bradley4, Theorem 3.2] implies that if S is a strictly stationary, finite-state Markov chain that is also irreducible and aperiodic, β_n → 0 as n → ∞. The mixing definition relevant to the approach here is different and this limiting behavior does not follow. A further discussion of the values of β(n) is included in Remark 3.3 (i).

(ii) One might also be interested to use the α-mixing coefficient, defined for σ -fields 𝒮 and 𝒯 as:

$$\alpha ({\cal S},{\cal T}) = 2\sup \{ |{\rm{Cov}}({{\bf{1}}_S},{{\bf{1}}_T})|{\kern 1pt} :{\kern 1pt} (S,T) \in {\cal S} \times {\cal T}\} .$$

Suppose further that 𝒯 has exactly N atoms. The following holds (see [Reference Bradley4] and [Reference Bradley3, Theorem 1]):

$$2\alpha ({\cal S},{\cal T}) \le \beta ({\cal S},{\cal T}) \le {(8N)^{1/2}}\alpha ({\cal S},{\cal T}).$$

However, for this setting the number of atoms N will be |𝒜|ⁿ, and since α(n) ≔ α(σ (X ⁽ⁿ⁾), σ(Y ⁽ⁿ⁾)) is increasing, a bound on β(n) using the inequality above is useless.

The following rate of convergence is the main result:

Theorem 3.1. Let (Z, (X, Y)) be a hidden Markov model, where the sequence Z is an aperiodic time-homogeneous and irreducible Markov chain with finite state space S. Let the distribution of the pairs (X_i, Y_i),i = 1, 2, 3,..., be symmetric for all states in Z. Then, for all n ≥ 2,

(3.1)

$${{{\rm{E}}[L{C_n}]} \over n} \ge {\gamma ^*} - 2{\beta ^*} - C\sqrt {{{\ln n} \over n}} - {2 \over n} - (1 - {{\bf{1}}_{\mu = \pi }})({1 \over {\sqrt n }} + c{\alpha ^{\sqrt n }}),$$

where α ∊ (0, 1), c > 0 are constants as in (2.2) and C > 0. All constants depend on the parameters of the model but not on n. Moreover, with the same α and c,

(3.2)

$${{{\rm{E}}[L{C_n}]} \over n} \le {\gamma ^*} + (1 - {{\bf{1}}_{\mu = \pi }})({1 \over {\sqrt n }} + c{\alpha ^{\sqrt n }}).$$

A key ingredient in proving Theorem 3.1 is a Hoeffding-type inequality for Markov chains, a particular case of a result due to Paulin [Reference Paulin14], which is now recalled. It relies on the mixing time τ ( ∈) of the Markov chain Z given by

$$\tau ( \in )\min \{ t \in \mathbb N :{\bar d_Z}(t) \le \varepsilon \} ,$$

where

$${\overline d _Z}(t)\mathop {\max }\limits_{1 \le i \le N - t} \mathop {\sup }\limits_{x,y \in {\Lambda _i}} {d_{{\rm{T}}V}}({\cal L}({Z_{i + t}}|{Z_i} = x),{\cal L}({Z_{i + t}}|{Z_i} = y)),$$

and where $${d_{{\rm{T}}V}}(\mu ,\nu ) = {1 \over 2}\sum\nolimits_{x \in \Omega } |\mu (x) - \nu (x)|$$ is the total variation distance between the two probability measures μ and ν on the finite set Ω.

Lemma 3.1. Let M ≔ (M ₁,..., M_N) be a (not necessarily time-homogeneous) Markov chain, taking values in a Polish space ∧ = ∧₁ × … × ∧_N, with mixing time τ (∊), 0 ≤ ∊ ≤ 1. Let

$${\tau _{{\rm{m}}in}}\mathop {\inf }\limits_{0 \le \varepsilon \lt 1} \tau (\varepsilon )({{2 - \varepsilon } \over {1 - \varepsilon }}{)^2},$$

and let f : ∧ → ℛ be such that there is $$c \in _ \mathbb R + ^N$$ with $$|{\kern 1pt} f(u) - f(v)| \le \sum\nolimits_{i = 1}^N {c_i}{{\bf{1}}_{{u_i} \ne {v_i}}}$$. Then, for any t ≥ 0,

(3.3)

$${\rm{P}}(f(M) - {\rm{E}}{\kern 1pt} f(M) \ge t) \le \exp ({{ - 2{t^2}} \over {{\tau _{{\rm{m}}in}}\sum\limits_{i = 1}^N c_i^2}}).$$

For our purposes, the Hoeffding-type inequality used below follows directly from (3.3) once we note that (Z_i, X_i, Y_i)_i≥1 is jointly a Markov chain on a bigger state space. Let τ (∊) be the mixing time of this chain. Taking f to be the length of the longest common subsequences of X ₁ ,..., X_n and Y ₁ ,..., Y_n we have c = ((0,..., 0), (1,..., 1)) ∊ ℛⁿ × ℝ²n, since f is a function of Z, X, and Y whose values do not depend on Z. Letting $$A: = \sqrt {{\tau _{{\rm{m}}in}}/2} $$, (3.3) becomes

(3.4)

$${\rm{P}}(L{C_n} - {\rm{E}}[L{C_n}] \ge t) \le \exp [{{ - {t^2}} \over {{A^2}n}}]$$

for all t ≥ 0.

Remark 3.2. (i) When X and Y are generated by two independent hidden chains Z^X and Z^Y, the same reasoning yields (3.4), where now $$\tilde \tau (\varepsilon )$$ is the mixing time of the chain $${(Z_n^X,Z_n^Y,{X_n},{Y_n})_{n \ge 1}}$$.

(ii) The mixing time τ (∊) of (Z_n, X_n, Y_n)_n _≥1 is the same as the mixing time $$\tilde \tau (\varepsilon )$$ of the chain (Z_n)_n _≥1. Two proofs of this fact are provided in the Appendix.

Proof of Theorem 3.1. First, recall a result of Berbee [Reference Berbee2](see also [Reference Doukhan8, Theorem 1, Section 1.2.1], [Reference Rio16, Chapter 5], and [Reference Goldstein11]), asserting that on our probability space, which is rich enough, there exists $${Y^{*(n)}}(Y_1^*, \ldots ,Y_n^*)$$, independent of (Z, X)⁽ⁿ⁾ = ((Z ₁, X ₁),..., (Z_n, X_n)), having the same law as Y ⁽ⁿ⁾ = (Y ₁,..., Y_n) and such that

(3.5)

$${\rm{P}}({Y^{(n)}} \ne {Y^{*(n)}}) = \beta (n),$$

where β(n) = β(σ ((Z, X)⁽ⁿ⁾), σ (Y ⁽ⁿ⁾)) is the β-mixing coefficient of (Z, X)⁽ⁿ⁾ and Y ⁽ⁿ⁾.Note also that if (Y_i)_i _≥1 is stationary, then $$(Y_1^*, \ldots ,Y_k^*)$$ and $$(Y_\ell ^*, \ldots ,Y_{\ell + k - 1}^*)$$ are identically distributed for every ℓ, k ≥ 1, and that if (X ⁽ⁿ⁾, Y ⁽ⁿ⁾) is symmetric, then so is (X ⁽ⁿ⁾, Y ^*(n)) where X ⁽ⁿ⁾ = (X ₁,..., X_n). Note finally that this implies that Y ^*(n) is independent of both X ⁽ⁿ⁾ and Z ⁽ⁿ⁾ = (Z ₁,..., Z_n).

Next, fix k ∊ 𝒩; the idea of the proof is to relate E[LC_kn] to E[LC _2n]. For k = 4, this is done in the i.i.d. case in [Reference Rhee15]. However, we wish to take k →∞ and therefore follow arguments presented for the i.i.d. case in [Reference Lember, Matzinger and Torres12]. Call (ν, τ ) ≔ (ν ₁,...,ν_r,τ ₁,...τ_r)an r-partition with k ≤ r ≤ ⌈2kn/(2n − 1)⌉ if

(3.6)

$$\eqalign{ & 1 = {\nu _1} \le {\nu _2} \le \cdots \le {\nu _{r + 1}} = kn + 1, \cr & 1 = {\tau _1} \le {\tau _2} \le \cdots \le {\tau _{r + 1}} = kn + 1, \cr & ({\nu _{j + 1}} - {\nu _j}) + ({\tau _{j + 1}} - {\tau _j}) \in \{ (2n - 1,2n\} ,{\rm{for}}\,j \in [1,r - 1], \cr & ({\nu _{r + 1}} - {\nu _r}) + ({\tau _{r + 1}} - {\tau _r}) \lt 2n. \cr} $$

Let $${\cal B}_{k,n}^r$$ be the set of all r-partitions defined as above, and let

$${{\cal B}_{k,n}} = \bigcup\limits_{r = k}^{2kn/(2n - 1)} {\cal B}_{k,n}^r.$$

If (ν, τ )isan r-partition, then setting

$$L{C_{kn}}(\nu ,\tau )\sum\limits_{i = 1}^r LCS({X_{{\nu _i}}}, \ldots ,{X_{{\nu _{i + 1}} - 1}};{Y_{{\tau _i}}}, \ldots ,{Y_{{\tau _{i + 1}} - 1}})$$

we have:

$$L{C_{kn}} = \mathop {\max }\limits_{(\nu ,\tau ) \in {\cal B}(k,n)} L{C_{kn}}(\nu ,\tau ).$$

Let ν_i ₊₁ − ν_i = n − m, τ_i ₊₁ − τ_i ≤ n + m for m ∊ (−n, n) and τ_i − ν_i = ℓ. Then

(3.7)

$$\matrix{ {{\rm{E}}[LCS({X_{{\nu _i}}}, \ldots ,{X_{{\nu _{i + 1}} - 1}}{\kern 1pt} ;{\kern 1pt} {Y_{{\tau _i}}}, \ldots ,{Y_{{\tau _{i + 1}} - 1}})]{\rm{ }}} \cr { = {\rm{E}}[LCS({X_1}, \ldots ,{X_{n - m}}{\kern 1pt} ;{\kern 1pt} {Y_\ell }, \ldots ,{Y_{\ell + n + m - 1}})]{\rm{ }}} \cr { \le {\rm{E}}[LCS({X_1}, \ldots ,{X_{n - m}}{\kern 1pt} ;{\kern 1pt} Y_\ell ^*, \ldots ,Y_{\ell + n + m - 1}^*){{\bf{1}}_{{Y^{(kn)}} = {Y^{*(kn)}}}}]} \cr } $$

(3.8)

$$\quad + \min (n - m,n + m){\rm{P}}({Y^{(kn)}} \ne {Y^{*(kn)}})$$

(3.9)

$$ \le {\rm{E}}[LCS({X_1}, \ldots ,{X_{n - m}}{\kern 1pt} ;{\kern 1pt} Y_\ell ^*, \ldots ,Y_{\ell + n + m - 1}^*)] + n\beta (kn).$$

In the last expression the LCS is now a function of two independent sequences. Stationarity implies (3.7), and $$LCS({X_1}, \ldots ,{X_{n - m}}{\kern 1pt} ;{\kern 1pt} Y_\ell ^*, \ldots ,Y_{\ell + n + m - 1}^*) \le \min (n - m,n + m)$$ leads to (3.8).

The error term nβ(kn) in (3.9) follows from an application of Berbee’s result (3.5). The same properties also imply that

(3.10)

$$\matrix{ {{\rm{E}}[LCS({X_1}, \ldots ,{X_{n - m}};Y_\ell ^*, \ldots ,Y_{\ell + n + m - 1}^*)]{\rm{ }}} \cr { = {\rm{E}}[LCS({X_1}, \ldots ,{X_{n - m}};Y_1^*, \ldots ,Y_{n + m}^*)]{\rm{ }}} \cr { \le {\rm{E}}[LCS({X_1}, \ldots ,{X_{n - m}};{Y_1}, \ldots ,{Y_{n + m}})] + n\beta (kn),} \cr } $$

and

(3.11)

$$\matrix{ {{\rm{E}}[LCS({X_1}, \ldots ,{X_{n - m}};Y_\ell ^*, \ldots ,Y_{\ell + n + m - 1}^*)]} \cr { = {\rm{E}}[LCS({X_1}, \ldots ,{X_{n + m}};Y_1^*, \ldots ,Y_{n - m}^*)]} \cr } $$

(3.12)

$$ \le {\rm{E}}[LCS({X_{n - m + 1}}, \ldots ,{X_{2n}};{Y_{n + m + 1}}, \ldots ,{Y_{2n}})] + n\beta (kn),$$

where the symmetry of the distributions of X and Y ^* is used to get (3.11). Next, by the superadditivity of the LCSs as well as (3.9), (3.10), and (3.12),

(3.13)

$$\matrix{ {{\rm{E}}[LCS({X_{{\nu _i}}}, \ldots ,{X_{{\nu _{i + 1}} - 1}};{Y_{{\tau _i}}}, \ldots ,{Y_{{\tau _{i + 1}} - 1}})]} \hfill \cr {{\rm{ }} \le {1 \over 2}({\rm{E}}[LCS({X_1}, \ldots ,{X_{n - m}};{Y_1}, \ldots ,{Y_{n + m}})]} \hfill \cr {\quad {\rm{ }} + {\rm{E}}[LCS({X_{n - m + 1}}, \ldots ,{X_{2n}};{Y_{n + m + 1}}, \ldots ,{Y_{2n}})] + 2n\beta (kn)) + n\beta (kn)} \hfill \cr {{\rm{ }} \le {1 \over 2}({\rm{E}}[L{C_{2n}}] + 2n\beta (kn)) + n\beta (kn)} \hfill \cr {{\rm{ }} = {1 \over 2}{\rm{E}}[L{C_{2n}}] + 2n\beta (kn).} \hfill \cr } $$

This inequality is key to the proof, since it yields an upper bound on E[LC_kn(ν, τ)] in terms of E[LC _2n], a quantity that does not depend on the partitioning (ν, τ ). A similar result is central to the proof of the rate of convergence in the independent setting [Reference Alexander1]. However, independence allows us to get (3.13) directly without the mere presence of or the need to introduce β-mixing coefficients. Moreover, the approach here is more direct. Applying Hoeffding’s inequality and summing over all partitions provide a relation between E[LC_kn] and E[LC _2n] which can be used to get the rate of convergence. Indeed,

$${\rm{E}}[L{C_{kn}}(\nu ,\tau )] \le {r \over 2}({\rm{E}}[L{C_{2n}}] + 4n\beta (kn)) \le {1 \over 2}{{2kn} \over {2n - 1}}({\rm{E}}[L{C_{2n}}] + 4n\beta (kn)).$$

In addition, for t > 0,

$$\matrix{ {{\rm{P}}(L{C_{kn}}(\nu ,\tau ) - {1 \over 2}{{2kn} \over {2n - 1}}({\rm{E}}[L{C_{2n}}] + 4n\beta (kn)) > tkn)} \cr { \le {\rm{P}}(L{C_{kn}}(\nu ,\tau ) - {\rm{E}}[L{C_{kn}}(\nu ,\tau )] > tkn){\rm{ }}} \cr { \le \exp [ - {{{t^2}kn} \over {{A^2}}}],{\rm{ }}} \cr } $$

where the second inequality follows from Lemma 3.1. Next, note that

$$\matrix{ {{\rm{P}}(L{C_{kn}} - {1 \over 2}{{2kn} \over {2n - 1}}({\rm{E}}[L{C_{2n}}] + 4n\beta (kn)) > tkn){\rm{ }}} \cr { = \sum\limits_{(\nu ,\tau ) \in {{\cal B}_{k,n}}} {\rm{P}}(L{C_{kn}}(\nu ,\tau ) - {1 \over 2}{{2kn} \over {2n - 1}}({\rm{E}}[L{C_{2n}}] + 4n\beta (kn)) > tkn)} \cr { \le |{{\cal B}_{k,n}}|\exp [ - {{{t^2}kn} \over {{A^2}}}].{\rm{ }}} \cr } $$

The above can be rewritten as:

$${\rm{P}}({{L{C_{kn}}} \over {kn}} > t + {1 \over k}{{2kn} \over {2n - 1}}({{{\rm{E}}[L{C_{2n}}]} \over {2n}} + 2\beta (kn))) \le |{{\cal B}_{k,n}}|\exp [ - {{{t^2}kn} \over {{A^2}}}].$$

Then, since LC_kn ≤ kn,

(3.14)

$$\matrix{ {{\rm{E}}[{{L{C_{kn}}} \over {kn}}]} & \le & {t + {1 \over k}{{2kn} \over {2n - 1}}({{{\rm{E}}[L{C_{2n}}]} \over {2n}} + 2\beta (kn))} \cr {} & {} & { + {\rm{P}}({{L{C_{kn}}} \over {kn}} > t + {1 \over k}{{2kn} \over {2n - 1}}{{{\rm{E}}[L{C_{2n}}]} \over {2n}})} \cr {} & \le & {t + {1 \over k}{{2kn} \over {2n - 1}}({{{\rm{E}}[L{C_{2n}}]} \over {2n}} + 2\beta (kn)) + |{{\cal B}_{k,n}}|\exp [ - {{{t^2}kn} \over {{A^2}}}].} \cr } $$

Next, a bound on |ℬ_k_,_n| is obtained using methods as in [Reference Lember, Matzinger and Torres12]. Recall that k ≤ r ≤ ⌈2kn/(2n − 1)⌉ and that $${{\cal B}_{k,n}} = \bigcup\nolimits_{r = k}^{2kn/2n - 1} {\cal B}_{k,n}^r$$. Now

(3.15)

$$|{\cal B}_{k,n}^r| \le {2^{r - 1}}2n\left( {\matrix{ {nk + r - 1} \cr {r - 1} \cr } } \right).$$

Indeed, the sizes of the partitions on the X side should sum to nk, which gives a factor of less than $$\left( {\matrix{ {nk + r - 1} \cr {r - 1} \cr } } \right)$$. Also, for each choice of the first r − 1 elements of the partition on the X side we have at most 2 choices on the Y side. The last interval can take at most 2n values, as per (3.6). Recall Stirling’s formula (see [Reference Feller10]): for n ≥ 1,

$${n^n}{{\rm{e}}^{ - n}}\sqrt {2\pi n} {{\rm{e}}^{1/(12n + 1)}} \le n! \le {n^n}{{\rm{e}}^{ - n}}\sqrt {2\pi n} {{\rm{e}}^{1/12n}}.$$

Since in the end of the proof k →∞, this bound can be used in (3.15) to obtain:

$$\matrix{ {|{\cal B}_{k,n}^r|} \hfill & { \le {{(2}^{r - 1}}2n){{{{(nk + r - 1)}^{nk + r - 1}}\sqrt {2\pi (nk + r - 1)} {{\rm{e}}^{1/12(nk + r - 1)}}} \over {(r - 1)(r - 1)\sqrt {2\pi (r - 1)} {{\rm{e}}^{1/12(r - 1) + 1}}{{(nk)}^{nk}}\sqrt {2\pi nk} {{\rm{e}}^{1/12(nk) + 1}}}}} \hfill \cr {} \hfill & { \le {2^r}n{{{{(nk + r - 1)}^{nk + r - 1}}} \over {{{(r - 1)}^{r - 1}}{{(nk)}^{nk}}}}} \hfill \cr {} \hfill & { \le {2^r}n{{(1 + {{nk} \over {r - 1}})}^{r - 1}}{{(1 + {2 \over {2n - 1}})}^{nk}}} \hfill \cr {} \hfill & { \le {2^r}n{{(1 + n + {n \over {k - 1}})}^{{{2nk} \over {2n - 1}}}}{{({{2n + 1} \over {2n - 1}})}^{nk}}.} \hfill \cr } $$

The last inequality in the above expression holds true since k ≤ r ≤ ⌈2kn/(2n − 1)⌉. Then, for |ℬ _k,n| we get:

$$\matrix{ {|{{\cal B}_{k,n}}|} \hfill & { \le ({{2nk} \over {2n - 1}} - k + 2)\mathop {\max }\limits_r |{\cal B}_{k,n}^r|} \hfill \cr {} \hfill & { \le ({k \over {2n - 1}} + {{2)2}^r}n{{(1 + n + {n \over {k - 1}})}^{{{2nk} \over {2n - 1}}}}{{({{2n + 1} \over {2n - 1}})}^{nk}}} \hfill \cr {} \hfill & { \le \exp (({{\ln ({k \over {2n - 1}} + 2)} \over {nk}} + {{r\ln 2 + \ln n} \over {nk}} + {2 \over {2n - 1}}\ln (2n) + \ln ({{2n + 1} \over {2n - 1}}))nk)} \hfill \cr {} \hfill & { \le \exp (({{\ln k} \over k} + {2 \over {2n - 1}}\ln 2 + {2 \over {2n - 1}}\ln (2n) + \ln ({{2n + 1} \over {2n - 1}}))nk)} \hfill \cr {} \hfill & { \le \exp (({{\ln k} \over k} + {4 \over {2n - 1}}\ln 2 + {2 \over {2n - 1}}\ln n + \ln ({{2n + 1} \over {2n - 1}}))nk)} \hfill \cr {} \hfill & { \le \exp (10k\ln n),} \hfill \cr } $$

where the last inequality holds for large k, in particular k > n, and since ln (1 + x) ≤ x for x > 0. Let $$t = 2A\sqrt {10} \sqrt {\ln n/n} $$. Then

$$\matrix{ {|{{\cal B}_{k,n}}|\exp ( - {{{t^2}kn} \over {{A^2}}})} \hfill & { \le \exp (10k\ln n)\exp ( - {{{t^2}kn} \over {{A^2}}})} \hfill \cr {} \hfill & { \le \exp ( - 30k\ln n).} \hfill \cr } $$

Next, note that, as k →∞,E[LC_kn/(kn)] → λ ^* and that

$${1 \over k}{{2kn} \over {2n - 1}} \le {1 \over k}({{2kn} \over {2n - 1}} + 1) \to {{2n} \over {2n - 1}}.$$

Recall also that β ^* = lim_n _→∞ β(n) = lim_k _→∞ β(kn). Then (3.14) implies that

(3.16)

$${{2n} \over {2n - 1}}({{{\rm{E}}[L{C_{2n}}]} \over {2n}} + 2{\beta ^*}) \ge {\gamma ^*} - 2A\sqrt {10} \sqrt {{{\ln n} \over n}} ,$$

and, finally,

$$\matrix{ {{{{\rm{E}}[L{C_{2n}}]} \over {2n}}} \hfill & \ge \hfill & {{{2n - 1} \over {2n}}({\gamma ^*} - 2A\sqrt {10} \sqrt {{{\ln n} \over n}} ) - 2{\beta ^*}} \hfill \cr {} \hfill & \ge \hfill & {{\gamma ^*} - 2{\beta ^*} - 2A\sqrt {10} \sqrt {{{\ln n} \over n}} - {1 \over {2n}}.} \hfill \cr } $$

To get the result for words of odd length note that, by (3.16),

$$\matrix{ {{{{\rm{E}}[L{C_{2n + 1}}]} \over {2n + 1}}} & \ge & {{{{\rm{E}}[L{C_{2n}}]} \over {2n + 1}}{\rm{ }}} \cr {} & \ge & {{{2n - 1} \over {2n + 1}}({\gamma ^*} - 2A\sqrt {10} \sqrt {{{\ln n} \over n}} ) - {{2n} \over {2n + 1}}2{\beta ^*}} \cr {} & \ge & {{\gamma ^*} - 2{\beta ^*} - 2A\sqrt {10} \sqrt {{{\ln n} \over n}} - {2 \over {2n + 1}}.} \cr } $$

Of course, these last bounds are only of interest, for n large enough, if λ ^* > 2β ^*. Otherwise we get the trivial lower bound 0 (see Remark 3.3 below). We are then left with slightly modifying the constants to get (3.1). The extra term on the right-hand side in (3.1) accounts for the difference in initial distributions (2.1).

The proof of the upper bound (3.2), where symmetry is not needed, follows by combining Fekete’s lemma (see [Reference Steele19]) with (2.1) and (2.2).

Remark 3.3. (i) Recall that the β-mixing coefficient β(n) is a measure on the dependency be- tween (X ₁,..., X_n) and (Y ₁,..., Y_n). The bounds in Theorem 3.1 rely on β ^* ≔ lim_n _→∞ β(n), which somehow quantifies a weak dependency requirement, and β ^* ≠ 0 unless the sequences X and Y are independent. Note also that the lower bound in Theorem 3.1 is meaningful only if 2β ^* < λ ^*. Besides the independent case, there are instances for which this condition is satisfied. For example, let X and Y both be Markov chains with L states and with the same transition matrix P, where some rows of P are equal to (1, 1, 1,..., 1)/L, i.e. such that there exists a set of states ℒ such that the transition probability between each one of these states is uniform. Let the initial distribution of X ₁ be μ with μ(x) = 0 if x ∉ ℒ and assume that Y ₁ = X ₁. Then the sequence $$\tilde Y$$ defined, for all n, via $${\tilde Y_i} = {Y_i}$$ for i ≥ 1, while Y ₁ is distributed according to μ, will be such that $${\tilde Y^{(n)}}$$ and Y ⁽ⁿ⁾ have the same distribution. Moreover, for all n, $${\tilde Y^{(n)}}$$ and X ⁽ⁿ⁾ will be independent and $${\rm{P}}({\tilde Y^{(n)}} \ne {Y^{(n)}}) \ge \beta (n)$$, but $${\rm{P}}({\tilde Y^{(n)}} \ne {Y^{(n)}}) = {\rm{P}}({Y_1} \ne {\tilde Y_1})$$, which can be made as small as desired for a suitable choice of μ. Thus the lower bound in Theorem 3.1 holds and is meaningful.

(ii) There are instances when the lower bound in Theorem 3.1 is vacuous. Such a case is when X_i = Y_i for all i ≥ 1 and the X_i are independent and uniformly distributed over the letters in 𝒜. Then it is clear that λ ^* = 1, whereas we can show that

$$\beta (n) = 1 - {1 \over {|{\cal A}{|^n}}},$$

and so β ^* = 1. In this case the lower bound in (3.1) is a negative quantity.

(iii) Theorem 3.1 continues to hold for Markov chains with a general state space ∧. Indeed, the Hoeffding inequality (3.4)is true when ∧ is a Polish space. The exponential decay (2.2) holds when ∧ is petite, i.e. when there exist a positive integer n ₀, ∊ > 0, and a probability measure ν on ∧ such that Pⁿ0 (x, A) ≥ ∊ν(A) for every measurable A and x ∊ ∧, and where P ⁿ⁰ (x, A) is the n ₀-step transition law of the Markov chain (see [Reference Roberts and Rosenthal17, Theorem 8]).

When X and Y are generated by independent hidden Markov models the following variant of Theorem 3.1 holds (for a sketch of the proof, see the Appendix).

Corollary 3.1. Let (Z_X, X) and (Z_Y, Y) be two independent hidden Markov models, where the latent chains Z_X and Z_Y have the same initial distribution, transition matrix, and emission probabilities. Then, for all n ≥ 2,

$${{{\rm{E}}[L{C_n}]} \over n} \ge {\gamma ^*} - C\sqrt {{{\ln n} \over n}} - {2 \over n} - (1 - {{\bf{1}}_{\mu = \pi }})({1 \over {\sqrt n }} + c{\alpha ^{\sqrt n }}),$$

where α ∊ (0, 1), c > 0 are constants as in (2.2) and C > 0. All constants depend on the parameters of the model but not on n. Moreover, with the same α and c,

$${{{\rm{E}}[L{C_n}]} \over n} \le {\gamma ^*} + (1 - {{\bf{1}}_{\mu = \pi }})({1 \over {\sqrt n }} + c{\alpha ^{\sqrt n }}).$$

As mentioned at the end of the proof of Theorem 3.1, the symmetry of the distribution of (X_i, Y_i) is used only for proving the lower bound. Let

$$h(n)\mathop {\max }\limits_{m \in [ - n,n]} (2\sum\limits_{i = 1}^{n - m} {\rm{P}}({X_i} \ne {Y_i}) + \sum\limits_{i = n - m + 1}^{n + m} {\rm{P}}({X_i} \ne {Y_i})).$$

Then the following result holds.

Proposition 3.1. Let (Z, (X, Y)) be a hidden Markov model, where the sequence Z is an aperiodic time-homogeneous and irreducible Markov chain with finite state space S. Then, for all n ≥ 2,

$${{{\rm{E}}[L{C_n}]} \over n} \ge {\gamma ^*} - {{h(n)} \over n} - 2{\beta ^*} - C\sqrt {{{\ln n} \over n}} - {2 \over n} - (1 - {{\bf{1}}_{\mu = \pi }})({1 \over {\sqrt n }} + c{\alpha ^{\sqrt n }}).$$

Appendix A

First, as asserted in Remark 3.2 (ii), this appendix provides two proofs of the fact that the mixing time τ(∊) of (Z_n, X_n, Y_n)_n _≥1 is the same as the mixing time $$\tilde \tau (\varepsilon )$$ of the chain (Z_n)_n _≥1.

Proof 1. let $$\tilde T = ({\tilde T_n}{)_{n \ge 1}}$$ be a Markov chain with finite state space 𝒮. Each $${\tilde T_i}$$ emits an observed variable T_i according to some probability distribution that depends only on the state $${\tilde T_i}$$. Let T = (T_n)_n _≥1 and assume T_i ∊ 𝒜, a finite alphabet. Note that $$(\tilde T,T)$$ is a Markov chain; let τ(∊) be its mixing time, and let $$\tilde \tau (\varepsilon )$$ be the mixing time for the hidden chain $$\tilde T$$. Then

$$\matrix{ {{d_{{\rm{T}}V}}({\cal L}(({{\tilde T}_{i + t}},{T_{i + t}})|({{\tilde T}_i},{T_i}) = (x,u)),{\cal L}(({{\tilde T}_{i + t}},{T_{i + t}})|({{\tilde T}_i},{T_i}) = (y,v)))} \hfill \cr {{\rm{ }} = {1 \over 2}\sum\limits_{(z,w) \in {\cal S} \times {\cal A}} |{\rm{P}}(({{\tilde T}_{i + t}},{T_{i + t}}) = (z,w)|({{\tilde T}_i},{T_i}) = (x,u)) - } \hfill \cr {{\rm{ }} - {\rm{P}}(({{\tilde T}_{i + t}},{T_{i + t}}) = (z,w)|({{\tilde T}_i},{T_i}) = (y,v)|} \hfill \cr {{\rm{ }} = {1 \over 2}\sum\limits_{(z,w)} |{\rm{P}}({{\tilde T}_{i + t}} = z|{{\tilde T}_i} = x){\rm{P}}(z \to w) - {\rm{P}}({{\tilde T}_{i + t}} = z|{{\tilde T}_i} = y){\rm{P}}(z \to w)|} \hfill \cr {{\rm{ }} = {1 \over 2}\sum\limits_{(z,w)} {\rm{P}}(z \to w)|{\rm{P}}({{\tilde T}_{i + t}} = z|{{\tilde T}_i} = x) - {\rm{P}}({{\tilde T}_{i + t}} = z|{{\tilde T}_i} = y)|} \hfill \cr {{\rm{ }} = {1 \over 2}\sum\limits_{z \in {\cal S}} |{\rm{P}}({{\tilde T}_{i + t}} = z|{{\tilde T}_i} = x) - {\rm{P}}({{\tilde T}_{i + t}} = z|{{\tilde T}_i} = y)|} \hfill \cr {{\rm{ }} = {d_{{\rm{T}}V}}({\cal L}({{\tilde T}_{i + t}}|{{\tilde T}_i} = x),{\cal L}({{\tilde T}_{i + t}}|{{\tilde T}_i} = y)),} \hfill \cr } $$

where $${\rm{P}}(z \to w){\rm{P}}({T_i} = w|{\tilde T_i} = z)$$, i.e. the probability that a state with value z ∊ 𝒮 emits w ∊ 𝒜. By the definitions of T and $$\tilde T$$ this last probability does not depend on i. Then ∑w∊𝒜 P(z → w) = 1. Therefore, $${\overline d _{(\tilde T,T)}}(t) = {\overline d _{\tilde T}}(t)$$ and $$\tau (\in) = \tilde \tau (\in)$$.

Proof 2. An alternative approach to proving the result of Remark 3.2 (ii) relies on coupling arguments and was kindly suggested by D. Paulin in personal communications with the authors. First, recall the following classical result [Reference Levin, Peres and Wilmer13, Proposition 4.7].

Lemma A.1. Let μ and ν be two probability distributions on Ω. Then

$${d_{{\rm{T}}V}}(\mu ,\nu ) = \inf \{ {\rm{P}}(X \ne Y):(X,Y)is\,\,\,\,a\,\,\,\,coupling\,\,\,\,of\mu \,\,\,and\,\,\,\,\nu \} .$$

Moreover, there is a coupling (X, Y) which attains the infimum and such a coupling is called optimal.

Let $$({\tilde T^1},{\tilde T^2})$$ be an optimal coupling according to $${d_{{\rm{T}}V}}({\cal L}({\tilde T_t}|{\tilde T_1} = x),{\cal L}({\tilde T_t}|{\tilde T_1} = y))$$ for some x, y ∊ 𝒮, i.e. $${\tilde T^1}$$ and $${\tilde T^2}$$ are Markov chains with the same transition probability as $$\tilde T,{\rm{ }}\tilde T_0^1 = x,{\rm{ }}\tilde T_0^2 = y$$, and

(A.1)

$${\rm{P}}(\tilde T_t^1 \ne \tilde T_t^2) = \quad {d_{{\rm{T}}V}}({\cal L}({\tilde T_t}|{\tilde T_1} = x),{\cal L}({\tilde T_t}|{\tilde T_1} = y)).$$

Next, let $$T_t^1$$ and $$T_t^2$$ be respectively distributed according to the distributions associated with $$\tilde T_t^1$$ and $$\tilde T_t^2$$ and be independent of all the other random variables. In addition, if for some $$t \ge 1,{\rm{ }}\tilde T_t^1 = \tilde T_t^2$$, then $$T_t^1 = T_t^2$$. Then

$${\rm{P}}(\tilde T_t^1 \ne \tilde T_t^2) = {\rm{P}}((\tilde T_t^1,T_t^1) \ne (\tilde T_t^2,T_t^2)),$$

and by Lemma A.1, for any u, v ∈ ∊ and any i ≥ 1,

$$\matrix{ {{\rm{P}}((\tilde T_t^1,T_t^1) \ne (\tilde T_t^2,T_t^2)){\rm{ }}} \cr { \ge \quad {d_{{\rm{T}}V}}({\cal L}(({{\tilde T}_{i + t}},{T_{i + t}})|({{\tilde T}_i},{T_i}) = (x,u)),{\cal L}(({{\tilde T}_{i + t}},{T_{i + t}})|({{\tilde T}_i},{T_i}) = (y,v))).} \cr } $$

Together with (A.1), the above yields

$$\matrix{ {{d_{{\rm{T}}V}}({\cal L}({{\tilde T}_t}|{{\tilde T}_1} = x),{\cal L}({{\tilde T}_t}|{{\tilde T}_1} = y)) \ge {\rm{ }}} \cr { \ge \quad {d_{{\rm{T}}V}}({\cal L}(({{\tilde T}_{i + t}},{T_{i + t}})|({{\tilde T}_i},{T_i}) = (x,u)),{\cal L}(({{\tilde T}_{i + t}},{T_{i + t}})|({{\tilde T}_i},{T_i}) = (y,v))).} \cr } $$

Taking the sup over x, y, u, v gives $${\overline d _{(\tilde T,T)}}(t) \le {\overline d _{\tilde T}}(t)$$.

For the reverse inequality, consider the optimal coupling $$(({\tilde T^1},{T^1}),({\tilde T^2},{T^2}))$$ according to $${d_{{\rm{T}}V}}({\cal L}(({\tilde T_t},{T_t})|({\tilde T_1},{T_1}) = (x,u)),{\cal L}(({\tilde T_t},{T_t})|({\tilde T_1},{T_1}) = (y,v)))$$, for some x, y ∊ 𝒮 and u, v ∊ 𝒜. Then

(A.2)

$${\rm{P}}((\tilde T_t^1,T_t^1) \ne (\tilde T_t^2,T_t^2)) = {d_{{\rm{T}}V}}({\cal L}(({\tilde T_t},{T_t})|({\tilde T_1},{T_1}) = (x,u)),{\cal L}(({\tilde T_t},{T_t})|({\tilde T_1},{T_1}) = (y,v)))$$

and

$${\rm{P}}((\tilde T_t^1,T_t^1) \ne (\tilde T_t^2,T_t^2)) \ge {\rm{P}}(\tilde T_t^1 \ne \tilde T_t^2).$$

However, by Lemma A.1, for any i ≥ 1,

(A.3)

$${\rm{P}}(\tilde T_t^1 \ne \tilde T_t^2) \ge {d_{{\rm{T}}V}}({\cal L}({\tilde T_{i + t}}|{\tilde T_i} = x),{\cal L}({\tilde T_{i + t}}|{\tilde T_i} = y)).$$

Taking the sup in (A.2), (A.3) gives $${\overline d _{(\tilde T,T)}}(t) \ge {\overline d _{\tilde T}}(t)$$, and then $${\overline d _{(\tilde T,T)}}(t) = {\overline d _{\tilde T}}(t)$$.

Proof of Corollary 3.1. The Hoeffding inequality (3.4) holds as long as (Z, X, Y)is a Markov chain. In addition, (X, Y) has to be symmetric (see proof of Proposition 3.1) in order for (3.13) to hold. Again, one such setting is when X and Y are two independent HMMs with the same transition matrix for the latent chain and same emission probabilities. A rate of convergence result then follows from arguments as in Section 3. The bound on ℬ_k_,_n is the same, and there is a Hoeffding-type inequality for this model as per Remark 3.2 (i). One thing that differs is the bound in (3.13). In the present case it is much easier to get. When started at the stationary distribution, by stationarity, independence, and symmetry, we have:

$$\matrix{ {LCS({X_{{\nu _i}}}, \ldots ,{X_{{\nu _{i + 1}} - 1}};{Y_{{\tau _i}}}, \ldots ,{Y_{{\tau _{i + 1}} - 1}})} \hfill & { \le LCS({X_1}, \ldots ,{X_{n - m}},{Y_1}, \ldots ,{Y_{n + m}})} \hfill \cr {} \hfill & { = LCS({X_1}, \ldots ,{X_{n + m}};{Y_1}, \ldots ,{Y_{n - m}})} \hfill \cr {} \hfill & { \le {1 \over 2}L{C_{2n}}.} \hfill \cr } $$

In particular, there is no need to introduce mixing coefficients in this case (β = 0). When the hidden chains are not started at the stationary distribution we get an error as in (2.4). Then, Theorem 3.1 holds but with constants depending on the new model. Moreover, this setting reduces to the one where X and Y are independent Markov chains by letting each state of the hidden chains emit a unique letter, which can further recover the i.i.d. case originally obtained in [Reference Alexander1].

Proof of Proposition 3.1. The symmetry of the distribution of (X, Y) is only used to get (3.11), which implies that for any m ∊ {−n + 1,..., n − 1}, LCS(X ₁,..., X _n−m; Y ₁,..., Y _n+m) and LCS(X ₁,..., X _n+m; Y ₁,..., Y _n−m) are identically distributed and bounded above by half of LC _2n. Such a result yields a comparison between E[LC _2n] and E[LC_kn], leading as k → ∞ to a lower bound on E[LC _2n] involving λ ^*. Without assuming symmetry, the step (3.11) in obtaining (3.13) needs to be modified. One way to do so is to make use of the Lipschitz property of the LCS to get the following estimate:

$$\matrix{ {LCS({X_1}, \ldots ,{X_{n - m}};{\kern 1pt} {Y_1}, \ldots ,{Y_{n + m}})} \hfill \cr {{\rm{ }} = LCS({X_1}, \ldots ,{X_{n + m}};{\kern 1pt} {Y_1}, \ldots ,{Y_{n - m}}) + (LCS({X_1}, \ldots ,{X_{n - m}};{\kern 1pt} {Y_1}, \ldots ,{Y_{n + m}})} \hfill \cr {\quad {\rm{ }} - LCS({X_1}, \ldots ,{X_{n + m}};{\kern 1pt} {Y_1}, \ldots ,{Y_{n - m}}))} \hfill \cr {{\rm{ }} \le LCS({Y_1}, \ldots ,{Y_{n - m}};{\kern 1pt} {X_1}, \ldots ,{X_{n + m}}) + 2\sum\limits_{i = 1}^{n - m} {{\bf{1}}_{{X_i} \ne {Y_i}}} + \sum\limits_{i = n - m + 1}^{n + m} {{\bf{1}}_{{X_i} \ne {Y_i}}}.} \hfill \cr } $$

Taking expectations, (3.13) becomes

$${\rm{E}}[LCS({X_{{\nu _i}}}, \ldots ,{X_{{\nu _{i + 1}} - 1}};{\kern 1pt} {Y_{{\tau _i}}}, \ldots ,{Y_{{\tau _{i + 1}} - 1}})] \le {1 \over 2}({\rm{E}}[L{C_{2n}}] + h(n)) + 2n\beta (kn),$$

where

$$h(n)\mathop {\max }\limits_{m \in [ - n,n]} (2\sum\limits_{i = 1}^{n - m} {\rm{P}}({X_i} \ne {Y_i}) + \sum\limits_{i = n - m + 1}^{n + m} {\rm{P}}({X_i} \ne {Y_i})).$$

This leads to a non-symmetric version of (3.1), namely

(A.4)

$${{{\rm{E}}[L{C_n}]} \over n} \ge {\gamma ^*} - C\sqrt {{{\ln n} \over n}} - {{h(n)} \over n} - 2{\beta ^*} - {1 \over {n - 2}} - (1 - {{\bf{1}}_{\mu = \pi }})({1 \over {\sqrt n }} + c{\alpha ^{\sqrt n }}).$$

This completes the proof.

If $$h(n) = O(\sqrt {n\ln n} )$$ then the rate in (3.17) or (A.4) will be the same as in (3.1). Such will be the case when (Z ^′, X) and (Z ^′, Y) are two independent hidden Markov models and Z = (Z ^′, Z ^″) is a coupling of the two latent chains such that if $$Z{' \over i} = Z{' \over i}$$ then $$Z{' \over i} = Z{' \over i}$$ for any j > i. Then, (Z, (X, Y)) is a hidden Markov model where X_i = Y_i once the two latent chains have met, and, by (2.2), $$h(n) = O(\sqrt {n\log n} )$$. However, h(n) can be much larger, e.g. of order n. A case in point is when the X_i and Y_i are i.i.d. Bernoulli random variables with parameters 1/3 and 1/2 respectively. Then P(X_i ≠ Y_i) = P(X_i = 0, Y_i = 1) + P(X_i = 1, Y_i = 0) = 1/6 + 2/6 = 1/2, for all i ≥ 1, and

$$(2\sum\limits_{i = 1}^{n - m} {\rm{P}}({X_i} \ne {Y_i}) + \sum\limits_{i = n - m + 1}^{n + m} {\rm{P}}({X_i} \ne {Y_i})) = (2(n - m)1/2 + (2m)1/2) = n.$$

Note also that when X = (X_i)_i≥1 and Y = (Y_i)_i≥1 are independent sequences of random variables, the symmetry assumption is equivalent to X and Y being identically distributed.

Acknowledgements

This material is based in part upon work supported by the National Science Foundation under Grant No. 1440140, while the first author was in residence at the Mathematical Sciences Research Institute in Berkeley, California, during the Fall semester of 2017. His research was also supported in part by grants #246283 and #524678 from the Simons Foundation. The second author was partially supported by the TRIAD NSF grant (award 1740776). Both authors are grateful to J. Spouge and S. Eddy for encouragement and correspondence on the relevance of the hidden Markov models in computational biology.

References

Alexander, K. (1994). The rate of convergence of the mean length of the longest common subsequence. Ann. Appl. Prob. 4, 1074–1082.CrossRef Google Scholar

Berbee, H. C. P. (1979). Random Walks with Stationary Increments and Renewal Theory. Mathematical Centre Tracts, 112. Mathematisch Centrum, Amsterdam.Google Scholar

Bradley, R. (1983) Approximation theorems for strongly mixing random variables. Michigan Math. J. 30, 69–81.Google Scholar

Bradley, R. (2005) Basic properties of strong mixing conditions. A survey and some open questions. Prob. Surveys 2, 104–144.CrossRef Google Scholar

Bradley, R. (2007). Introduction to Strong Mixing Conditions, Vol. 1. Kendrick Press, Heber City, UT.Google Scholar

ChváTal, V. and Sankoff, D. (1975). Longest common subsequences of two random sequences. J. Appl. Prob. 12, 306–315.CrossRef Google Scholar

Doeblin, W. (1938). Exposé de la théorie des chaînes simples constantes de Markoff à un nombre fini d’états, Revue Math. de l’Union Interbalkanique 2, 77–105.Google Scholar

Doukhan, P. (1994). Mixing. Properties and Examples. Lecture Notes Statist., 85. Springer, New York.Google Scholar

Durbin, R., Eddy, S., Krogh, A. and Mitchison, G. (1998). Biological Sequence Analysis. Cambridge University Press.CrossRef Google Scholar

Feller, W. (1968). An Introduction to Probability Theory and its Applications, Vol. 1, 3rd edn. John Wiley, New York.Google Scholar

Goldstein, S. (1979). Maximal coupling. Z. Wahrsch. verw. Gebiete 46, 193–204.CrossRef Google Scholar

Lember, J., Matzinger, H. and Torres, F. (2012). The rate of convergence of the mean score in random sequence comparison. Ann. Appl. Prob. 22, 1046–1058.CrossRef Google Scholar

Levin, D., Peres, Y. and Wilmer, E. (2008). Markov Chains and Mixing Times. AMS, Providence, RI.CrossRef Google Scholar

Paulin, D. (2015). Concentration inequalities for Markov chains by Marton couplings and spectral methods. Electron. J. Prob. 20, no. 79.CrossRef Google Scholar

Rhee, W. (1995). On rates of convergence for common subsequences and first passage time. Ann. Appl. Prob. 5, 44–48.CrossRef Google Scholar

Rio, E. (2017) Asymptotic Theory of Weakly Dependent Random Processes. Probability Theory and Stochastic Modelling, 80. Springer, Berlin.CrossRef Google Scholar

Roberts, G. and Rosenthal, J. (2004) General state space Markov chains and MCMC algorithms. Prob. Surveys 1, 20–71.CrossRef Google Scholar

Steele, M. (1982). Long common subsequences and the proximity of two random strings. SIAM J. Appl. Math. 42, 731–737.CrossRef Google Scholar

Steele, M. (1997). Probability Theory and Combinatorial Optimization. SIAM, Philadelphia, PA, 18–21.CrossRef Google Scholar

Thorisson, H. (2000). Coupling, Stationarity and Regeneration. Probability and its Applications. Springer, New York.CrossRef Google Scholar

Article contents

On the rate of convergence for the length of the longest common subsequences in hidden Markov models

Abstract

Keywords

MSC classification

1. Introduction

2. Mean convergence

3. Rate of convergence

Appendix A

Acknowledgements

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests