1. Introduction: framework and motivation
The well-known Pearson chi-squared test of goodness of fit is a statistical test applied to categorical data to establish whether an observed frequency distribution differs from a theoretical probability distribution. In this test the observations are always assumed to be independent and identically distributed (i.i.d.). Under this hypothesis, in a multinomial sample of size N, the chi-squared statistics
(where k is the number of possible values, and $O_{i}$ , $E_{i}$ , $\widehat{p}_i=O_i/N$ , and ${p}_i=E_i/N$ are the observed and expected absolute and relative frequencies, respectively) is proportional to N, which is multiplied by the chi-squared distance between the observed and expected probabilities. Therefore, the goodness-of-fit test based on this statistics is highly sensitive to the sample size N (see, for instance, [Reference Bergh8, Reference Knoke, Bohrnstedt and Potter Mee36]): the larger N, the more significant a small value of the chi-squared distance. More precisely, the value of the chi-squared distance has to be compared with the ‘critical’ value $\chi^2_{1-\theta}(k-1)/N$ , where $\chi^2_{1-\theta}(k-1)$ denotes the quantile of order $1-\theta$ of the chi-squared distribution $\chi^2(k-1)$ with $k-1$ degrees of freedom. Hence, it is clear that the larger N, the easier the rejection of $H_0$ . As a consequence, in the context of ‘big data’ (e.g. [Reference Bergh8, Reference Bertoni11]), where one often works with correlated noised data, suitable generative models and related chi-squared goodness-of-fit tests are needed.
Different types of correlation have been taken into account and different techniques have been developed to control the performance of the goodness-of-fit test based on (1) (see, among others, [Reference Bergh8, Reference Chanda14, Reference Gasser26, Reference Gleser and Moore29, Reference Pan43, Reference Pei, Tang and Guo44, Reference Radlow and Alf46, Reference Tang, Pei, Wong and Li52], where some form of correlation is introduced in the sample, and variants of the chi-squared statistics are proposed and analyzed mainly by means of simulations). Our approach differs from the one adopted in the previously quoted papers. Indeed, our starting point is that a natural way to get a positive correlation between events of the same type is to deal with the Dirichlet-multinomial (D-M) distribution: briefly, the parameters of the D-M distribution are randomized a priori with a Dirichlet distribution, yielding an exchangeable (not independent) sequence. The variance–covariance matrix of the D-M distribution is equal to that of the multinomial (M) distribution, multiplied by a fixed constant greater than 1: precisely, given the k parameters $\textbf{b}_{\textbf{0}} = ({b_{0\,1}}, \ldots,{b_{0\,k}})$ of the D-M distribution and setting $|\textbf{b}_{\textbf{0}}| =\sum_{i=1}^k {b_{0\,i}}$ , we have
Therefore, if we set $|\textbf{b}_{\textbf{0}}| = \frac{1-\rho^2}{\rho^2}$ , we have, for any $i,j \in \{1, \ldots, k\}$ ,
where $\rho$ represents a correlation parameter. Roughly speaking, the D-M model adds variance to the multinomial model by taking a mixture or by adding a positive correlation. The property (2) is fundamental for our purpose. In fact, as highlighted in [Reference Rao and Scott47], the two conditions (i) $\widehat{p}_i=O_i/N \to p_i$ almost surely for $N\to\infty$ and (ii) $Cov(O_i,O_j)=\lambda Cov_{\text{M}}(O_i,O_j)$ with $\lambda>1$ imply that the statistics $\chi^2$ , defined in (1), is asymptotically distributed as $\chi^2(k-1)\lambda$ (see [Reference Rao and Scott47, Corollary 2]), so that the critical value for the chi-squared distance becomes $ \chi^2_{1-\theta}(k-1)\lambda/N$ , where $\lambda$ mitigates the effect of N. As already observed, the D-M model satisfies (ii), but it is well known that it does not meet the condition (i). In this paper we present a variant of the D-M model such that the condition (i) holds true and the chi-squared statistics (1) is asymptotically distributed as $\chi^2(k-1)\lambda$ with $\lambda>1$ .
The D-M distribution may be generated by means of the standard Eggenberger–Pólya urn (see [Reference Eggenberger and Pólya25, Reference Mahmoud38]), a model that has been widely studied and generalized (some recent variants can be found in [Reference Aletti, Ghiglietti and Paganoni5, Reference Aletti, Ghiglietti and Rosenberger6, Reference Aletti, Ghiglietti and Vidyashankar7, Reference Berti, Crimaldi, Pratelli and Rigo10, Reference Caldarelli, Chessa, Crimaldi and Pammolli12, Reference Caron13, Reference Chen and Kuba15, Reference Collevecchio, Cotar and LiCalzi17, Reference Crimaldi18, Reference Ghiglietti and Paganoni27, Reference Ghiglietti, Vidyashankar and Rosenberger28, Reference Laruelle and Pagés37]). This urn model with k colors works as follows. An urn contains $N_{0\, i}$ balls of color i, for $i=1,\dots,\, k$ . At each discrete time, a ball is drawn from the urn, and then it is again placed in the urn together with $\alpha>0$ additional balls of the same color. Therefore, if we denote by $N_{n\, i}$ the number of balls of color i in the urn at time n, we have for $n\geq 1$
where $\xi_{n\,i}=1$ if the extracted ball at time n is of color i, and $\xi_{n\,i}=0$ otherwise. The parameter $\alpha$ regulates the reinforcement mechanism: the greater $\alpha$ , the greater the dependence of $N_{n\,i}$ on $\sum_{m=1}^n\xi_{m\,i}$ . In addition, it is well known that the conditional expectation of the sequential extractions, i.e. $E[\xi_{n+1\,i} |\, \mbox{`past'}]$ , also known as the predictive mean, converges almost surely to a beta-distributed random variable, forcing the empirical mean $\bar{\xi}_{N\,i} = \sum_{n=1}^N\xi_{n\,i}/N$ to converge almost surely to the same limit.
In this work we exhibit an urn model that preserves the relevant aspects of the models above: a reinforcement mechanism, together with a global almost sure convergence of the empirical mean of the sequential extraction toward a fixed limit. However, differently from the previous models, for a certain choice of the parameters, the predictive mean $E[\xi_{n+1\,i} |\, \mbox{`past'}]$ randomly fluctuates without converging almost surely, asymptotically forming a stationary ergodic process. As a consequence, since the classical martingale approach and the standard stochastic approximation require or imply the convergence of $E[\xi_{n+1\,i} |\, \mbox{`past'}]$ (e.g. [Reference Aletti, Crimaldi and Ghiglietti1, Reference Berti, Crimaldi, Pratelli and Rigo9, Reference Laruelle and Pagés37]), in order to prove asymptotic results for the new urn model introduced, we need mathematical methods that are not usual in the urn modeling literature.
Rescaled Pólya urn
We introduce a new variant of the Eggenberger–Pólya urn with k colors, which we call the rescaled Pólya (RP) urn model. In this model, the almost sure limit of the empirical mean of the draws will play the rôle of an intrinsic long-run characteristic of the process, while a local mechanism generates persistent fluctuations. More precisely, the RP urn model is characterized by the introduction of the parameter $\beta$ , together with the initial parameters $(b_{0\, i})_{i=1,\dots,\,k}$ and $(B_{0\, i})_{i=1,\dots,\,k}$ , next to the parameter $\alpha$ of the original model, so that
Therefore, the urn initially contains $b_{0\,i}+B_{0\,i}$ balls of color i, and the parameter $\beta\geq 0$ , together with $\alpha>0$ , regulates the reinforcement mechanism. More precisely, $N_{n\,i}$ is the sum of three terms:
-
• the term $b_{0\,i}$ , which remains constant along time;
-
• the term $\beta B_{n-1\,i}$ , which links $N_{n\,i}$ to the ‘configuration’ at time $n-1$ , through the ‘scaling’ parameter $\beta$ that tunes the dependence on this factor;
-
• the term $\alpha\xi_{n\,i}$ , which links $N_{n\,i}$ to the outcome of the extraction at time n, through the parameter $\alpha$ that tunes the dependence on this factor.
Note that the case $\beta=1$ corresponds to the standard Eggenberger–Pólya urn with an initial number $N_{0\,i}=b_{0\,i}+B_{0\,i}$ of balls of color i, while, when $\beta\neq 1$ , the RP urn does not fall among the variants of the Eggenberger–Pólya urn discussed in [Reference Pemantle45, Section 3.2], and, as explained in detail in Section 2, it does not belong to the class of reinforced stochastic processes studied in [Reference Aletti, Crimaldi and Ghiglietti1, Reference Aletti, Crimaldi and Ghiglietti3, Reference Aletti, Crimaldi and Ghiglietti2, Reference Crimaldi, Dai Pra, Louis and Minelli20, Reference Crimaldi, Dai Pra and Minelli21, Reference Dai Pra, Louis and Minelli23, Reference Sahasrabudhe50].
The quantities $p_{0\,1},\ldots,p_{0\,k}$ defined as
can be seen as an intrinsic probability distribution on the possible values (colors) $\{1,\dots,\,k\}$ , that remains constant over time, and that will be related to the long-term characteristic of the process, while the random variables $(B_{n\,1},\dots,\, B_{n\,k})$ model random fluctuations over time, so that the probability distribution on the set of the k possible values at time n is given by
Assuming for $B_{n\,i}$ the dynamics (3) with $\beta>0$ , the probability $\psi_{n\,i}$ increases with the number of times we have observed the value i (see Equation (13) below), and so the random variables $\xi_{n\,i}$ are generated according to a reinforcement mechanism. But, in particular, when $\beta<1$ , the reinforcement at time n associated to observation $\xi_{m\,i}$ , with $m=1,\dots,\,n$ , increases exponentially with m (we refer again to (13) below), leaving the fluctuations to be driven by the most recent draws. We refer to this feature as ‘local’ reinforcement. The case $\beta=0$ is an extreme case where $\psi_{n\,i}$ depends only on the last draw $\xi_{n\,i}$ (and not on $\xi_{m\,i}$ with $m=1,\dots,\,n-1$ ). Hence, we are mainly interested in the case $\beta\in [0,1)$ , because in this case the RP urn exhibits the following distinctive characteristics:
-
(a) for each i, the process $(\psi_{n\,i})_n$ randomly fluctuates, driven by the most recent observations (‘local’ reinforcement), and does not converge almost surely;
-
(b) for each i, the empirical mean $\bar{\xi}_{N\,i} = \sum_{n=1}^N\xi_{n\,i}/N$ , that is, the empirical frequency $O_i/N$ , converges almost surely to the deterministic limit $p_i=p_{0\,i}$ ;
-
(c) the chi-squared statistics (1) is asymptotically distributed as $\chi^2(k-1)\lambda$ with $\lambda>1$ .
As stated before, owing to (a), the usual methods adopted in the urn literature do not work for $\beta < 1$ , and so different techniques are needed for the study of the RP urn model.
We have also considered the asymptotic results for $\beta>1$ , to complete the study of the RP urn model. In this situation, the process $(\psi_{n\,i})_n$ converges exponentially fast to a random limit, and so even faster than in the classical Eggenberger–Pólya urn. Therefore, in this case, we may apply the usual martingale technique (e.g. [Reference Aletti, Crimaldi and Ghiglietti1, Reference Berti, Crimaldi, Pratelli and Rigo9, Reference Laruelle and Pagés37]). Note that at $\beta=1$ we have a phase transition: for the empirical means, we pass from a deterministic limit (the case $\beta<1$ ) to a random limit (the case $\beta\geq 1$ ), and for the predictive means, we pass from fluctuations (the case $\beta<1$ ) to almost sure convergence (the case $\beta\geq 1$ ).
In Figure 1 we show the properties (a) and (b) for $\beta = 0$ and $\beta \in (0, 1)$ (Figure 1(A) and Figure 1(B), respectively) compared with the classical behavior of the processes for $\beta = 1$ and $\beta > 1$ (Figure 1(C) and Figure 1(D), respectively).
Goodness-of-fit result and its statistical application
Given a sample $(\boldsymbol{\xi_{\textbf{1}}}, \ldots, \boldsymbol{\xi_{\textbf{N}}})$ (where $\boldsymbol{\xi}_{\textbf{n}}$ denotes the random vector with components $\xi_{n\,i}$ , $i=1,\dots,\,k$ ) generated by an RP urn, the statistics
counts the number of times we have observed the value i. When $\beta\in[0,1)$ , the theorem below states the almost sure convergence of the empirical mean $\overline{\xi}_{N\,i}=O_i/N$ toward the probability $p_{0\,i}$ , together with a chi-squared goodness-of-fit result for the long-term probabilities $p_{0\,1},\dots,\, p_{0\,k}$ . More precisely, we prove the following result.
Theorem 1. Assume $p_{0\,i}>0$ for all $i=1,\dots,\,k$ and $\beta \in[0,1)$ . Define the constants $\gamma$ and $\lambda$ as
Then $O_i/N\stackrel{a.s.}\longrightarrow p_{0\,i}$ and
where $W_{0}$ has distribution $\chi^2(k-1)=\Gamma\!\left(\frac{(k-1)}{2},\frac{1}{2} \right)$ , and consequently $W_{*}$ has distribution $\Gamma\!\left(\frac{k-1}{2},\frac{1}{2\lambda}\right)$ .
From a mathematical point of view, the proof of the above result easily follows from Theorem 2, whose proof occupies the main part of the present work.
A possible application we have in mind was inspired by [Reference Bertoni11, Reference Micheletti41] and is the following. We suppose we have a sample $\{\boldsymbol{\xi}_n\,:\,n=1,\dots,\,N\}$ , where the observations cannot be assumed i.i.d., but they exhibit a structure in clusters, with independence between clusters and with correlation inside each cluster. This is a usual circumstance in many applications (e.g. [Reference Chessa, Crimaldi, Riccaboni and Trapin16, Reference Ieva, Paganoni, Pigoli and Vitelli34, Reference Tharwat53, Reference Xu and Tian55]). More precisely, we consider the situation when inside each cluster the probability that a certain unit chooses the value i is affected by the number of units in the same cluster that have already chosen the value i, hence according to a reinforcement rule. For example, we can imagine that our dataset collects messages from the online social network Twitter: ‘tweets’ referring to different topics can be placed in different clusters. If the topics are distant from each other, we can assume independence between clusters. Inside each cluster, the tweets are temporally ordered and the associated ‘sentiment’ is observed to be driven by a local reinforcement mechanism: the probability of having a tweet with positive sentiment is increasing in the number of past tweets with positive sentiment, but the reinforcement is mostly driven by the most recent tweets, leading to a fluctuations of the predictive means [Reference Aletti, Crimaldi and Saracco4]. A different clustering of the tweets can be obtained with different slots of time, sufficiently far from each other.
Another example is the following. Each cluster corresponds to an agent. The agents act independently of each other (independence between clusters). At each time-step, each agent has to choose between k brands that are related to a loyalty program: the more often the agent selects the same brand, the more loyalty points he/she gains. This fact induces the reinforcement mechanism, and it could make sense that the reinforcement is mostly driven by the most recent actions. Finally, we can have the case where clusters are associated to some products, and at each time-step a customer has to give a vote to each product on an online platform. Each cluster collects the votes for the corresponding product. If the products belong to very different categories, we can assume independence between clusters; while, if the customers can see the votes given by the previous customers, we may have a reinforcement mechanism, mainly based on the last observations.
Formally, we suppose that the N units are ordered so that we have the following L clusters of units:
Therefore, the cardinality of each cluster $C_\ell$ is $N_\ell$ . We assume that the units in different clusters are independent; that is,
are L independent multidimensional random variables. Moreover, we assume that the observations inside each cluster can be modeled as an RP urn with $\beta\in [0,1)$ . We denote by $ p_{0\,1}(\ell),\dots,\,p_{0\,k}(\ell)$ the intrinsic long-run probabilities for the cluster $C_\ell$ , which we assume strictly positive, and we assume the same parameter $\lambda>1$ for each cluster (but not necessarily the same parameters $\alpha$ and $\beta$ ), so that all of the L random variables
are asymptotically distributed as $\Gamma\big(\frac{k-1}{2},\frac{1}{2\lambda}\big)$ . Since $Q_1,\dots,\,Q_L$ are independent because they refer to different clusters, when all the cluster sizes $N_\ell$ are large, we can estimate the parameter $\lambda$ by means of the (asymptotic) maximum likelihood and obtain
Note that $E[\widehat{\lambda}]=\lambda$ ; that is, the estimator is unbiased. Moreover, $\widehat{\lambda}/\lambda$ has asymptotic distribution $\Gamma\!\left(\frac{L(k-1)}{2}, \frac{L(k-1)}{2}\right)$ (that does not depend on $\lambda$ ), and so it can be used to construct asymptotic confidence intervals for $\lambda$ . Moreover, given certain (strictly positive) intrinsic probabilities $p_{0\,1}^*(\ell),\dots,\,p_{0\,k}^*(\ell)$ for each cluster $C_\ell$ , we can use the above procedure with $p_{0\,i}(\ell)=p_{0\,i}^*(\ell)$ for $i=1,\dots,\, k$ and $\ell=1,\dots,\,L$ to obtain an estimate $\widehat{\lambda}^*$ of $\lambda$ , and then use the statistics $Q_\ell$ with $p_{0\,i}(\ell)=p_{0\,i}^*(\ell)$ and the corresponding asymptotic distribution $\Gamma\!\left(\frac{k-1}{2},\frac{1}{2\widehat{\lambda}^*}\right)$ to perform a $\chi^2$ -test with null hypothesis
Regarding the probabilities $p_{0\,i}^*(\ell)$ , some possibilities are as follows:
-
• we can take $p_{0\,i}^*(\ell)=1/k$ for all $i=1,\dots,\,k$ if we want to test possible differences in the probabilities for the k different values;
-
• we can suppose that we have two different periods of time, and so two samples, say $\big\{\boldsymbol{\xi^{(1)}_{\textbf{n}}}\,:\,n=1,\dots,\,N\big\}$ and $\big\{\boldsymbol{\xi^{(2)}_{\textbf{n}}}\,:\,n=1,\dots,\,N\big\}$ ; take $p_{0\,i}^*(\ell)=\sum_{n\in C_\ell}\xi^{(1)}_{n\,i}/N_\ell$ for all $i=1,\dots,\, k$ ; and perform the test on the second sample in order to check for possible changes in the intrinsic long-run probabilities;
-
• we can take one of the clusters as a benchmark, say $\ell^*$ ; set $p_{0\,i}^*(\ell)=\sum_{n\in C_{\ell^*}}\xi_{n\,i}/N_{\ell^*}$ for all $i=1,\dots,\, k$ and $\ell\neq\ell^*$ ; and perform the test for the other $L-1$ clusters in order to check differences with the benchmark cluster $\ell^*$ .
Structure of the paper
Summing up, the sequel of the paper is structured as follows. In Section 2 we set up our notation and formally define the RP urn model with parameters $\alpha>0$ and $\beta\geq 0$ . In Section 3 we provide a complete characterization of the RP urn for the three cases $\beta=0$ , $\beta\in (0, 1)$ , and $\beta>1$ . (We do not deal with the case $\beta=1$ because, as stated before, it coincides with the standard Eggenberger–Pólya urn, whose properties are well known.) In particular we show that for each i, the empirical mean of the $\xi_{n\,i}$ almost surely converges to the intrinsic probabilities $p_{0\,i}$ when $\beta\in [0,1)$ , while it almost surely converges to a random limit when $\beta> 1$ . We obtain also the corresponding central limit theorems (CLTs), which, in particular for $\beta\in [0,1)$ , are the basis for the proof of Theorem 1. For completeness, we also describe the case $\alpha = 0$ , which generates a sequence of independent draws. Section 4 contains the proof of Theorem 1, which gives the possibility of constructing a chi-squared test for the intrinsic long-run probabilities when the observed sample is assumed to be generated by an RP urn with $\beta\in [0,1)$ . Finally, the paper contains an appendix: in Section A.1 we state and prove a general CLT for Markov chains with a compact state space $S\subset\mathbb{R}^k$ , under a certain condition which we call the ‘linearity’ condition, and in Section A.2 we explain a fundamental coupling technique used in the proof of the CLT for $\beta\in (0, 1)$ .
2. The rescaled Pólya urn model
In all the sequel (unless otherwise specified) we suppose given two parameters $\alpha>0$ and $\beta \geq 0$ . Given a vector $\textbf{x}=(x_1, \ldots, x_k)^\top\in \mathbb{R}^k$ , we set $ |\textbf{x}| =\sum_{i=1}^k |x_i| $ and $ \|\textbf{x}\|^2 = \textbf{x}^\top \textbf{x}=\sum_{i=1}^k |x_i|^2 $ . Moreover we denote by $\textbf{1}$ and $\textbf{0}$ the vectors with all the components equal to 1 and equal to 0, respectively, and by $\{\textbf{e}_{\textbf{1}}, \ldots, \textbf{e}_{\textbf{k}}\}$ the canonical base of $\mathbb{R}^k$ .
To formally work with the RP urn model presented in the introduction, we add here some notation. Throughout the sequel, the expression number of balls is not to be understood literally; all the quantities are real numbers, not necessarily integers. The urn initially contains $b_{0\,i}+B_{0\,i}>0$ distinct balls of color i, with $i=1,\dots,\,k$ . We set $\textbf{b}_{\textbf{0}}=(b_{0\,1},\dots,\,b_{0\,k})^{\top}$ and $\textbf{B}_{\textbf{0}}=(B_{0\,1},\dots,\,B_{0\,k})^{\top}$ . In all the sequel (unless otherwise specified) we assume $|\textbf{b}_{\textbf{0}}|>0$ . Consistently with (4), we set $\textbf{p}_{\textbf{0}} =\frac{\textbf{b}_{\textbf{0}}}{|\textbf{b}_{\textbf{0}}|}$ . At each discrete time $(n+1)\geq 1$ , a ball is drawn at random from the urn, producing the random vector $\boldsymbol{\xi_{\textbf{n}+1}} = (\xi_{\textbf{n}+1\,1}, \ldots, \xi_{n+1\,k})^\top$ defined as
and the number of balls in the urn is updated as follows:
which gives (since $|\boldsymbol{\xi}_{\textbf{n}+\textbf{1}}|=1$ )
Therefore, setting $r^*_{n} = |{\textbf{N}_{\textbf{n}}}|= |\textbf{b}_{\textbf{0}}|+|\textbf{B}_{\textbf{n}}|$ , we get
Moreover, setting $\mathcal{F}_0$ equal to the trivial $\sigma$ -field and $\mathcal{F}_n=\sigma(\boldsymbol{\xi_{\textbf{1}}},\dots,\,\boldsymbol{\xi}_{\textbf{n}})$ for $n\geq 1$ , the conditional probabilities $\boldsymbol{\psi}_{\!\textbf{n}}= (\psi_{n\,1}, \ldots,\psi_{n\,k})^\top$ of the extraction process, also called predictive means, are
It is obvious that we have $|\boldsymbol{\psi}_{\!\textbf{n}}|=1$ . Finally, for the sequel, we set ${\overline{\boldsymbol{\xi}}_{\textbf{N}}}=\sum_{n=1}^N\boldsymbol{\xi}_{\textbf{n}}/N$ .
We note that, by means of (10), together with (7) and (9), we have
As previously stated, the RP urn for $\beta=1$ coincides with the well-known standard Eggenberger–Pólya urn, and so we will exclude it from the following analyses. When $\beta\neq 1$ , because of the first term in the right-hand side of the above relation, the RP urn does not belong to the class of reinforced stochastic processes (RSPs) studied in [Reference Aletti, Crimaldi and Ghiglietti1, Reference Aletti, Crimaldi and Ghiglietti3, Reference Aletti, Crimaldi and Ghiglietti2, Reference Crimaldi, Dai Pra, Louis and Minelli20, Reference Crimaldi, Dai Pra and Minelli21, Reference Dai Pra, Louis and Minelli23]. Generally speaking, by reinforcement in a stochastic dynamics we mean any mechanism for which the probability that a given event occurs, i.e. the predictive mean, has an increasing dependence on the number of times that the same event occurred in the past. This ‘reinforcement mechanism’, also known as a ‘preferential attachment rule’, ‘rich get richer rule’, or ‘Matthew effect’, is a key feature governing the dynamics of many biological, economic, and social systems (see e.g. [Reference Pemantle45]). The RSPs are characterized by a ‘strict’ reinforcement mechanism such that, at each time-step, we have a strictly positive increment of the predictive mean associated to the extracted color. As an immediate consequence, the ‘general’ reinforcement mechanism is satisfied; that is, the predictive mean for a given color has an increasing dependence on the number of past extractions of that color. When $\beta\neq 1$ , the RP urn model does not satisfy the ‘strict’ reinforcement mechanism, because the first term in the right-hand side of (11) is positive or negative according to the sign of $(1-\beta)$ and of $\big(\boldsymbol{\psi}_{\!\textbf{n}-\textbf{1}}-\textbf{p}_{\textbf{0}}\big)$ . However, when $\alpha,\,\beta>0$ , it satisfies the general reinforcement mechanism. Indeed, by (7), (8), (9), and (10), using $\sum_{m=0}^{n-1}x^m=(1-x^{n})/(1-x)$ , we have
and
In particular, for $\beta>1$ , the dependence of $\boldsymbol{\psi}_{\!\textbf{n}}$ on $\boldsymbol{\xi_{\textbf{m}}}$ exponentially decreases with m, because of the factor $\beta^{-m}$ . For $\beta<1$ we have the opposite behavior; that is, the dependence of $\boldsymbol{\psi}_{\!\textbf{n}}$ on $\boldsymbol{\xi_{\textbf{m}}}$ exponentially increases with m, because of the factor $\beta^{n-m}$ , and so the main contribution is given by the most recent extractions. We refer to this phenomenon as ‘local’ reinforcement. The case $\beta=0$ is an extreme case, for which $\boldsymbol{\psi}_{\!\textbf{n}}$ depends only on the last extraction $\boldsymbol{\xi}_{\textbf{n}}$ : at each time-step $n+1\geq 2$ we extract a ball from an urn with $b_{0\,i}+\alpha$ balls of color i, if i is the color extracted at time n, and $b_{0\,j}$ balls for each color $j\neq i$ . This particular case corresponds to a version of the so-called ‘memory-1 senile reinforced random walk’ on a star-shaped graph introduced in [Reference Holmes and Sakai33], but the study done in that paper differs from ours. Finally, we observe that Equation (11) recalls the dynamics of an RSP with a ‘forcing input’ (see [Reference Aletti, Crimaldi and Ghiglietti1, Reference Crimaldi, Dai Pra, Louis and Minelli20, Reference Sahasrabudhe50]), but the main difference relies on the fact that, for the RP urn, the sequence $\big(r_n^*\big)$ is such that $r_n^*\to r^*>0$ , and so $\sum_n1/r_n^*=+\infty$ and $\sum_n 1/\big(r_n^*\big)^2=+\infty$ , when $\beta\in[0,1)$ , and such that $\sum_n 1/r_n^*<+\infty$ (and $\sum_n 1/\big(r_n^*\big)^2<+\infty$ ) when $\beta>1$ . These facts lead to different asymptotic behavior for $(\boldsymbol{\psi}_{\!\textbf{n}})$ . Specifically, for the RP urn with $\beta\in [0,1)$ , the predictive mean $\boldsymbol{\psi}_{\!\textbf{n}}$ randomly fluctuates and does not converge almost surely, while, for the RP urn with $\beta>1$ , the sequence $(\boldsymbol{\psi}_{\!\textbf{n}})$ almost surely converges to a random variable $\boldsymbol{\psi}_{\!\boldsymbol{\infty}}$ and $|\boldsymbol{\psi}_{\!\textbf{n}}-\boldsymbol{\psi}_{\!\boldsymbol{\infty}}|=O(\beta^{-n})$ . By contrast, for the RSP with a ‘forcing input’, the almost sure convergence of $\boldsymbol{\psi}_{\!\textbf{n}}$ toward the forcing input (which is a constant) holds true, and the corresponding rate of convergence depends on a model parameter $\gamma\in (1/2,1]$ and equals $n^{-\gamma/2}$ .
3. Properties of the rescaled Pólya urn model
We study separately the three cases $\beta=0$ , $\beta\in (0, 1)$ , and $\beta>1$ . In particular, the case $\beta=0$ is placed at the end of this section, because it is elementary.
3.1. The case $\beta \in (0, 1)$
In this case, we have $\lim_n\beta^n=0$ and $\sum_{n\geq 1}\beta^n=\beta/(1-\beta)$ . Therefore, setting $r=\tfrac{\alpha}{1-\beta}$ and $r^*=|\textbf{b}_{\textbf{0}}|+r$ , we have by (8) and (9)
and so we have that the denominator $r^*_n$ in $\boldsymbol{\psi}_{\!\textbf{n}}$ (see Equation (10)) goes exponentially fast to the limit $r^*$ . Moreover, recalling the definition of the constant $\gamma$ in (5), we have $\beta < \gamma < 1$ (remember that $|\textbf{b}_{\textbf{0}}|>0$ by assumption),
Therefore, by (14), the terms $\frac{(1-\beta)|\textbf{b}_{\textbf{0}}|}{r_{n}^*}$ and $\frac{\alpha}{r_{n}^*}$ in the dynamics (11) converge exponentially fast to $(1-\gamma)$ and $(\gamma-\beta)$ , respectively. Furthermore, as we will see, the fact that the constant $\gamma$ is strictly smaller than 1 will play a central rôle, because it will imply the existence of a contraction of the process $\boldsymbol{\psi}=(\boldsymbol{\psi}_{\!\textbf{n}})_n$ in a proper metric space with (sharp) constant $\gamma$ . Consequently, it is not a surprise that this constant enters naturally in the parameters of the asymptotic distribution, given in the following result.
Theorem 2. We have ${\overline{\boldsymbol{\xi}}_{\textbf{N}}} \mathop{\longrightarrow}\limits^{a.s.}\textbf{p}_{\textbf{0}}$ and
where
with $\lambda$ defined in (6) as a function of $\beta$ and $\gamma$ .
Remark 1. Note that in Theorem 2 we do not assume $b_{0\,i}>0$ for all i, but only $|\textbf{b}_{\textbf{0}}|>0$ (as stated in Section 2). The asymptotic result given above fails when $\textbf{b}_{\textbf{0}} = \textbf{0}$ , and a different behavior is observed. Indeed, from (11), we have
and hence $(\boldsymbol{\psi}_{\!\textbf{n}})$ is a martingale. Therefore the bounded process $(\boldsymbol{\psi}_{\!\textbf{n}})$ converges almost surely (and in mean) to a bounded random variable $\boldsymbol{\psi}_{\!\boldsymbol{\infty}}$ . In addition, since $r_{n}^* \to r=\alpha/(1-\beta)$ , from (16), we obtain that the unique possible limits ${\boldsymbol{\psi}_{\!\boldsymbol{\infty}}}$ are those for which $\boldsymbol{\xi}_{\textbf{n}} = {\boldsymbol{\psi}_{\!\boldsymbol{\infty}}}$ eventually. Hence $\boldsymbol{\psi}_{\!\boldsymbol{\infty}}$ takes values in $\{\textbf{e}_{\textbf{1}},\dots,\,\textbf{e}_{\textbf{k}}\}$ , and since we have $E[{\boldsymbol{\psi}_{\!\boldsymbol{\infty}}}] = E[\boldsymbol{\psi_{0}}] =\textbf{B}_{\textbf{0}}/{|\textbf{B}_{\textbf{0}}|}$ , we get $P( {\boldsymbol{\psi}_{\!\boldsymbol{\infty}}} = \textbf{e}_{\textbf{i}} ) =\frac{{B_{0\,i}}}{ |\textbf{B}_{\textbf{0}}|}$ for all $i=1,\dots,\,k$ .
We will split the proof of Theorem 2 into two main steps: first, we will prove that the convergence behavior of ${\overline{\boldsymbol{\xi}}_{\textbf{N}}}$ does not depend on the initial constant $|\textbf{B}_0|$ , and then, without loss of generality, we will assume $|\textbf{B}_0|=r$ and we will give the proof of the theorem under this assumption.
3.1.1. Independence of the asymptotic properties of the empirical mean from $|\textbf{B}_{\textbf{0}}|$
We use a coupling method to prove that the convergence results stated in Theorem 2 are not affected by the value of the initial constant $|\textbf{B}_{\textbf{0}}|$ .
Set $\boldsymbol{\xi^{(1)}_{\textbf{n}}} =\boldsymbol{\xi}_{\textbf{n}} $ and $\boldsymbol{\psi^{(1)}_{\textbf{n}}} =\boldsymbol{\psi}_{\!\textbf{n}} $ , which follows the dynamics (11), together with (10), starting from a certain initial point ${\boldsymbol{\psi}_{\!\textbf{0}}^{(\textbf{1})}}=\boldsymbol{\psi}_{\!\textbf{0}}$ . By (7) and the relations (15), we can write
where
Since, by (14), we have $r^*/r_{n+1}^*-1=O\big(\beta^{n+1}\big)$ and $r_{n}^*/r_{n+1}^* -1 =O\big(\beta^{n+1}\big)$ , we get $|{\textbf{l}_{\textbf{n}+\textbf{1}}^{(\textbf{1})}}|=O\big(\beta^{n+1}\big)$ . Now, take $\boldsymbol{\xi^{(2)}}=\big(\boldsymbol{\xi^{(2)}_{\textbf{n}}}\big)_n$ and $\boldsymbol{\psi^{(2)}} =\big(\boldsymbol{\psi^{(2)}_{\textbf{n}}} \big)_n$ following the same dynamics given in (10) and (11), but starting from an initial point with $|\boldsymbol{\textbf{B}_0^{(2)}}|=r$ . Therefore, we have
Both dynamics are of the form (A.9) with $a_0 = \beta$ , $a_1 = (\gamma-\beta)$ , $\textbf{c}= (1-\gamma)\textbf{p}_{\textbf{0}}$ , $c_n^{(1)}=\beta^{n}$ , and $c_{n}^{(2)}=0$ , and, by (10), the condition (A.10) holds true. Hence we can apply Theorem 10 so that there exist two stochastic processes ${\widetilde{\boldsymbol\psi}^{(1)}}$ and ${\widetilde{\boldsymbol\psi}^{(2)}}$ , following the dynamics (A.11) (with the same specifications as above), together with (A.12), starting from the same initial points and such that (A.14) holds true; that is,
Since $\gamma<1$ , if we subtract (A.11) with $\ell=2$ from (A.11) with $\ell=1$ , we obtain that
which implies $\sum_{n=0}^{+\infty}\Big|{\widetilde{\boldsymbol\xi}^{(1)}_{\textbf{n}}} -{\widetilde{\boldsymbol\xi}^{(2)}_{\textbf{n}}} \Big| <+\infty$ almost surely. Recall that ${\widetilde{\boldsymbol\xi}^{(1)}_{\textbf{n}}}\in \{0,1\} $ and $ {\widetilde{\boldsymbol\xi}^{(2)}_{\textbf{n}}}\in \{0,1\}$ , and hence
Therefore, if we prove some asymptotic results for $\boldsymbol{\xi^{(2)}}$ , then they hold true also for ${\widetilde{\boldsymbol\xi}^{(2)}}$ (since they have the same joint distribution); then they hold true also for ${\widetilde{\boldsymbol\xi}^{(1)}}$ (since (17)), and finally they hold true also for $\boldsymbol{{\xi}^{(1)}}$ (since they have the same joint distribution). Summing up, without loss of generality, we may prove Theorem 2 under the additional assumption $|\textbf{B}_{\textbf{0}}| = r$ .
3.1.2. The case $|\textbf{B}_{\boldsymbol{0}}|=r$
Thanks to what we have observed in the previous subsection, we here assume that $|\textbf{B}_{\textbf{0}}|=r=\alpha/(1-\beta)$ , which implies $|\textbf{B}_{\textbf{n}}|=r$ and
for any n. Hence, we can simplify (10) as
The process $\boldsymbol{\psi}=(\boldsymbol{\psi}_{\!\textbf{n}})_{n\geq 0}$ is then a Markov chain with state space
which, endowed with the distance induced by the norm $|\cdot|$ , is a compact metric space.
In the sequel, according to the context, since we work with a Markov chain with state space $S\subset\mathbb{R}^k$ , the notation P will be used for the following:
-
• a kernel $P\,:\, S\times \mathcal{B}(S)\to [0,1]$ , where $\mathcal{B}(S)$ is the Borel $\sigma$ -field on S and we will use the notation $P(\textbf{x},\textbf{dy})$ in the integrals;
-
• an operator $P\,:\,C(S) \to M(S)$ , where C(S) and M(S) denote the spaces of the continuous and measurable functions on S, respectively, defined as
\begin{equation*}(Pf)(\textbf{x}) = \int_S f(\textbf{y}) P(\textbf{x},\textbf{dy}).\end{equation*}In addition, when f is the identity map, that is, $f(\textbf{y})=\textbf{y}$ , we will write $(Pid)(\textbf{x})$ or $(P\textbf{y})(\textbf{x})$ .
Moreover, we set $P^0f = f$ and $P^nf = P\big(P^{n-1}f\big)$ .
By (11), together with (15) and (18), the process $\boldsymbol{\psi}=(\boldsymbol{\psi}_{\!\textbf{n}})_n$ follows the dynamics
Therefore, given $\textbf{z} = (z_1, \ldots, z_k)^T$ and setting
for any $i=1, \ldots, k$ , we get
In particular, from the above equality, we get
We now show that $\boldsymbol{\psi}$ is an irreducible, aperiodic, compact Markov chain (see Definitions 3 and 4).
Checking that $\boldsymbol{\psi}$ is a compact Markov chain: By Lemma 1, it is sufficient to show that P defined in (20) is weak Feller (Definition 1) and that it is a semi-contractive operator on Lip(S) (Definition 2). From (20), we have immediately that the function Pf is continuous whenever f is continuous and hence P is weak Feller. In order to prove the contractive property, we start by observing that the dynamics (19) of $\boldsymbol{\psi}$ is of the form (A.9) with $a_0=\beta$ , $a_1=(\gamma-\beta)$ , $\textbf{c}=\textbf{p}_{\textbf{0}}(1-\gamma)$ , and $\textbf{l}_{\textbf{n}}\equiv\textbf{0}$ for each n. Moreover, by (10), the condition (A.10) holds true. Then we let $\boldsymbol{\psi^{(1)}}$ and $\boldsymbol{\psi^{(2)}}$ be two stochastic processes following the dynamics (A.9) with the same specifications as above, together with (A.10), and starting, respectively, from the points $\textbf{x}$ and $\textbf{y}$ . Applying Theorem 10, we get two stochastic processes ${\widetilde{\boldsymbol\psi}^{\boldsymbol{(1)}}}$ and ${\widetilde{\boldsymbol\psi}^{\boldsymbol{(2)}}}$ , evolving according to (A.11), together with (A.12), starting from the same initial points $\textbf{x}$ and $\textbf{y}$ and such that
Therefore, if we take $f \in Lip(S)$ with
we obtain
and so
with $\gamma <1 $ , as desired.
Checking that $\boldsymbol{\psi}$ is irreducible and aperiodic: We prove the irreducibility and aperiodicity condition stated in Definition 4, using Theorem 7. Therefore, let us denote by $\pi$ an invariant probability measure for P. Moreover, let $\boldsymbol{\psi^{(1)}}$ and $\boldsymbol{\psi^{(2)}}$ be two processes that follow the same dynamics (19) of $\boldsymbol{\psi}$ , but for the first process, we set the initial distribution equal to $\pi$ , while for the second process, we take any other initial distribution $\nu$ on S. Again, as above, since (19) is of the form (A.9) with $a_0=\beta$ , $a_1=(\gamma-\beta)$ , $\textbf{c}=\textbf{p}_{\textbf{0}}(1-\gamma)$ , and $\textbf{l}_{\textbf{n}}\equiv\textbf{0}$ for each n, and, by (10), the condition (A.10) holds true, we can apply Theorem 5.2 and obtain two stochastic processes $\boldsymbol{\widetilde\psi^{(1)}}$ and ${\widetilde{\boldsymbol\psi}^{\boldsymbol{(2)}}}$ , evolving according to (A.11) (with the same specifications as above), together with (A.12), starting from the same initial random variables $\boldsymbol{\psi^{(1)}_0}$ (with distribution $\pi$ ) and $\boldsymbol{\psi^{(2)}_0}$ (with distribution $\nu$ ) and such that
Hence, since $\gamma<1$ , we have $E\Big[\Big|{\widetilde{\boldsymbol\psi}^{(1)}_{\textbf{n}}}-{\widetilde{\boldsymbol\psi}^{(2)}_{\textbf{n}}}\Big|\Big]\longrightarrow 0 $ , and, since the distribution of ${\widetilde{\boldsymbol\psi}^{(1)}_{\textbf{n}}}$ is always $\pi$ (by the definition of an invariant probability measure), we can conclude that $\boldsymbol{\widetilde\psi^{(2)}_{\textbf{n}}}$ , and so $\boldsymbol{\psi^{(2)}_{\textbf{n}}}$ (because they have the same distribution), converge in distribution to $\pi$ .
Proof of the almost sure convergence: We have already proven that the Markov chain $\boldsymbol{\psi}$ has one invariant probability measure $\pi$ . Furthermore, from (19) we get
Therefore, applying [Reference Hairer31, Corollary 5.3 and Corollary 5.12], we obtain
where $\big({\boldsymbol{\psi}^{(\boldsymbol{\pi})}_{\textbf{n}}}\big)_{n\geq 0}$ is a Markov chain with transition kernel P and initial distribution $\pi$ , and so stationary. From (21), we have
and so, since $\gamma<1$ , we get $E\big[ {\boldsymbol{\psi}^{(\boldsymbol{\pi})}_{\textbf{n}}} - \textbf{p}_{\textbf{0}} \big]=0$ . This means that $\int_S\textbf{x}\,\pi(d\textbf{x})=\textbf{p}_{\textbf{0}}$ and ${\overline{\boldsymbol{\xi}}_{\textbf{N}}}\stackrel{a.s.}\longrightarrow \textbf{p}_{\textbf{0}}$ .
Proof of the CLT: We apply Theorem 9, taking into account that we have already proven that $\boldsymbol{\psi}$ is an irreducible and aperiodic compact Markov chain. By (22) and what we have already proven before, we have
and $\textbf{p}_{\textbf{0}}=\int_S\textbf{x}\,\pi(d\textbf{x})$ . Hence ${\textbf{f}}$ and P form a linear model as defined in Definition 5 (see also Remark 5). Indeed, we have $A_{1} = -\frac{\beta}{\gamma-\beta} Id$ and $A_{2} =\frac{1}{\gamma-\beta} Id$ . Moreover, by (21), we have $P(id)(\textbf{x})-\textbf{p}_{\textbf{0}}=\gamma( \textbf{x} -\textbf{p}_{\textbf{0}} )$ , which means $A_{P} = \gamma Id$ . Therefore Theorem 9 holds true with
and so, after some computations, with
In order to conclude, we take a Markov chain $\big({\boldsymbol{\psi}^{(\boldsymbol{\pi})}_{\textbf{n}}}\big)_{n\geq 0}$ with transition kernel P and initial distribution $\pi$ and we set
Then we observe that, by (A.4) and (A.5), we have
3.2. The case $\beta>1$
In this case, $\lim_n \beta^{n}=+\infty$ and $\sum_{n\geq 1}\beta^{-n}= 1/(\beta-1)$ . Moreover, by (12), $r_n^*$ increases exponentially to $+\infty$ . Hence, the following results hold true.
Theorem 3. We have
and
Moreover, $\boldsymbol{\psi}_{\!\boldsymbol{\infty}}$ takes values in $\big\{\textbf{x}\in [0,1]^k\,:\,|\textbf{x}|=1\big\}$ , and if $B_{0\,i}>0$ for all $i=1,\dots,\, k$ , then $P\{ \psi_{\infty\, i}\in (0, 1)\}=1$ for each i.
Note that $\boldsymbol{\psi}_{\!\boldsymbol{\infty}}$ is a function of ${\boldsymbol \xi}=(\boldsymbol{\xi}_{\textbf{n}})_{n\geq 1}$ which takes values in $\big(\big\{\textbf{x}\in\{0,1\}^k\,:\, |\textbf{x}|=1\big\}\big)^\infty$ .
Proof. By (13), we have
Hence, the almost sure convergence immediately follows because $\big|\sum_{n\geq 1}\beta^{-n} \boldsymbol{\xi}_{\textbf{n}}\big|\leq \sum_{n\geq 1}\beta^{-n}<+\infty$ . Moreover, after some computations, we have
where
Therefore, since $|\sum_{n\geq N+1}\beta^{-n}\boldsymbol{\xi}_{\textbf{n}}|\leq\sum_{n\geq N+1}\beta^{-n}=\beta^{-N}/(\beta-1)$ and $|{\textbf{R}}|$ is bounded by a constant, we obtain that
In order to conclude, it is enough to recall that, by definition, we have $\boldsymbol{\psi}_{\!\textbf{N}}\in [0,1]^k$ with $|\boldsymbol{\psi}_{\!\textbf{N}}|=1$ and observe that, if $B_{0,\, i}>0$ for all i, then we have
Theorem 4. We have ${\overline{\boldsymbol{\xi}}_{\textbf{N}}}\stackrel{a.s.}\longrightarrow{{\boldsymbol{\psi}_{\!\boldsymbol{\infty}}}}$ , and
where $\Sigma^2 = \mathrm{diag} ({{\boldsymbol{\psi}_{\!\boldsymbol{\infty}}}}) -{{\boldsymbol{\psi}_{\!\boldsymbol{\infty}}}} {\boldsymbol{\psi}_{\!\boldsymbol{\infty}}^\top}$ and $\mathop{\longrightarrow}\limits^{s}$ means stable convergence.
Note that if $B_{0\,i}>0$ for all $i=1,\dots,\,k$ , then $\Sigma^2_{i,j}\in (0, 1)$ for each pair (i, j).
Stable convergence was introduced in [Reference Rényi48]; for its definition and properties, we refer to [Reference Crimaldi19, Reference Crimaldi, Letta and Pratelli22, Reference Hall and Heyde32].
Proof. The almost sure convergence of ${\overline{\boldsymbol{\xi}}_{\textbf{N}}}$ to $\boldsymbol{\psi}_{\!\boldsymbol{\infty}}$ follows by the usual martingale arguments (see, for instance, [Reference Berti, Crimaldi, Pratelli and Rigo9, Lemma 2]) because $E\big[\boldsymbol{\xi_{\textbf{n}+1}}|\mathcal{F}_n\big]=\boldsymbol{\psi}_{\!\textbf{n}}\to \boldsymbol{\psi}_{\!\boldsymbol{\infty}}$ almost surely and $\sum_{n\geq 1} E\big[\|\boldsymbol{\xi}_{\textbf{n}}\|^2 \big]n^{-2}\leq \sum_{n\geq 1}n^{-2}<+\infty$ .
Regarding the CLTs, we observe that, by means of (11), we can write
where $\textbf{H}(\textbf{x})=(\beta-1)|\textbf{b}_{\textbf{0}}|(\textbf{x}-\textbf{p}_{\textbf{0}})$ and $\Delta{\textbf{M}_{\textbf{n}+1}}=\alpha\big({\boldsymbol{\xi_{\textbf{n}+1}}}-\boldsymbol{\psi}_{\!\textbf{n}}\big)$ . Therefore, we get
where
and
Since $\sum_{n\geq 1} n/{r^*_{n}}<+\infty$ and $|\textbf{H}\big({\boldsymbol{\psi}_{\!\textbf{n}-\textbf{1}}}\big)| + |\Delta \textbf{M}_{\textbf{n}}|$ is uniformly bounded by a constant, we have that $\textbf{Q}_{\textbf{N}}$ converges to zero almost surely. Therefore it is enough to prove that $\sum_{n=1}^N\textbf{Y}_{\textbf{N},\textbf{n}}$ converges stably to the desired Gaussian kernel. For this purpose we observe that $E\big[\textbf{Y}_{\textbf{N},\textbf{n}}|\mathcal{F}_{n-1}\big]=\textbf{0}$ , and so, to prove the stable convergence, we have to check the following conditions (see [Reference Crimaldi, Letta and Pratelli22, Corollary 7] or [Reference Crimaldi19, Corollary 5.5.2]):
-
(c1) $E\!\left[\,\max_{1\leq n\leq N} |\textbf{Y}_{\textbf{N},\textbf{n}} |\,\right]\to 0$ , and
-
(c2) $\sum_{n=1}^N \textbf{Y}_{\textbf{N},\textbf{n}} {\textbf{Y}_{{\textbf{N},\textbf{n}}}^\top}\mathop{\longrightarrow}\limits^{P}\Sigma^2$ .
Regarding (c1), we observe that
In order to conclude, we have to prove the condition (c2), that is,
The above convergence holds true even almost surely by standard martingale arguments (see, for instance, [Reference Berti, Crimaldi, Pratelli and Rigo9, Lemma 2]). Indeed, we have $\sum_{n\geq 1}E\Big[\big\|\boldsymbol{\xi}_{\textbf{n}}-{\boldsymbol{\psi}_{\!\textbf{n}-\textbf{1}}}\big\|^2\Big]/n^2\leq \sum_{n\geq 1}n^{-2}<+\infty$ and
The last stable convergence follows from the equality
where the last term converges almost surely to zero by Theorem 3.
Remark 2. Equation (25) implies that the bounded stochastic process $\boldsymbol{\psi}= (\boldsymbol{\psi}_{\!\textbf{n}})_n$ is a positive (i.e. nonnegative) almost supermartingale [Reference Robbins and Siegmund49] and also a quasi-martingale [Reference Métivier39], because $\textbf{H}({\boldsymbol{\psi}_{\!\textbf{n}}})$ is uniformly bounded by a constant and $\sum_{n\geq 1} 1/{r^*_{n+1}}<+\infty$ .
3.3. Trivial cases
We here consider the two elementary cases $\beta=0$ and $\alpha=0$ .
3.3.1. The case $\beta=0$
In this case, by (7), (8), and (9), we have for all $i = 1, \ldots, k$
We now focus on $\boldsymbol{\psi}_{\!\textbf{n}}$ for $n\geq 1$ . The process $(\boldsymbol{\psi}_{\!\textbf{n}})_{n\geq 1}$ is a k-dimensional Markov chain with a finite state space $ S=\{\textbf{s}_{\textbf{1}},\ldots, \textbf{s}_{\textbf{k}}\} $ , where
and transition probability matrix
which is irreducible and aperiodic. Now, since $\textbf{1}_k \,{\textbf{p}_{\textbf{0}}}^\top $ is idempotent and commutes with the identity, we have
where $\gamma$ is the constant given in (5), which becomes equal to $\frac{\alpha}{|\textbf{b}_{\textbf{0}}| + {\alpha}}$ for $\beta=0$ . We note that $\gamma<1$ (since $|\textbf{b}_{\textbf{0}}|>0$ by assumption) and so $P^n \to \textbf{1}_k \, {\textbf{p}_{\textbf{0}}}^\top $ , and the unique invariant probability measure on S is hence $ \boldsymbol{\pi} = \textbf{p}_{\textbf{0}}$ .
Theorem 5. We have ${\overline{\boldsymbol{\xi}}_{\textbf{N}}}\stackrel{a.s.}\longrightarrow \textbf{p}_{\textbf{0}}$ and
where
with $\lambda$ defined in (6) (taking $\beta=0$ ).
Proof. We observe that, by (26), we have for each $n\geq 0$
Therefore, the strong law of large numbers for Markov chains immediately yields
Take a vector $\textbf{c} = (c_1,\ldots,c_k)^T$ and define $g(\textbf{x}) =\textbf{c}^T (\textbf{x}-\textbf{p}_{\textbf{0}})$ . Recall that $g(\boldsymbol{\xi}_n)=g\big({\unicode{x1D7D9}}_{\{\boldsymbol{\psi}_{\!\textbf{n}} = \textbf{s}_{\textbf{1}}\}}, \ldots,{\unicode{x1D7D9}}_{\{\boldsymbol{\psi}_{\!\textbf{n}} = \textbf{s}_{\textbf{k}}\}} \big)$ and apply the CLT for uniformly ergodic Markov chains (see, for instance, [Reference Meyn and Tweedie40, Theorem 17.0.1]): the sequence
converges in distribution to the Gaussian distribution $\mathcal{N}\big(0,\sigma^2_{\textbf{c}}\big)$ , with
where
and $\big({\boldsymbol{\psi}^{(\boldsymbol{\pi})}_{\textbf{n}}}\big)_{n\geq 0}$ is a Markov chain with transition matrix P and initial distribution $\boldsymbol{\pi}$ , that is, $\textbf{p}_{\textbf{0}}$ . Now, by definition,
, and hence $\mathrm{Var}\big[\boldsymbol{{\xi}^{(\pi)}_0} \big] = \mathrm{diag}(\textbf{p}_{\textbf{0}}) -\textbf{p}_{\textbf{0}}\textbf{p}_{\textbf{0}}^T$ . Moreover, by means of (27),
Hence, since $\gamma < 1$ , we have $\sum_{n\geq 1}\gamma^n=\gamma/(1-\gamma)$ and so
By the Cramér–Wold device, the theorem is proved with $\Sigma^2$ given in (28).
Remark 3. Note that in Theorem 5 we do not assume $b_{0\,i}>0$ for all i, but only $|\textbf{b}_{\textbf{0}}|>0$ (as stated in Section 2). A different behavior is observed when $\textbf{b}_{\textbf{0}} = \textbf{0}$ . In this case, (26) gives $\boldsymbol{\psi}_{\!\textbf{n}} = \boldsymbol{\xi}_{\textbf{n}} $ for $n\geq 1$ . Since $\psi_{n\,i}=P(\xi_{n+1\,i}=1|\mathcal{F}_n)$ , the above equality implies recursively $\boldsymbol{\psi}_{\!\textbf{n}} = \boldsymbol{\xi}_{\textbf{n}} = \boldsymbol{\xi_{\textbf{1}}}$ for each $n\geq 1$ . In other words, the process of extractions $\boldsymbol{\xi}=(\boldsymbol{\xi}_{\textbf{n}})_{n\geq 1}$ is constant, with $P(\xi_{1\,i} = 1)= \psi_{0\,i} = {B_{0\,i}} / |\textbf{B}_{\textbf{0}}|$ .
3.3.2. The case $\alpha=0$
The model introduced above for $\alpha>0$ also makes sense when $\alpha=0$ . For completeness, in this section we discuss this case. Recall that we are assuming $|\textbf{b}_{\textbf{0}}|>0$ and $b_{0\,i}+B_{0\,i}>0$ (see Section 2). For the case $\beta>1$ , we here assume also $|\textbf{B}_{\textbf{0}}|>0$ .
When $\alpha=0$ , the random vectors $\boldsymbol{\xi}_{\textbf{n}}$ are independent with
Therefore, we have
for all n if $\beta=1$ (which corresponds to the classical multinomial model) and
Moreover, the following result holds true.
Theorem 6. We have
Moreover, we have
where $\Sigma^2 = \mathrm{diag} ({{\boldsymbol{\psi}_{\!\boldsymbol{\infty}}}}) -{{\boldsymbol{\psi}_{\!\boldsymbol{\infty}}}} {\boldsymbol{\psi}_{\!\boldsymbol{\infty}}^\top}$ and $\mathop{\longrightarrow}\limits^{s}$ means stable convergence.
Proof. The almost sure convergence follows from the Borel–Cantelli lemmas (see, for instance, [Reference Williams54, Section 12.15]). Indeed, we have the following:
-
• if $\sum_{n\geq 0}\psi_{n,i}<+\infty$ , then $\sum_{n=1}^N\xi_{n\,i}\stackrel{a.s.}\longrightarrow \xi_{\infty\,i}$ , with $P(\xi_{\infty\,i}<+\infty)=1$ ;
-
• if $\sum_{n\geq 0}\psi_{n,i}=+\infty$ , then $\sum_{n=1}^N\xi_{n\,i}/\sum_{n=1}^N\psi_{n-1\,i}\stackrel{a.s.}\longrightarrow 1$ .
Hence, the statement of the theorem follows because
-
(i) if $\beta=1$ , then $\sum_{n\geq 0}\psi_{n\,i}=+\infty$ and
\begin{equation*}\sum_{n=1}^{N}\psi_{n-1\,i}\sim \frac{b_{0\,i}+B_{0\,i}}{|\textbf{b}_{\textbf{0}}|+|\textbf{B}_{\textbf{0}}|} N;\end{equation*} -
(ii) if $\beta\in [0,1)$ and $b_{0\,i}>0$ , then $\sum_{n\geq 0}\psi_{n\,i}=+\infty$ and
\begin{equation*}\sum_{n=1}^{N}\psi_{n-1\,i}\sim \frac{b_{0\,i}}{|\textbf{b}_{\textbf{0}}|} N=p_{0\,i} N;\end{equation*} -
(iii) if $\beta\in [0,1)$ and $b_{0\,i}=0$ , then
\begin{equation*}\sum_{n\geq 0}\psi_{n\,i}\leq \frac{B_{0\,i}}{|\textbf{b}_{\textbf{0}}|}\sum_{n\geq 0}\beta^n <+\infty\end{equation*}and so $\sum_{n\geq 1}\xi_{n\,i}<+\infty$ almost surely; that is, $\xi_{n\,i}=0$ eventually with probability one; -
(iv) if $\beta>1$ and $B_{0\,i}> 0$ , then $\sum_{n\geq 0}\psi_{n\,i}=+\infty$ and $\sum_{n=1}^{N}\psi_{n-1\,i}\sim \frac{B_{0\,i}}{|\textbf{B}_{\textbf{0}}|} N$ ;
-
(v) if $\beta>1$ and $B_{0\,i}=0$ , then
\begin{equation*}\sum_{n\geq 0}\psi_{n\,i}\leq \frac{b_{0\,i}}{|\textbf{B}_{\textbf{0}}|}\sum_{n\geq 0}\beta^{-n}<+\infty\end{equation*}and so $\sum_{n\geq 1}\xi_{n\,i}<+\infty$ almost surely; that is, $\xi_{n\,i}=0$ eventually with probability one.
For the CLT we argue as in the proof of Theorem 4. Indeed, we set $\textbf{Y}_{\textbf{N},\textbf{n}}=\frac{\boldsymbol{\xi}_{\textbf{n}}-{\boldsymbol{\psi}_{\!\textbf{n}-\textbf{1}}}}{\sqrt{N}}$ so that we have
where the second term converges to zero because
Therefore it is enough to prove that $\sum_{n=1}^N\textbf{Y}_{\textbf{N},\textbf{n}}$ converges stably to the desired Gaussian kernel. To this end we observe that $E[\textbf{Y}_{\textbf{N},\textbf{n}}|\mathcal{F}_{n-1}]=\textbf{0}$ , and so, to prove the stable convergence, we have to check the following conditions (see [Reference Crimaldi, Letta and Pratelli22, Corollary 7] or [Reference Crimaldi19, Corollary 5.5.2]):
-
(c1) $E\!\left[\,\max_{1\leq n\leq N} |\textbf{Y}_{\textbf{N},\textbf{n}} |\,\right]\to 0$ , and
-
(c2) $\sum_{n=1}^N \textbf{Y}_{\textbf{N},\textbf{n}} {\textbf{Y}_{\textbf{N},\textbf{n}}^\top}\mathop{\longrightarrow}\limits^{P}\Sigma^2$ .
Regarding (c1), we note that
Regarding (c2), we observe that
because (see, for instance, [Reference Berti, Crimaldi, Pratelli and Rigo9, Lemma 2]) $\sum_{n\geq 1} E\big[\big\|\boldsymbol{\xi}_{\textbf{n}}-{\boldsymbol{\psi}_{\!\textbf{n}-\textbf{1}}}\big\|^2\big]/n^2\leq\sum_{n\geq 1} n^{-2}<+\infty$ and
4. Proof of the goodness-of-fit result (Theorem 1)
The proof is based on Theorem 2 (for $0<\beta<1$ ) and Theorem 5 (for $\beta = 0$ ), whose proofs are in Subsections 3.1 and 3.3.1, respectively. The almost sure convergence of $O_i/N$ immediately follows since $O_i/N=\overline{\xi}_{N\,i}$ . In order to prove the stated convergence in distribution, we mimic the classical proof for the Pearson chi-squared test based on the Sherman–Morrison formula (see [Reference Sherman and Morrison51]); however, see also [Reference Rao and Scott47, Corollary 2].
Proof. We start by recalling the Sherman–Morrison formula: if A is an invertible square matrix and $ 1 - \textbf{v}^\top A^{-1} \textbf{u} \neq 0 $ , then
Given the observation $\boldsymbol{\xi}_{\textbf{n}}=(\xi_{n\,1},\dots,\,\xi_{n\,k})^{\top}$ , we define the ‘truncated’ vector $\boldsymbol{\xi}^*_{\textbf{n}}= \big(\xi^*_{n\, 1}, \ldots,\xi^*_{n\, k-1}\big)^\top$ , given by the first $k-1$ components of $\boldsymbol{\xi}_{\textbf{n}}$ . Theorem 2 (for $\beta \in (0, 1)$ ) and Theorem 5 (for $\beta = 0$ ) give the CLT for $(\boldsymbol{\xi}_{\textbf{n}})_n$ , which immediately implies
where $\textbf{p}^*$ is given by the first $k-1$ components of $\textbf{p}_{\textbf{0}}$ and
By assumption $p_{0\,i}>0$ for all $i=1,\dots,\,k$ and so $\mathrm{diag}(\textbf{p}^*)$ is invertible with inverse $\mathrm{diag} (\textbf{p}^*)^{-1}= \mathrm{diag} \Big(\frac{1}{p_{0\,1}}, \ldots, \frac{1}{p_{0\,k-1}} \Big)$ ; since $ \big(\mathrm{diag} (\textbf{p}^*)^{-1} \big) \textbf{p}^* =\textbf{1}\in\mathbb{R}^{k-1}$ , we have
Therefore we can use the Sherman–Morrison formula with $A =\mathrm{diag} (\textbf{p}^*)$ and $\textbf{u}= \textbf{v} = \textbf{p}^*$ , and we obtain
Now, since $\sum_{i=1}^{k} \big(\overline{\xi}_{N\,i} - {p_{0\,i}}\big) = 0$ , we have $\overline{\xi}_{N\,k} - {p_{0\,k}} = -\sum_{i=1}^{k-1}\big(\overline{\xi}_{N\,i} - {p_{0\,i}}\big)$ , and so we get
where $\delta_{i_1}^{i_2}$ is equal to 1 if $i_1=i_2$ and equal to zero otherwise. Finally, from the above equalities, recalling (29) and (30), we obtain
where $W_0$ is a random variable with distribution $\chi^2(k-1)=\Gamma((k-1)/2,1/2)$ , where $\Gamma(a,b)$ denotes the gamma distribution with density function
As a consequence, $W_*$ has distribution $\Gamma((k-1)/2,1/(2\lambda))$ .
Appendix A
A.1. A central limit theorem for a multidimensional compact Markov chain
In this section we prove the general CLT for Markov chains, used for the proof of Theorem 2.
Let (S, d) be a compact metric space and denote by C(S) the space of continuous real functions on S, by Lip(S) the space of Lipschitz continuous real functions on S, and by $Lip(S\times S)$ the space of Lipschitz continuous real functions on $S\times S$ . Moreover, we define $\|f\|_\infty=\sup_{x\in S}|f(x)|$ for each f in C(S) and, for each f in Lip(S),
Let P(x, dy) be a Markovian kernel on S and set $(Pf)(x) = \int_Sf(y)P(x,dy)$ . We now recall some definitions and results concerning Markov chains with values in S.
Definition 1. We say that P is weak Feller if $(Pf)(x)=\int_S f(y)P(x,dy)$ defines a linear operator $P\,:\,C(S)\to C(S)$ . A Markov chain with a weak Feller transition kernel is called a weak Feller Markov chain.
Remark 4. If P is weak Feller, then the sequence $(P^n)_{n\geq 1}$ of operators from C(S) to C(S) is uniformly bounded with respect to $\|\cdot\|_\infty$ : indeed, we simply have
Moreover, the existence of at least one invariant probability measure for P is easily shown. In fact, the set of probability measures $\mathcal{P}(S)$ on S, endowed with the topology of weak convergence, is a compact convex set. In addition, the adjoint operator of P, namely
is continuous on $\mathcal{P}(S)$ (since P is weak Feller). Then, the existence of an invariant probability measure $\pi$ is a consequence of Brouwer’s fixed-point theorem.
Definition 2. We say that P is semi-contractive or a semi-contraction on Lip(S) if it maps Lip(S) into itself and there exists a constant $\gamma<1$ such that
for each $f\in Lip(S)$ .
We now give the definition of compact Markov chain (see [Reference Norman42, Chapter 3] for a general exposition of the theory of these processes, and [Reference Doeblin and Fortet24] for the beginning of this theory).
Definition 3. We say that P is a Doeblin–Fortet operator if it is weak Feller and a bounded operator from $(Lip(S),\|\cdot\|_{Lip})$ into itself, and there are finite constants $n_0\geq 1$ , $\gamma<1$ , and $R\geq 0$ such that
for each $f\in Lip(S)$ . A Markov chain with a Doeblin–Fortet operator on a compact set S is called compact Markov chain (or process).
Note that the Doeblin–Fortet operator, the weak Feller property, and the semi-contraction may also be defined for a non-compact state space. In general, a compact Markov process is a Doeblin–Fortet process in a compact state space. In our framework, since S is compact, the two concepts coincide and the result below follows immediately.
Lemma 1. If P is weak Feller and a semi-contractive operator on Lip(S), then P is a Doeblin–Fortet operator. In other words, a weak Feller Markov chain such that its transition kernel is semi-contractive on Lip(S) is a compact Markov chain.
Definition 4. We say that P is irreducible and aperiodic if
A Markov chain with an irreducible and aperiodic transition kernel is called an irreducible and aperiodic Markov chain.
Under the hypotheses of the theorem of Ionescu-Tulcea and Marinescu in [Reference Ionescu Tulcea and Marinescu35], the spectral radius of P is 1, the set of eigenvalues of P of modulus 1 has only a finite number of elements, and each relative eigenspace is finite-dimensional. This theorem can always be applied to a compact Markov chain (see [Reference Norman42, Theorem 3.3.1]). More specifically, every compact Markov chain has d disjoint closed sets, called ergodic sets, contained in its compact state space S. These sets are both the support of the base of the ergodic invariant probability measures, and the support of a base of the eigenspaces related to the eigenvalues of modulus 1 (see [Reference Norman42, Theorem 3.4.1]). In addition, each of these ergodic sets may be subdivided into $p_j$ closed disjoint subsets. The number $p_j$ is the period of the jth irreducible component, and the ergodic subdivision gives the support of the eigenfunctions related to the $p_j$ roots of 1 (see [Reference Norman42, Theorem 3.5.1]). Then, as also explained in [Reference Norman42, Section 3.6], there are no other eigenvalues of modulus 1 except 1 (aperiodicity) and no other eigenfunctions except the constant for the eigenvalue equal to 1 (irreducibility) if and only if the compact Markov chain has but one ergodic kernel, and this kernel has period 1. In other words, the following result holds true.
Theorem 7. Let $\psi = (\psi_n)_{n\geq 0}$ be a compact Markov chain and let $\pi$ an invariant probability measure with respect to its transition kernel. If $\psi = (\psi_n)_{n\geq 0}$ converges in distribution to $\pi$ , regardless of its initial distribution, then $\pi$ is the unique invariant probability measure and $\psi$ is irreducible and aperiodic.
We now note that if P is Doeblin–Fortet, irreducible, and aperiodic, then it satisfies all the conditions given in [Reference Guivarc’h and Hardy30, Définition 0] and [Reference Guivarc’h and Hardy30, Définition 1]. Therefore, it has a unique invariant probability measure $\pi$ , and for any $f\in Lip(S\times S)$ , there exists a unique (up to a constant) function $u_f \in Lip(S)$ such that
By means of this function $u_f$ , it is possible to define the (unique) function $ f^{\prime}(x,y) = f(x,y)+u_f(y)-u_f(x) $ so that we have
In addition, we may define the quantity $\sigma^2(f) \geq 0$ as follows (see [Reference Guivarc’h and Hardy30, Equation (6)]):
Finally, we have the following convergence result.
Theorem 8. ([Reference Guivarc’h and Hardy30, Théorème 1 and Théorème 2]) Let $\psi=(\psi_n)_{n\geq 0}$ be an irreducible and aperiodic compact Markov chain and denote by $\pi$ its unique invariant probability measure. Let $f\in Lip(S\times S)$ such that $m(f)=0$ and $\sigma^2(f)>0$ . Then, setting $S_N(f)=\sum_{n=0}^{N-1}f(\psi_n,\psi_{n+1})$ , we have
and
Now let us specialize our assumptions, taking as S a compact subset of $\mathbb{R}^k$ . Therefore, in the sequel we will use boldface to highlight the fact that we are working with vectors.
Definition 5. (‘Linearity’ condition.) We say that P and ${\textbf{f}}\,:\, S\times S \to \mathbb{R}^d$ form a linear model if ${\textbf{f}}$ is linear (in $\textbf{x}$ and $\textbf{y}$ ) with $m({\textbf{f}})= \textbf{0}$ and the function
is linear (in $\textbf{x}$ ).
Remark 5. Denote by $\textbf{p}_{\textbf{0}} = \int_S \textbf{x} \pi(\textbf{dx})$ the mean value under the invariant probability measure $\pi$ of P. If P and ${\textbf{f}}$ form a linear model, then there exist two matrices $A_{1},A_{2} \in\mathbb{R}^{d\times k}$ such that
and a square matrix $A_{P} \in \mathbb{R}^{k\times k}$ such that
Indeed, if $(P\textbf{y})(\textbf{x}) = A_{P}\textbf{x} + \textbf{b}$ , using that $\pi$ is invariant with respect to P, we obtain
and hence $(P(\textbf{y}-\textbf{p}_{\textbf{0}})) (\textbf{x}) = A_{P} (\textbf{x}-\textbf{p}_{\textbf{0}})$ . Moreover, if ${\textbf{f}}(\textbf{x},\textbf{y}) = A_{1}\textbf{x}+A_{2}\textbf{y}+\textbf{b}$ , then
and hence, if $m({\textbf{f}})= \textbf{0}$ , we obtain ${\textbf{f}}=A_{1}(\textbf{x}-\textbf{p}_{\textbf{0}}) + A_{2}(\textbf{y}-\textbf{p}_{\textbf{0}})$ .
Theorem 9. Let $\psi=(\psi_n)_{n\geq 0}$ be an irreducible and aperiodic compact Markov chain, and denote by P its transition kernel and by $\pi$ its unique invariant measure. Assume that P and ${\textbf{f}}$ form a linear model and let $A_1,\,A_2$ , and $A_P$ be defined as in (A.2) and in (A.3). Then, setting ${\textbf{S}_{\textbf{N}}}({\textbf{f}})=\sum_{n=0}^{N-1}{\textbf{f}}(\boldsymbol{\psi}_{\!\textbf{n}},\boldsymbol{\psi}_{\!\textbf{n}+\textbf{1}})$ , we have
where
with
(the variance–covariance matrix under the invariant probability measure $\pi$ ),
where $D_0 = (A_{1}+ A_{2} A_{P})(Id - A_{P})^{-1} $ . Moreover, for any $\textbf{c}\in\mathbb{R}^k$ ,
Proof. As a consequence of Definition 2, the spectral radius of $A_{P}$ must be less than one, and hence $Id - A_{P}$ is invertible. Therefore, we may define
so that we have
We immediately get that the function $ \textbf{g}(\textbf{x},\textbf{y}) ={\textbf{f}}(\textbf{x},\textbf{y}) +\textbf{u}_{\textbf{f}} (\textbf{y}) -\textbf{u}_{\textbf{f}} (\textbf{x}) $ is linear and it may be written as $ \textbf{g}(\textbf{x},\textbf{y}) =D_{1}(\textbf{x}-\textbf{p}_{\textbf{0}}) + D_{2}(\textbf{y}-\textbf{p}_{\textbf{0}})$ . Taking into account that
and
we can compute the quantity
By the Cramér–Wold device, the theorem is proven with $\Sigma^2$ given above if we prove that, for any $\textbf{c}$ ,
Therefore, in order to conclude, it is enough to note that the above convergence is a consequence of Theorem 8 with $f =\textbf{c}^\top{\textbf{f}}$ . Indeed, by definition $f\in Lip(S\times S)$ and the function $u_f\in Lip(S)$ in (A.1) may be chosen as $ u_f = \textbf{c}^\top\textbf{u}_{\textbf{f}}$ , so that $m(f)=0 $ and $ \sigma^2(f)=\textbf{c}^\top\Sigma\textbf{c}$ .
A.2. Coupling technique
The result proven in this subsection plays a relevant rôle in the proof of Theorem 2. Indeed, it shows that, under suitable assumptions, two stochastic processes can be ‘coupled’ in a suitable way, preserving their respective joint distributions.
Set $S^*=\{\textbf{x}\,:\,x_i\geq 0,\; |\textbf{x}|= 1\} $ (i.e. the standard (or probability) simplex in $\mathbb{R}^k$ ), and recall that $\{\textbf{e}_{\textbf{1}}, \ldots, \textbf{e}_{\textbf{k}}\}$ denotes the canonical base of $\mathbb{R}^k$ . We have the following technical lemma.
Lemma 2. There exist two measurable functions $\textbf{h}^{(\textbf{1})},\textbf{h}^{(\textbf{2})}\,:\,S^*\times S^*\times (0, 1) \to \{\textbf{e}_{\textbf{1}}, \ldots, \textbf{e}_{\textbf{k}}\}$ , such that for any ${\textbf{x} ,\textbf{y} \in S^*}$ ,
and
As a consequence, we have
Proof. Given $\textbf{x},\textbf{y} \in S^*$ , define ${\underline{\textbf{xy}}}= \textbf{x}\wedge \textbf{y}$ so that $\underline{xy}_{\,i}= \min\!(x_{i},y_i)$ . Set $u_0= |{\underline{\textbf{xy}}}| = \sum_{i=1}^k \min\!(x_{i},y_i)$ , and note that $0 \leq u_0 \leq 1$ . Moreover, for any $i \in \{1,\ldots,k\}$ , set
and let
Observe that since $1 = u_0 + \sum_{i=1}^{k} \big(x_i-\underline{xy}_{\,i}\big)= u_0 + \sum_{i=1}^{k} \big(y_i-\underline{xy}_{\,i}\big)$ , the equalities above uniquely define $\textbf{h}^{(\textbf{1})},\textbf{h}^{(\textbf{2})}$ on the whole domain. Moreover, since $x_i =\underline{xy}_{\,i} +\big(x_i-\underline{xy}_{\,i}\big)$ and $y_i = \underline{xy}_{\,i}+\big(y_i-\underline{xy}_{\,i}\big)$ , the two conditions collected in Equation (A.6) are satisfied.
To check (A.7), just note that $\textbf{h}^{(\textbf{1})} (\textbf{x} , \textbf{y}, u) $ is equal to $\textbf{h}^{(\textbf{2})} (\textbf{x}, \textbf{y}, u) $ on the set $\cup_i A_{{\underline{\textbf{xy}}}\,i} =(0,u_0)$ , and we have
Finally, (A.8) follows immediately from (A.7) since $| \textbf{h}^{(\textbf{1})}-\textbf{h}^{(\textbf{2})} |\leq 2$ .
Now we are ready to prove the following ‘coupling result’.
Theorem 10. Let $\boldsymbol{{\psi}^{(1)}}= \big(\boldsymbol{{\psi}_{n}^{(1)}}\big)_n$ and $\boldsymbol{{\psi}^{(2)}}=\big(\boldsymbol{{\psi}_{n}^{(2)}}\big)_n$ be two stochastic processes with values in $S^*$ that evolve according to the following dynamics:
where $a_0,a_1 \geq 0$ , $\textbf{c}\in\mathbb{R}^k$ , $\boldsymbol{\xi_{\textbf{n}+1}^{(\ell)}}$ are random variables taking values in $\{\textbf{e}_{\textbf{1}}, \ldots, \textbf{e}_{\textbf{k}}\}$ and such that
and ${\textbf{l}_{\textbf{n}+\textbf{1}}^{(\boldsymbol{\ell})}}$ are measurable functions such that $\big|{\textbf{l}^{(\boldsymbol{\ell})}_{\textbf{n}+\textbf{1}}}\big|=O\big(c_{n+1}^{(\ell)}\big)$ . Then there exist two stochastic processes ${\widetilde{\boldsymbol{\psi}}^{(\boldsymbol{\ell})}}=\big({\widetilde{\boldsymbol\psi}_{n}^{(\ell)}}\big)_{n\geq 0}$ , $\ell=1,2$ , evolving according to the dynamics
with ${\widetilde{\boldsymbol\psi}_0^{(\ell)}}=\boldsymbol{\psi_{0}^{(\ell)}}$ and
and such that, for any $n\geq 0$ , we have
Remark 6. As a consequence, for each $\ell=1,\,2$ , the two stochastic processes ${\widetilde{\boldsymbol{\psi}}^{(\boldsymbol{\ell})}}$ and ${\widetilde{\boldsymbol\xi}^{(\boldsymbol\ell)}}$ have the same joint distribution of $\boldsymbol{\psi}^{(\boldsymbol{\ell})}$ and of $\boldsymbol{\xi}^{(\boldsymbol\ell)}$ , respectively. Indeed, ${\widetilde{\boldsymbol\psi}^{(\boldsymbol\ell)}_{\textbf{0}}}={\boldsymbol{\psi}^{(\boldsymbol\ell)}_{\textbf{0}}}$ , and by (A.9), (A.10), (A.11), and (A.12), the conditional distributions of ${\widetilde{\boldsymbol\psi}^{\boldsymbol\ell}_{\textbf{n}+\textbf{1}}}$ given $\Big[{\widetilde{\boldsymbol\psi}^{(\boldsymbol\ell)}_{\textbf{0}}},\dots,\,{\widetilde{\boldsymbol\psi}^{(\boldsymbol\ell)}_{\textbf{n}}}\Big]$ and of ${\widetilde{\boldsymbol\xi}^{\boldsymbol\ell}_{\textbf{n}+\textbf{1}}}$ given $\Big[{\widetilde{\boldsymbol\psi}^{(\boldsymbol\ell)}_{\textbf{0}}},{\widetilde{\boldsymbol\xi}^{(\boldsymbol\ell)}_{\textbf{1}}}\dots,\, {\widetilde{\boldsymbol\xi}^{(\boldsymbol\ell)}_{\textbf{n}}}\Big]$ are the same as those of ${\boldsymbol{\psi}^{\boldsymbol\ell}_{\textbf{n}+\textbf{1}}}$ given $\Big[{\boldsymbol{\psi}^{(\boldsymbol\ell)}_{\textbf{0}}},\dots,\,\boldsymbol{{\psi}^{(\ell)}_n}\Big]$ and of $\boldsymbol{{\xi}^{\ell}_{\textbf{n}+1}}$ given $\Big[{\boldsymbol{\psi}^{(\boldsymbol\ell)}_{\textbf{0}}}, \boldsymbol{{\xi}^{(\ell)}_1}\dots,\, \boldsymbol{{\xi}^{(\ell)}_{\textbf{n}}}\Big]$ , respectively.
Moreover, from the inequality (A.13), by recursion, we obtain
Proof. We set ${\widetilde{\boldsymbol\psi}_{0}^{(\ell)}}= \boldsymbol{{\psi}_{0}^{(\ell)}}$ , for $\ell=1,\,2$ , and we take a sequence $(U_n)_{n\geq 1}$ of i.i.d. (0, 1)-uniform random variables, independent of $\sigma\big(\boldsymbol{{\psi}_{0}^{(1)}},\boldsymbol{{\psi}_{0}^{(2)}} \big)$ . Then we take the two functions $\textbf{h}^{(\textbf{1})},\textbf{h}^{(\textbf{2})}$ of Lemma 2, and for each $\ell$ and any $n\geq 0$ we recursively define
Setting $\widetilde{\mathcal{F}}_n =\sigma\Big(\boldsymbol{{\psi}_{0}^{(1)}},\boldsymbol{{\psi}_{0}^{(2)}}, U_1,\ldots, U_n\Big)$ , we have that $U_{n+1}$ is independent of $\widetilde{\mathcal{F}}_n$ and, by definition, ${\widetilde{\boldsymbol\xi}_{\textbf{n}}^{(\ell)}}$ and ${\widetilde{\boldsymbol\psi}_{\textbf{n}}^{(\ell)}}$ are $\widetilde{\mathcal{F}}_n$ -measurable, for any $\ell=1,2$ and $n\geq 0$ . Therefore, using the relation (A.6), we get, for any $\ell=1,\,2$ , $n\geq 0$ , and $i=1,\ldots,k$ ,
This means that (A.11), together with (A.12), holds true. Finally, by the relation (A.8), we have
hence, subtracting (A.11) with $\ell=2$ from the same relation with $\ell=1$ , we obtain
and so the inequality (A.13) holds true.
Declaration
All the authors developed the theoretical results, performed the numerical simulations, and contributed to the final version of the manuscript.
Acknowledgements
Giacomo Aletti is a member of the Gruppo Nazionale per il Calcolo Scientifico, of the Istituto Nazionale di Alta Matematica (Italy). Irene Crimaldi is a member of the Gruppo Nazionale per l’Analisi Matematica, la Probabilità e le loro Applicazioni, of the Istituto Nazionale di Alta Matematica.
Funding information
Irene Crimaldi is partially supported by the Programma di Attività Integrata (PAI), under the project ‘TOol for Fighting FakEs’ (TOFFE) funded by the IMT School for Advanced Studies Lucca.
Competing interests
There were no competing interests to declare which arose during the preparation or publication process for this article.