Article contents
SIZE-BIASED PERMUTATION OF DIRICHLET PARTITIONS AND SEARCH-COST DISTRIBUTION
Published online by Cambridge University Press: 01 January 2005
Abstract
Consider the random Dirichlet partition of the interval into n fragments at temperature θ > 0. Explicit results on the law of its size-biased permutation are first supplied. Using these, new results on the comparative search cost distributions from Dirichlet partition and from its size-biased permutation are obtained.
- Type
- Research Article
- Information
- Probability in the Engineering and Informational Sciences , Volume 19 , Issue 1 , January 2005 , pp. 83 - 97
- Copyright
- © 2005 Cambridge University Press
1. INTRODUCTION AND DESCRIPTION OF MAIN RESULTS
Basic facts on the random Dirichlet partition of the interval into n fragments at temperature θ > 0 are first recalled in Section 2. In Section 3, explicit results on the law of its size-biased permutation are supplied. A size-biased permutation of the fragments sizes is the one obtained in a size-biased sampling process without replacement from a Dirichlet partition. The main points that we develop are the following: In Proposition 1, it is recalled that the length of an interval containing a random sample is stochastically larger than the typical fragment size from a Dirichlet distribution. Its law is computed and the stochastic domination result is made more explicit in Corollary 2. In Theorem 3, the law of the length of the kth fragment in the size-biased permutation is supplied. It is also shown there that the consecutive fragments in the size-biased permutation are arranged in stochastic descending order. In Corollary 4, the expected length of the kth fragment in the size-biased permutation is supplied. In Theorem 5, we give the joint law of the size-biased permutation fragments sizes explicitly (or, rather, its joint moment function).
Size-biased permutations of random discrete distributions are known to be the random equilibrium distributions of the heaps process consisting in moving sequentially sampled fragments to the front, starting from the original partition. Using the computations from Section 3, new results on the comparative search-cost distributions from Dirichlet partition and from its size-biased permutation are obtained in Section 4. The search cost of an item in a library is the number of items above it in the heap; averaging over the items gives the search cost of a typical item. The search cost when the library has reached the equilibrium state is expected to be smaller than the search cost in the original Dirichlet partition itself. The results that we describe confirm this intuition. In Proposition 6, the limiting search cost per item in a Dirichlet partition is first shown to be uniformly distributed. In Lemma 7, we compute explicitly the law of the search cost in the size-biased permutation of a Dirichlet partition, using Corollary 4. First and second moments are also obtained differently from the techniques usually employed to do so. In Theorem 8, the limiting search cost per item in a size-biased permutation of the Dirichlet partition is shown to be beta(1,1 + 1/θ) distributed. Finally, in Proposition 9, considering the asymptotic introduced by Kingman, n ↑ ∞, θ ↓ 0, nθ = γ > 0, we find the limiting size-biased permutation search cost to be geometrically distributed.
2. PRELIMINARIES: THE DIRICHLET DISTRIBUTION Dn(θ)
We will consider the following random partition into n fragments of the unit interval: Let θ > 0 be some parameter that we will interpret as temperature or disorder of the partition. Assume that the random fragments' sizes Sn := (S1,…,Sn) (with

is distributed according to the (exchangeable) Dirichlet Dn(θ) density function on the simplex; that is to say,

Alternatively, the law of Sn := (S1,…,Sn) is characterized by its joint moment function

In this case,

, independently of m and the individual fragment sizes are all identically distributed (i.d.). Their common density on the interval (0,1) is given by

which is a beta(θ,(n − 1)θ) density, with mean value E(Sn) = 1/n and variance σ2(Sn) = (n − 1)/[n2(nθ + 1)].
We recall that a random variable, say Ba,b, with

, has density function fBa,b(x) := [Γ(a + b)/Γ(a)Γ(b)]xa−1(1 − x)b−1, a,b > 0, x ∈ [0,1] and moment function E[Ba,bq] = [Γ(a + q)Γ(a + b)]/[Γ(a)Γ(a + b + q)], with Γ(a) the Euler's Gamma function.
We also recall that when θ = 1, the partition model [Eqs. (1) and (2)] corresponds to the standard uniform partition model of the interval.
From Eq. (3), as n ↑ ∞, we next have

showing that the sizes of fragments are asymptotically all of order 1/n.
Consider next the sequence S(n) := (S(m); m = 1,…,n) obtained while ranking the fragment sizes Sn according to descending sizes, hence with S(1) > ··· > S(m) > ··· > S(n). The S(m) distribution can hardly be derived in closed form. However, one could prove that as n ↑ ∞,

where Wθ is a Weibull random variable and Gθ is a Gumbel random variable such that P(Wθ > t) = exp[−tθ/sθ] , t > 0, and

is a scale parameter.
In the random division of the interval as in Eq. (1) at disorder θ, although all fragments are identically distributed with sizes of order n−1, the smallest fragment's size grows like n−(θ+1)/θ and the largest is of order (1/nθ)log(n(log n)θ−1). The smaller θ is, the larger (smaller) the largest (smallest) fragment's size is; hence, the smaller disorder θ is, the more the values of the Sm are, with high probability, disparate. At low disorder, the size of the largest fragment S(1) tends to dominate the other ones and the range S(1) − S(n) increases when θ decreases.
To the contrary, large values of θ correspond to situations in which the range of fragments' sizes is lower: the fragments' sizes look more homogeneous and distribution equation (1) concentrates on its center. At high disorder, the diversity of the partition is large.
3. SAMPLING WITHOUT REPLACEMENT AND SIZE-BIASED PERMUTATION OF THE FRAGMENTS
Assume some observer is sampling such an interval as follows: Drop at random points onto this randomly broken interval and record the corresponding numbers of visited fragments. Consider the problem of determining the order in which the various fragments will be discovered in such a sampling process. To avoid revisiting many times the same fragment once it has been discovered, we need to remove it from the population as soon as it has been met in the sampling process. However, to do that, an estimation of its size is needed. We first do that for the first visited fragment. Once this is done, after renormalizing the remaining fragments' sizes, we are left with a population of n − 1 fragments, the sampling of which will necessarily supply a so far undiscovered fragment. Its size can be estimated and so forth, renormalizing again, until the whole available fragments' population has been visited. In this way, not only can the visiting order of the different fragments be understood but also their sizes. The purpose of this section is to describe the statistical structure of the size-biased permutation of the fragments' sizes as those obtained while avoiding the ones previously encountered in a sampling process.
Let Sn := (S1,…,Sn) be the random partition of the interval [0,1] considered here, with

.
Let U be a uniformly distributed random throw on [0,1] and let

be the length of the interval of the random partition containing U. The distribution of

is characterized by the conditional probability

In this size-biased picking procedure, long intervals are favored and one expects that

in the usual stochastic sense that

.
Let us first check that the size of the interval containing U is stochastically larger than the typical fragment's length of the original partition.
3.1. Length of the First Size-Biased Sample
From the size-biased picking construction, it follows (see, e.g., [6]) that for all nonnegative measurable functions φ on [0,1],

Taking in particular φ(x) = xI(x > s) in Eq. (7), we get

which, since

, is

Proposition 1:

and it holds that

Proof: The condition

holds for all s in [0,1] because this is equivalent to

, which is always true because the left-hand side is the conditional expectation of Sn given Sn > s, certainly larger than E(Sn) itself. Because

, one can check directly that

, with

. █
This apparent paradox (discussed when θ = 1 in Feller [8, pp.22,23] and subsequently worked out in Hawkes [12, pp.294,295]) may be understood by observing that in the size-biased picking procedure, long intervals are favored. It constitutes the version on the interval of the standard waiting-time paradox on the half-line. As a corollary, the following decomposition holds.
Corollary 2: Let Bn be a Bernoulli random variable with parameter 1/n and

, independent of Bn. Define a [0,1]-valued random variable Rn with distribution

Then, the following decomposition holds:

where

are independent.
Proof: Since P(Bn = 1) = 1/n, we have

Taking φ(x) = xq+1 in Eq. (7), the moment function of

reads (q > −(1 + θ))

recalling that E[Snq] = [Γ(nθ)Γ(θ + q)]/[Γ(θ)Γ(nθ + q)] is the common moment function of Sm, m = 1,…,n, with E(Sn) = 1/n. So,

3.2. Size-Biased Permutation of the Fragments
Consider the random partition Sn. Let

be the length of the first randomly chosen fragment M1 := M, so with L1 := SM1 and P(M1 = m1|Sn) = Sm1. A standard problem is to iterate the size-biased picking procedure, by avoiding the fragments already encountered: By doing so, a size-biased permutation (SBP) of the fragments is obtained. We study here this process in some detail.
In the first step of this size-biased picking procedure,

which can be written as Sn → (L1,(1 − L1)Sn−1(1)), with

a new random partition of the unit interval into n − 1 random fragments.
Given

, the conditional joint distribution of the remaining components of Sn is the same as that of (1 − L1)Sn−1(1), where the (n − 1)-vector

has the distribution of a Dirichlet random partition into n − 1 fragments. Next, pick at random an interval in Sn−1(1) and call V2 its length, now with distribution beta(1 + θ,(n − 2)θ), and iterate until all fragments have been exhausted.
With V1 := L1, the length of the second fragment by avoiding the first reads L2 = (1 − V1)V2. Iterating, the final size-biased permutation (SBP) of Sn is Ln := (L1,…,Ln). We will set Ln = SBP(Sn).
From this construction, if (V1,…,Vn−1) is an independent sample with distribution

, then,


is the stick-breaking scheme construction of the size-biased permutation of Sn. Note that

and that Vn should be set to one. From this well-known construction and properties (see Kingman [16, Chap. 9, 9.6], Patil and Taillie [17], and Donnelly [4]) we obtain that the Lk's, k = 1,…,n, are arranged in stochastically decreasing order. More precisely, we have the following:
Theorem 3:
(i) The law of Lk, for k = 1,…,n, is characterized by

(ii) Let

. Then,

where pairs B(n−k+1)θ,1 and Lk−1 are mutually independent for k = 2,…,n.
(iii)

.
Proof:
Part (i) is a direct consequence of the construction, since

are mutually independent. Recalling the expression of the moment function for beta distributions, the corresponding expression of E[Lkq] follows.
Part (iii) being clearly a consequence of (ii), it remains to prove (ii).
Regrouping terms directly from Eq. (14), we have E[Lkq] = E[Lk−1q]E[Bkq], with

This is the moment function of a beta((n − k + 1)θ,1)-distributed random variable. █
Result (ii) is also in Collet, Huillet, and Martinez [3], with a slightly different proof.
Corollary 4: With β := 1/θ, we have

with

.
Proof: Putting q = 1 in the expression of E[Lkq] and if β = 1/θ, we get

From normalization, it holds by construction that

. █
Let us now compute the joint distribution of the size-biased permutation Ln of Sn. We will say in the sequel that, if Ln = SBP(Sn), then

assuming that

.
3.3. Joint Law of the SBP
Let us first discuss the visiting order of the fragments in the SBP process. For any permutation (m1,…,mn) of (1,…,n), with M1,…,Mk, k = 1,…,n, the first k distinct fragments' numbers that have been visited in the SBP sampling process, we have

so that

As a result,

is the probability that the kth visited fragment is fragment number m from Dn(θ). If Km is the random position of fragment number m, we then clearly have

translating the fact that Km and Mk are inverses of one another, hence with KMk = k and MKm = m.
Let us now compute the joint distribution of the size-biased permutation Ln of Sn with

. First, we have

and, consequently,

the average of which over Sn gives the joint law of Ln := (L1,…,Ln).
We will now consider the joint moment function of the random size-biased permutation Ln = (L1,…,Ln). Indeed, we observe from Eqs. (12) and (13) and the independence of the Vk's that

with

.
Putting all of this together, we obtain the following result.
Theorem 5: The joint moment function of the

reads

Proof: Let

. Then, with V := 1 − V, it holds that

Adapting this computation, recalling that

, the quantity E[VkqkVkqk+1+…+qn−1] has the expression displayed inside the product from Eq. (23). █
Remark: Letting qk = q/n, k = 1,…,n, in Eq. (24), the moment function of the geometric average of Ln, which is

, follows.
4. COMPARATIVE SEARCH COST IN DIRICHLET PARTITION AND IN THE SIZE-BIASED PERMUTATION OF IT
We now show how these results can be used when considering an arcane problem from applied probability.
A collection of n books with random popularities Sm, m = 1,…,n, is arranged on a shelf. (If instead of a collection of books, a population of n species were considered, popularities verbatim interpret as species abundance; see Kingman [15] and Ewens [7] for such interpretations).
Books' popularities are assumed to satisfy

. When a book is demanded, it is removed and replaced (before a next demand) to the top of the shelf, other books being shifted accordingly; successive demands are independent. Iterating this heaps process (as a recurrent positive Markov chain over the set of permutations), there is intuitively a tendency when the system has reached equilibrium, to find more popular books to the top of the heap. At equilibrium indeed (see Donnelly [5] and references therein to the works of Dies, Hendricks, and Letac), books' popularities are given by

and result (iii) in Theorem 3 stating that

confirms and gives some flesh to this intuition. Note from this that Ln = SBP(Ln) (Ln is invariant under size-biased permutation) and that Ln = SBP(S(n)) since S(n) is simply obtained from Sn while rearranging its components in descending order.
Next, define the search cost of an item in a library to be the number of items above it in the heap; a weighted sum over the items yields the search cost of a typical item. The search cost when the library has reached the equilibrium state Ln is, of course, expected to be smaller than the search cost in Sn itself. We would like to revisit these ancient questions in the light of our preceding results on Ln = SBP(Sn) when

.
4.1. Search Cost in Sn


We start with computing the search cost Cn,S assuming popularities to be Dirichlet distributed. Here, Cn,S is the discrete random variable taking the value m − 1 with probability E(Sm) = 1/n, m = 1,…,n. The moment generating function of Cn,S is expressed as

As a result, E(Cn,S) = (n − 1)/2, E(Cn,S2) = (n − 1)(2n − 1)/6, and σ2(Cn,S) = (n − 1)(n + 1)/12, and we have the following proposition.
Proposition 6: With U a uniformly distributed random variable on (0,1), it holds that

Proof: From the expression of the moment generating function of Cn,S, we have

which is the Laplace-Stieltjes transform of a uniformly distributed random variable U on (0,1), with mean value ½. Although in a different (deterministic) partition context, a similar result can be found in Fill [9, Thm.4.2, p.198]. █
4.2. Search Cost in


From its definition, the search cost in Ln is the mixture Cn,L = KM − 1 (the number of fragments above fragment M in the list). Consequently, given Sn, Cn,L will take the value Km − 1 with probability Sm. Its conditional distribution is

where P(Km = k|Sn) is given by Eqs. (19) and (20). Let us first recall some well-known results on conditional search cost in Ln, as a functional of Sn.
When P(Km = k|Sn) takes the more usual form

with SJ = [sum ]j∈J Sj, the average position of original fragment m in the limiting partition SBDn(θ) is known to be

so that the expected search cost in a SBDn(θ) partition is

The results [Eqs. (27)–(29)] were obtained by Burville and Kingman [2]. They are valid for any random (or not) partition Sn. Using Poisson embedding techniques, Fill and Holst [10], following combinatorial results of Flajolet, Gardy, and Thimonier [11], also found the full conditional generating function of Cn,L under the form

See also Hildebrand [13] for related problems.
Averaging over Sn gives the search-cost distribution in the random partition case. This is not so simple a matter, as it involves complicated Dirichlet integrals as it stands, in general. When

, averaging over Sn, Kingman [14] obtained E(Cn,L) = (n − 1)θ/(2θ + 1). Recently, Barrera and Paroissin [1, Thm.1] gave the full generating function E(uCn,L) under the form of a not so explicit double integral, even in the particular Dirichlet example.
Using the results of the preceding section, we will now show that the full law of Cn,L can be computed more explicitly when

. Several conclusions can next be drawn in this particular case.
Stated differently, indeed, Cn,L takes the value k − 1 with probability Lk = SMk, k = 1,…,n, recalling that Mk and Km are inverses of one another. Averaging over Ln, Cn,L therefore takes the value k − 1 with probability E(Lk). As a result, with β := 1/θ the inverse of temperature (or disorder), we obtain the following lemma.
Lemma 7:
(i) The law of Cn,L is given by

it is unimodal, with mode at k = 0.
(ii) The first and second moments are

Proof:
(i) The first part is a consequence of the explicit expression of E(Lk), k = 1,…,n, appearing in Corollary 4. Note that P(Cn,L = 0) = (β + 1)/(β + n) = (1 + θ)/(1 + nθ), which is E(L1). Next, for each k = 0,…,n − 1, we have

and the mode of this distribution is, thus, at k = 0.
(ii) Equivalently, the moment function of Cn,L is

Putting A := Γ(β + n + 1)/[(β + 1)Γ(n)], the normalization

(with q = 0) reads

Let B be obtained from A while substituting β + 1 to β and n − 1 to n. We get

Putting q = 1 in E[Cn,Lq], we obtain

Using similar arguments, the second-order moment (q = 2) can be computed. In more detail, if C is obtained from A while substituting β + 2 to β and n − 2 to n, we get

Putting q = 2 in E[Cn,Lq], we obtain

The result on the mean value was obtained by Kingman [14], with different techniques. Note that E(Cn,S) ≥ E(Cn,L), as expected. The result on the variance (with a different proof) is also in Barrera and Paroissin's example 2 [1]. In fact, the full preasymptotic law of Cn,L is available from part (i), which seems to be new. █
From our approach, we also obtain the asymptotic result.
Theorem 8: As n ↑ ∞,

where

.
Proof: From the expression of the moment function of Cn,L, we have

For large n, this can be approximated by

From Stirling's formula, with a > 0, Γ(a + z)/Γ(z) ∼z↑∞ za. This shows that for large n,

which is the moment function of a beta(1,1 + β) random variable with mean value 1/(2 + β) = θ/(2θ + 1) < ½. █
Remark: The limiting search cost per item in

is distributed like B1,1+1/θ, whereas its law is the one of a uniform random variable U in

. Clearly, we have

, expressing the fact that search-cost per item from the random partition Dn(θ) is asymptotically larger than from the SBDn(θ) one, which is more organized. This result differs from the well-known one that E(Cn,S) ≥ E(Cn,L) for each n.
4.3. Search Cost in the Kingman Limit
Consider the situation where n ↑ ∞, θ ↓ 0 while nθ = γ > 0. Such an asymptotics was first considered by Kingman. When k = o(n), recalling

, we have

and the SBDn(θ) distribution converges weakly to a GEM(γ) distribution (GEM = Griffiths-Engen-McCloskey). Namely,

, where

Here, (Uk,k ≥ 1) are independent and identically distributed with law

. Note that

and that L* is invariant under size-biased permutation. In the Kingman limit, (S(m),m = 1,…,n) converges in law to a Poisson–Dirichlet distribution

with L(1)* > ··· > L(k)* > ···. The size-biased permutation of

(see Kingman [16, Chap.9], Tavaré and Ewens [18], and references to Pitman's work therein for a review). As a result, we have the following:
Proposition 9: As n ↑ ∞, θ ↓ 0 while nθ = γ > 0,

Proof: In particular, we have E(Lk*) = [γ/(1 + γ)]k−1[1/(1 + γ)]. The moment generating function of the search cost in the Kingman limit is thus

which is the one of a geometric distribution with mean γ.
Note that P(CL* = 0) = 1/(1 + γ), which is the average length of L1*. █
References
REFERENCES
- 10
- Cited by