Published online by Cambridge University Press: 12 December 2005
This article deals with estimations of probabilities of rare events using fast simulation based on the splitting method. In this technique, the sample paths are split into multiple copies at various stages in the simulation. Our aim is to optimize the algorithm and to obtain a precise confidence interval of the estimator using branching processes. The numerical results presented suggest that the method is reasonably efficient.
The analysis of rare events is of great importance in many fields because of the risk associated with the event. Their probabilities are often about 10−9 to 10−12. One can use many ways to study them: The first is statistical analysis, based on the standard extreme value distributions, but this needs a long observation period (see Aldous [1]); the second is modeling, which leads to estimating the rare event probability either by an analytical approach (see Sadowsky [10]) or by simulation.
In this article we focus on the simulation approach based on the Monte Carlo method. Nevertheless, a crude simulation is impractical for estimating such small probabilities: To estimate probabilities of order 10−10 with acceptable confidence would require the simulation of at least 1012 events (which corresponds to the occurrence of 100 rare events).
To overcome these limits, fast simulation techniques are applied. In particular, importance sampling (IS) is a refinement of Monte Carlo methods. The main idea of IS is to make the occurrence of the rare event more frequent. More precisely, IS consists of selecting a change of measure that minimizes the variance of the estimator. Using another method based on particles systems, Cerou, Del Moral, Legland, and Lezaud [3] gave theoretical results on the convergence of this kind of algorithm. In this article, we deal with the RESTART (REpetitive Simulation Trials After Reaching Thresholds) algorithm presented by Villen-Altamirano and Villen-Altamirano [11] and based on splitting. The basic idea of splitting is to partition the space state of the system into a series of nested subsets and to consider the rare event as the intersection of a nested sequence of events. When a given subset is entered by a sample trajectory, random retrials are generated from the initial state corresponding to the state of the system at the entry point. Thus, the system trajectory has been split into a number of new subtrajectories. However, the analysis of the RESTART model presents numerous difficulties because of the lack of hypothesis and the complexity of formulas.
In this article we build a simple model of splitting for which we are able to derive precise conclusions. It is based on the same idea: Before entering the rare event A, there exists intermediate states visited more often than A by the trajectory: A = BM+1 ⊂ BM ⊂ ··· ⊂ B1. Let
. The fact that a sample trajectory enters Bi is represented by a Bernoulli trial. Every time a sample trajectory enters a subset Bi, i = 1,…,M, it is divided in a number Ri of subtrajectories starting from level i. More precisely, we generate N random variables with common law Bernoulli Ber(P1) and check whether the subset B1 is reached. If so, we duplicate the trials in R1 retrials of Ber(P2) and check whether the subset B2 is reached. Thus,
and an unbiased estimator of P is
where
are independent and identically distributed (i.i.d.), NA is the number of trials that reach A during the simulation, and N is the number of particles initially generated. An optimal algorithm is chosen via the minimization of the variance of
for a given budget. For this, we have to describe the cost of a given simulation: Each time a particle is launched, it generates an average cost that is assumed here to be a function h of the transition probability. Therefore, the (average) cost is
where ri = R1···Ri, i = 1,…,M, r0 = 1, and Pi|0 = P1···Pi, i = 1,…,M + 1, and P0|0 = 1. Then the optimal algorithm is described by
and M is given by M = [ln P/y0] − 1 or M = [ln P/y0], where y0 is the solution of Eq. (30). The optimal sampling number is independent of the budget and this former only determines the optimal number of independent particles first generated. In the special case of h = 1,
Thus, the optimal sampling number and the optimal transition probabilities are independent of the rare event probability. For example, if P = 10−12 and C = 103, M = 16, Pi ≈ 0.2, Ri = 5, and N = 59.
Example 1.1: To analyze the behavior of the different implementations described earlier, we perform a simulation experiment using these methods. We consider a queuing network and we want to estimate the occupancy of the finite buffer queuing system M/M/1/C0. The results are presented in Figure 1. As expected and since we proceed for a given cost C (C = 104), the crude simulation stops after a few iterations, the number of samples run at the beginning being not sufficient. However, note that splitting simulation and theoretical analysis give very close results.
Example 1.2: This model can be applied to approximate counting (see Jerrum and Sinclair [7] and Diaconis and Holmes [5]). Given a positive real vector a = (ai)i=1n and a real number b, we want to estimate the number of 0–1 vectors x = (xi)i=1n s.t.
For more details, see Section 3.2.
Remark 1.1: Hereafter we will take all the Ri equal to R and all the Pi equal to P0 = 1/R. Thus, RP0 = 1.
The aim of the article is to give a precise confidence interval of
. The bound involving the variance of
and given by the Markov inequality is not precise enough. Therefore, as done in the theory of large deviations, we introduce the Laplace transform of
, which can be rewritten as
, where fM is the Mth functional iterate of a Bin(R,P0) generating function (g.f.). The elementary theory of branching processes leads to precise bounds of fM and to a precise confidence interval that we can compare to the confidence interval if we only use the variance. For example, for P = 10−9, C = 108, and α = 0.02, the variance gives a bound about 10−2 and the Laplace transform gives a bound approximating 10−12.
The article is organized as follows. Section 2 describes the importance splitting model, presents our model and goals (the analysis of the behavior of the probability P of a rare event), and introduces an estimator
. Section 3 is dedicated to the optimization of the algorithm. In Section 4 we obtain a precise confidence interval of the estimator via branching processes. Finally, in Section 5 we conclude and discuss the merits of this approach and potential directions for further researches.
Our goal is to estimate the probability of a rare event A corresponding, for example, to the hit of a certain level L by a process X(t). The main hypothesis is to suppose that before entering the target event, there exists intermediate states visited more frequently than A by the trajectory. Thus, define a sequence of sets of states Bi such as A = BM+1 ⊂ BM ⊂ ··· ⊂ B1, which determines a partition of the state space into regions Bi–Bi+1 called the importance regions (as represented in Figure 2). In general, these sets are defined through a function Φ called the importance function from the state space to
such that for all i, Bi = {Φ ≤ Ti} for some value Ti called thresholds, with T1 ≤ T2 ≤ ··· ≤ TM ≤ L.
In this model a more frequent occurrence of the rare event is achieved by performing a number of simulation retrials when the process enters regions where the chance of occurrence of the rare event is higher. The fundamental idea consists of generating N Bernoulli Ber(P1) and check whether the subset B1 is reached. If so, we duplicate the trials in R1 retrials of Bernoulli Ber(P2) and check whether the subset B2 is reached. If none of the higher levels is reached, the simulation stops.
Thus, by the Bayes formula,
Then P is the product of M + 1 quantities (conditional probabilities) that are easier to estimate and with more accuracy than the probability P of the rare event itself, for a given simulation effort.
The estimator
defined in (2) can be rewritten as
where 1i0 i1···ij represents the result of the jth trial. In that case,
Moreover, we define
as the probability of reaching A and we suppose that the process forgets the past after reaching a level; this happens as soon as the process is Markov.
First, note that
is unbiased since
As done in [11], the variance of the estimator
is derived by induction and the variance for k thresholds is given by
where
represents the estimator of P in a simulation with k thresholds.
Clearly, the formula holds in straightforward simulation (i.e., when k = 0), since
is a renormalized sum of i.i.d. Bernoulli variables with parameter P.
To go from k to k + 1, assume (12); thus, we have to prove that this formula holds for k + 1 thresholds. First, note that for all X and Y random variables, which are independent given the set B and X σ(B)-measurable, we have
Now let
The random variables Xi0 are i.i.d. with common law Ber(P1), and conditionally at the event B1, Xi0 and Zi0 are independent. Note that each Zi0 is the estimator of P in a model with k thresholds, T2 to Tk+1 for the trajectory issued from the success of Xi0. Thus,
and by the induction hypothesis,
So applying (13) with X ∼ Ber(P1) and Z ∼ Zi0, we have
Thus, for M thresholds,
Remark 3.1: The induction principle has a concrete interpretation: If in a simulation with M steps, the retrials generated in the first level are not taken into account except one that we call the main trial, we have a simulation with M − 1 steps.
As stated in Section 1, our aim is to minimize the variance for a fixed budget, giving optimal values for N,R1,…,RM, P1,…,PM+1, and M. Therefore, we have to describe the cost of a given simulation: Each time a particle is launched, it generates an average cost function h. We assume the following:
The (average) cost is then
where Ni is the number of trials that have reached threshold i. Finally,
Example 3.1: We want to study the model of the simple random walk on
starting from zero that we kill as soon as it reaches the level −1 or k (success if we reach k, failure otherwise).
So let Xn be such that
, where {Yn} is a sequence of random variables valued in {−1,1} with
and define Tk = inf{n ≥ 0 : Xn = −1 or k}.
One can easily check that Xn and Xn2 − n are martingales. By Doob's stopping theorem,
, which yields
(i.e., the cost needed to reach the next level is (1/p) − 1 if p is the success probability).
To minimize the variance of
, the optimal values are derived in three steps:
Optimal values for N, R1,…, RM. Using the method of Lagrange multipliers, we get
Optimal values for P1,…,PM+1. Thus, the variance becomes
Proceeding as previously under the constraint P = P1···PM+1, we obtain that all of the Pi's satisfy
. If we assume that there exists a unique solution to this equation, we have Pi = g(λ); hence, P = g(λ)M+1 and g(λ) = P1/(M+1). Finally,
Optimal value for M. The optimal values for P1,…,PM+1 imply that the optimal Ri becomes 1/Pi, i = 1,…,M; thus,
which we want to minimize in M. Note that Ri Pi = 1. Let
whose derivative cancels in
In general, this does not give an integer. We have y0 = ln P/(M + 1) (i.e., M + 1 = [ln P/y0] or [ln P/y0] + 1). Let ln P/y0 = n + x with 0 < x < 1. Then the following hold:
The value of the ratio ρ := f (n − 1)/f (n) gives the best choice for M as follows:
Thus, the optimal number of thresholds is given by M = [ln P/y0] − 1 or M = [ln P/y0], where y0 solves F(y) = 0. Then M minimizes
Example 3.2: For h = 1, we have to solve y = 2(ey − 1). We get y1 = 0 and y2 ≈ −1.5936. y2 is a minimum and the optimal value of M is
With P = 10−k, we have
Note that M increases while P decreases, and with this value of M, each Ri and Pi become
Thus, the optimal sampling number and the optimal transition probabilities are independent of the rare event probability.
Moreover, asymptotically, M = n = [ln P/y0] − 1; thus,
Application 3.1: In approximate counting, remember that the goal is to estimate the number of Knapsack solutions (i.e., the cardinal of Ω defined by
for a given positive real vector a = (ai)i=1n and real number b). We might try to apply the Markov chain Monte Carlo method (MCMC) [9]: Construct a Markov chain
with state space Ω = {x ∈ {0,1}n : a·x ≤ b} and transitions from each state x = (x1,…,xn) ∈ Ω defined by the following:
The new state is y. This random walk on the hypercube truncated by the hyperplane a·x = b converges to the uniform distribution over Ω. This suggests a procedure for selecting Knapsack solutions almost uniformly at random. Starting in state (0,..,0), simulate
for sufficiently many steps that the distribution over states is “close”
1The problem is to bound the number of steps necessary to make the Markov chain
“close” to stationarity. More precisely, we need a bound of the mixing time:
where Δx(t) = maxS⊂Ω|Pt(x,S) − Π(S)| and Π is the stationary distribution. In [7], it is shown that
steps suffice.
Keep the vector a fixed but allow b to vary. Use
instead of
to emphasize the dependence on b. Assume without loss of generality that a1 ≤ ··· ≤ an and define
. One can check that
Now write
The ratio ρi = |Ω(bi)|/|Ω(bi+1)| may be estimated by sampling almost uniformly from Ω(bi+1) using the Markov chain
and computing the fraction of the samples that lie within Ω(bi).
Now take a = [1,2,3,4], b = 3, h = 1, R = 5, and C = 2600. We chose the levels as follows: First, define b1 = 0, b2 = 1, b3 = 3, b4 = 3, and b5 = b; second, define B0 = Ω, B1 = Ω(b4), B2 = Ω(b3), B3 = Ω(b2), and B4 = Ω(b1). Thus, here, M = n − 1, N = C/n, and nstep = 1020. Obviously, Card(Ω) = 5. We run three different simulations: The first, suggested in [7], consists of estimating the n ratios independently, the crude and splitting ones. We obtain different estimations for Card(Ω):
Even though the levels are not optimal, splitting provides an improvement.
Let us describe briefly the possible solutions of (30). Remember that we want to solve (30); that is, if z = ey and z ≠ 0,1,
First, let z0 be the solution of 2(z − 1) = ln z. Since h′ ≤ 0, H is negative and a quick survey shows that L is positive on]0,z0 [and negative on]z0,1[. As a consequence, the solutions of (37) lie in]z0,1[, if they exist. Thus, solving (30) is equivalent to studying the intersections between H and L. A quick survey of these functions shows that we have two cases (see Fig. 3):
Case 1: An odd number of intersections between L and H
Case 2: An even number of intersections or 0 between L and H
Note that y = 0 is a solution of (30). In case 1, it corresponds to a maximum, and in case 2, it corresponds to a minimum. The second case is excluded since we made the assumption h(1) > 0.
Remark 3.2: The solution y = 0 corresponds to the following optimal values:
However Pi = 1 implies that P = 1 and Ri = 1 means that we just perform a crude simulation.
Example 3.3: Here, P = 10−12 and C = 104.
1. In Example 1, h(1) = 0 and we are in the second case: The unique solution y = 0 is the minimum.
2. Let h(x) = 1/x − 8x2 + 12x − 5. h(1) = 0 and we are in the first case: y = 0 and y ≈ −0.9919 are the solutions. y = 0 is the maximum and the other solution is the minimum. Taking y ≈ −0.9919, we obtain
and we can take R = 3 and N = 23.
3. Let h(x) = (1/x − 1)2e6x. h(1) = 0 and we are in the second case: y = 0, y1 ≈ −0.4612, and y2 ≈ −0.5645 are the solutions. y = 0 is the minimum and the second solution is the maximum.
4. Let h(x) = 1/x. Here, h(1) = 1. We want to solve (30), whose solutions are y = 0 and y ≈ −0.6438. Taking y ≈ −0.6438, we obtain
and we can take R = 2 and N = 34.
Thus, the control of the variance of
gives a crude confidence interval for P. Indeed, we get
This estimation is, in general, useless. For example, for h = 1, M = 12, and α = 10−2, the upper bound becomes ≈ 5 × 105/N. To obtain a bound lower than 1, we need N ≥ 5 × 105. To improve it, we will use Chernoff's bounding method instead of the Markov inequality: For all λ > 0,
where
is the log-Laplace of
. Optimization on λ > 0 provides
Similarly,
Let ψ* be the Crämer transform of ψ: ψ*(τ) = supλ[λτ − ψ(λ)]. Thus,
So we want to obtain an accurate lower bound of ψ*.
Remark 3.3: Although we would therefore like to take Ri so that Ri Pi = 1, we are constrained to choose Ri to be a positive integer. Hereafter, we suppose that we are in the optimal case, where Ri = 1/Pi is an integer.
To study the Laplace transform of
, we turn to the theory of branching processes (see Harris [6], Lyons [8], and Athreya and Ney [2]). More precisely, we consider our splitting model as a Galton–Watson process, the thresholds representing the different generations.
We consider a Galton–Watson model (Zn), where the size of the nth-generation Zn is the number of particles that have reached the level Bn, with one particle run at the beginning. Then Z0 = 1 and Zn satisfies the following recurrence relation:
where Xin is the number of particles among Ri that have reached the (n + 1)-st level. The (Xin)n≥1 are i.i.d. with common law Binomial, with parameters (Rn,Pn+1) and Xi0 ∼ Ber(P1). Take the optimal values of Section 3.2:
Let
, the g.f. of Z1. Then the g.f. of Zn is the nth iterate of f. Since
, we get
where g is the g.f. of a Ber(P0) and f the g.f. of a Bin(R,P0). Thus, we are interested in the expression of fM, the Mth functional iterate of f.
Here,
, so we are in the critical case of the branching process that ensures that the algorithm of the simulation stops with probability 1 when M → ∞ (see [6]), since if f(3)(1) < ∞,
This emphasizes the rarity character when the number M of thresholds increases and the probabilities between the levels decrease.
In our case,
The iterated function fM has no explicit tractable form and we will derive bounds for fM(s) around s = 1. To do this, we state a general result on the Laplace transform in critical Galton–Watson models, which we could not find in the literature.
Remark 4.1: Remember that fn and its derivatives are convex. Furthermore, for all 0 ≤ s ≤ 1, s ≤ f (s) ≤ f (1) = 1, and by induction, f (s) ≤ f2(s) ≤ ··· ≤ 1. Finally, we obtain fn(s) → 1 since fn(s) ≥ fn(0).
Proposition 4.1: Let α1 = f′′(1)/2, C = (maxs∈[0,1] f′′′(s))/6α1, and γn = nα1 [1 − (C/n)(log n + 1)] − α1. Then, for s close to 1 and large n,
Proof: Upper bound: Using Taylor's expansion, with fn(s) ≤ θn ≤ fn(1) = 1,
since f′(1) = 1. Let rn = 1 − fn(s); rn satisfies
Now let α0 = f′′(0)/2. Define the decreasing sequences (an) and (bn) satisfying
Then
1. bn's upper bound: Since 0 ≤ bj ≤ 1, we have
Thus,
2. an's lower bound: Apply this upper bound to an (α0 becoming α1):
By substituting (66) in
, we get
Finally, (63) and (67) lead to the upper bound of fn in (58).
Lower bound: In fact, we prove by induction that
For n = 1, the left-hand side of (68) is given by Remark 4.1. Then note that
; thus, for n large enough,
. For all 1 − (1/n) ≤ s ≤ 1,
However, by induction, we have
, and so, since f is increasing, taking
,
(i.e.,
, where
. Note that
; more precisely, we have
and we finally obtain the left-hand side of (58) since γ → hγ is increasing. █
In the particular case of f (s) = (P0 s + 1 − P0)R, we can derive a more precise lower bound:
Proposition 4.2: For s close to 1,
Observe that this is precise at s = 1.
Proof: Let h(s) = 1 − ((1 − s)/(1 + α1(1 − s))). Since f (1) = h(1) = 1, f′(1) = h′(1) = 1, and f′′(1) = h′′(1) = 2α1, the sign of f − h trivially depends on the sign of the third derivative of f − h, which is obviously negative here. Then h ≤ f. Since f is increasing, we deduce (74) by induction. █
We plot in Figure 4a the upper bound and the two lower bounds for P = 10−12 and s near 1.
Remark 4.2: First, let us note that, by convexity, for all s ≥ 1,
hence, f (s) ≥ s, and by induction on n,
We remark that for s > 1, the iterated function increases rapidly to infinity.
Proposition 4.3: Let γn′ = nα1 [1 + (C/n)(log n + 1)] − α1. Then, for s close to 1 and large n,
Proof: Proceeding as in Proposition 4.1 leads to the upper bound. Here, fn →n→+∞ ∞, which prevents us from making a Taylor expansion around 1. To overcome this difficulty, consider kn, the inverse function of fn; it is the nth functional iterate of the g.f. k (inverse function of f) that takes the value 1 in 1, whose derivative is 1 in 1 and second derivative is negative, and kn →n→+∞ 1. Thus, making a Taylor development and using the same tools as previously, we get
where β2 = k′′(s)/2 and sn := 1 + (1/nα1). Using the link between kn and fn and the upper bound of kn,
The lower bound of kn leads to an upper bound of fn. However, it provides no improvement. █
As done earlier, we can derive a more precise upper bound in the particular case of f (s) = (P0 s + 1 − P0)R:
Proposition 4.4: For s close to 1,
We plot in Figure 4b these three bounds for P = 10−12 and s near 1.
About the geometric distribution. If the law of X is such that the probabilities pk are in a geometric proportion (
for k = 1,2…. and p0 = 1 − p1 − p2… with b,c > 0 and b ≤ 1 − c), then the associated g.f. is a rational function:
Taking b = (1 − c)2 and c = α1 /(1 + α1) leads to
So we have compared the nth functional iterate of a Binomial g.f. to the one of a geometric g.f. It suggests comparing the importance splitting models with Binomial and with geometric laws. The geometric laws model is set in the following way: We run particles one after the other. As long as the next level is not reached, we keep on generating particles; then we start again from the level the particle is at (the geometric distribution is the law of the first success).
This link is also stressed by Cosnard and Demongeot in [4]: for m = 1 and σ2 = f′′(1) = 2α1, the asymptotic behavior of f2n is the same as the geometric distribution with the same variance (i.e., h).
Remember that
Considering the gradient of the functions, we prove that the supremum for λ ≥ 0 is reached near zero. So we can use the upper bounds for fM obtained in the previous subsection, which leads to lower bounds for ψ*:
where
Finally,
One can easily obtain explicit but complex expressions for F(x) and G(x). We plot in Figure 5 the upper bounds obtained by the variance and by the Laplace transform, for different values of α, the prescribed error of the confidence interval. We take P = 10−9 and the optimal values obtained above for the parameters. Note that the upper bound given by the Laplace transform is better than the bound given by Chebychev's inequality, with the variance. We obtain
. In the preceding example where P = 10−9, if we fix α = 0.05 and L close to 0.01, then the corresponding costs needed are 3 × 107 for the variance and 3 × 106 for the Laplace transform.
The simplified model described here has two main faults. First, we cannot choose in general the optimal level Pi. In practice, we just have an empirical estimation of the Pi, and we can adjust the levels according to them. A more precise analysis is then needed to get confidence intervals of the estimation. Second, the optimal sampling number at each level is not an integer in general. Therefore, in practice, the number of particles generated at each step should be chosen at random, either such that
. Thus, we finally need to work in a random environment. This requires a precise asymptotic of random iterates of the Laplace transform, where analysis is more delicate than the one presented here and will be the subject of a forthcoming paper.
I gratefully acknowledge my supervisors Dominique Bakry and Pascal Lezaud for their helpful suggestions and comments. I would also like to thank Persi Diaconis for telling me about approximate counting and Pieter-Tjerk de Boer for making a number of valuable remarks that improved the precision of the figures.