Hostname: page-component-7b9c58cd5d-9klzr Total loading time: 0 Render date: 2025-03-15T19:37:25.236Z Has data issue: false hasContentIssue false

QUICK SIMULATION METHODS FOR ESTIMATING THE UNRELIABILITY OF REGENERATIVE MODELS OF LARGE, HIGHLY RELIABLE SYSTEMS

Published online by Cambridge University Press:  01 July 2004

Marvin K. Nakayama
Affiliation:
Department of Computer Science, New Jersey Institute of Technology, Newark, NJ 07102, E-mail: marvin@njit.edu
Perwez Shahabuddin
Affiliation:
Department of Industrial Engineering and Operations Research, Columbia University, New York, NY 10027, E-mail: perwez.shahabuddin@columbia.edu
Rights & Permissions [Opens in a new window]

Abstract

We investigate fast simulation techniques for estimating the unreliability in large Markovian models of highly reliable systems for which analytical/numerical techniques are difficult to apply. We first show mathematically that for “small” time horizons, the relative simulation error, when using the importance sampling techniques of failure biasing and forcing, remains bounded as component failure rates tend to zero. This is in contrast to naive simulation where the relative error tends to infinity. For “large” time horizons where these techniques are not efficient, we use the approach of first bounding the unreliability in terms of regenerative-cycle-based measures and then estimating the regenerative-cycle-based measures using importance sampling; the latter can be done very efficiently. We first use bounds developed in the literature for the asymptotic distribution of the time to hitting a rare set in regenerative systems. However, these bounds are “close” to the unreliability only for a certain range of time horizons. We develop new bounds that make use of the special structure of the systems that we consider and are “close” to the unreliability for a much wider range of time horizons. These techniques extend to non-Markovian, highly reliable systems as long as the regenerative structure is preserved.

Type
Research Article
Copyright
© 2004 Cambridge University Press

1. INTRODUCTION

This article deals with the estimation of the unreliability (i.e., probability of system failure within a given time horizon) in large regenerative models of highly reliable systems. For example, we might be interested in the probability that the computer system aboard a space shuttle fails before the mission end time. Similarly, we might be interested in the probability that the web server used for real-time monitoring and displaying scores in the Olympic Games crashes during the period of the games.

The class of systems we study in this article includes a large class of systems that can be modeled by the System Availability Estimator (SAVE) modeling tool (Blum, Goyal, Heidelberger, Lavenberg, Nakayama, and Shahabuddin [2,3]; also see Goyal, Carter, de Souza e Silva, Lavenberg, and Trivedi [13] and Goyal and Lavenberg [14] for earlier versions). This tool is mainly used for the availability/reliability modeling of computer and communication systems. It models systems consisting of components, each of which fails and gets repaired. The system is considered to be up or down depending on the states of the individual components. All components are highly reliable (i.e., their expected failure times are much larger than their expected repair times). There are a limited number of repairmen in the system. In addition, there are component dependencies and interactions like failure propagation (the failure of a component could cause some other components to fail instantaneously), operational dependencies (the operation of an up component requires some other components to be up), and repair dependencies (the repair of a failed component requires some other components to be up). SAVE considers models for which component failure- and repair-time distributions are exponentially distributed; we will call these highly reliable Markovian systems. The techniques in this article can easily be extended to systems for which the repair times of components have general distributions.

Highly reliable Markovian systems can be modeled as continuous-time Markov chains (CTMCs). However, due to the size of the state space (the state space grows exponentially with the number of components in the system), analytical and numerical (nonsimulation) methods are very difficult. These methods include Gaussian elimination (see, e.g., Young [42]) for computing steady-state measures and uniformization (see, e.g., Gross and Miller [16] and Jensen [19]) for computing transient measures. Hence, approximations have to be used. Although some useful approximations based on lumping and state aggregation have been developed for the steady-state case (see, e.g., Muntz, de Souza e Silva, and Goyal [26]), very few exist for the transient setting. Further complications arise because the embedded Markov chain is stiff due to the wide difference between the failure rates and the repair rates of components. In any case, when no exact solutions are easily computable, simulation presents a viable alternative.

In the case of simulation, because system failures are rare, it takes long CPU times for naive simulation to produce good estimates. Importance sampling (see, e.g., Glynn and Iglehart [12] for the basic technique) has been used successfully to increase the speed of simulation in highly reliable Markovian systems (see Alexopoulos and Shultes [1], Carrasco [5,6], Conway and Goyal [7], Geist and Smotherman [9], Goyal, Shahabuddin, Heidelberger, Nicola, and Glynn [15], Juneja and Shahabuddin [20,21,22], Lewis and Böhm [25], Obal and Sanders [33], Shahabuddin [36], Shultes [39], and references therein). Surveys can be found in Heidelberger [17], Nakayama [28], Nicola, Shahabuddin, and Nakayama [32], and Shahabuddin [37]. The most widely researched importance-sampling technique for Markovian models is failure biasing. In failure biasing (Lewis and Böhm [25]), we artificially accelerate the occurrence of component-failure events with respect to component-repair events so that the system fails more often. The resulting estimates are then adjusted to make them unbiased. Mathematical analyses of failure-biasing techniques for estimating steady-state measures in highly reliable Markovian systems has been done in Nakayama [27,29], Shahabuddin [36], and Strickland [41]. In [36], it was shown that a failure-biasing heuristic called balanced failure biasing (Goyal et al. [15] and Shahabuddin [36]), produces a bounded relative error (BRE) in the estimation of steady-state measures like the unavailability (the relative error of the estimate remains bounded as the failure rates tend to zero). This is in contrast to naive simulation, for which the relative error tends to infinity. Balanced failure biasing was then implemented in SAVE (Blum et al. [2,3]).

In the case of transient measures, an additional technique that is used is forcing (Lewis and Böhm [25]). In this case, the first component-failure transition is accelerated to make it happen within the time horizon. As has been shown experimentally (Goyal et al. [15] and Shahabuddin [35]), for most transient measures, forcing and failure biasing work well for the case where the time horizon is “small.”

When the time horizon is “large,” simulation using importance sampling is not efficient. Some work in the large-time-horizon case has been done in Carrasco [5] by numerically inverting a discretized Laplace transform. Another approach suggested in Shahabuddin and Nakayama [38] and Shahabuddin [35] is to bound the transient measure in terms of regenerative-cycle-based measures and then directly estimate (using simulation) the regenerative-cycle-based measures. In highly reliable systems, regenerative cycle-based measures can be estimated with bounded relative error using importance sampling (Shahabuddin [36]). In Shahabuddin [35], such bounds were developed for the expected interval unavailability (the expected fraction of time in (0,t] that the system is down). It was also shown that these bounds converge to the expected interval unavailability as component failure rates tend to zero.

This article has two main contributions. First, we prove that for the case of small time horizons, failure biasing and forcing give bounded relative error in the estimation of the unreliability. A crucial proposition (Proposition 1) proved in this article also allows the bounded-relative-error result to be extended to the small-time-horizon estimation of the expected interval unavailability using the infrastructure already developed by Shahabuddin [35]. Second, we investigate the “bounding approach” mentioned in the previous paragraph, for large-time-horizon estimation of the unreliability. Since a highly reliable Markovian system is regenerative, the time to failure is roughly the “geometric sum” of independent and identically distributed (i.i.d.) random variables, where a “success” in the geometric distribution is defined as failure occurring in a regenerative cycle. It is well known that if the probability of success is small, then this geometric sum is approximately exponentially distributed (see, e.g., Keilson [24] and Solovyev [40]). There is some literature regarding bounds for the cumulative distribution function (which, in our case, is the unreliability) of such random variables (e.g., Brown [4], Kalashnikov [23], and Solovyev [40]). We consider bounds proposed in [4] and [23]. Mathematical analysis reveals that these bounds converge to the unreliability (as the component failure rates tend to zero) only for a certain range of time horizons. We develop new improved bounds that take into consideration the special structure of the type of highly reliable systems we consider; the bounds in [4], [23] and [40] do not make use of this special structure. We show that these new bounds converge to the unreliability for a much wider range of time horizons. We implement these bounds and do some experimental comparisons between our bounds and the bounds in [4]. Preliminary versions of some of the results mentioned earlier appeared (without proofs) in Shahabuddin and Nakayama [38].

In Section 2, we briefly review the Markovian framework considered in Shahabuddin [36] and results for regenerative-cycle-based measures. Section 3 contains asymptotic expressions for the unreliability and the variance of unreliability estimates without and with importance sampling. The bounded-relative-error property of the estimate of the unreliability for the small-time-horizon case is proven here. In Section 3.2, we explain why the efficiency deteriorates for large-time-horizon cases. Then, we consider bounds developed in Brown [4] and determine the range of the time horizons for which these bounds converge. Finally, we develop bounds on the unreliability and investigate their convergence. Section 4 contains experimental results. Difficult proofs of intermediate results are relegated to the Appendix.

2. REVIEW OF HIGHLY RELIABLE SYSTEMS AND FAST SIMULATION

2.1. CTMC Models of Highly Reliable Systems

We further elaborate on the description of the highly reliable Markovian systems presented in Section 1. Let there be R types of components in the system with ni components of type i. Some of the ni components can act as spares for that type. Let N be the total number of components of all types in the system (i.e.,

). There are many repairman classes in the system, with a finite number of repairmen in each repairman class. Each component type is assigned a repairman class. The component types assigned to a repairman class are divided into priority classes and repaired in a preemptive or nonpreemptive manner. Any service discipline can be used for members of the same priority class.

Let {X(t) : t ≥ 0} be a CTMC model (assume left-continuous with right limits) of a highly reliable Markovian system. For the simplest such system, X(t) = (X1(t),X2(t),…,XR(t)), where Xi(t) is the number of components of type i that are failed. For the general, highly reliable Markovian systems we consider, we have to add some more process descriptors in the state of the system. For example, we might have to add an ordered list of components waiting to be repaired in each repairman class. The following analysis is independent of which definition of X(t) we use.

Transitions in the CTMC correspond either to a component failing (which may cause the instantaneous failure of other components through failure propagation) or a component completing repair. We will refer to the first as a “failure transition” and to the second as a “repair transition.” We label the state with all components up as 1. We will also use 1 to denote the set of states that contains only the single state 1. Assume that X(0) = 1 unless stated otherwise. Let S be the set of states that are accessible from 1. For the purpose of simulation analysis, we will restrict {X(t) : t ≥ 0} to the set S. We partition S into two subsets: S = UF, where U is the set of up states and F ≠ [empty ] is the set of down states. Of course, 1U and the state where all components are failed is in F. We will need the following assumptions:

(A.1) The CTMC is irreducible over the set S.

(A.2) From all states in S except 1, there is at least one repair transition (with positive probability).

(A.3) From all states in U, there is at least one failure transition (with positive probability).

(A.4) From 1, there is at least one failure transition to a state in U1 (with positive probability).

Let Q = {q(x,y), x,yS} be the rate matrix (also called the infinitesimal generator matrix) of the CTMC. In this matrix, we arrange the states in the order of increasing number of components failed. Thus, the first state is one in which no components are failed (i.e., 1). This is followed by states in which exactly one component is failed and so on. Let q(x) = −q(x,x) denote the total rate out of state x and h(x) = 1/q(x) be the mean holding time in that state. From (A.2) and (A.3), q(x) > 0 for all xS. Let Φ denote the probability measure on the sample paths of this CTMC. For any ES, let TE = inf{t > 0 : X(t−) ∉ E,X(t) ∈ E}. Of particular interest are T1 and TF.

Let {Yn : n ≥ 0} denote the embedded discrete-time Markov chain (DTMC) of {X(t) : t ≥ 0}; that is, {Yn : n ≥ 0} has transition matrix P = {P(x,y) : x,yS}, where P(x,x) = 0 and P(x,y) = q(x,y)/q(x) for xy. One can simulate a CTMC by progressively generating the next state of the embedded DTMC and generating the random holding time in that state. For any ES, let τE = inf{n ≥ 1 : YnE}. Of particular interest are τ1 and τF.

As is well known, positive-recurrent CTMCs are regenerative (see, e.g., Crane and Iglehart [8] for the definition) in nature. In the following study, we will be considering system regenerations that occur when the system enters state 1. In any regenerative cycle, let Z be the random variable denoting the holding time in state 1 and let W be the remaining time until either state 1 is reached (again) or the system fails. In mathematical terms, W = min(T1,TF) − Z, where T1 and TF are measured from the start of the cycle. We will let W (resp., W) be the random variable having the distribution of W given that a system failure (resp., no system failure) occurs in a regenerative cycle. Another quantity that will be used in the following analysis will be γ ≡ P{TF < T1} = PF < τ1) (i.e., the probability of system failure occurring in a cycle). Assumption (A.3) ensures that γ is not trivially equal to zero. Assumptions (A.2) and (A.4) ensure that 1 − γ is not equal to zero. For any regenerative-cycle-based random variable, say W, we will use Wi to denote its value in the ith regenerative cycle.

2.2. Modeling Highly Reliable Components

In mathematical models of highly reliable Markovian systems, the failure rate of any component, say component i, is assumed to be of the form λiεri, where ε is a small positive parameter called the rarity parameter; ri and λi are positive constants (Shahabuddin [36]; see also Gertsbakh [10]). Let r0 = min{r1,…,rN} > 0. Because the repair rates are large compared to failure rates, the repair rate of component i is represented by a constant μi > 0. The failure-propagation probabilities are either assumed to be constants or of the same form as the failure rates (i.e., constants multiplied by ε raised to positive powers). Once we introduce this ε-parameterization, then the performance measures become functions of ε. However, for simplicity, we do not specify this dependence in the notation. For example, we continue to use γ for what should ideally be denoted by γ(ε). In highly reliable Markovian systems, the aim is to study the performance measures and the variance of their estimators for small ε.

This particular ε-parameterization guarantees that if Q(x,y) > 0 (resp., P(x,y) > 0) for some ε = ε0 > 0, then Q(x,y) > 0 (resp., P(x,y) > 0) for all 0 < ε < ε0. Given the arrangement of states in Q that we mentioned earlier, it can easily be seen that all the elements above the diagonal are O(ε) and all elements below the diagonal are O(1). (A function f (ε) is defined to be Od) (resp., Od)), d ≥ 0, if there exists a constant K such that | f (ε)| ≤ Kεd (resp., ≥ Kεd) for all sufficiently small ε. A function is said to be Ω(εd), d ≥ 0, if it is both Od) and Od); that is, a function is Ω(εd) if it is exactly of order εd. A function f (ε) is defined to be od), d ≥ 0, if | f (ε)|/εd → 0 as ε → 0.) This structure played an important role in the steady-state-simulation analysis of highly reliable Markovian systems in Shahabuddin [36].

One key property of highly reliable Markovian systems that will be used later is the special structure of the regenerative cycle. Because state 1 has no repair transitions, q(1) is Ω(εr0) since it is the sum of only component failure rates. An important consequence of this is that E(Z) is Ω(εr0). By Assumption (A.2) and the fact that μi's are positive constants, we have that q(x) = Ω(1) for all xS1. Finally, as mentioned earlier, repair rates are very large compared to failure rates. Using these three facts, we see that most of the regenerative cycles consist of a long time interval in which the system is in state 1, after which there is a single failure transition (which might correspond to more than one component failing because of failure propagation), followed by a short time interval in which the failed components are repaired (and thus the cycle completes). This is the intuition behind the fact shown in Shahabuddin [36] that E(min(TF,T1)) = E(Z + W) is of the same order as E(Z) (i.e., Ω(εr0)). Using similar methods, we can show that the expected regenerative-cycle time, E(T1), is also Ω(εr0) and that E(W), E(W), and E(W2) are O(1). For completeness, the formal proofs are given in the Appendix. It was also shown in [36] that γ = Ω(εr), where r is some nonnegative number depending on the structure of the system. Using the fact that the mean time to system failure (MTTF) for regenerative systems can be expressed as E(min(TF,T1))/γ (see, e.g., Goyal et al. [15]), we see that the MTTF is Ω(ε−(r0+r)).

2.3. Importance Sampling for Highly Reliable Systems

Consider the problem of estimating the probability, α, of a rare event {XA}, where X is a random sample path with probability measure Φ and A is a rare set. The “naive” way of estimation is to generate samples of X, say X1,X2,…,Xn, from Φ and then form the sample mean

where I(·) is the indicator function of the event inside the bracket. The variance of this estimator is (α − α2)/n ≈ α/n, as α is small. The expected half-width of the 100(1 − δ)% confidence interval, HW, is proportional to

. Thus, the relative error (RE), which is defined by RE = HW/α, is proportional to

. For fixed n, as α → 0, RE → ∞. This is the problem with naive estimation of the probability of rare events.

In importance sampling, we use a change of measure Φnew, such that for each sample path xA, Φnew(x) > 0 if Φ(x) > 0. Then, we can express α as

where L(x) = dΦ(x)/dΦnew(x) when dΦnew(x) > 0 and 0 otherwise. The function L is the Radon–Nikodym derivative; it is also called the likelihood ratio. The subscript in the second expectation operator denotes the new probability measure under which the expectation is taken. Equation (1) forms the basis of the technique of importance sampling. The last expectation term suggests that we use the probability measure Φnew(·) and generate the samples (Xi,L(Xi)). Then, a new unbiased estimator is given by

and its variance is

The main problem in importance sampling is to find an easily implementable Φnew so that E(I(XA)L(X)) << α, and so the variance of the new estimator is significantly less than that of the naive one.

The most common importance-sampling technique used for highly reliable systems is failure biasing, proposed by Lewis and Böhm [25]. The basic idea behind failure biasing is to make component-failure transitions in the embedded DTMC happen with a probability that is much higher than in the original system. In state 1, there are no repair transitions. Therefore, we do not need to failure bias in this state. However, in states that have both failure and repair transitions, the total probability of repair transitions ≈ 1 and the total probability of failure transitions ≈ 0. In such states, the total probability of failure transitions is increased to θ, where θ is some constant (i.e., it is independent of ε) between 0 and 1 that is significantly larger than the failure transition probabilities (in practice, θ is typically taken to be 0.5). Therefore, the total probability of repair transitions is decreased to 1 − θ. In a version of failure biasing called balanced failure biasing (Goyal et al. [15] and Shahabuddin [36]), the probability of each failure transition (that had positive probabilities in the original system), conditioned on the event that the transition that occurs is a failure transition, is made the same (this is also done in state 1). The probability of each repair transition, conditioned on the event that the transition that occurs is a repair transition, is left unchanged. Let P′ be the new transition-probability matrix corresponding to balanced failure biasing. Note that P′(x,y) > 0 if and only if P(x,y) > 0.

We will now review the order of magnitude results for the variance associated with importance sampling in the estimation of γ. To obtain a sample of I(TF < T1), we need only simulate one regenerative cycle of the CTMC. Thus, γ is called a “regenerative-cycle-based measure.” Other examples of regenerative-cycle-based measures are E(W), E(WI(TF < T1)), E(WI(TF > T1)), E(T1), and E(min{TF,T1}). Let Φ′′ be a change of measure on the sample paths of the CTMC where we use P′ until time min(TF,T1) and then P after that. For any stopping time τ of the embedded DTMC, define

Let τmin = min(τF1). Then, in the same spirit as Eq. (1), we can write

The variance of the importance sampling estimator, σγ2(Φ′′) ≡ VarΦ′′(I(TF < T1)Lτmin), is given by σγ2(Φ′′) = EΦ′′(I(TF < T1)Lτmin2) − γ2. Let σγ2(Φ) be the variance of the naive estimator.

The following theorem was proved in Shahabuddin [36] under the following strengthened version of Assumption (A.4):

(A.4′) For all xF, P(1,x) is o(1) (this implies that there exists xU1 such that P(1,x) is Ω(1)).

Theorem 1: Both γ and σγ2(Φ) are Ω(εr), where r is a positive constant depending on the structure of the system, the ri's, and the failure propagation probabilities. Also, EΦ′′(I(TF < T1)Lτmin2) is Ω(ε2r) and, so, σγ2(Φ′′) is O2r).

This theorem implies that we get a bounded RE in the estimation of γ. One can prove similar bounded RE results for other “rare” regenerative-cycle-based measures like E(WI(TF < T1)). Moreover, observe that Assumption (A.4′) is not at all restrictive. If it does not hold, then γ is Ω(1) and we do not have to use importance sampling to estimate it. Unless otherwise stated, in the following sections we will assume that Assumptions (A.1), (A.2), (A.3), and (A.4′) hold.

For transient measures like the unreliability and expected interval unavailability, in addition to failure biasing, we have to use another technique called forcing (Lewis and Böhm [25]). If the time horizon is orders of magnitude less than E(Z), then the system will fail very rarely in [0,t], even though we failure bias. To avoid this, the time of the first event is sampled from the distribution of Z conditioned on the fact that it is less than t. Because Z is exponentially distributed with rate q(1), the time of the first event that we use in the simulation is sampled from the distribution function given by

where 0 ≤ st.

3. ESTIMATION OF THE UNRELIABILITY

Given a finite time horizon t, the unreliability, U(t), is defined to be the probability that the system fails before time t given that it starts in state 1; that is,

We wish to estimate the unreliability of the system for different orders of magnitude of the time horizon. A modeling technique that is used in Shahabuddin [35] and Shahabuddin and Nakayama [38] is to represent t as being Ω(εrt), where rt ≥ 0, and then model different orders of magnitude of the time horizon by varying rt. Hence, for rt = 0, t is of the same order as the expected repair times, and for rt = r0, t is of the same order as the expected first component-failure time in the system (which is of the same order as the expected regenerative cycle time). For rt = r0 + r, t is of the same order as the MTTF.

In the following subsection, we will prove that a combination of forcing and importance sampling gives bounded RE in the estimation of the unreliability for the case where rt = 0; that is, the time horizon is “small.” Before that, we will need to significantly extend the importance-sampling theory developed for “rare” regenerative-cycle-based measures (which was partially reviewed in Section 2.3) to a “nonrare” measure. In particular, the following proposition is crucial to the main result of the next subsection. The proof is technical, so it is deferred to the Appendix.

Proposition 1: EΦ′′(I(TF > T1)Lτmin2) − 1 = Ω(1).

3.1. The Small-Time-Horizon Case

We can express the unreliability as U(t) = E(I(A)), where A is the event {TF < t}. Let Φ′ be the new measure on the sample paths of {X(t) : t ≥ 0}, in which in each replication we use P′ until the system fails and P from then on. For the unreliability, for each sample, we only need to simulate the CTMC until either the system fails or the time horizon is exceeded. Let ΦForcing′ be the measure corresponding to both balanced failure biasing and forcing. The variances, without and with balanced failure biasing, are denoted by σU(t)2(Φ) and σU(t)2(Φ′), respectively. The variance with balanced failure biasing and forcing is denoted by σU(t)2(ΦForcing′). The following theorem provides the orders of magnitude of the unreliability and the variances of its estimators when the time horizon is small.

Theorem 2: Consider the case where t = Ω(εrt) with rt = 0. Then, both U(t) and σU(t)2(Φ) are Ω(εr+r0). Also, σU(t)2(Φ′) = O2r+r0) and σU(t)2(ΦForcing′) = O2(r+r0)) (r is the same as in the order of magnitude expression for γ in Theorem 1).

From Theorem 2, we get the following corollary:

Corollary 1: The RE using ΦForcing(corresponding to a fixed 100(1 − δ)% level of confidence), for a fixed number n of replications, remains bounded as ε → 0.

Lemmas 1 and 2 give bounds on U(t) that are used to prove the order of magnitude result for U(t) in Theorem 2. The result for σU(t)2(Φ) is a consequence of the fact that σU(t)2(Φ) = U(t) − U2(t). To simplify notation, let qq(1). Let K be the random variable denoting the number of times the Markov chain is in state 1 (including the first time) before hitting a state in F. Clearly, K has a geometric distribution with parameter (i.e., success probability) γ.

Lemma 1: Let U(t) = 1 − eqtγ. Then, for all ε and t,

Proof: Let Wtot be the total amount of time the CTMC spends in states other than 1 before hitting F. Let fWtot|k(·) be the density of Wtot given that K = k and let Erlang(q,k) denote an Erlang random variable with rate q and shape parameter k. For a real-valued a, let (a)+ = max(a,0). Then,

Lemma 2: Let qmin = min(q(x) : xU1). Then, for all ε and t,

where k is some constant.

Proof: From Shahabuddin [36], it can be shown that there exists a sequence of transitions that start from state 1 and reach a state in F without reentering state 1 such that the product of their probabilities is Ω(εr). All other paths have probability of order Or). In highly reliable systems terminology (see, e.g., Gertsbakh [10]), this is one of the “most likely paths” to system failure. Let (x0,x1,x2,…xk) be the sequence of states visited in one such path, where x0 = 1 and xkF and xiU1 for 1 ≤ i < k. Then,

where the inequality in the fifth line above follows from the fact that the region of integration in the fifth line is contained in that of the previous line. █

In the same spirit as Eq. (1), we can write U(t) = E(I(TF < t)) = EΦ′(I(TF < t)LτF) because we terminate a replication once the failed state is reached. Hence, now we can use Φ′ and obtain samples of (I(TF < t),LτF). Consequently, if we use balanced failure biasing, the new variance is given by σU(t)2(Φ′) ≡ VarΦ′(I(TF < t)LτF) = EΦ′(I(TF < t)LτF2) − U2(t).

Lemma 3, which follows, gives an upper bound on EΦ′(I(TF < t)LτF2) which will enable us to evaluate the order of magnitude of σU(t)2(Φ′). Let DEΦ′′(I(TF < T1)Lτmin2) = E(I(TF < T1)Lτmin) and BEΦ′′(I(TF > T1)Lτmin2) = E(I(TF > T1)Lτmin), where Φ′′ has been defined in Section 2.3. The order of magnitude of D (resp., B − 1) is given in Theorem 1 (resp., Proposition 1).

The key to this result lies in the fact that we can decompose the likelihood ratio LτF into a product of likelihood ratios over individual cycles. For any sample path with K = k, we can write

where L(i) represents the likelihood ratio over the ith regenerative cycle in the sample path. Given K = k, the L(i) are mutually (conditionally) independent, with L(1),…,L(k−1) having the distribution of Lτmin given that {τ1 < τF} and L(k) has the distribution of Lτmin given that {τ1 > τF}.

Lemma 3: For all ε and t,

Proof: In the same spirit as Eq. (1),

Given K = k, (Z1 + Z2 + ··· + Zk) is (conditionally) independent of LτF. Therefore, we can write

Substituting this in Eq. (4), we obtain

Carrying out the necessary algebra, we get the result of the lemma. █

We will now describe the contribution of forcing to the reduction in variance. Recall that ΦForcing′ denotes the new measure on the sample paths of {X(t) : t ≥ 0} when with balanced failure biasing (i.e., Φ′) we also use forcing. Then, we have the following lemma.

Lemma 4: σU(t)2(ΦForcing′) = (1 − eqt)EΦ′(I(TFt)LτF2) − U2(t).

Proof: Let LForcing be the likelihood ratio incurred due to forcing. Note that LForcing = 1 − eqt. Then,

Proof of Theorem 2: From Lemma 1, we get that U(t) is Or+r0). Because q is Ω(εr0), we have that (1 − eqt/k) is Ω(εr0). Moreover, since qmin is Ω(1), we have that (1 − eqmin t/k) is also Ω(1). It then follows that U(t) is Or+r0) and we get the first part of the theorem.

From Theorem 1, we see that D = Ω(ε2r). Using Proposition 1 and the fact that ex = 1 + x + o(x), we get that e(B−1)qt − 1 is Ω(εr0). Therefore, by Lemma 3, we get that EΦ′(I(TF < t)LτF2) is O2r+r0). Hence, σU(t)2(Φ′) = EΦ′(I(TF < t)L2) − U2(t) is O2r+r0). Then, using Lemma 4, we get that

3.2. The Large-Time-Horizon Case

Consider the following generalization of Theorem 2, which gives the orders of magnitude of the variances of our estimators for large time horizons.

Theorem 3: Consider the case of large time horizons where t = Ω(εrt) with 0 ≤ rtr0. Then, both U(t) and σU(t)2(Φ) are Ω(εr+r0rt) and σU(t)2(ΦForcing′) = O2(r+r0rt)).

The proof of this theorem follows from Lemmas 1–4 and Proposition 1 by substituting t = Ω(εrt) in all of the expressions involving t and using the same method as in the proof of Theorem 2.

We saw in the previous section that for small t, for the bounded RE property to hold, the variance reduction using importance sampling had to be of the same order as the unreliability. From Theorem 3, we see that for the case of large time horizons, we also get a variance reduction that is of the same order as the unreliability as long as rtr0.

Let

We can easily show that for 0 < rtr0, EΦ′(I(TF < t)LτF2)/V(t) → 1 as ε → 0. We conjecture that this holds even for r0 < rt < r0 + rt. If this conjecture is true, then we have an approximate expression for the variance for the case when ε is small. Then, it is easy to see why simulation using importance sampling becomes very inefficient for the case where rt > r0. This is also explained by the results in Glynn [11]; the variance of the likelihood ratio increases (roughly) exponentially fast in the number of transitions of the Markov chain. In experiments in Nicola, Nakayama, Heidelberger, and Goyal [30], it is shown that even for the case where rt = r0, one has to tune the value of the failure-biasing parameter θ by trial and error, which can be computationally expensive. Hence, for the case where rtr0, it is best to use the “bounding approach,” as mentioned in Section 1.

Motivated by this, we now investigate bounds on the unreliability. As mentioned in Section 1, because the system is regenerative, the time to failure is roughly a geometric sum of i.i.d. random variables. There is some literature on bounds on the distribution function of such random variables, which, in our case, corresponds to the unreliability. We investigate the bounds given in [4]. For completeness, we first present the bound in its original form. We then adapt it to the reliability model described in this article. Let Vi (generically denoted by V) be i.i.d. nonnegative random variables and let V′ be another nonnegative random variable independent of the Vi's. Let

, where N0 is a geometric random variable, independent of the Vi's and V′, with probability of “success” p (i.e., P(N0 = i) = (1 − p)i−1p for i ≥ 1). It is well known in the literature (e.g., Keilson [24] and Solovyev [40]) that as p → 0, T converges in distribution to an exponentially distributed random variable with mean E(T). Here, E(T) = (1 − p)E(V)/p + E(V′). Theorem 2.2 of Brown [4] gives the following bound:

In our setting, a “success” in the geometric random variable definition corresponds to system failure happening in a regenerative cycle. More specifically, if we define N0K, ViZi + Wi for 0 ≤ iK − 1, V′ ≡ ZK + WK (recall that Wi is the random variable Wi conditioned on no failure event occurring in that cycle and that Wi is the random variable Wi conditioned on a failure event occurring in that cycle), and p ≡ γ, then TTF. An important point to note is that in our setting, even though Wi is not independent of K, the Wi, 0 ≤ iK − 1, and WK are independent of K. Also,

Define Ub(t) = 1 − et/E(TF). Let II(TF < T1) (resp., II(TF > T1)) and define

Then, using Eq. (5), we see that upper and lower bounds on U(t) = P(TFt) are given by Ub(t) = Ub(t) + Uerr,b(t) and Ub(t) = Ub(t) − Uerr,b(t), respectively. The bounds are in terms of regenerative-cycle-based measures.

The quality of any upper bound and lower bound combination depends on how close they are to each other. Define the relative error of the bounds (this should not be confused with the relative error, RE, in the simulation context, that was defined earlier), REB, to be the difference between the upper bound and the lower bound divided by twice the measure of interest. (The “twice” is motivated by the fact that if we use the arithmetic mean of the upper and lower bound as an approximation for our measure of interest, then the relative error between the approximation and the actual value is always less than REB.) In this case, the REB is given by REBb(t) ≡ (Ub(t) − Ub(t))/2U(t) = Uerr,b(t)/U(t).

Theorem 4: If rt > r0, then Uerr,b(t)/Ub(t) → 0 as ε → 0; if rt = r0, then Uerr,b(t)/Ub(t) is Ω(1); if rt < r0, then Uerr,b(t)/Ub(t) → ∞ as ε → 0. Hence, REBb(t) → 0, as ε → 0 if and only if rt > r0.

Proof: We use the representation given in Eq. (6). Assumptions (A.2) and (A.4′) imply that 1 − γ is Ω(1). Using the orders of γ, 1 − γ, E(W), E(W), E(W2), and E(Z) (see Sect. 2.2), we get that E(Z + W) = Ω(εr0), E(Z + W) = Ω(εr0), and E((Z + W)2) = Ω(ε−2r0). Consequently, Uerr,b(t) = Ω(εr). Moreover, since E(TF) = Ω(εrr0), Ub(t) = Ω(εr+r0rt) for rt < r + r0 and Ub(t) = Ω(1) for rtr + r0. The result for all three cases follows from these facts. The last part of the theorem follows from the fact that

Therefore, REBb(t) → 0 if and only if Uerr,b(t)/Ub(t) → 0. █

We also tried using the bounds in Kalashnikov [23], but there seems to be some error in the bounds. Moreover, as mentioned earlier, the bounds do not make use of the special structure of the regenerative cycles; that is, they do not make use of the fact that min{T1,TF} = Z + W, where Z is exponentially distributed and E(Z) >> E(W). The following theorem provides bounds that make use of this special structure.

Theorem 5:

(i) Let U(t) = 1 − eqγt. Define

and let U(t) ≡ U(t) − Uerr(t), where

Then, U(t) ≤ U(t) ≤ U(t) for all ε and t.

(b) Let REB(t) denote the REB in this case. For rt > 0, REB(t) ≡ (U(t) − U(t))/2U(t) = Uerr(t)/2U(t) → 0 as ε → 0.

Remark 1: These bounds are in terms of regenerative-cycle-based measures.

Remark 2: These bounds converge for a much wider range of the time horizon as compared to the range in Theorem 4.

4. EXPERIMENTAL RESULTS WITH A LARGE MARKOVIAN MODEL

We took an example of a large computing system originally considered in Goyal et al. [15] and subsequently in many other articles. The system is depicted in Figure 1. It consists of two sets of processors with two processors per set, two sets of disk controllers with two controllers per set, and six clusters of disks with four disks per cluster. In a disk cluster, data are replicated so that we can have one disk fail without affecting the system. The failure rates of processors, controllers, and disks are assumed to be

per hour, respectively. If a processor of a given set fails, it has a 0.01 probability of causing the operating processor of the other set to fail. Each unit in the system can fail in one of two failure modes which occur with equal probability. The repair rates for all mode 1 and all mode 2 failures are 1 per hour and ½ per hour, respectively. The system is defined to be operational if all data are accessible to both processor types, which means that at least one processor in each set, one controller in each set, and three out of four disks in each of the six disk clusters are operational. We also assume that operational components continue to fail at the given rates when the system failed.

Block diagram of the computing system modeled.

To keep the state space within manageable limits (in order to facilitate comparison with approximate numerical results from SAVE), Goyal et al. [15] assumed that after each transition (whether failure or repair), the repairman picks a component at random from the set of failed components. In this way, the state variable does not have to include the order of components waiting at the repair queue. For the purpose of comparison, we first use the same repair discipline. Later, we also consider the same example with first-come first-served (FCFS), where the numerical methods implemented in SAVE cannot be used.

In order to see the effect of our simulation schemes, we estimate the unreliability for different values of the time horizon using different techniques. The results are presented in Table 1. The time horizon is given in the first column. This model has 1,265,625 states and is thus very difficult to solve by exact techniques. Goyal et al. [15] used SAVE to numerically compute approximations for values of the time horizon up to 1024, but give no bounds on the approximation error. We used SAVE to complete these computations for the other time horizons. All these are reproduced in column 2. The third column gives the estimate and the RE corresponding to 99% confidence intervals (CIs) if we use naive simulation. The fourth column gives the estimate and the RE using failure biasing and forcing. Each of the naive and importance-sampling-simulation cases were simulated for 400,000 events. In the fifth and sixth columns, we estimate the bounds of Brown [4] mentioned earlier (henceforth referred to as M.B. bounds). We do this by running 1 simulation of 400,000 events and estimating γ, E(W), E(WI(T1 < TF)), and E(W2I(T1 < TF)). We then used them to compute the bounds for all t. These regenerative-cycle-based measures can be estimated using the dynamic importance sampling (DIS) approach with balanced failure biasing as described in Goyal et al. [15]. For building confidence intervals, we use the delta method (e.g., see Serfling [34, p.124]) to establish a central limit theorem.

Estimates of the Unreliability, Brown's [4] Bounds, and the New Bounds

Next, we estimate the bounds which we developed. In this case, one has to first estimate the regenerative-cycle-based measures γ, E(W), and E(WI(T1 < TF)). The last two columns of Table 1 give the experimental results using these new bounds. We again use 1 simulation run of 400,000 events to estimate the bound for all t. Compared to the M.B. bounds, it is simpler to build confidence intervals in this case, as there are fewer measures to be estimated and the expressions are less complicated.

In this example, E(T1) ≈ 125 and E(TF) ≈ 152,240. For time horizons that are significantly larger than 125, one should not expect balanced failure biasing and forcing to work well. One can observe in Table 1 that for time horizon 1024 and beyond, the estimates of the unreliability and/or their REs, using balanced failure biasing and forcing, become highly unstable. In fact, we suspect infinite variance, either in the estimation of the quantity or its RE, in this range of the time horizon; that is, no matter how long we run the simulation, we will never get stability. The estimates of the M.B. bounds are “satisfactory” (we define this to be less than 10% REB) only for time horizon 4096 and beyond. However, for all time horizons where failure biasing and forcing do not work well (i.e., time horizon 1024 and beyond), the estimates of the new bounds are satisfactory. For very large values of t, estimates using M.B. bounds do better than those using the new ones. Hence, these could be used for better accuracy for the higher time ranges.

To verify the robustness of our methods for simulating rare events, we consider a more “rare” case where all the failure rates and the failure propagation probabilities are reduced by a factor of 100. We also use the FCFS service discipline. In this case, E(T1) ≈ 12,500 and E(TF) ≈ 16.44 × 108. As earlier, a total of 400,000 events were simulated for each case. The results are presented in Table 2. In this case, importance sampling starts to become unstable from time horizons 105 and beyond, and we note the same relative trends in the bounds as we did earlier. In addition, the RE using naive simulation goes up about 10 times from the previous less rare case; however, the relative errors in the simulation of the bounds are almost unchanged. All of the (estimated) REs in the estimates of the bounds, except for t = 10, ranged from 3.7% to 3.8%.

Estimates of the Unreliability and the Bounds for Another Model

In these experiments, we used balanced failure biasing to estimate the regenerative-cycle-based measures. As mentioned earlier, this is guaranteed to produce a bounded RE in the estimation of these measures. However, other failure-biasing schemes that are efficient in practice (e.g., the failure distance scheme of Carrasco [5]) can also be used in the estimation of these regenerative-cycle-based measures.

5. DISCUSSION AND OPEN PROBLEMS

In this article, we discussed the estimation of the unreliability in large Markovian models of highly reliable systems. We show that for small time horizons, simulation using the importance-sampling techniques of failure biasing and forcing are provably effective. For large time horizons, simulation using the above techniques becomes difficult. In this case, the approach used is to first bound the unreliability in terms of regenerative-cycle-based measures and then estimate the regenerative-cycle-based measures using importance sampling. We explore bounds existing in the literature and develop some bounds of our own.

For the small-time-horizon case (rt = 0), this work complements the work in Heidelberger, Shahabuddin, and Nicola [18] and Nicola et al. [30], which deals with the estimation of transient measures and proving corresponding BRE results for non-Markovian reliability models. However, the frameworks used for simulation and thus importance sampling in the non-Markovian setting are very different from those used for Markovian models. In particular, a discrete-event-simulation approach and/or uniformization is used in [18] and [30], whereas the CTMC approach is recommended and used for large Markovian models (e.g., SAVE). As was the case in this article, the direct importance-sampling approaches described in [18] and [30] do not work when the time horizon is large.

For the large-time-horizon non-Markovian case, we can easily extend the bounding approach in this article, for the case when failure times are exponentially distributed but repair times are generally distributed. This is because the regenerative property is still preserved and one can use the techniques in Nicola et al. [30] and Nicola, Shahabuddin, Heidelberger, and Glynn [31] to estimate the regenerative-cycle-based measures efficiently. However, results on the convergence of the bounds to the actual measure, although intuitively apparent, are difficult to prove rigorously. The development of efficient large-time-horizon simulation techniques in models for which both the failure times and repair times are nonexponentially distributed is still an open problem, because, then, the regenerative structure is lost.

It should be mentioned that even though we use balanced failure biasing for estimating the transient measures (for small time horizons) and the regenerative-cycle-based measures, one could also have used balanced failure transition distance biasing (Carrasco [5,6]) or the balanced likelihood ratio method (Alexopoulos and Shultes [1] and Shultes [39]). Balanced failure transition distance biasing uses structural information about the system to failure bias and, thus, tends to produce more accurate estimates. However, there is an implementation overhead that comes with using more information that is not present in the case of balanced failure biasing. Bounded relative error results for regenerative-cycle-based measures in the case of balanced failure transition distance biasing follow as straightforward extensions of the work in Shahabuddin [36] (see, e.g., Nakayama [27,29] and Nicola et al. [32]); we expect the same to be true for transient measures in the case of small time horizons. The balanced likelihood ratio method has empirically been shown to work better than balanced failure biasing on systems with larger redundancy (minimum number of component failures required for the system to fail) than those originally considered in SAVE (Blum et al. [2,3]). Bounded relative error results for estimating regenerative-cycle-based measures using the balanced likelihood ratio method, and an extension of the technique that uses structural information about the system are also described in Alexopoulos and Shultes [1] and Shultes [39].

Acknowledgments

The work of the first author was supported by National Science Foundation (NSF) CAREER Award DMI-96-24469 and NSF grant DMI-99-00117. The work of the second author was supported in part by NSF grant DMS-95-08709, NSF CAREER Award DMI-96-25291, and a 1998 IBM University Partnership Program Award. The authors would like to thank Chia-wei Hu for computational assistance.

APPENDIX

We first briefly review the notation and framework considered in Shahabuddin [36]. Let PR be the matrix constructed from P where the rows and columns corresponding to state 1 and states in F are removed. Since all states represented in PR have at least one ongoing repair transition, PR is a matrix in which the positive elements above the diagonal (i.e., probabilities of failure transitions) are of the form cεd + od), c > 0, d > 0, where c and d are generic representations of constants. The positive elements below the diagonal (i.e., probabilities of repair transitions) are of the form c + o(1), c > 0. For xU, let pF(x) = [sum ]yF P(x,y), p1(x) = P(x,1), and p0(x) = P(1,x). Let the vectors pF, p1, and p0 be given by pF = {pF(x) : xU1}, p1 = {p1(x) : xU1}, and p0 = {p0(x) : xU1}, respectively. All vectors are assumed to be column vectors. The positive elements of p1 are of the form c + o(1), c > 0, the positive elements of pF are o(1), and the positive elements of p0 are O(1). From Assumption (A.4), we have that at least one element of p0 is positive. Let vγ(x) = P(TF < T1|Y0 = x) and vγ = {vγ(x) : xU1}. By Assumption (A.3), all elements of vγ are positive. By Assumption (A.1), starting from any state xU1, the Markov chain hits 1F with probability 1. Consequently, for xU, P(T1 < TF|Y0 = x) = 1 − vγ(x). By Assumption (A.2), we have that all elements of evγ are positive, where e is a vector of 1's.

By a matrix or a vector being od) (resp., Od), Od), Ω(εd)), d ≥ 0, we mean that all elements of that matrix or vector are od) (resp., Od), Od), Ω(εd)). It has been shown in Shahabuddin [36] (see the proof of Lemma 3 in that article) that there exists a nonnegative integer constant N0 such that

Therefore,

It was shown in Shahabuddin [36] that

. Similarly, one can show that

. From Eq. (A.2) and the fact mentioned earlier that pF is o(1), we get that vγ is o(1).

Let

be a matrix constructed out of P by removing the row and column corresponding to state 1.

has the same structure as PR in the sense that the positive elements above the diagonal are of the form cεd + od), c > 0, d > 0, and the positive elements below the diagonal are of the form c + o(1), c > 0. Thus,

for exactly the same reason as Eq. (A.2).

Lemma A.1: E(T1) = Ω(εr0).

Proof: Let

(recall that h(x) ≡ 1/q(x)),

, and

. Then, we have that

, from which we get

. From Assumption (A.2) and the fact that the repair rates are Ω(1), we get that

is Ω(1), and so from Eq. (A.3),

is O(1). In addition, p0 is O(1), so

. █

Lemma A.2: E(W) and E(W) are O(1).

Proof: For xU, let w(x) = E(min(T1,TF)I(TF < T1)|Y0 = x), and w(x) = E(min(T1,TF)I(T1 < TF)|Y0 = x). Let = {(x) : xU1} and = {w(x) : xU1}. Let h = {h(x) : xU1}. Note that

(the

denotes the scalar product) from which we get

. Then, using the fact from Section 2.1 that 1 − γ > 0, we get that

In a similar fashion, we get that

and using the fact from Section 2.1 that γ > 0, we get that

By Assumption (A.4) and the fact that all elements of vγ are positive, p0Tvγ > 0. We will now show that (resp., ) and evγ (resp., vγ) are of the same order. From Assumption (A.2) and the fact that the repair rates are Ω(1), we get that h is Ω(1). Consequently, if an element of evγ (resp., vγ) is Ω(εd) for some d ≥ 0, then the corresponding element of

(resp.,

is also Ω(εd) for the same d. Because

, the corresponding element of (resp., ) is similarly Od) for the same d. Using this fact in Eq. (A.4) (resp., Eq. (A.5)), we see that E(W) (resp., E(W)) is O(1). █

Lemma A.3: E(W2) is O(1).

Proof: We use notation from the proof of Lemma A.2. Furthermore, let H(x) denote the holding-time random variable in state x. Clearly, H(x) is exponentially distributed with rate q(x). Define l(x) = E(H2(x)) = 2/q2(x), l = {l(x) : xU1}, q(x) = E((min(T1,TF))2 × I(T1 < TF)|Y0 = x), and = {q(x) : xU1}. Then, one can easily derive the recursive equation that

from which one gets that

. In that case,

Using the fact that vγ is o(1), we get that the elements of evγ are Ω(1). In addition, from what was stated in the proof of Lemma A.2, is O(1). The l and h are Ω(1),

is O(1), and, therefore, is O(1). Then, the proof follows from Eq. (A.6). █

Proof of Proposition 1: For any matrix (vector), say PR, with elements of the form c + o(1), c ≥ 0, let

be the matrix containing the “c part” of the elements. Using the fact mentioned earlier that vγ is o(1), we get that

Using Eq. (A.1) and the fact that the positive elements of p1 are of the form c + o(1), c > 0, we have that

Comparing Eq. (A.8) with Eq. (A.7), we see that

As in Shahabuddin [36], we construct the matrix B = {B(x,y) : x,yS} by setting B(x,y) = P2(x,y)/P′(x,y) for all x, y such that P′(x,y) ≠ 0; B(x,y) = 0 otherwise. The BR, bF, b1, and b0 are constructed from B the same way as PR, pF, p1, and p0 were constructed from P. Note that BR is a matrix in which the positive elements above the diagonal are of the form cεd + od), c > 0, d ≥ 2, and the positive elements below the diagonal are of the form c + o(1), c > 0. It has been shown in Shahabuddin [36] that for the same N0 as earlier,

. Thus,

is O(1). It has also been shown in Shahabuddin [36] that

. Using a similar method, we can show that

. The positive elements of b1 are of the form c + o(1), where c > 0. Hence,

Equation (A.10) follows from the form of the elements of P′ and the fact that in the Markov chain corresponding to P, the sum of all the repair transition probabilities from any state other than 1 is 1 − o(1). The last equality follows from Eq. (A.9). From Eq. (A.11), we get that

where r(1) is the number of positive probability failure transitions from 1. The last equality follows from the fact that P′(1,x) = 1/r(1).

The elements of P(1,x) can be represented in the form c + o(1), c ≥ 0. Let

be the set of x for which P(1,x) is of the form c + o(1), c > 0. From Assumption (A.4′), we see that

. Since

, we have that

The last inequality follows because the minimum of the function

subject to the constraints

, ∀i, is 1. Therefore,

so B − 1 = O(1). Using the fact that

, b0, and b1 are O(1), we get that

is O(1). Hence, B − 1 = Ω(1). █

Proof of Theorem 5:

(a) The U(t) ≤ U(t) bound can be proved by exactly the same method as Lemma 1. We will now prove the lower bound.

Let f(q,k)(x) be the density of an Erlang random variable with rate q and shape parameter k (i.e., the density of Z1 + Z2 + ··· + ZK given that K = k). Then,

We now have to prove that the second term above is upper bounded by Uerr(t). Given K = k, the Wi's are (conditionally) independent of the Zi's. Also given K = k, the W1,W2,…,Wk−1 are i.i.d. and have the distribution of W. Moreover, given K = k, Wk is independent of the W1,W2,…Wk−1 and has the distribution of W. Thus, we can rewrite the second term and bound it as follows:

(b) We will first show that Uerr(t)/U(t) → 0 as ε → 0. The result then follows from the fact that

We use the representation of Uerr(t) given by Eq. (A.12).

First, let us study the properties of l. If rt < r0 (resp., rt > r0), then for all sufficiently small ε, t < 1/q (resp., t > 1/q), which implies that

(resp.,

). Thus, l is Ω(εrt /2) (resp., l is Ω(εrt+r0 /2)). If rt = r0, then

, but in either case, l = Ω(εrt /2). Since rt > 0 and for rt > r0, rtr0 /2 > 0, we have that 1/l is o(1). Since r0 > 0 and rt > 0, we have that l/t = o(1). It then follows that tl is Ω(εrt).

Consider the case where rt < r + r0, so that εr+r0rt → 0 as ε → 0. Using the well-known fact that 1 − ex = x + o(x), we get that U(t) is Ω(εr+r0rt). Hence, all we have to show is that Uerr(t) = or+r0rt). Since tl is Ω(εrt), (1 − e−γq(tl)) = γq(tl) + or+r0rt) is Ω(εr+r0rt). Lemma A.2 shows that both E(W) and E(W) are O(1). Since 1/l is o(1), the first term in Eq. (A.12) is or+r0rt). Using the well-known fact that 1 − exxex is x2/2 + o(x2) and the fact that tl is Ω(εrt), one can easily show that (1 − e−γq(tl) − γq(tl)e−γq(tl)) is γq(tl) × Ω(εr+r0rt). Considering separately the two cases of 0 < rtr0 and r0 < rt < r + r0, one can show that q(tl)/l is o(1) and, therefore, the second term in Eq. (A.12) is or+r0rt). Finally,

Using the fact mentioned earlier that l/t is o(1), we can show that the last part of Eq. (A.12) is or+r0rt).

Now, consider the case where rtr + r0. In this case, U(t) is Ω(1). Thus, all we have to show is that Uerr,t(t) is o(1). For rt = r + r0, 1 − e−γq(tl) and 1 − e−γq(tl) − γq(tl)e−γq(tl) are Ω(1). For rt > r + r0, one can show the same by using the fact that 1 − e−1/x and 1 − e−1/x − (1/x)e−1/x approach 1 as x → 0. Using the fact that 1/l is o(1), one can show that the first term of Eq. (A.12) is o(1). Note that rtr + r0 implies that rt > r0. Consequently,

is Ω(εrtrr0 /2) = o(1). Using this fact, one can show that the second term of Eq. (A.12) is o(1). Express the last term of Eq. (A.12) as

. For rt > r + r0, it is easy to see that as ε → 0, this term tends to 0 and, hence, it is o(1). For rt = r + r0, e−γqt is Ω(1) and

is o(1), and so the last term is also o(1). █

References

REFERENCES

Alexopoulos, C. & Shultes, B.C. (2001). Estimating reliability measures of highly-dependable Markovian systems, using balanced likelihood ratios. IEEE Transactions on Reliability 50: 265280.Google Scholar
Blum, A., Goyal, A., Heidelberger, P., Lavenberg, S.S., Nakayama, M.K., & Shahabuddin, P. (1994). Modeling and analysis of system dependability using the System Availability Estimator. In M. Malek, D.S. Fussel, A. Dahbura, & T. Nanya (eds.), Digest of papers: The twenty-fourth annual international symposium on fault-tolerant computing. New York: IEEE Press, pp. 137141.
Blum, A., Heidelberger, P., Lavenberg, S.S., Nakayama, M.K., & Shahabuddin, P. (1993). System Availability Estimator (SAVE) language reference and user's manual version 4.0. Research Report RA 219 S, IBM T.J. Watson Research Center, Yorktown Heights, NY.
Brown, M. (1990). Error bounds for exponential approximations of geometric convolutions. The Annals of Probability 18(3): 13881402.Google Scholar
Carrasco, J.A. (1991). Efficient transient simulation of failure/repair Markovian models. In V.K. Agarwal & E. Cerny (eds.), Proceedings of the tenth symposium on reliable distributed systems. New York: IEEE Press, pp. 152161.CrossRef
Carrasco, J.A. (1992). Failure distance-based simulation of repairable fault-tolerant systems. In G. Balbo & G. Serazzi (eds.), Computer performance evaluation: Modelling techniques and tools, pp. 351366. Amsterdam: Elsevier.
Conway, A.E. & Goyal, A. (1987). Monte Carlo simulation of computer system availability/reliability models. In J.P. Shen, D.P. Siewiorek, F. Cristian, & J. Goldberg (eds.), Proceedings of the seventeenth symposium of fault-tolerant computing. New York: IEEE Press, pp. 230235.
Crane, M.A. & Iglehart, D.L. (1975). Simulating stable stochastic systems, III: Regenerative processes and discrete event simulations. Operations Research 23(1): 3345.Google Scholar
Geist, R.M. & Smotherman, M. (1989). Ultrahigh reliability estimates through simulation. In Proceedings of the annual reliability and maintainability symposium. New York: IEEE Press, pp. 350355.
Gertsbakh, I.B. (1984). Asymptotic methods in reliability theory: A review. Advances in Applied Probability 16: 147175.Google Scholar
Glynn, P.W. (1994). Importance sampling for Markov chains: Asymptotics for the variance. Stochastic Models 10: 701717.Google Scholar
Glynn, P.W. & Iglehart, D.L. (1989). Importance sampling for stochastic simulations. Management Science 35(11): 13671393.Google Scholar
Goyal, A., Carter, W.C., de Souza e Silva, E., Lavenberg, S.S., & Trivedi, K.S. (1986). The System Availability Estimator. In H. Kopetz & M. Dal Cin (eds.), Proceedings of the sixteenth symposium on fault tolerant computing, pp. 8489. New York: IEEE.
Goyal, A. & Lavenberg, S.S. (1987). Modelling and analysis of computer system availability. IBM Journal of Research and Development 31(6): 651664.Google Scholar
Goyal, A., Shahabuddin, P., Heidelberger, P., Nicola, V.F., & Glynn, P.W. (1992). A unified framework for simulating Markovian models of highly dependable systems. IEEE Transactions on Computers C-41(1): 3651.Google Scholar
Gross, D. & Miller, D.R. (1984). The randomization technique as a modeling tool and solution procedure for transient Markov processes. Operations Research 32: 343361.Google Scholar
Heidelberger, P. (1995). Fast simulation of rare events in queueing and reliability models. ACM Transactions on Modeling and Computer Simulation 5(1): 4385.Google Scholar
Heidelberger, P., Shahabuddin, P., & Nicola, V.F. (1994). Bounded relative error in estimating transient measures of highly dependable Markovian systems. ACM Transactions on Modeling and Computer Simulation 4(2): 137164.Google Scholar
Jensen, A. (1953). Markoff chains as an aid in the study of Markoff processes. Skandinavisk Aktuarietidskrift 36: 8791.Google Scholar
Juneja, S. & Shahabuddin, P. (1992). Fast simulation of Markovian reliability/availability models with general repair policies. In D. Pradhan, J. Stiffler, J. Lala, & I. Koren (eds.), Proceedings of the twenty-second annual international symposium on fault tolerant computing. New York: IEEE Computer Society Press, pp. 150159.
Juneja, S. & Shahabuddin, P. (2001). Efficient simulation of Markov chains with small transition probabilities. Management Science 47: 547562.Google Scholar
Juneja, S. & Shahabuddin, P. (2001). A splitting-based importance-sampling algorithm for fast simulation of Markov reliability models with general repair policies. IEEE Transactions on Reliability 50: 235245.Google Scholar
Kalashnikov, V.V. (1989). Analytical and simulation estimates of reliability for regenerative models. Systems Analysis Modelling Simulation 6: 833851.Google Scholar
Keilson, J. (1979). Markov chain models—rarity and exponentiality. New York: Springer-Verlag.CrossRef
Lewis, E.E. & Böhm, F. (1984). Monte Carlo simulation of Markov unreliability models. Nuclear Engineering and Design 77: 4962.Google Scholar
Muntz, R.R., de Souza e Silva, E., & Goyal, A. (1989). Bounding availability of repairable computer systems. IEEE Transactions on Computers 38: 17141723.Google Scholar
Nakayama, M.K. (1994). A characterization of the simple failure biasing method for simulations of highly reliable Markovian systems. ACM Transactions on Modeling and Computer Simulation 4(1): 5288.Google Scholar
Nakayama, M.K. (1994). Fast simulation methods for highly dependable systems. In D.A. Sadowski, A.F. Seila, J.D. Tew, & S. Manivannan (eds.), Proceedings of the 1994 winter simulation conference. New York: IEEE Press, pp. 221228.
Nakayama, M.K. (1996). General conditions for bounded relative error in simulation of highly reliable Markovian systems. Advances in Applied Probability 28: 687727.Google Scholar
Nicola, V.F., Nakayama, M.K., Heidelberger, P., & Goyal, A. (1993). Fast simulation of highly dependable systems with general failure and repair processes. IEEE Transactions on Computers 42(8): 14401452.Google Scholar
Nicola, V.F., Shahabuddin, P., Heidelberger, P., & Glynn, P.W. (1993). Fast simulation of steady-state availability in non-Markovian highly dependable systems. In J.C. Laprie & A. Goyal (eds.), Proceedings of the 23rd annual international symposium on fault tolerant computing. New York: IEEE Computer Society Press, pp. 491498.
Nicola, V.F., Shahabuddin, P., & Nakayama, M.K. (2001). Techniques for fast simulation of models of highly dependable systems. IEEE Transactions on Reliability 50: 246264.Google Scholar
Obal, W.D., II & Sanders, W.H. (1994). Importance sampling simulation in ULTRA-SAN. Simulation 62: 98111.Google Scholar
Serfling, R.J. (1980). Approximation theorems in mathematical statistics. New York: Wiley.CrossRef
Shahabuddin, P. (1994). Fast transient simulation of Markovian models of highly dependable systems. Performance Evaluation, Special Issue: Performance '93—16th IFIP Working Group 7.3 International Symposium on Computer Performance Modeling, Measurement and Evaluation 20, 1–3, 267286.Google Scholar
Shahabuddin, P. (1994). Importance sampling for the simulation of highly reliable Markovian systems. Management Science 40(3): 333352.Google Scholar
Shahabuddin, P. (2001). Rare event simulation in stochastic models. In B.R. Haverkort, R. Marie, K. Trivedi, & G. Rubino (eds.), Performability modeling techniques and tools. New York: Wiley, pp. 163178.
Shahabuddin, P. & Nakayama, M.K. (1993). Estimation of unreliability and its derivatives for large time horizons in Markovian systems. In E.C. Russell, W.E. Biles, G.W. Evans, & M. Mollaghasemi (eds.), Proceedings of the 1993 winter simulation conference. New York: IEEE Press, pp. 422429.
Shultes, B.C. (1997). Regenerative techniques for estimating performance measures of highly dependable systems with repair. Ph.D. thesis, Georgia Institute of Technology, Atlanta.
Solovyev, A.D. (1971). Asymptotic behavior of the time of first occurrence of a rare event. Engineering Cybernetics 9: 10381048.Google Scholar
Strickland, S.G. (1995). Optimal importance sampling for quick simulation of highly reliable systems. In E.C. Russell, W.E. Biles, G.W. Evans, & M. Mollaghasemi (eds.), Proceedings of the 1993 winter simulation conference. New York: IEEE Press, pp. 437444.
Young, D. (1971). Iterative solution of large linear systems. New York: Academic Press.
Figure 0

Block diagram of the computing system modeled.

Figure 1

Estimates of the Unreliability, Brown's [4] Bounds, and the New Bounds

Figure 2

Estimates of the Unreliability and the Bounds for Another Model