1. INTRODUCTION
We consider semi-Markov decision processes (SMDPs) with finite state and action spaces. Let R s be the reward function at time s. R s can be an impulse function corresponding to the reward earned immediately at a transition epoch or it can be a step function between transition epochs corresponding to the rate of reward. The great majority of the literature in this area is concerned with finding a policy u that maximizes
![$$\phi _{1} \lpar {\bi u}\rpar \triangleq \mathop{\rm {lim\,inf}}_{t \to \infty} {1 \over t} E_{\bi u}\left[\vint_{0}^{t}R_{s}\,ds\right].$$](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20151022132024594-0921:S026996480700037X_eqn1.gif?pub-status=live)
φ1 denotes the average expected reward [Reference Denardo10, Reference Denardo and Fox11, Reference Fox20, Reference Jewell22, Reference Lippman27, Reference Schweitzer and Federgruen35, Reference Sennott37]. The following alternative to φ1 is given by Jewell [Reference Jewell23], Ross [Reference Ross30, Reference Ross31], and Mine and Osaki [Reference Mine and Osaki28] as
![$$\phi _{2}\lpar {\bi u}\rpar \triangleq \mathop{\rm {lim\,inf}}_{n \to \infty} {E_{\bi u}\left[\displaystyle\sum\limits_{m=1}^{n} R_{m}\right] \over E_{\bi u}[T_{n}]},$$](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20151022132024594-0921:S026996480700037X_eqn2.gif?pub-status=live)
where R m denotes the reward earned between the (m − 1)st and the (m)th epochs and T m denotes the (m)th transition time. The performance measure φ2 is also used by other researchers (see e.g., [Reference Beutler and Ross6, Reference Federgruen, Hordijk and Tijms15–Reference Federgruen, Schweitzer and Tijms17, Reference Heyman and Sobel21, Reference Jewell22, Reference Puterman29]). In [Reference Feinberg18], φ2 is referred to as the ratio-average reward. A sufficient condition for these two definitions to coincide under stationary policies requires that every stationary policy generates a semi-Markov chain with only one irreducible class [Reference Mine and Osaki28, Reference Ross30].
Although φ1 is clearly the more appealing criterion, it is easier to write the optimality equations when establishing the existence of an optimal pure policy under criterion φ2 [Reference Schäl34, Reference Schweitzer and Federgruen35, Reference Yushkevich39]. On the other hand, for finite-state and finite-action SMDPs there exists an optimal pure policy under φ1 [Reference Denardo and Fox11, Reference Schäl34, Reference Yushkevich39], whereas such an optimal policy might not exist under φ2 in a general multichain SMDP [Reference Jianyong and Xiaobo24].
Even though there is considerable research on the nonstandard criteria for average reward Markov decision processes (MDPs), the same cannot be claimed for the average reward SMDPs. A variance-type objective function for the discrete time MDPs has been studied (see e.g., [Reference Baykal-Gürsoy and Ross3, Reference Bouakiz and Sobel7, Reference Filar, Kallenberg and Lee19, Reference Sobel38]). Constraints have been introduced for the average reward MDPs (see e.g., [Reference Altman1, Reference Beutler and Ross4, Reference Derman13, Reference Derman and Veinott14, Reference Kallenberg25, Reference Ross and Varadarajan32, Reference Ross and Varadarajan33, Reference Sennott37]). For the average reward SMDPs, only the constrained problem has been investigated [Reference Beutler and Ross5, Reference Derman16, Reference Feinberg18]. Beutler and Ross [Reference Beutler and Ross5, Reference Beutler and Ross6] considered the ratio-average reward with a constraint under a condition stronger than the unichain condition. In [Reference Feinberg18], Feinberg examined the problem of maximizing both φ1 and φ2 subject to a number of constraints. Under the condition that the initial distribution is fixed, he showed that for both criteria, there exist optimal mixed stationary policies when an associated linear program (LP) is feasible. The mixed stationary policies are defined as policies with an initial one-step randomization applied to a set of pure policies. Obviously, such a policy is not stationary. Feinberg provided a linear programming algorithm for the unichain SMDP under both criteria. However, there is a need for an efficient algorithm that would locate an optimal or ε-optimal stationary policy for the communicating and multichain SMDPs under φ1 with constraints.
In this article we study the following criterion:
![$$\psi \, \lpar {\bi u} \rpar \triangleq E_{\bi u} \left[\mathop{\rm {lim\,inf}}_{t \to \infty} {1 \over t} \vint_{0}^{t} R_{s}\,ds\right] $$](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20151022132024594-0921:S026996480700037X_eqn3.gif?pub-status=live)
subject to the sample path constraint
![$$P_{\bi u}\left\{\mathop{\hbox{lim sup}}_{t \to \infty}{1 \over t} \vint_{0}^{t} C_{s}\, ds \leq \alpha \right\} = 1,$$](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20151022132024594-0921:S026996480700037X_eqn4.gif?pub-status=live)
where C s denotes the cost function at time s. Fatou's lemma immediately implies that ψ (u) ≤ φ1(u) holds for all policies. We however, prove that for a large class of policies, the two rewards are equal. We show that an ε-optimal randomized stationary policy can be obtained for the general SMDP, whereas such a policy might not exist for the expectation problem.
We also consider the problem of locating a policy that maximizes over all policies, u, the following expected average reward:
![$$\nu\lpar {\bi u}\rpar \triangleq E_{\bi u}\left[\mathop{\rm {lim\,inf}}_{t \to \infty} {1 \over t} \vint_{0}^{t} h\left( R_{s}, {1 \over t} \vint_{0}^{t} R_{q}\,dq\right) \,ds\right],$$](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20151022132024594-0921:S026996480700037X_eqn5.gif?pub-status=live)
where h(·,·) is a function of the current reward at time s and the average reward over an interval that includes time s. Throughout, we assume that h(·,·) is a continuous function. We will refer to ν(u) as the expected time-average variability. We show that an ε-optimal stationary policy can be obtained for the general SMDP. If h(x, y) = x − λ(x − y)2, then the optimal policy is a pure policy. Note that, in this case, maximizing ν(u) corresponds to maximizing the expected average reward penalized by the expected average variability.
This article is organized as follows. In Section 2 we introduce the notation. In Section 3 we present our preliminary results, which will be used in the proceeding sections and summarize the known facts about the decomposition and sample-path theory. In Section 4, mathematical programs that will be utilized are constructed and the upper bounds for the expected average reward and the expected variability are established. Communicating SMDPs are investigated in Section 5, and it is shown that there exist ε-optimal stationary policies for both criteria. Multichain SMDPs are considered in Section 6, an intermediate problem is introduced, and the algorithm to locate the ε-optimal stationary policies is given. Finally, we conclude in Section 7 with a brief discussion on the sample-path problem with multiple constraints.
2. NOTATIONS
Denote {X m, m ≥ 0} for the state process, which takes values in a finite state space 𝒮. At each epoch m, the decision-maker chooses an action A m from the finite action space 𝒜. The sojourn time between the (m − 1)st and the (m)th epochs is a random variable and denoted by ϒm. The underlying sample space Ω = {𝒮 ×𝒜×(0, ∞)}∞ consists of all possible realizations of states, actions, and the transition times. Throughout, the sample space will be equipped with the σ-algebra generated by the random variables {X m, A m, ϒm+1; m ≥ 0}. Denote P xay, x ∈ 𝒮, a ∈ 𝒜, y ∈ 𝒮, for the law of motion of the process; that is, for all policies u and all epochs m,
![$${P_{\bi u}\{X_{m+1} = y \vert X_{0}, A_{0}, \ldots, X_{m} = x, A_{m} = a\}= P_{xay}.}$$](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20151022132024594-0921:S026996480700037X_eqnU1.gif?pub-status=live)
Also conditioned on the event that the next state is y, ϒm+1 has the distribution function F xay(·); that is,
![$$P_{\bi u}\{\Upsilon_{m+1} \leq t \vert X_{0}, A_{0}, \Upsilon_{1}, \ldots, X_{m} = x, A_{m} = a, X_{m + 1} = y\} = F_{xay}\lpar t\rpar .$$](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20151022132024594-0921:S026996480700037X_eqnU2.gif?pub-status=live)
Assume that F xay (0) < 1.
The process {S t, B t : t ≥ 0}, where S t is the state of the process at time t and B t is the action taken at time t, is referred to as the SMDP. Let T n = ∑m=1nϒm. For t ∈ [T m, T m+1), clearly
![$$S_{t} = X_{m}, \quad B_{t} = A_{m}.$$](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20151022132024594-0921:S026996480700037X_eqnU3.gif?pub-status=live)
A policy is called stationary if the decision rule at each epoch depends only on the present state of the process; denote f xa for the probability of choosing action a when in state x. A stationary policy is said to be pure if for each x ∈ 𝒮, there is only one action a ∈ 𝒜 such that fxa = 1. Let U, F, and G denote the set of all policies, stationary policies, and pure policies, respectively.
Under a stationary policy f, the state process {S t : t ≥ 0} is a semi-Markov process, and the process {X m : m ∈ 𝒩 } is the embedded Markov chain with transition probabilities
![$$P_{xy} \lpar \;{\bi f}\rpar = \sum\limits_{a \in {\cal A}} P_{xay}\, f_{xa}.$$](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20151022132024594-0921:S026996480700037X_eqnU4.gif?pub-status=live)
Clearly, the process {S t, B t : t ≥ 0} is also a semi-Markov process under a stationary policy f with the embedded Markov chain {X m, A m : m ∈ 𝒩 }.
Under a stationary policy f, state x is recurrent if and only if x is recurrent in the embedded Markov chain; similarly, x is transient if and only if x is transient for the embedded Markov chain. A SMDP is said to be unichain if the embedded Markov chain for each pure policy is unichain [i.e., if the transition matrix P(g) has at most one recurrent class plus (a perhaps empty) set of transient states for all pure policies g]. Similarly, a SMDP is said to be communicating if P(f) is irreducible for all stationary policies that satisfy f xa > 0, for all x ∈ 𝒮, a ∈ 𝒜.
Let τ(x, a) define the expected sojourn time,
![$$\eqalignno{\tau\,\lpar x, a\rpar &\triangleq E_{\bi u}[\Upsilon_{m} \vert X_{m-1} = x, A_{m - 1} = a] &\cr & = \vint_{0}^{\infty} \sum\limits_{y \in {\bf \cal S}} P_{\bi u} \{X_{m} = y, \Upsilon_{m} \gt t \vert X_{m-1} = x, A_{m-1} = a\}\,dt &\cr & = \vint_{0}^{\infty} \left[1 - \sum\limits_{y \in {\bf \cal S}} P_{xay} F_{xay}\lpar t\rpar \right]\,dt.}$$](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20151022132024594-0921:S026996480700037X_eqnU5.gif?pub-status=live)
Let W t(x, a) denote the random variables representing the state-action intensities,
![$$W_{t}\lpar x, a\rpar \triangleq {1 \over t} \vint_{0}^{t} {\bf 1}\{\lpar S_{s}, B_{s}\rpar = \lpar x, a\rpar \}\,ds,$$](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20151022132024594-0921:S026996480700037X_eqnU6.gif?pub-status=live)
where 1{·} denotes the indicator function. Let U 0 denote the class of all policies u such that {W t(x, a); t ≥ 0} converges P u-almost surely (P u-a.s.) for all x ∈ 𝒮 and a ∈ 𝒜. Thus, for u ∈ U 0, there exist random variables {W(x, a)} such that
![$$\mathop{\hbox{lim}}_{t \to \infty} W_{t}\lpar x, a\rpar = W\lpar x, a\rpar ,$$](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20151022132024594-0921:S026996480700037X_eqnU7.gif?pub-status=live)
P u-a.s. for all x and a. Let U 1 be the class of all policies u such that the expected state-action intensities {E u[W t(x, a)]; t ≥ 0} converge for all x and a. For u ∈ U 1, denote
![$$w_{\bi u} \lpar x, a\rpar = \mathop{\hbox{lim}}_{t \to \infty} E_{\bi u} [W_{t}\lpar x, a\rpar ].$$](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20151022132024594-0921:S026996480700037X_eqnU8.gif?pub-status=live)
From Lebesgue's Dominated Convergence Theorem, U 0 ∈ U 1.
A well-known result from renewal theory (see Çinlar [Reference Çinlar9]) is that if {Y t = (S t, B t) : t ≥ 0} is a homogeneous semi-Markov process and if the embedded Markov chain is unichain, then the proportion of time spent in state y; that is,
![$$\mathop{\hbox{lim}}_{t \to \infty} {1 \over t} \vint_{0}^{t} {\bf 1}\{Y_{s} = y\}\,ds$$](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20151022132024594-0921:S026996480700037X_eqnU9.gif?pub-status=live)
exists almost surely. Since under a stationary policy f the process {Y t = (S t, B t) : t ≥ 0} is a homogeneous semi-Markov process, if the embedded Markov decision process is unichain, then the limit of W t(x, a) as t goes to infinity exists and the proportion of time spent in state x when action a is applied is given as
![$$W\lpar x, a \rpar = \mathop{\hbox{lim}}_{t \to \infty} W_{t}\lpar x, a\rpar = {\tau\, \lpar x, a\rpar\, Z\,\lpar x, a\rpar \over \displaystyle\sum\limits_{x, a} \tau\, \lpar x, a\rpar\, Z\, \lpar x, a\rpar },$$](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20151022132024594-0921:S026996480700037X_eqnU10.gif?pub-status=live)
P f-a.s. for all x and a, where Z(x, a) denotes the associated state-action frequencies. Let {z f (x, a); x ∈ 𝒮, a ∈ 𝒜} denote the expected state-action frequencies; that is,
![$$z_{\,{\bi f}} \lpar x, a\rpar = \mathop{\hbox{lim}}_{n \to \infty} E_{\bi f} {1 \over n} \sum\limits_{m=1}^{n} {\bf 1}\{X_{m-1} = x, A_{m-1} = a\} = \pi_{x}\lpar\; {\bi f}\,\rpar\, f_{xa},$$](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20151022132024594-0921:S026996480700037X_eqnU11.gif?pub-status=live)
where πx(f) is the steady-state distribution of the embedded Markov chain P(f).
The long-run average number of transitions into state x when action a is applied per unit time is
![$$\nu_{\bi f}\lpar x, a\rpar = {\pi_{x}\lpar \;{\bi f}\,\rpar f_{xa} \over \displaystyle\sum\limits_{x, a} \tau\,\lpar x, a\rpar \pi_{x}\lpar \;{\bi f}\,\rpar\, f_{xa}} = {z_{\bi f}\lpar x, a\rpar \over \displaystyle\sum\limits_{x, a} \tau\,\lpar x, a\rpar z_{\bi f}\lpar x, a\rpar }.$$](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20151022132024594-0921:S026996480700037X_eqn6.gif?pub-status=live)
This gives w f (x, a) = τ (x, a) ν f (x, a).
The decision-maker earns an immediate reward R(X m, A m) and a reward with rate r (X m, A m) until the (m + 1)st epoch. Thus,
![$$R_{m+1} = R\lpar X_{m}, A_{m}\rpar + r\lpar X_{m}, A_{m}\rpar \Upsilon_{m+1}$$](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20151022132024594-0921:S026996480700037X_eqnU12.gif?pub-status=live)
is the reward earned during the (m + 1)st transition [Reference Beutler and Ross5, Reference Sennott36]. Similarly, there is an immediate cost C(X m, A m) and a cost with rate c(X m, A m) with
![$$C_{m + 1} = C\lpar X_{m}, A_{m}\rpar + c\lpar X_{m}, A_{m}\rpar \Upsilon_{m + 1}.$$](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20151022132024594-0921:S026996480700037X_eqnU13.gif?pub-status=live)
Hence, at any epoch if the process is in state x ∈ 𝒮 and action a ∈ 𝒜 is chosen, then the reward earned during this epoch is represented by (x, a) ≜ R(x, a) + r (x, a) τ(x, a). Similarly, the cost during this epoch is represented by
(x, a) ≜ C(x, a) + c(x, a)τ (x, a).
We conclude this section with a fact that will be used in the subsequent theorems. It follows directly from the law of large numbers for martingale differences (see, e.g., Loeve [Reference Loeve26]):For all policies u ∈ U, if ∑m=1∞ [(var ϒm)/m 2] < ∞,
![$$\mathop{\hbox{lim}}_{n \to \infty} {1 \over n} \sum\limits_{m=1}^{n} [d\lpar X_{m-1}, A_{m-1}\rpar \Upsilon_{m} - d\lpar X_{m-1}, A_{m-1}\rpar \tau \, \lpar X_{m-1}, A_{m-1}\rpar ] = 0 $$](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20151022132024594-0921:S026996480700037X_eqn7.gif?pub-status=live)
holds P u-a.s. with d(·,·) as an arbitrary bounded function on 𝒮 × 𝒜.
Thus, we need the following assumption on the sojourn times.
Assumption 1
For all policies u ∈ U,
![$$\sum\limits_{m = 1}^{\infty} {\hbox{var}\,\Upsilon_{m} \over m^2} \lt \infty.$$](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20151022132024594-0921:S026996480700037X_eqnU14.gif?pub-status=live)
This condition is essentially equivalent to the assumption that E u [ϒm2|X m−1 = x, A m−1 = a] < ∞ for all x ∈ 𝒮 and a ∈ 𝒜.
3. PRELIMINARY RESULTS
In this section we establish some facts that will be used later in the analysis. Proposition 1 shows that the expected average reward (average cost) can be written in terms of the expected state-action frequencies {z u(x, a)} ({Z(x, a)}).
Proposition 1
Assume that the SMDP is unichain. For any policy u ∈ F, the expected average reward and the average cost are given respectively as
![$${\psi \,\lpar {\bi u}\rpar = {\displaystyle \sum\limits_{x, a} \bar{r}\,\lpar x, a\rpar z_{\bi u}\lpar x, a\rpar \over \displaystyle\sum\limits_{x^{\prime}, a^{\prime}} \tau\,\lpar x^{\prime}, a^{\prime}\rpar z_{\bi u}\lpar x^{\prime}, a^{\prime}\rpar } = \sum\limits_{x, a} \bar{r}\,\lpar x, a\rpar \nu_{\bi u}\lpar x, a\rpar} $$](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20151022132024594-0921:S026996480700037X_eqn8.gif?pub-status=live)
and
![$$\mathop{\rm {lim\quad sup}}_{t \to \infty} {1 \over t} \vint_{0}^{t} C_{s}\, ds = {\displaystyle\sum\limits_{x, a} \bar{c}\lpar x, a\rpar Z\lpar x, a\rpar \over \displaystyle\sum\limits_{x^{\prime}\!, a^{\prime}} \tau\,\lpar x^{\,\prime}\!, a^{\prime}\rpar Z\lpar x^{\,\prime}\!, a^{\prime}\rpar }{\it\comma} $$](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20151022132024594-0921:S026996480700037X_eqn9.gif?pub-status=live)
P u-a.s.
Proof
Fix a policy u ∈ F. Equation (8) is written as
![$$\eqalignno{\psi \,\lpar {\bi u}\rpar &= E_{\bi u}\left[\mathop{\rm {lim inf}}_{t \to \infty} {1 \over t} \vint_{0}^{t} R_{s}\,ds\right] &\cr &= E_{\bi u}\left[\mathop{\rm {lim inf}}_{t \to \infty} {1 \over t} \left[\sum\limits_{m=0}^{n\lpar t\rpar } R\lpar X_{m}, A_{m}\rpar+ \sum\limits_{m=0}^{n\lpar t\rpar -1} r\,\lpar X_{m}, A_{m}\rpar \Upsilon_{m + 1} + \lpar t - t_{n\lpar t\rpar }\rpar r\,\lpar X_{n\lpar t\rpar }, A_{n\lpar t\rpar }\rpar \right]\right] &\cr &= E_{\bi u}\left[\mathop{\rm {lim inf}}_{t \to \infty} {1 \over t} \sum\limits_{x, a} \left[R\lpar x, a\rpar \sum\limits_{m=0}^{n\lpar t\rpar } {\bf 1} \{X_{m} = x, A_{m} = a\}\right. \right. &\cr &\quad \left. \left. \quad +\; r\,\lpar x, a\rpar \tau\, \lpar x, a\rpar \sum\limits_{m=0}^{n\lpar t\rpar -1} {\bf 1}\{X_{m} = x, A_{m} = a\}\right]\right]&\cr & = \sum\limits_{x, a} R\lpar x, a\rpar \nu_{\bi u}\lpar x, a\rpar + \sum\limits_{x, a} r\,\lpar x, a\rpar \tau\,\lpar x, a\rpar \nu_{\bi u}\lpar x, a\rpar ,}$$](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20151022132024594-0921:S026996480700037X_eqnU15.gif?pub-status=live)
where n(t) ≜ max{m : T m ≤ t} denotes the number of transitions up to time t. Note that as t goes to infinity, so does n(t). Thus, the last term in the second equality goes to zero as t goes to infinity. Equation (9) is similarly obtained.■
Now, consider the expected time-average variability ν(u). The following proposition shows that the time-average variability can also be expressed in terms of the long-run state-action frequencies {Z(x, a)}. Let Ψt denote the time average reward random variable (r.v.) up to time t:
![$$\Psi_{t} \triangleq {1 \over t} \vint_{0}^{t} R_{s}\,ds$$](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20151022132024594-0921:S026996480700037X_eqnU16.gif?pub-status=live)
and
![$$\quad \quad \quad \quad \quad \Psi \triangleq \sum\limits_{x, a} \bar{r}\lpar x, a\rpar V\lpar x, a\rpar .$$](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20151022132024594-0921:S026996480700037X_eqnU17.gif?pub-status=live)
Proposition 2
For all u ∈ U 0,
![$$\eqalignno{&\mathop{\rm {lim inf}}_{t \to \infty} {1 \over t} \vint_{0}^{t} h\left( R_{s}, {1 \over t} \vint_{0}^{t} R_{q}\,dq\right) \,ds \cr &\quad= \left(\sum\limits_{x, a} h\left[{\bar{r}\,\lpar x, a\rpar , {\displaystyle\sum\limits_{y, b} \bar{r} \,\lpar y, b\rpar Z\lpar y, b\rpar \over \displaystyle\sum\limits_{x^{\prime}, a^{\prime}} \tau\, \lpar x^{\prime}, a^{\prime}\rpar Z\lpar x^{\prime}, a^{\prime}\rpar}} \right] Z\lpar x, a\rpar \right)\cr &\quad\qquad \times\Bigg[\sum\limits_{x^{\prime}, a^{\prime}} \tau\, \lpar x^{\prime}, a^{\prime}\rpar Z\lpar x^{\prime}, a^{\prime}\rpar \Bigg]^{-1},}$$](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20151022132024594-0921:S026996480700037X_eqnU18.gif?pub-status=live)
P u-a.s. If h(x, y) = x − λ(x − y)2, then for u ∈ U 0, we have
![$$\nu\lpar {\bi u}\rpar = \psi\; \lpar {\bi u}\rpar - \lambda \mathop{\rm {lim}}_{t \to \infty} {1 \over t} \vint_{0}^{t} E_{\bi u} \left[R_{s} - {1 \over t} \vint_{0}^{t} R_{q}\,dq\right]^{2}\,ds.$$](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20151022132024594-0921:S026996480700037X_eqnU19.gif?pub-status=live)
Proof
Fix a policy u ∈ U 0. Similar to Proposition 1, it is straightforward to establish that
![$$\mathop{\rm {lim inf}}_{t \to \infty} {1 \over t} \vint_{0}^{t} h\lpar R_{s}, \Psi\rpar \,ds = \sum\limits_{x, a} h[\bar{r}\,\lpar x, a\rpar , \Psi] V\lpar x, a\rpar ,$$](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20151022132024594-0921:S026996480700037X_eqnU20.gif?pub-status=live)
P u-a.s. The rest of the proof follows from Proposition 1 in [Reference Baykal-Gürsoy and Ross3].■
The proof of the following proposition that defines ψ and ν for the multichain case, is straightforward.
Proposition 3
Let f be a stationary policy and let ℛ1, … , ℛqbe the recurrent classes associated with P(f). Denote (πxi(f); x ∈ ℛi) for the equilibrium probability vector associated with class i, i = 1, … , q. Further, denote
![$$\psi_{i}\,\lpar \ {\bi f}\rpar = {\displaystyle \sum\limits_{x, a} \bar{r}\,\lpar x, a\rpar \pi_{x}^{\,i}\lpar\; {\bi f}\,\rpar f_{xa} \over \displaystyle\sum\limits_{y, b} \tau \,\lpar y, b\rpar \pi_{y}^{\,i}\lpar\; {\bi f}\rpar f_{yb}}. $$](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20151022132024594-0921:S026996480700037X_eqn10.gif?pub-status=live)
Then
![$$\psi \,\lpar \;{\bi f}\rpar = \sum\limits_{i=1}^{q} P_{\bi f}\{X_{n} \in {\cal R}_{i}\;\;a.s.\} \psi_{i}\lpar\; {\bi f}\rpar $$](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20151022132024594-0921:S026996480700037X_eqn11.gif?pub-status=live)
and
![$$\qquad\qquad\qquad \nu\lpar \; {\bi f}\rpar = \sum\limits_{i=1}^{q} P_{\bi f}\{ X_{n} \in {\cal R}_{i}\;\;a.s.\} \sum\limits_{x, a} {h[\bar{r}\lpar x, a\rpar , \psi_{i}\lpar\; {\bi f}\rpar ] \pi_{x}^{\,i}\lpar\; {\bi f}\rpar f_{xa} \over \displaystyle\sum\limits_{y, b} \tau\, \lpar \,y, b\rpar \pi_{y}^{\,i}\lpar\; {\bi f}\rpar f_{yb}}. $$](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20151022132024594-0921:S026996480700037X_eqn12.gif?pub-status=live)
If SMDP is unichain, then
![$$\nu\lpar \;{f}\rpar = \sum\limits_{x, a} {h[\bar{r}\lpar x, a\rpar , \psi \,\lpar\; {\bi f}\rpar ] \pi_{x}\lpar\; {\bi f}\rpar f_{xa} \over \displaystyle\sum\limits_{y,b} \tau \,\lpar\, y, b\rpar \pi_{y} \lpar \;{\bi f}\rpar f_{yb}}. $$](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20151022132024594-0921:S026996480700037X_eqn13.gif?pub-status=live)
Note that Schäl [Reference Schäl34] showed in Lemma 2.7 that for finite-state finite-action multichain SMDPs under a pure policy, φ1 is equivalently given by Eqs. (10) and (11), which define the expected average reward ψ.
Decomposition and Sample Path Theory
The following notation will be used in the subsequent sections. A set 𝒞 ⊆ 𝒮 is said to be a strongly communicating class if (1) 𝒞 is a recurrent class for some stationary policy, (2) 𝒞 is not a proper subset of some 𝒞′ for which (1) holds true. Let {𝒞1, … , 𝒞I} be the collection of all strongly communicating classes. Let 𝒯 be the (possibly empty) set of states that are transient under all stationary policies. It is shown in [Reference Ross and Varadarajan33] that {𝒞1, … , 𝒞I, 𝒯 } forms a partition of the state space 𝒮. The decomposition ideas was first introduced by Bather [Reference Bather2]. For each i = 1, … , I, denote the for each x ∈ 𝒞i the set
![$${\cal F}_{x} = \{a \in {\cal A} : P_{xay} = 0\;\hbox{for all}\ y \notin {\cal C}_{i}\}.$$](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20151022132024594-0921:S026996480700037X_eqnU21.gif?pub-status=live)
The following result is also proved in [Reference Ross and Varadarajan33].
Proposition 4
For all policies u,
![$$\sum\limits_{i=1}^{I} P_{\bi u} \{ X_{n} \in {\cal C}_{i}\;\; a.s.\} = 1 $$](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20151022132024594-0921:S026996480700037X_eqn14.gif?pub-status=live)
and
![$$\;\; P_{\bi u} \{ A_{n} \in {\cal F}_{X_{n}}\;\;a {.} s.\,\}=1. $$](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20151022132024594-0921:S026996480700037X_eqn15.gif?pub-status=live)
For each i = 1, … , I, define a new SMDP, called SMDP-i, as follows: The state space is 𝒞i; for each x ∈ 𝒞i, the set of available actions is given by the state-dependent action spaces ℱx; the law of motion P xay, the conditional sojourn time distribution F xay(·), the reward function (x, a), and the cost function
(x, a) are the same as earlier but restricted to 𝒞i and ℱx for x ∈ 𝒞i. Now, each SMDP-i is communicating for all i = 1, … , I. For each SMDP-i, let νi(u) be the expected average variability under policy u.
4. OPTIMIZATION RESULTS
In the constrained problem, say T (1), we seek to maximize the expected average reward ψ (u) [Eq. (3)] over the policies that satisfy the sample-path constraint [Eq. (4)]. Let U f denote the class of feasible policies. The optimal constrained average reward is given as
![$$\psi^{\ast} = \mathop{\hbox{sup}}\limits_{{\bi u} \in U_{f}} \psi \,\lpar {\bi u}\rpar .$$](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20151022132024594-0921:S026996480700037X_eqnU22.gif?pub-status=live)
A policy u* ∈ U f is said to constrained average optimal if ψ (u*) = ψ *. A policy u ∈ U f is said to be ε-average optimal if ψ (u) > ψ * − ε. The second problem, T (2), maximizes the expected time-average variability [Eq. (5)]. Let
![$$\nu^{\ast} = \mathop{\hbox{sup}}\limits_{{\bi u} \in U} \nu\lpar {\bi u}\rpar .$$](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20151022132024594-0921:S026996480700037X_eqnU23.gif?pub-status=live)
A policy u* is optimal for ν(·) if ν(u*) = ν*. An ε-optimal policy for ν(·) is defined as a policy u such that ν(u) > ν*−ε.
Note that by choosing α to be sufficiently large, the unconstrained problem can be viewed as a special case of the constrained optimization problem. Also, by choosing h (1)(x, y) = x, we have ν(u) = ψ (u). Thus, in the next section we will present the general problem of maximizing ν(j)(u) subject to the sample-path constraint (4), where j = 1 corresponds to the constrained problem with ν(1)(u) = ψ (u) and j = 2 corresponds to the expected average variability with ν(2)(u) = ν (u) and α(2) = ∞.
For each j = 1, 2 and i = 1, … , I, consider the following fractional program with decision variables z(x, a), x ∈ 𝒞i, a ∈ ℱx. Let δxy = 1 if x = y and δxy = 0 otherwise.
Program Ti(j)
![$$t_{i}^{\,\lpar \,j\rpar } = \max \left\{{\matrix{\displaystyle\sum\limits_{x \in {\cal C}_{i}} \displaystyle\sum\limits_{a \in {\cal F}_{x}} h^{\lpar j\rpar }\left [{\bar{r}\lpar x, a\rpar ,{ {} \displaystyle\sum\limits_{y \in {\cal C}_{i}, b \in {\cal F}_{y}} \bar{r}\lpar y, b\rpar \,z\,\lpar y, b\rpar {} \over {} \displaystyle\sum\limits_{x\,^{\prime}\in{\cal C}_{i}, a^{\prime} \in {\cal F}_{x\,^{\prime}}} \tau \,\lpar x\,^{\prime}, a^{\prime}\rpar \,z\,\lpar x\,^{\prime}, a^{\prime}\rpar {}} }\right] \,z\,\lpar x, a\rpar \cr} \over \displaystyle\sum\limits_{x\,^{\prime}\in{\cal C}_{i}, a^{\prime} \in {\cal F}_{x\,^{\prime}}} \tau \, \lpar x\,^{\prime}, a^{\prime}\rpar \,z\,\lpar x\,^{\prime}, a^{\prime}\rpar } \right\} $$](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20151022132024594-0921:S026996480700037X_eqn16.gif?pub-status=live)
![$$\hbox{s.t.} \sum\limits_{x\in{\cal C}_{i}, a \in {\cal F}_{x}} \lpar \delta_{xy} - P_{xay}\rpar \,z\,\lpar x, a\rpar = 0, \qquad y \in {\cal C}_{i}$$](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20151022132024594-0921:S026996480700037X_eqn17.gif?pub-status=live)
![$$\sum\limits_{x\in{\cal C}_{i}, a \in {\cal F}_{x}} z\lpar x, a\rpar = 1, $$](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20151022132024594-0921:S026996480700037X_eqn18.gif?pub-status=live)
![$${\displaystyle\sum\limits_{x\in{\cal C}_{i},a \in {\cal F}_{x}} {\bar{c}}\lpar x, a\rpar z\lpar x, a\rpar \over \displaystyle\sum\limits_{x^{\prime}\in{\cal C}_{i}, a^{\prime} \in {\cal F}_{x^{\prime}}} \tau \, \lpar x\,^{\prime}, a^{\prime}\rpar z\lpar x\,^{\prime}, a^{\prime}\rpar } \leq \alpha^{\,\lpar \,j\rpar },$$](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20151022132024594-0921:S026996480700037X_eqn19.gif?pub-status=live)
![$$z\lpar x, a\rpar \geq 0 \;\;\; x \in {\cal C}_{i},\;a \in {\cal F}_{x}. $$](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20151022132024594-0921:S026996480700037X_eqn20.gif?pub-status=live)
For each η ≥ 0, we will also need to refer to the following fractional program with decision variables z(x, a), for all x∈ 𝒮, a ∈ 𝒜.
Program Qη(j)
![$$q_{\eta}^{\lpar j\rpar } = \max \left\{{\matrix{\displaystyle\sum\limits_{x \in {\cal S}, a \in {\cal A}} h^{\,\lpar \,j\rpar } \left [\bar{r} \lpar x, a\rpar , { \displaystyle\sum\limits_{y \in {\cal S}, b \in {\cal A}} \bar{r}\lpar y, b\rpar\, z\,\lpar y, b\rpar \over \displaystyle\sum\limits_{x\,^{\prime}, a^{\prime}} \tau\,\lpar x\,^{\prime}, a^{\prime}\rpar\,z\,\lpar x\,^{\prime}, a^{\prime}\rpar } \right ] \,z\,\lpar x, a\rpar \cr} \over \displaystyle\sum\limits_{x\,^{\prime}, a^{\prime}} \tau\,\lpar x\,^{\prime}, a^{\prime}\rpar \,z\,\lpar x\,^{\prime}, a^{\prime}\rpar }\right\} $$](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20151022132024594-0921:S026996480700037X_eqn21.gif?pub-status=live)
![$$\hbox{s.t.} \sum\limits_{x \in {\cal S}, a \in {\cal A}} \lpar \delta_{xy} - P_{xay}\rpar z\lpar x, a\rpar = 0, \qquad y\in {\cal S},$$](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20151022132024594-0921:S026996480700037X_eqn22.gif?pub-status=live)
![$$\sum\limits_{x \in {\cal S}, a \in {\cal A}} z\lpar x, a\rpar = 1, $$](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20151022132024594-0921:S026996480700037X_eqn23.gif?pub-status=live)
![$${\displaystyle\sum\limits_{x \in {\cal S}, a \in {\cal A}} \bar{c}\lpar x, a\rpar z\lpar x, a\rpar \over \displaystyle\sum\limits_{x\,^{\prime} \in {\cal S}, a^{\prime} \in {\cal A}} \tau \, \lpar x\,^{\prime}, a^{\prime}\rpar z\lpar x\,^{\prime}, a^{\prime}\rpar } \leq \alpha^{\lpar j\rpar }, $$](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20151022132024594-0921:S026996480700037X_eqn24.gif?pub-status=live)
![$$z\lpar x, a\rpar \geq \eta, \;\;\;x \in {\cal S}, \; a \in {\cal A}. $$](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20151022132024594-0921:S026996480700037X_eqn25.gif?pub-status=live)
We will refer to the feasible regions of Program T i(j) and Program Q η(j) simply as T i(j) and Q η(j), respectively. Note that the objective functions for both sets of mathematical programs are continuous functions over polytopes. As long as the cost constraint is satisfied for some {z(x, a)}, then T i(1) for i = 1, … , I and Q 0(1) are nonempty. Note that T i(2) for i = 1, … , I and Q 0(2) are always nonempty. For a given solution {z(x, a)}, we will write
![$$z\lpar x\rpar = \sum\limits_{a} z\lpar x, a\rpar .$$](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20151022132024594-0921:S026996480700037X_eqnU24.gif?pub-status=live)
First, we consider the constrained problem, T (1) given by Eqs. (3) and (4). Thus, use h (1)(x, y) = x in Eqs. (16) and (21). The following lemmas provide bounds on φ(u) and ν(u). The proof of Lemma 2 is similar to the proof of Lemma 1, thus only an outline of the proof will be given.
Lemma 1
If U fis nonempty, then for all i = 1, … , I, T i(1)is nonempty, and for u ∈ U f ,
![$$P_{\bi u}\left\{\mathop{\rm {lim inf}}_{t \to \infty} {1 \over t} \vint_{0}^{t} R_{s}\,ds \leq t_{i}^{\lpar 1\rpar } \vert X_{n} \in {\cal C}_{i}\;\;a.s.\!\right\} = 1 $$](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20151022132024594-0921:S026996480700037X_eqn26.gif?pub-status=live)
and, consequently
![$$\psi \, \lpar {\bi u}\rpar \leq \sum\limits_{i = 1}^{I} t_{i}^{\lpar 1\rpar } P_{\bi u} \{X_{n} \in {\cal C}_{i}\;\;a.s.\!\}.$$](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20151022132024594-0921:S026996480700037X_eqn27.gif?pub-status=live)
Proof
Fix a policy u ∈ U f. Let Γ be the set of all sample paths ω = (x 0, a 0, τ1, x 1, a 1, τ2, …) that satisfy the following:
(i) a n ∈ ℱx n, ∀ n ≥ N for some positive integer N
(ii) ∑x∈𝒮 ∑a∈𝒜P xayZ(x, a) = ∑a∈𝒜Z(y, a), ∀ y∈ 𝒮
(iii) lim supt→∞(1/t)∫0tC sds ≤ α(1).
Combining Eq. (14) with Eq. (7) where d(·,·) = 1 and ϒm = 1{X m = y} and the fact that u is feasible yields
![$$P_{\bi u} \lpar \Gamma\rpar = 1.$$](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20151022132024594-0921:S026996480700037X_eqnU25.gif?pub-status=live)
Let (x 0, a 0, τ1, x 1, a 1, τ2, …) ∈ {X n∈ 𝒞ia.s.} ∩ Γ and define
![$$Z_n \lpar x, a\rpar \triangleq {1 \over n} \sum \limits_{m = 1}^{n} {\bf 1} \{X_{m-1} = x, A_{m-1} = a\}.$$](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20151022132024594-0921:S026996480700037X_eqnU26.gif?pub-status=live)
Since 0 ≤ Z n(x, a) ≤ 1, by the standard compactness argument there exists a subsequence {N k(ω)} along which {Z n(x, a; ω)} converges to some Z ′(x, a; ω) on Φ = {X n ∈ 𝒞i a.s.} ∩ Γ; that is,
![$$\lim_{k \to \infty} Z_{N_{k}} \lpar x, a\rpar = Z^{\prime} \lpar x, a\rpar . $$](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20151022132024594-0921:S026996480700037X_eqn28.gif?pub-status=live)
By definition, it follows that
![$$Z^{\prime}\lpar x, a\rpar = 0 \quad \hbox{whenever} \, x \notin {\cal C}_i \, \hbox{or} \, a \notin {\cal F}_{x}$$](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20151022132024594-0921:S026996480700037X_eqnU27.gif?pub-status=live)
on the set Φ. Thus, on Φ,
![$$\sum_{x \in {\cal C}_{i}} \sum_{a \in {\cal F}_{x}} P_{xay} Z^{\prime} \lpar x, a\rpar = \sum_{a \in {\cal F}_{y}} Z^{\prime}\lpar y,a\rpar , \quad \forall\;y \in {\cal C}_{i},$$](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20151022132024594-0921:S026996480700037X_eqnU28.gif?pub-status=live)
and
![$$\sum_{x \in {\cal C}_{i}} \sum_{a \in {\cal F}_{x}} Z^{\prime}\lpar x, a\rpar = 1, \qquad Z^{\prime}\lpar x, a\rpar \ge 0, \quad \forall \; x\in{\cal C}_{i}, a \in {\cal F}_{x}.$$](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20151022132024594-0921:S026996480700037X_eqnU29.gif?pub-status=live)
Observe that for any bounded function d(·,·) on Φ,
![$$\eqalign{&\left|{1 \over N_{k}} \sum \limits_{m = 1}^{N_{k}} d\lpar X_{m-1}, A_{m-1}\rpar \Upsilon_m - \sum \limits_{x, a} d\lpar x, a\rpar \tau\, \lpar x, a\rpar Z^{\prime} \lpar x, a\rpar \right| \cr & \qquad \le \left|{1 \over N_{k}} \sum_{m = 1}^{N_{k}} [d\lpar X_{m-1}, A_{m-1}\rpar \Upsilon_m - d\lpar X_{m-1}, A_{m-1}\rpar \,\tau\, \lpar X_{m-1}, A_{m-1}\rpar ] \right| \cr &\qquad + \left|{1 \over N_{k}} \sum \limits_{m = 1}^{N_{k}} d\lpar X_{m-1}, A_{m-1}\rpar\, \tau\,\lpar X_{m-1}, A_{m-1}\rpar - \sum \limits_{x, a} d\lpar x, a\rpar \tau\, \lpar x, a\rpar Z^{\prime}\lpar x, a\rpar\right|, }$$](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20151022132024594-0921:S026996480700037X_eqnU30.gif?pub-status=live)
which combined with Eq. (7) and Eq. (28) gives
![$$\lim_{k\to\infty}{1 \over N_{k}} \sum_{m = 1}^{N_{k}} d\lpar X_{m-1}, A_{m-1}\rpar \Upsilon_m = \sum_{x, a} d\lpar x, a\rpar \tau\, \lpar x, a\rpar Z^{\prime}\lpar x, a\rpar .$$](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20151022132024594-0921:S026996480700037X_eqnU31.gif?pub-status=live)
From this equation the following holds:
![$$\eqalign{\lim_{k \to \infty} {1 \over T_{N_{k}}} \sum \limits_{m = 1}^{N_{k}} C_{m} & = {\lim\limits_{k \to \infty} {1 \over N_{k}} \displaystyle\sum \limits_{m = 1}^{N_{k}} [C\lpar X_{m-1}, A_{m-1}\rpar + c\lpar X_{m-1}, A_{m-1}\rpar \Upsilon_m] \over {1 \over N_{k}} \displaystyle\sum \limits_{m = 1}^{N_{k}} \Upsilon_m} \cr & = {\displaystyle\sum \limits_{x, a} \bar{c} \lpar x, a\rpar Z^{\prime} \lpar x, a\rpar \over {\displaystyle\sum \limits_{x, a}} \tau\, \lpar x, a\rpar Z^{\prime}\lpar x, a\rpar }.}$$](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20151022132024594-0921:S026996480700037X_eqnU32.gif?pub-status=live)
Also, on Φ,
![$$ \eqalign{\alpha^{\lpar 1\rpar } \ge & \limsup_{t \to \infty } {1 \over t} \vint_0^t C_s\ ds \cr & \ge \lim_{k \to \infty} {1 \over T_{N_{k}}} \sum \limits_{m = 1}^{N_{k}} C_{m} = {\displaystyle\sum \limits_{x, a} \bar{c}\lpar x, a\rpar Z^{\prime}\lpar x, a\rpar \over \displaystyle\sum \limits_{x, a} \!\tau\, \lpar x, a\rpar Z^{\prime}\lpar x, a\rpar }}$$](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20151022132024594-0921:S026996480700037X_eqnU33.gif?pub-status=live)
Thus, Z ′(x, a) is in the feasible set implying that T (1)i is nonempty. Hence, on Φ,
![$${\displaystyle\sum \limits_{x, a} \bar{r}\,\lpar x, a\rpar Z^{\prime}\lpar x, a\rpar \over \displaystyle\sum \limits_{x, a} \!\tau\,\lpar x, a\rpar Z^{\prime} \lpar x, a\rpar } \le t_{i}^{\lpar 1\rpar }.$$](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20151022132024594-0921:S026996480700037X_eqnU34.gif?pub-status=live)
In a similar manner,
![$$\mathop{\lim \inf}\limits_{t \to \infty} {1 \over t} \vint_0^t R_s\, ds \le \lim_{k \to \infty} {1 \over T_{N_{k}}} \sum_{m = 1}^{N_{k}} R_{m} = {\displaystyle\sum \limits_{x, a} \bar{r} \,\lpar x, a\rpar Z^{\prime}\lpar x, a\rpar \over \displaystyle\sum \limits_{x, a} \tau\,\lpar x, a\rpar Z^{\prime}\lpar x, a\rpar },$$](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20151022132024594-0921:S026996480700037X_eqnU35.gif?pub-status=live)
which gives the desired result. Combining Eq. (26) with Proposition 4 gives Eq. (27).
Next, we consider the expected time-average variability criterion.■
Lemma 2
For all i = 1, … , I and for all policies u, we have
![$$P_{\bi u}\left\{ \mathop{lim\,inf}\limits_{t \to \infty} {1 \over t} \vint_0^t h \left( R_s, {1 \over t} \vint_0^t R_q\, dq \right) ds \le t_i^{\lpar 2\rpar } \vert X_{n} \in {\cal C}_{i}\;a.s. \right\} = 1 $$](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20151022132024594-0921:S026996480700037X_eqn29.gif?pub-status=live)
and, consequently,
![$$\nu\lpar {\bi u}\rpar \le \sum \limits_{i = 1}^{I} t_{i}^{\lpar 2\rpar } P_{\bi u} \{X_{n} \in {\cal C}_{i}\, a.s. \}. $$](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20151022132024594-0921:S026996480700037X_eqn30.gif?pub-status=live)
Proof
The proof is similar to the proof of Lemma 1. We only need to note that
![$$\eqalign{\mathop{\rm {lim\,inf}}\limits_{t \to \infty} & {1 \over t} \vint_0^t h \left( R_s, {1 \over t} \vint_0^t R_q\, dq \right) ds \cr \le \lim_{k \to \infty} {1 \over T_{N_{k}}} \sum \limits_{m = 1}^{N_{k}} h \left( R_m, {1 \over T_{N_{k}}} \sum \limits_{l = 1}^{N_{k}} R_{l} \right) \cr & = \lim_{k \to \infty} {1 \over T_{N_{k}}} \sum \limits_{m = 1}^{N_{k}} h \left( R_m, \lim_{k \to \infty} {1 \over T_{N_{k}}} \sum \limits_{l = 1}^{N_{k}} R_{l} \right) \cr & = {\displaystyle\sum \limits_{x \in {\cal C}_{i}} \displaystyle\sum \limits_{a \in {\cal F}_{x}} h \left[\bar{r} \lpar x, a\rpar , {\frac{\displaystyle\sum \limits_{y \in {\cal C}_{i}} \displaystyle\sum \limits_{b \in {\cal F}_{y}} \bar{r}\,\lpar y, b\rpar Z^{\prime}\lpar y, b\rpar} {\displaystyle\sum \limits_{x\,^{\prime} \in {\cal C}_{i}} \displaystyle\sum \limits_{a^{\prime} \in {\cal F}_{x\,^{\prime}}} \tau\, \lpar x\,^{\prime}, a^{\prime}\rpar Z^{\prime} \lpar x\,^{\prime}, a^{\prime}\rpar }} \right] Z^{\prime}\lpar x, a\rpar \over \displaystyle\sum \limits_{x\,^{\prime}\in {\cal C}_{i}} \displaystyle\sum \limits_{a^{\prime}\in {\cal F}_{x\,^{\prime}}} \tau\,\lpar x\,^{\prime},a^{\prime}\rpar Z^{\prime}\lpar x\,^{\prime},a^{\prime}\rpar },}$$](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20151022132024594-0921:S026996480700037X_eqnU36.gif?pub-status=live)
on Φ.■
5. The Communicating Case
We assume that the SMDP is communicating. This implies that there is only one strongly communicating class and that 𝒮 = 𝒞1. The analysis of this section draws on results and observations from [Reference Ross and Varadarajan32].
In this section we will show that, in general, there does not exist an optimal stationary policy for both criteria. Instead, we show that an ε-optimal stationary policy can be constructed. First, we consider the constrained problem: Eqs. (3) and (4). Let h (1)(x, y) = x in Eqs. (16) and (21).
Proposition 5
Fix η ≥ 0 and let { z η(x, a)} be an optimal extreme point for Q (1)η. Define a policy fη by the transformation
![$${\bi f}^{\eta} = \left\{ \matrix{ \displaystyle{z^{\eta} \lpar x, a\rpar \over z^{\eta}\lpar x\rpar } \hfill& \quad if \ z^{\eta} \lpar x\rpar \gt 0 \cr uniformly \, over \, the \, actions & otherwise.} \right.$$](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20151022132024594-0921:S026996480700037X_eqn31.gif?pub-status=live)
Then
![$$\sum_{x} z^{\eta}\lpar x\rpar P_{xy}\lpar \;{\bi f}^{\eta}\rpar = z^{\eta} \lpar y\rpar , $$](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20151022132024594-0921:S026996480700037X_eqn32.gif?pub-status=live)
![$$\sum_{x} z^{\eta}\lpar x\rpar = 1.$$](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20151022132024594-0921:S026996480700037X_eqn33.gif?pub-status=live)
If P(fη) is unichain, then fη∈ Uf and Pfη {lim inft→∞(1/t)Rs ds = qη(1)} = 1. In particular, if P(f0) is unichain, then f 0is an optimal stationary policy for the constrained problem.
Proof
It is straightforward to show equations (32) and (33). If P(fη) is unichain, there is a unique probability vector π (f η) associated with P(fη). Hence, πx (f η) = z η(x), giving P f η-almost surely
![$$\eqalign{ \mathop{\lim\hbox{sup}}\limits_{t \to \infty} {1 \over t} \vint _0^t C_s\ ds = {\displaystyle\sum \limits_{x, a} \bar{c} \lpar x, a\rpar \pi_{x} \lpar \;{\bi f}^{\eta}\rpar f_{xa}^{\eta} \over \displaystyle\sum \limits_{x, a} \tau \lpar x, a\rpar \pi_{x} \lpar\; {\bi f}^{\eta}\rpar f_{xa}^{\eta}}\cr = {\displaystyle\sum \limits_{x, a} \bar{c}\lpar x, a\rpar z^{\eta} \lpar x, a\rpar \over \displaystyle\sum \limits_{x, a} \tau\, \lpar x, a\rpar z^{\eta} \lpar x, a\rpar } \le \alpha^{\lpar 1\rpar }.}$$](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20151022132024594-0921:S026996480700037X_eqnU37.gif?pub-status=live)
In a similar manner, we have P fη-a.s.
![$$\mathop{\lim\hbox{sup}}_{t \to \infty} {1 \over t} \vint _0^t R_s\ ds ={\displaystyle\sum \limits_{x, a} \bar{r}\lpar x, a\rpar z^{\eta} \lpar x, a\rpar \over \displaystyle\sum \limits_{x, a} \tau\, \lpar x, a\rpar z^{\eta} \lpar x, a\rpar } = q_{\eta}^{\lpar 1\rpar } \qquad \hbox{\squf}$$](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20151022132024594-0921:S026996480700037X_eqnU38.gif?pub-status=live)
Only the outline of the proof of the following theorem will be given since it follows the proofs of Propositions 5–7 in [Reference Ross and Varadarajan32].
Theorem 1
Suppose that the SMDP is communicating. Then Uf is nonempty if and only if Q (1)0is nonempty. If Q (1)0is nonempty, then for each ε > 0, there exists an ε-optimal stationary policy for the constrained problem.
Proof
Proposition 5 proves the (only if) part. To prove the (if) part assume that {z 0(x, a)} is an optimal extreme point of Q (1)0. Let f0 be the policy obtained via transformation (31). It follows from Eq. (32) that the set of states where z 0(x) > 0 is a closed set, and by Lemma 2 of [Reference Ross and Varadarajan32], all states outside of this closed set are transient. This closed set can be composed of the union of m recurrent classes R 1, … , R m associated with P(f0). For each recurrent class, we can define
![$$d_k = {\displaystyle\sum \limits_{x \in R_k} \displaystyle\sum \limits_{a} \bar{c} \lpar x, a\rpar z^0 \lpar x, a\rpar \over \displaystyle\sum \limits_{x \in R_{k}} \displaystyle\sum \limits_{a} \tau \,\lpar x, a\rpar z^0 \lpar x, a\rpar }.$$](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20151022132024594-0921:S026996480700037X_eqnU39.gif?pub-status=live)
The value d k has the interpretation of being the average cost per unit time, given that the process has entered R k. Let l = arg min1≤ k≤md k. Then since {z 0(x, a)} is feasible for Q (1)0, we have d l ≤ α(1). Since d k can be greater than α(1) for some k, f0 does not necessarily belong to U f. However, we can define a stationary policy f˜ that is equal to f0 in R l, and outside R l it takes every available action with equal probability. Clearly, since the SMDP is communicating, R l is the only recurrent class associated with P(f˜) and f˜ is in U f. Thus, U f is nonempty.
For the second part of the theorem, we assume that Q (1)0 is nonempty. Using the machinery developed in [Reference Ross and Varadarajan32], whenever there exists a policy that strictly meets the constraint, one can construct a feasible stationary policy that chooses every action with positive probability and gives rise to an irreducible Markov chain. Otherwise, the stationary policy f0 given by the transformation (31) gives rise to a unichain P(f0); thus, f0 is the optimal policy.
Thus, we assume that there exists a policy that strictly meets the constraint. In this case, there exists an ζ > 0 such that for each η that satisfies 0 < η < ζ, there is a feasible solution for Q (1)η, and P(fη) is irreducible for fη obtained via transformation (31). From Proposition 5, we have fη ∈ U f and P fη {lim inft→∞ (1/t )∫0tR sds = q η(1) = 1}. To prove that limη→0q η(1) = q 0(1), we can transform the fractional program into a linear program using transformation (6) for νf η [Reference Charnes and Cooper8, Reference Derman12]. Then the desired continuity holds by the piecewise linearity and convexity of the objective function with respect to the right-hand-side value of η / ∑x, a τ(x, a).■
Next, we present the mathematical programs obtained via transformation (6) explicitly, in terms of the decision variables ν (x, a),
Program LT i(j)
![$$\eqalign{t_i^{\,\lpar\, j\rpar } & = \max \left\{\sum \limits_{x \in {\cal C}_i} \sum \limits_{a \in {\cal F}_x} h^{\, \lpar \,j\rpar } \Bigg[\bar {r} \,\lpar x, a\rpar , \sum \limits_{y \in {\cal C}_i, b \in {\cal F}_y} \bar {r} \,\lpar y, b\rpar \nu \lpar y, b\rpar \Bigg] \nu \lpar x, a\rpar \right\} \cr &\quad\hbox{s.t.} \sum \limits_{x \in {\cal C}_i, a \in {\cal F}_x} \lpar \delta_{xy} - P_{xay}\rpar \nu \lpar x, a\rpar = 0, \qquad y \in {\cal C}_i \cr &\quad \sum \limits_{x \in {\cal C}_i, a \in {\cal F}_x} \tau\, \lpar x, a\rpar \nu \lpar x, a\rpar = 1, \cr &\quad \sum \limits_{x \in {\cal C}_i, a \in {\cal F}_x} \bar {c} \lpar x, a\rpar \nu \lpar x, a\rpar \le \alpha^{\lpar\, j\rpar }, \cr&\quad \nu \lpar x, a\rpar \ge 0, x \in {\cal C}_i, a \in {\cal F}_x.}$$](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20151022132024594-0921:S026996480700037X_eqnU40.gif?pub-status=live)
For each η ≥ 0, we also define the following program with decision variables ν(x, a), x ∈ 𝒮, a ∈ 𝒜.
Program LQ η(j)
![$$\eqalign{q_{\eta}^{\, \lpar \,j\rpar } & = \max \left\{\sum_{x \in {\cal S}, a \in {\cal A}} h^{\,\lpar \,j\rpar } \Bigg [{\bar {r}} \,\lpar x, a\rpar , \sum \limits_{y \in {\cal S}, b \in {\cal A}} \bar {r} \,\lpar y, b\rpar \nu \lpar y, b\rpar \Bigg] \nu \lpar x, a\rpar \right\} \cr & \quad \hbox{s.t.} \sum \limits_{x \in {\cal S}, a \in {\cal A}} \lpar \delta_{xy} - P_{xay}\rpar \nu \lpar x, a\rpar = 0, \quad y \in {\cal S}, \cr & \quad \sum \limits_{x \in {\cal S}, a \in {\cal A}} \tau\, \lpar x, a\rpar \nu \lpar x, a\rpar = 1, \cr & \quad \sum \limits_{x \in {\cal S}, a \in {\cal A}} \bar {c} \lpar x, a\rpar \nu \lpar x, a\rpar \le \alpha^{\,\lpar \,j\rpar }, \cr&\quad \nu \lpar x, a\rpar \ge \eta,\quad x \in {\cal S},\quad a \in {\cal A}.}$$](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20151022132024594-0921:S026996480700037X_eqnU41.gif?pub-status=live)
Now, we can present the following procedure to locate the optimal or near optimal policies for the constrained problem.
Step 1: Solve the LP LQ 0(1) by the simplex method. If LQ 0(1) is not feasible, then there does not exist a policy that meets the sample path constraint, stop; otherwise go to Step 2.
Step 2: Let { ν0 (x, a) } be an optimal extreme point for the LP LQ 0(1) and let f0 be the corresponding stationary policy obtained via transformation (31). If P(f0) is unichain, then f0 is an optimal stationary policy, stop; otherwise go to Step 3.
Step 3: Solve the parametric LP LQ (1)η, η ≥ 0 over some interval [0, δ] beginning with η = 0. Then employ the transformation (31) to obtain an ε-optimal stationary policy for ε as small as desired.
For the second criterion, we consider that the right-hand-side value of the cost constraint is equal to infinity; that is, α(2) = ∞ and the objective function is equal to ν(u). We have the following lemma, which easily follows from the invariance of the steady-state distribution.
Lemma 3
Let zbe a feasible solution for Program LQ 0(2)and let f be defined as in Eq. (31). If P(f) is unichain, then
![$$\nu\lpar {\,\,\bi f}\rpar = {\displaystyle\sum \limits_{x \in {\cal S}} \displaystyle\sum\limits _{a \in {\cal A}} h^{\lpar 2\rpar } \left[ \bar {r} \lpar x, a\rpar , {\displaystyle\sum\limits_{x,a} {\bar {r}} \,\lpar x, a\rpar z \lpar x, a\rpar \over \displaystyle\sum\limits_{x,a} \tau \,\lpar x, a\rpar z \lpar x, a\rpar } \right] z\lpar x, a\rpar \over \displaystyle\sum \limits_{x \in {\cal S}} \displaystyle\sum \limits_{a \in {\cal A}} \tau \,\lpar x, a\rpar z \lpar x, a\rpar }.$$](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20151022132024594-0921:S026996480700037X_eqnU42.gif?pub-status=live)
For the communicating SMDP Q η(2), consequently the feasible region of program LQ η(2) is nonempty for all η ∈ [0, δ] for some δ > 0. Now for each η, let νη be an optimal solution to Program LQ η(2). If there is an optimal extreme point solution to Program LQ (2)0, further require that ν0 to be an extreme point. For each η ∈ [0,δ], let fη be defined from νη according to transformation given in Eq. (31).
Theorem 2
Fix ε > 0. If the SMDP is communicating, then for η > 0 sufficiently small, the stationary policy fη is ε-optimal for ν( u). If, in addition, h(2)(x,y) = x − λ(x – y) 2 with λ > 0, then the policy f 0is the optimal pure policy for the expected average variability criterion.
Proof
Noting that the objective function of program LQ (2)η is continuous over the feasible region of Program LQ 0(2), the proof follows from the proof of Theorem 1 in [Reference Baykal-Gürsoy and Ross3].■
6. Multichain SMDPs
In this section we impose no restrictions on the law of motion P xay, x ∈ 𝒮, a ∈ 𝒜, and y ∈ 𝒮. We now construct ε-optimal stationary policies for the constrained problem and for the expected average variability problem. Since the arguments are similar for both criteria, we will present the combined results. The construction of the optimal policy follows closely the developments for the MDP problem in [Reference Ross and Varadarajan33]; thus, we will only give the outlines of the proofs.
Recall that SMDP-i is communicating. By Theorems 1 and 2 we can construct an ε-optimal stationary policy fi(j) for each SMDP-i, i = 1, … , I, and for either criterion, j = 1, 2. Recall that t i(j) is the value of Program LT i(j). We will make the following modification to t i(1) in the constrained problem. Although t i(1) is assigned to each communicating class i whenever program LT i(1) has a feasible solution, if there does not exist any feasible policy for program LT i(1), then t i(1) = −∞ is assigned to discourage the process from going into class 𝒞i.
Consider the problem of finding a policy that maximizes the following time-average expected reward for each criterion:
![$$\beta^{\lpar \;j\rpar } \lpar {\bi u}\rpar = \mathop{\rm {lim inf}}_{n \to \infty} {1 \over n} \sum \limits_{m = 1}^n E_{\bi u} \left[\sum \limits_{i = 1}^I t^{\lpar\, j\rpar }_i {\bf 1} \{X_{m-1} \in {\cal C}_i\}\right].$$](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20151022132024594-0921:S026996480700037X_eqnU43.gif?pub-status=live)
This problem is referred as the intermediate SMDP. At this stage, the decision-maker decides which communicating class generates the maximum reward while satisfying the constraint. It is known that there exists an optimal pure policy g (j) for each criterion that can be found by policy improvement, value iteration, or linear programming. Let
![$$H^{\lpar j\rpar } = \{ i:\ {\cal C}_i \hbox{ contains a recurrent class under}\ P\lpar \,{\bi g}^{\lpar\,\, j\rpar }\rpar \}.$$](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20151022132024594-0921:S026996480700037X_eqnU44.gif?pub-status=live)
Modify g( j) so that 𝒞i is closed for each i ∈ H ( j) and so that g ( j) remains optimal for the intermediate problem (see [Reference Ross and Varadarajan33]).
We now construct stationary policy f(1)* (f(2)*) as follows: When in state x ∈ 𝒞i, i ∈ H (1) (H (2)), apply fi1 (fi2); otherwise, apply g(1) (g(2)). The main result is as follows:
Theorem 3
The stationary policy f(1)*( f(2)*) is ε-optimal for ψ ( u) (ν( u)).
Proof
Employing Eq. (14) it can be shown that
![$$\beta\,^{\lpar \;j\rpar } \lpar {\bi u}\rpar = \sum_{i = 1}^{I} t_{i}^{\lpar \;j\rpar } P_{\bi u} \{ X_n \in {\cal C}_i\ {\rm a.s.}\}$$](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20151022132024594-0921:S026996480700037X_eqnU45.gif?pub-status=live)
for all policies u ∈ U f and j = 1, 2. Thus, from Lemma 1, we have
![$$\psi\, \lpar {\bi u}\rpar \le \beta^{\lpar 1\rpar } \lpar \,{\bi g}^{\lpar 1\rpar }\rpar $$](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20151022132024594-0921:S026996480700037X_eqnU46.gif?pub-status=live)
for all policies u ∈ U f. From Lemma 2, we have
![$$\nu\lpar {\bi u}\rpar \le \beta^{\lpar 2\rpar }\lpar \,{\bi g}^{\lpar 2\rpar }\rpar $$](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20151022132024594-0921:S026996480700037X_eqnU47.gif?pub-status=live)
for all policies u. From Proposition 3 and the construction of f(1)* and f(2)*, we have
![$$\psi \,\lpar\; {\bi f}^{\lpar 1\rpar^* }\rpar = \sum_{i = 1}^{I} \psi_{i}\lpar\; {\bi f}_{i}^{\,\lpar 1\rpar }\rpar P_{{\bi g}^{\lpar 1\rpar }} \{ X_{n} \in {\cal C}_{i}\ {\rm a.s.}\}$$](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20151022132024594-0921:S026996480700037X_eqnU48.gif?pub-status=live)
and
![$$\nu\lpar\; {\bi f}\,^{\lpar 2\rpar ^*}\rpar = \sum_{i = 1}^{I} \nu_{i}\lpar\; {\bi f}_{i}^{\; \lpar 2\rpar }\rpar P_{{\bi g}^{\lpar 2\rpar }} \{ X_{n} \in {\cal C}_{i}\ {\rm a.s.}\},$$](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20151022132024594-0921:S026996480700037X_eqnU49.gif?pub-status=live)
Combining the above equations with Theorems 1 and 2 gives the desired results.■
In order to construct the ε-optimal (respectively optimal) stationary (respectively pure) policy f* for the constained problem and for the expected variability criteria (expected time-average variability criteria when h (2)(x, y) = x − λ(x − y)2, λ > 0), we can use the following procedure.
Step 1: Determine the strongly communicating classes 𝒞i, i = 1, … , I.
Step 2: For the constrained problem (respectively the expected time-average variability criterion), solve Program LT i(1) and obtain policies fi(1) and optimal values t i(1) (respectively LT i(2), fi(2), and t i(2)) for i = 1, … , I.
Step 3: For the constrained problem (respectively the expected time-average variability criterion) solve the intermediate SMDP and obtain g (1) and H (1) (respectively g (2)and H (2)). Then combine it with fi(1) (fi(2)) for i ∈ H (1) (H (2)), to get the ε-optimal (or optimal) policy f(1)* (f(2)*).
7. Conclusions
In this article, we first considered the expected time-average reward ψ (u) subject to a sample path constraint on the time-average cost. In general, there exists an ε-optimal stationary policy that can be obtained from the decomposition algorithm outlined in Section 6. If the SMDP is unichain, then the policy is optimal for the constrained problem. The optimal (ε-optimal) policy can be found for unichain (respectively communicating) SMDPs from the algorithm presented in Section 5.
Then we considered the expected time-average variability ν(u). In general, there exists an ε-optimal stationary policy that can be obtained from the decomposition algorithm outlined in Section 6. If h(x, y) = x − λ(x − y)2 with λ > 0, then there exists an optimal pure policy that can again be obtained from the decomposition algorithm; moreover, in this case, each restricted SMDP can be solved with parameteric LP. For general h(·,·) an optimal (ε-optimal) policy can be found for unichain (respectively communicating) SMDPs by solving the mathematical program LQ 0(2) (respectively mathematical programs LQ η(2), η ≥ 0).
Multiple Constraints
Multiple sample-path constraints could be handled by the theory presented above and in [Reference Ross and Varadarajan32]; they were omitted in order to simplify the notation. Multiple sample-path constraints could be introduced as
![$$P_{\bi u} \left\{ \mathop{\rm lim\, sup}\limits_ {{n \to \infty}} {1 \over t} \vint_0^t C_s \, ds \le \alpha_{k}^{\lpar 1\rpar } \right\} = 1$$](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20151022132024594-0921:S026996480700037X_eqnU50.gif?pub-status=live)
for all k = 1, … , K. To incorporate these constraints, the programs T (j)i, Q (j)η, LT (j)i, and LQ (j)η should be modified accordingly. One can see that all of the results in Sections 3, 4, and 6 continue to hold. However, note that except in the unichain case, for general SMDPs, the existence of a stationary policy is not implied by the nonemptiness of Q (1)0 when there is more than one constraint. Thus, Theorem 1 should be altered similar to [Reference Ross and Varadarajan32], as below.
Theorem 4
Suppose that SMDP is communicating. If there exists a policy u and a ν > δ such that
![$$P_{\bi u} \left\{\mathop{\rm lim\, sup}\limits_ {{t \to \infty}} {1 \over t} \vint_0^t C_s \,ds \le \alpha_k^{\lpar 1\rpar } - \delta \right\} = 1$$](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20151022132024594-0921:S026996480700037X_eqnU51.gif?pub-status=live)
for all k = 1, … , K, then for any ε > 0, there exists an ε-optimal stationary policy for the sample-path criterion.
Since the modified program LQ (1)0 is an LP with |𝒮| + K linearly independent constraints, one could see that the number of additional actions that an ε-optimal policy uses in communicating SMDP problems is equal to the number of constraints.
Acknowledgments
The first author's research was supported by the NSF under grant No. NCR-9110105. The first author would like to thank K.W. Ross for introducing this problem and for his valuable comments. The authors acknowledge with gratitude the insightful comments and suggestions by an anonymous referee that improved the presentation substantially.