SEMI-MARKOV DECISION PROCESSES: NONSTANDARD CRITERIA

M. Baykal-Gürsoy; K. Gürsoy

doi:10.1017/S026996480700037X

SEMI-MARKOV DECISION PROCESSES

NONSTANDARD CRITERIA

Published online by Cambridge University Press: 22 October 2007

M. Baykal-Gürsoy and

K. Gürsoy

Show author details

M. Baykal-Gürsoy: Affiliation:
Department of Industrial and Systems EngineeringRutgers University, Piscataway, NJ E-mail: gursoy@rci.rutgers.edu
K. Gürsoy: Affiliation:
Department of Management ScienceKean UniversityUnion, NJ

Article contents

Abstract
INTRODUCTION
NOTATIONS
PRELIMINARY RESULTS
OPTIMIZATION RESULTS
The Communicating Case
Multichain SMDPs
Conclusions
References

Rights & Permissions

Abstract

Considered are semi-Markov decision processes (SMDPs) with finite state and action spaces. We study two criteria: the expected average reward per unit time subject to a sample path constraint on the average cost per unit time and the expected time-average variability. Under a certain condition, for communicating SMDPs, we construct (randomized) stationary policies that are ε-optimal for each criterion; the policy is optimal for the first criterion under the unichain assumption and the policy is optimal and pure for a specific variability function in the second criterion. For general multichain SMDPs, by using a state space decomposition approach, similar results are obtained.

Type: Research Article
Information: Probability in the Engineering and Informational Sciences , Volume 21 , Issue 4 , October 2007 , pp. 635 - 657

DOI: https://doi.org/10.1017/S026996480700037X [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2007

1. INTRODUCTION

We consider semi-Markov decision processes (SMDPs) with finite state and action spaces. Let R _s be the reward function at time s. R _s can be an impulse function corresponding to the reward earned immediately at a transition epoch or it can be a step function between transition epochs corresponding to the rate of reward. The great majority of the literature in this area is concerned with finding a policy u that maximizes

(1)

$\phi _{1} \lpar {\bi u}\rpar \triangleq \mathop{\rm {lim\,inf}}_{t \to \infty} {1 \over t} E_{\bi u}\left[\vint_{0}^{t}R_{s}\,ds\right].$

φ₁ denotes the average expected reward [Reference Denardo10, Reference Denardo and Fox11, Reference Fox20, Reference Jewell22, Reference Lippman27, Reference Schweitzer and Federgruen35, Reference Sennott37]. The following alternative to φ₁ is given by Jewell [Reference Jewell23], Ross [Reference Ross30, Reference Ross31], and Mine and Osaki [Reference Mine and Osaki28] as

(2)

$\phi _{2}\lpar {\bi u}\rpar \triangleq \mathop{\rm {lim\,inf}}_{n \to \infty} {E_{\bi u}\left[\displaystyle\sum\limits_{m=1}^{n} R_{m}\right] \over E_{\bi u}[T_{n}]},$

where R _m denotes the reward earned between the (m − 1)st and the (m)th epochs and T _m denotes the (m)th transition time. The performance measure φ₂ is also used by other researchers (see e.g., [Reference Beutler and Ross6, Reference Federgruen, Hordijk and Tijms15–Reference Federgruen, Schweitzer and Tijms17, Reference Heyman and Sobel21, Reference Jewell22, Reference Puterman29]). In [Reference Feinberg18], φ₂ is referred to as the ratio-average reward. A sufficient condition for these two definitions to coincide under stationary policies requires that every stationary policy generates a semi-Markov chain with only one irreducible class [Reference Mine and Osaki28, Reference Ross30].

Although φ₁ is clearly the more appealing criterion, it is easier to write the optimality equations when establishing the existence of an optimal pure policy under criterion φ₂ [Reference Schäl34, Reference Schweitzer and Federgruen35, Reference Yushkevich39]. On the other hand, for finite-state and finite-action SMDPs there exists an optimal pure policy under φ₁ [Reference Denardo and Fox11, Reference Schäl34, Reference Yushkevich39], whereas such an optimal policy might not exist under φ₂ in a general multichain SMDP [Reference Jianyong and Xiaobo24].

Even though there is considerable research on the nonstandard criteria for average reward Markov decision processes (MDPs), the same cannot be claimed for the average reward SMDPs. A variance-type objective function for the discrete time MDPs has been studied (see e.g., [Reference Baykal-Gürsoy and Ross3, Reference Bouakiz and Sobel7, Reference Filar, Kallenberg and Lee19, Reference Sobel38]). Constraints have been introduced for the average reward MDPs (see e.g., [Reference Altman1, Reference Beutler and Ross4, Reference Derman13, Reference Derman and Veinott14, Reference Kallenberg25, Reference Ross and Varadarajan32, Reference Ross and Varadarajan33, Reference Sennott37]). For the average reward SMDPs, only the constrained problem has been investigated [Reference Beutler and Ross5, Reference Derman16, Reference Feinberg18]. Beutler and Ross [Reference Beutler and Ross5, Reference Beutler and Ross6] considered the ratio-average reward with a constraint under a condition stronger than the unichain condition. In [Reference Feinberg18], Feinberg examined the problem of maximizing both φ₁ and φ₂ subject to a number of constraints. Under the condition that the initial distribution is fixed, he showed that for both criteria, there exist optimal mixed stationary policies when an associated linear program (LP) is feasible. The mixed stationary policies are defined as policies with an initial one-step randomization applied to a set of pure policies. Obviously, such a policy is not stationary. Feinberg provided a linear programming algorithm for the unichain SMDP under both criteria. However, there is a need for an efficient algorithm that would locate an optimal or ε-optimal stationary policy for the communicating and multichain SMDPs under φ₁ with constraints.

In this article we study the following criterion:

(3)

$\psi \, \lpar {\bi u} \rpar \triangleq E_{\bi u} \left[\mathop{\rm {lim\,inf}}_{t \to \infty} {1 \over t} \vint_{0}^{t} R_{s}\,ds\right]$

subject to the sample path constraint

(4)

$P_{\bi u}\left\{\mathop{\hbox{lim sup}}_{t \to \infty}{1 \over t} \vint_{0}^{t} C_{s}\, ds \leq \alpha \right\} = 1,$

where C _s denotes the cost function at time s. Fatou's lemma immediately implies that ψ (u) ≤ φ₁(u) holds for all policies. We however, prove that for a large class of policies, the two rewards are equal. We show that an ε-optimal randomized stationary policy can be obtained for the general SMDP, whereas such a policy might not exist for the expectation problem.

We also consider the problem of locating a policy that maximizes over all policies, u, the following expected average reward:

(5)

$\nu\lpar {\bi u}\rpar \triangleq E_{\bi u}\left[\mathop{\rm {lim\,inf}}_{t \to \infty} {1 \over t} \vint_{0}^{t} h\left( R_{s}, {1 \over t} \vint_{0}^{t} R_{q}\,dq\right) \,ds\right],$

where h(·,·) is a function of the current reward at time s and the average reward over an interval that includes time s. Throughout, we assume that h(·,·) is a continuous function. We will refer to ν(u) as the expected time-average variability. We show that an ε-optimal stationary policy can be obtained for the general SMDP. If h(x, y) = x − λ(x − y)², then the optimal policy is a pure policy. Note that, in this case, maximizing ν(u) corresponds to maximizing the expected average reward penalized by the expected average variability.

This article is organized as follows. In Section 2 we introduce the notation. In Section 3 we present our preliminary results, which will be used in the proceeding sections and summarize the known facts about the decomposition and sample-path theory. In Section 4, mathematical programs that will be utilized are constructed and the upper bounds for the expected average reward and the expected variability are established. Communicating SMDPs are investigated in Section 5, and it is shown that there exist ε-optimal stationary policies for both criteria. Multichain SMDPs are considered in Section 6, an intermediate problem is introduced, and the algorithm to locate the ε-optimal stationary policies is given. Finally, we conclude in Section 7 with a brief discussion on the sample-path problem with multiple constraints.

2. NOTATIONS

Denote {X _m, m ≥ 0} for the state process, which takes values in a finite state space 𝒮. At each epoch m, the decision-maker chooses an action A _m from the finite action space 𝒜. The sojourn time between the (m − 1)st and the (m)th epochs is a random variable and denoted by ϒ_m. The underlying sample space Ω = {𝒮 ×𝒜×(0, ∞)}^∞ consists of all possible realizations of states, actions, and the transition times. Throughout, the sample space will be equipped with the σ-algebra generated by the random variables {X _m, A _m, ϒ_m+1; m ≥ 0}. Denote P _xay, x ∈ 𝒮, a ∈ 𝒜, y ∈ 𝒮, for the law of motion of the process; that is, for all policies u and all epochs m,

${P_{\bi u}\{X_{m+1} = y \vert X_{0}, A_{0}, \ldots, X_{m} = x, A_{m} = a\}= P_{xay}.}$

Also conditioned on the event that the next state is y, ϒ_m+1 has the distribution function F _xay(·); that is,

$P_{\bi u}\{\Upsilon_{m+1} \leq t \vert X_{0}, A_{0}, \Upsilon_{1}, \ldots, X_{m} = x, A_{m} = a, X_{m + 1} = y\} = F_{xay}\lpar t\rpar .$

Assume that F _xay (0) < 1.

The process {S _t, B _t : t ≥ 0}, where S _t is the state of the process at time t and B _t is the action taken at time t, is referred to as the SMDP. Let T _n = ∑_m=1ⁿϒ_m. For t ∈ [T _m, T _m+1), clearly

$S_{t} = X_{m}, \quad B_{t} = A_{m}.$

A policy is called stationary if the decision rule at each epoch depends only on the present state of the process; denote f _xa for the probability of choosing action a when in state x. A stationary policy is said to be pure if for each x ∈ 𝒮, there is only one action a ∈ 𝒜 such that f_xa = 1. Let U, F, and G denote the set of all policies, stationary policies, and pure policies, respectively.

Under a stationary policy f, the state process {S _t : t ≥ 0} is a semi-Markov process, and the process {X _m : m ∈ 𝒩 } is the embedded Markov chain with transition probabilities

$P_{xy} \lpar \;{\bi f}\rpar = \sum\limits_{a \in {\cal A}} P_{xay}\, f_{xa}.$

Clearly, the process {S _t, B _t : t ≥ 0} is also a semi-Markov process under a stationary policy f with the embedded Markov chain {X _m, A _m : m ∈ 𝒩 }.

Under a stationary policy f, state x is recurrent if and only if x is recurrent in the embedded Markov chain; similarly, x is transient if and only if x is transient for the embedded Markov chain. A SMDP is said to be unichain if the embedded Markov chain for each pure policy is unichain [i.e., if the transition matrix P(g) has at most one recurrent class plus (a perhaps empty) set of transient states for all pure policies g]. Similarly, a SMDP is said to be communicating if P(f) is irreducible for all stationary policies that satisfy f _xa > 0, for all x ∈ 𝒮, a ∈ 𝒜.

Let τ(x, a) define the expected sojourn time,

$\eqalignno{\tau\,\lpar x, a\rpar &\triangleq E_{\bi u}[\Upsilon_{m} \vert X_{m-1} = x, A_{m - 1} = a] &\cr & = \vint_{0}^{\infty} \sum\limits_{y \in {\bf \cal S}} P_{\bi u} \{X_{m} = y, \Upsilon_{m} \gt t \vert X_{m-1} = x, A_{m-1} = a\}\,dt &\cr & = \vint_{0}^{\infty} \left[1 - \sum\limits_{y \in {\bf \cal S}} P_{xay} F_{xay}\lpar t\rpar \right]\,dt.}$

Let W _t(x, a) denote the random variables representing the state-action intensities,

$W_{t}\lpar x, a\rpar \triangleq {1 \over t} \vint_{0}^{t} {\bf 1}\{\lpar S_{s}, B_{s}\rpar = \lpar x, a\rpar \}\,ds,$

where 1{·} denotes the indicator function. Let U ₀ denote the class of all policies u such that {W _t(x, a); t ≥ 0} converges P _u-almost surely (P _u-a.s.) for all x ∈ 𝒮 and a ∈ 𝒜. Thus, for u ∈ U ₀, there exist random variables {W(x, a)} such that

$\mathop{\hbox{lim}}_{t \to \infty} W_{t}\lpar x, a\rpar = W\lpar x, a\rpar ,$

P _u-a.s. for all x and a. Let U ₁ be the class of all policies u such that the expected state-action intensities {E _u[W _t(x, a)]; t ≥ 0} converge for all x and a. For u ∈ U ₁, denote

$w_{\bi u} \lpar x, a\rpar = \mathop{\hbox{lim}}_{t \to \infty} E_{\bi u} [W_{t}\lpar x, a\rpar ].$

From Lebesgue's Dominated Convergence Theorem, U ₀ ∈ U ₁.

A well-known result from renewal theory (see Çinlar [Reference Çinlar9]) is that if {Y _t = (S _t, B _t) : t ≥ 0} is a homogeneous semi-Markov process and if the embedded Markov chain is unichain, then the proportion of time spent in state y; that is,

$\mathop{\hbox{lim}}_{t \to \infty} {1 \over t} \vint_{0}^{t} {\bf 1}\{Y_{s} = y\}\,ds$

exists almost surely. Since under a stationary policy f the process {Y _t = (S _t, B _t) : t ≥ 0} is a homogeneous semi-Markov process, if the embedded Markov decision process is unichain, then the limit of W _t(x, a) as t goes to infinity exists and the proportion of time spent in state x when action a is applied is given as

$W\lpar x, a \rpar = \mathop{\hbox{lim}}_{t \to \infty} W_{t}\lpar x, a\rpar = {\tau\, \lpar x, a\rpar\, Z\,\lpar x, a\rpar \over \displaystyle\sum\limits_{x, a} \tau\, \lpar x, a\rpar\, Z\, \lpar x, a\rpar },$

P _f-a.s. for all x and a, where Z(x, a) denotes the associated state-action frequencies. Let {z _f (x, a); x ∈ 𝒮, a ∈ 𝒜} denote the expected state-action frequencies; that is,

$z_{\,{\bi f}} \lpar x, a\rpar = \mathop{\hbox{lim}}_{n \to \infty} E_{\bi f} {1 \over n} \sum\limits_{m=1}^{n} {\bf 1}\{X_{m-1} = x, A_{m-1} = a\} = \pi_{x}\lpar\; {\bi f}\,\rpar\, f_{xa},$

where π_x(f) is the steady-state distribution of the embedded Markov chain P(f).

The long-run average number of transitions into state x when action a is applied per unit time is

(6)

$\nu_{\bi f}\lpar x, a\rpar = {\pi_{x}\lpar \;{\bi f}\,\rpar f_{xa} \over \displaystyle\sum\limits_{x, a} \tau\,\lpar x, a\rpar \pi_{x}\lpar \;{\bi f}\,\rpar\, f_{xa}} = {z_{\bi f}\lpar x, a\rpar \over \displaystyle\sum\limits_{x, a} \tau\,\lpar x, a\rpar z_{\bi f}\lpar x, a\rpar }.$

This gives w _f (x, a) = τ (x, a) ν _f (x, a).

The decision-maker earns an immediate reward R(X _m, A _m) and a reward with rate r (X _m, A _m) until the (m + 1)st epoch. Thus,

$R_{m+1} = R\lpar X_{m}, A_{m}\rpar + r\lpar X_{m}, A_{m}\rpar \Upsilon_{m+1}$

is the reward earned during the (m + 1)st transition [Reference Beutler and Ross5, Reference Sennott36]. Similarly, there is an immediate cost C(X _m, A _m) and a cost with rate c(X _m, A _m) with

$C_{m + 1} = C\lpar X_{m}, A_{m}\rpar + c\lpar X_{m}, A_{m}\rpar \Upsilon_{m + 1}.$

Hence, at any epoch if the process is in state x ∈ 𝒮 and action a ∈ 𝒜 is chosen, then the reward earned during this epoch is represented by (x, a) ≜ R(x, a) + r (x, a) τ(x, a). Similarly, the cost during this epoch is represented by (x, a) ≜ C(x, a) + c(x, a)τ (x, a).

We conclude this section with a fact that will be used in the subsequent theorems. It follows directly from the law of large numbers for martingale differences (see, e.g., Loeve [Reference Loeve26]):For all policies u ∈ U, if ∑_m=1^∞ [(var ϒ_m)/m ²] < ∞,

(7)

$\mathop{\hbox{lim}}_{n \to \infty} {1 \over n} \sum\limits_{m=1}^{n} [d\lpar X_{m-1}, A_{m-1}\rpar \Upsilon_{m} - d\lpar X_{m-1}, A_{m-1}\rpar \tau \, \lpar X_{m-1}, A_{m-1}\rpar ] = 0$

holds P _u-a.s. with d(·,·) as an arbitrary bounded function on 𝒮 × 𝒜.

Thus, we need the following assumption on the sojourn times.

Assumption 1

For all policies u ∈ U,

$\sum\limits_{m = 1}^{\infty} {\hbox{var}\,\Upsilon_{m} \over m^2} \lt \infty.$

This condition is essentially equivalent to the assumption that E _u [ϒ_m²|X _m−1 = x, A _m−1 = a] < ∞ for all x ∈ 𝒮 and a ∈ 𝒜.

3. PRELIMINARY RESULTS

In this section we establish some facts that will be used later in the analysis. Proposition 1 shows that the expected average reward (average cost) can be written in terms of the expected state-action frequencies {z _u(x, a)} ({Z(x, a)}).

Proposition 1

Assume that the SMDP is unichain. For any policy u ∈ F, the expected average reward and the average cost are given respectively as

(8)

${\psi \,\lpar {\bi u}\rpar = {\displaystyle \sum\limits_{x, a} \bar{r}\,\lpar x, a\rpar z_{\bi u}\lpar x, a\rpar \over \displaystyle\sum\limits_{x^{\prime}, a^{\prime}} \tau\,\lpar x^{\prime}, a^{\prime}\rpar z_{\bi u}\lpar x^{\prime}, a^{\prime}\rpar } = \sum\limits_{x, a} \bar{r}\,\lpar x, a\rpar \nu_{\bi u}\lpar x, a\rpar}$

and

(9)

$\mathop{\rm {lim\quad sup}}_{t \to \infty} {1 \over t} \vint_{0}^{t} C_{s}\, ds = {\displaystyle\sum\limits_{x, a} \bar{c}\lpar x, a\rpar Z\lpar x, a\rpar \over \displaystyle\sum\limits_{x^{\prime}\!, a^{\prime}} \tau\,\lpar x^{\,\prime}\!, a^{\prime}\rpar Z\lpar x^{\,\prime}\!, a^{\prime}\rpar }{\it\comma}$

P _u-a.s.

Proof

Fix a policy u ∈ F. Equation (8) is written as

$\eqalignno{\psi \,\lpar {\bi u}\rpar &= E_{\bi u}\left[\mathop{\rm {lim inf}}_{t \to \infty} {1 \over t} \vint_{0}^{t} R_{s}\,ds\right] &\cr &= E_{\bi u}\left[\mathop{\rm {lim inf}}_{t \to \infty} {1 \over t} \left[\sum\limits_{m=0}^{n\lpar t\rpar } R\lpar X_{m}, A_{m}\rpar+ \sum\limits_{m=0}^{n\lpar t\rpar -1} r\,\lpar X_{m}, A_{m}\rpar \Upsilon_{m + 1} + \lpar t - t_{n\lpar t\rpar }\rpar r\,\lpar X_{n\lpar t\rpar }, A_{n\lpar t\rpar }\rpar \right]\right] &\cr &= E_{\bi u}\left[\mathop{\rm {lim inf}}_{t \to \infty} {1 \over t} \sum\limits_{x, a} \left[R\lpar x, a\rpar \sum\limits_{m=0}^{n\lpar t\rpar } {\bf 1} \{X_{m} = x, A_{m} = a\}\right. \right. &\cr &\quad \left. \left. \quad +\; r\,\lpar x, a\rpar \tau\, \lpar x, a\rpar \sum\limits_{m=0}^{n\lpar t\rpar -1} {\bf 1}\{X_{m} = x, A_{m} = a\}\right]\right]&\cr & = \sum\limits_{x, a} R\lpar x, a\rpar \nu_{\bi u}\lpar x, a\rpar + \sum\limits_{x, a} r\,\lpar x, a\rpar \tau\,\lpar x, a\rpar \nu_{\bi u}\lpar x, a\rpar ,}$

where n(t) ≜ max{m : T _m ≤ t} denotes the number of transitions up to time t. Note that as t goes to infinity, so does n(t). Thus, the last term in the second equality goes to zero as t goes to infinity. Equation (9) is similarly obtained.■

Now, consider the expected time-average variability ν(u). The following proposition shows that the time-average variability can also be expressed in terms of the long-run state-action frequencies {Z(x, a)}. Let Ψ_t denote the time average reward random variable (r.v.) up to time t:

$\Psi_{t} \triangleq {1 \over t} \vint_{0}^{t} R_{s}\,ds$

and

$\quad \quad \quad \quad \quad \Psi \triangleq \sum\limits_{x, a} \bar{r}\lpar x, a\rpar V\lpar x, a\rpar .$

Proposition 2

For all u ∈ U ₀,

$\eqalignno{&\mathop{\rm {lim inf}}_{t \to \infty} {1 \over t} \vint_{0}^{t} h\left( R_{s}, {1 \over t} \vint_{0}^{t} R_{q}\,dq\right) \,ds \cr &\quad= \left(\sum\limits_{x, a} h\left[{\bar{r}\,\lpar x, a\rpar , {\displaystyle\sum\limits_{y, b} \bar{r} \,\lpar y, b\rpar Z\lpar y, b\rpar \over \displaystyle\sum\limits_{x^{\prime}, a^{\prime}} \tau\, \lpar x^{\prime}, a^{\prime}\rpar Z\lpar x^{\prime}, a^{\prime}\rpar}} \right] Z\lpar x, a\rpar \right)\cr &\quad\qquad \times\Bigg[\sum\limits_{x^{\prime}, a^{\prime}} \tau\, \lpar x^{\prime}, a^{\prime}\rpar Z\lpar x^{\prime}, a^{\prime}\rpar \Bigg]^{-1},}$

P _u-a.s. If h(x, y) = x − λ(x − y)², then for u ∈ U ₀, we have

$\nu\lpar {\bi u}\rpar = \psi\; \lpar {\bi u}\rpar - \lambda \mathop{\rm {lim}}_{t \to \infty} {1 \over t} \vint_{0}^{t} E_{\bi u} \left[R_{s} - {1 \over t} \vint_{0}^{t} R_{q}\,dq\right]^{2}\,ds.$

Proof

Fix a policy u ∈ U ₀. Similar to Proposition 1, it is straightforward to establish that

$\mathop{\rm {lim inf}}_{t \to \infty} {1 \over t} \vint_{0}^{t} h\lpar R_{s}, \Psi\rpar \,ds = \sum\limits_{x, a} h[\bar{r}\,\lpar x, a\rpar , \Psi] V\lpar x, a\rpar ,$

P _u-a.s. The rest of the proof follows from Proposition 1 in [Reference Baykal-Gürsoy and Ross3].■

The proof of the following proposition that defines ψ and ν for the multichain case, is straightforward.

Proposition 3

Let f be a stationary policy and let ℛ₁, … , ℛ_qbe the recurrent classes associated with P(f). Denote (π_xⁱ(f); x ∈ ℛ_i) for the equilibrium probability vector associated with class i, i = 1, … , q. Further, denote

(10)

$\psi_{i}\,\lpar \ {\bi f}\rpar = {\displaystyle \sum\limits_{x, a} \bar{r}\,\lpar x, a\rpar \pi_{x}^{\,i}\lpar\; {\bi f}\,\rpar f_{xa} \over \displaystyle\sum\limits_{y, b} \tau \,\lpar y, b\rpar \pi_{y}^{\,i}\lpar\; {\bi f}\rpar f_{yb}}.$

Then

(11)

$\psi \,\lpar \;{\bi f}\rpar = \sum\limits_{i=1}^{q} P_{\bi f}\{X_{n} \in {\cal R}_{i}\;\;a.s.\} \psi_{i}\lpar\; {\bi f}\rpar$

and

(12)

$\qquad\qquad\qquad \nu\lpar \; {\bi f}\rpar = \sum\limits_{i=1}^{q} P_{\bi f}\{ X_{n} \in {\cal R}_{i}\;\;a.s.\} \sum\limits_{x, a} {h[\bar{r}\lpar x, a\rpar , \psi_{i}\lpar\; {\bi f}\rpar ] \pi_{x}^{\,i}\lpar\; {\bi f}\rpar f_{xa} \over \displaystyle\sum\limits_{y, b} \tau\, \lpar \,y, b\rpar \pi_{y}^{\,i}\lpar\; {\bi f}\rpar f_{yb}}.$

If SMDP is unichain, then

(13)

$\nu\lpar \;{f}\rpar = \sum\limits_{x, a} {h[\bar{r}\lpar x, a\rpar , \psi \,\lpar\; {\bi f}\rpar ] \pi_{x}\lpar\; {\bi f}\rpar f_{xa} \over \displaystyle\sum\limits_{y,b} \tau \,\lpar\, y, b\rpar \pi_{y} \lpar \;{\bi f}\rpar f_{yb}}.$

Note that Schäl [Reference Schäl34] showed in Lemma 2.7 that for finite-state finite-action multichain SMDPs under a pure policy, φ₁ is equivalently given by Eqs. (10) and (11), which define the expected average reward ψ.

Decomposition and Sample Path Theory

The following notation will be used in the subsequent sections. A set 𝒞 ⊆ 𝒮 is said to be a strongly communicating class if (1) 𝒞 is a recurrent class for some stationary policy, (2) 𝒞 is not a proper subset of some 𝒞′ for which (1) holds true. Let {𝒞₁, … , 𝒞_I} be the collection of all strongly communicating classes. Let 𝒯 be the (possibly empty) set of states that are transient under all stationary policies. It is shown in [Reference Ross and Varadarajan33] that {𝒞₁, … , 𝒞_I, 𝒯 } forms a partition of the state space 𝒮. The decomposition ideas was first introduced by Bather [Reference Bather2]. For each i = 1, … , I, denote the for each x ∈ 𝒞_i the set

${\cal F}_{x} = \{a \in {\cal A} : P_{xay} = 0\;\hbox{for all}\ y \notin {\cal C}_{i}\}.$

The following result is also proved in [Reference Ross and Varadarajan33].

Proposition 4

For all policies u,

(14)

$\sum\limits_{i=1}^{I} P_{\bi u} \{ X_{n} \in {\cal C}_{i}\;\; a.s.\} = 1$

and

(15)

$\;\; P_{\bi u} \{ A_{n} \in {\cal F}_{X_{n}}\;\;a {.} s.\,\}=1.$

For each i = 1, … , I, define a new SMDP, called SMDP-i, as follows: The state space is 𝒞_i; for each x ∈ 𝒞_i, the set of available actions is given by the state-dependent action spaces ℱ_x; the law of motion P _xay, the conditional sojourn time distribution F _xay(·), the reward function (x, a), and the cost function (x, a) are the same as earlier but restricted to 𝒞_i and ℱ_x for x ∈ 𝒞_i. Now, each SMDP-i is communicating for all i = 1, … , I. For each SMDP-i, let ν_i(u) be the expected average variability under policy u.

4. OPTIMIZATION RESULTS

In the constrained problem, say T ⁽¹⁾, we seek to maximize the expected average reward ψ (u) [Eq. (3)] over the policies that satisfy the sample-path constraint [Eq. (4)]. Let U _f denote the class of feasible policies. The optimal constrained average reward is given as

$\psi^{\ast} = \mathop{\hbox{sup}}\limits_{{\bi u} \in U_{f}} \psi \,\lpar {\bi u}\rpar .$

A policy u* ∈ U _f is said to constrained average optimal if ψ (u*) = ψ *. A policy u ∈ U _f is said to be ε-average optimal if ψ (u) > ψ * − ε. The second problem, T ⁽²⁾, maximizes the expected time-average variability [Eq. (5)]. Let

$\nu^{\ast} = \mathop{\hbox{sup}}\limits_{{\bi u} \in U} \nu\lpar {\bi u}\rpar .$

A policy u* is optimal for ν(·) if ν(u*) = ν*. An ε-optimal policy for ν(·) is defined as a policy u such that ν(u) > ν*−ε.

Note that by choosing α to be sufficiently large, the unconstrained problem can be viewed as a special case of the constrained optimization problem. Also, by choosing h ⁽¹⁾(x, y) = x, we have ν(u) = ψ (u). Thus, in the next section we will present the general problem of maximizing ν^(j)(u) subject to the sample-path constraint (4), where j = 1 corresponds to the constrained problem with ν⁽¹⁾(u) = ψ (u) and j = 2 corresponds to the expected average variability with ν⁽²⁾(u) = ν (u) and α⁽²⁾ = ∞.

For each j = 1, 2 and i = 1, … , I, consider the following fractional program with decision variables z(x, a), x ∈ 𝒞_i, a ∈ ℱ_x. Let δ_xy = 1 if x = y and δ_xy = 0 otherwise.

Program T_i^(j)

(16)

$t_{i}^{\,\lpar \,j\rpar } = \max \left\{{\matrix{\displaystyle\sum\limits_{x \in {\cal C}_{i}} \displaystyle\sum\limits_{a \in {\cal F}_{x}} h^{\lpar j\rpar }\left [{\bar{r}\lpar x, a\rpar ,{ {} \displaystyle\sum\limits_{y \in {\cal C}_{i}, b \in {\cal F}_{y}} \bar{r}\lpar y, b\rpar \,z\,\lpar y, b\rpar {} \over {} \displaystyle\sum\limits_{x\,^{\prime}\in{\cal C}_{i}, a^{\prime} \in {\cal F}_{x\,^{\prime}}} \tau \,\lpar x\,^{\prime}, a^{\prime}\rpar \,z\,\lpar x\,^{\prime}, a^{\prime}\rpar {}} }\right] \,z\,\lpar x, a\rpar \cr} \over \displaystyle\sum\limits_{x\,^{\prime}\in{\cal C}_{i}, a^{\prime} \in {\cal F}_{x\,^{\prime}}} \tau \, \lpar x\,^{\prime}, a^{\prime}\rpar \,z\,\lpar x\,^{\prime}, a^{\prime}\rpar } \right\}$

(17)

$\hbox{s.t.} \sum\limits_{x\in{\cal C}_{i}, a \in {\cal F}_{x}} \lpar \delta_{xy} - P_{xay}\rpar \,z\,\lpar x, a\rpar = 0, \qquad y \in {\cal C}_{i}$

(18)

$\sum\limits_{x\in{\cal C}_{i}, a \in {\cal F}_{x}} z\lpar x, a\rpar = 1,$

(19)

${\displaystyle\sum\limits_{x\in{\cal C}_{i},a \in {\cal F}_{x}} {\bar{c}}\lpar x, a\rpar z\lpar x, a\rpar \over \displaystyle\sum\limits_{x^{\prime}\in{\cal C}_{i}, a^{\prime} \in {\cal F}_{x^{\prime}}} \tau \, \lpar x\,^{\prime}, a^{\prime}\rpar z\lpar x\,^{\prime}, a^{\prime}\rpar } \leq \alpha^{\,\lpar \,j\rpar },$

(20)

$z\lpar x, a\rpar \geq 0 \;\;\; x \in {\cal C}_{i},\;a \in {\cal F}_{x}.$

For each η ≥ 0, we will also need to refer to the following fractional program with decision variables z(x, a), for all x∈ 𝒮, a ∈ 𝒜.

Program Q_η^(j)

(21)

$q_{\eta}^{\lpar j\rpar } = \max \left\{{\matrix{\displaystyle\sum\limits_{x \in {\cal S}, a \in {\cal A}} h^{\,\lpar \,j\rpar } \left [\bar{r} \lpar x, a\rpar , { \displaystyle\sum\limits_{y \in {\cal S}, b \in {\cal A}} \bar{r}\lpar y, b\rpar\, z\,\lpar y, b\rpar \over \displaystyle\sum\limits_{x\,^{\prime}, a^{\prime}} \tau\,\lpar x\,^{\prime}, a^{\prime}\rpar\,z\,\lpar x\,^{\prime}, a^{\prime}\rpar } \right ] \,z\,\lpar x, a\rpar \cr} \over \displaystyle\sum\limits_{x\,^{\prime}, a^{\prime}} \tau\,\lpar x\,^{\prime}, a^{\prime}\rpar \,z\,\lpar x\,^{\prime}, a^{\prime}\rpar }\right\}$

(22)

$\hbox{s.t.} \sum\limits_{x \in {\cal S}, a \in {\cal A}} \lpar \delta_{xy} - P_{xay}\rpar z\lpar x, a\rpar = 0, \qquad y\in {\cal S},$

(23)

$\sum\limits_{x \in {\cal S}, a \in {\cal A}} z\lpar x, a\rpar = 1,$

(24)

${\displaystyle\sum\limits_{x \in {\cal S}, a \in {\cal A}} \bar{c}\lpar x, a\rpar z\lpar x, a\rpar \over \displaystyle\sum\limits_{x\,^{\prime} \in {\cal S}, a^{\prime} \in {\cal A}} \tau \, \lpar x\,^{\prime}, a^{\prime}\rpar z\lpar x\,^{\prime}, a^{\prime}\rpar } \leq \alpha^{\lpar j\rpar },$

(25)

$z\lpar x, a\rpar \geq \eta, \;\;\;x \in {\cal S}, \; a \in {\cal A}.$

We will refer to the feasible regions of Program T _i^(j) and Program Q _η^(j) simply as T _i^(j) and Q _η^(j), respectively. Note that the objective functions for both sets of mathematical programs are continuous functions over polytopes. As long as the cost constraint is satisfied for some {z(x, a)}, then T _i⁽¹⁾ for i = 1, … , I and Q ₀⁽¹⁾ are nonempty. Note that T _i⁽²⁾ for i = 1, … , I and Q ₀⁽²⁾ are always nonempty. For a given solution {z(x, a)}, we will write

$z\lpar x\rpar = \sum\limits_{a} z\lpar x, a\rpar .$

First, we consider the constrained problem, T ⁽¹⁾ given by Eqs. (3) and (4). Thus, use h ⁽¹⁾(x, y) = x in Eqs. (16) and (21). The following lemmas provide bounds on φ(u) and ν(u). The proof of Lemma 2 is similar to the proof of Lemma 1, thus only an outline of the proof will be given.

Lemma 1

If U _fis nonempty, then for all i = 1, … , I, T _i⁽¹⁾is nonempty, and for u ∈ U _f ,

(26)

$P_{\bi u}\left\{\mathop{\rm {lim inf}}_{t \to \infty} {1 \over t} \vint_{0}^{t} R_{s}\,ds \leq t_{i}^{\lpar 1\rpar } \vert X_{n} \in {\cal C}_{i}\;\;a.s.\!\right\} = 1$

and, consequently

(27)

$\psi \, \lpar {\bi u}\rpar \leq \sum\limits_{i = 1}^{I} t_{i}^{\lpar 1\rpar } P_{\bi u} \{X_{n} \in {\cal C}_{i}\;\;a.s.\!\}.$

Proof

Fix a policy u ∈ U _f. Let Γ be the set of all sample paths ω = (x ₀, a ₀, τ₁, x ₁, a ₁, τ₂, …) that satisfy the following:

(i) a _n ∈ ℱ_{x _n}, ∀ n ≥ N for some positive integer N
(ii) ∑_x∈𝒮 ∑_a∈𝒜P _xayZ(x, a) = ∑_a∈𝒜Z(y, a), ∀ y∈ 𝒮
(iii) lim sup_t→∞(1/t)∫₀^tC _sds ≤ α⁽¹⁾.

Combining Eq. (14) with Eq. (7) where d(·,·) = 1 and ϒ_m = 1{X _m = y} and the fact that u is feasible yields

$P_{\bi u} \lpar \Gamma\rpar = 1.$

Let (x ₀, a ₀, τ₁, x ₁, a ₁, τ₂, …) ∈ {X _n∈ 𝒞_ia.s.} ∩ Γ and define

$Z_n \lpar x, a\rpar \triangleq {1 \over n} \sum \limits_{m = 1}^{n} {\bf 1} \{X_{m-1} = x, A_{m-1} = a\}.$

Since 0 ≤ Z _n(x, a) ≤ 1, by the standard compactness argument there exists a subsequence {N _k(ω)} along which {Z _n(x, a; ω)} converges to some Z ′(x, a; ω) on Φ = {X _n ∈ 𝒞_i a.s.} ∩ Γ; that is,

(28)

$\lim_{k \to \infty} Z_{N_{k}} \lpar x, a\rpar = Z^{\prime} \lpar x, a\rpar .$

By definition, it follows that

$Z^{\prime}\lpar x, a\rpar = 0 \quad \hbox{whenever} \, x \notin {\cal C}_i \, \hbox{or} \, a \notin {\cal F}_{x}$

on the set Φ. Thus, on Φ,

$\sum_{x \in {\cal C}_{i}} \sum_{a \in {\cal F}_{x}} P_{xay} Z^{\prime} \lpar x, a\rpar = \sum_{a \in {\cal F}_{y}} Z^{\prime}\lpar y,a\rpar , \quad \forall\;y \in {\cal C}_{i},$

and

$\sum_{x \in {\cal C}_{i}} \sum_{a \in {\cal F}_{x}} Z^{\prime}\lpar x, a\rpar = 1, \qquad Z^{\prime}\lpar x, a\rpar \ge 0, \quad \forall \; x\in{\cal C}_{i}, a \in {\cal F}_{x}.$

Observe that for any bounded function d(·,·) on Φ,

$\eqalign{&\left|{1 \over N_{k}} \sum \limits_{m = 1}^{N_{k}} d\lpar X_{m-1}, A_{m-1}\rpar \Upsilon_m - \sum \limits_{x, a} d\lpar x, a\rpar \tau\, \lpar x, a\rpar Z^{\prime} \lpar x, a\rpar \right| \cr & \qquad \le \left|{1 \over N_{k}} \sum_{m = 1}^{N_{k}} [d\lpar X_{m-1}, A_{m-1}\rpar \Upsilon_m - d\lpar X_{m-1}, A_{m-1}\rpar \,\tau\, \lpar X_{m-1}, A_{m-1}\rpar ] \right| \cr &\qquad + \left|{1 \over N_{k}} \sum \limits_{m = 1}^{N_{k}} d\lpar X_{m-1}, A_{m-1}\rpar\, \tau\,\lpar X_{m-1}, A_{m-1}\rpar - \sum \limits_{x, a} d\lpar x, a\rpar \tau\, \lpar x, a\rpar Z^{\prime}\lpar x, a\rpar\right|, }$

which combined with Eq. (7) and Eq. (28) gives

$\lim_{k\to\infty}{1 \over N_{k}} \sum_{m = 1}^{N_{k}} d\lpar X_{m-1}, A_{m-1}\rpar \Upsilon_m = \sum_{x, a} d\lpar x, a\rpar \tau\, \lpar x, a\rpar Z^{\prime}\lpar x, a\rpar .$

From this equation the following holds:

$\eqalign{\lim_{k \to \infty} {1 \over T_{N_{k}}} \sum \limits_{m = 1}^{N_{k}} C_{m} & = {\lim\limits_{k \to \infty} {1 \over N_{k}} \displaystyle\sum \limits_{m = 1}^{N_{k}} [C\lpar X_{m-1}, A_{m-1}\rpar + c\lpar X_{m-1}, A_{m-1}\rpar \Upsilon_m] \over {1 \over N_{k}} \displaystyle\sum \limits_{m = 1}^{N_{k}} \Upsilon_m} \cr & = {\displaystyle\sum \limits_{x, a} \bar{c} \lpar x, a\rpar Z^{\prime} \lpar x, a\rpar \over {\displaystyle\sum \limits_{x, a}} \tau\, \lpar x, a\rpar Z^{\prime}\lpar x, a\rpar }.}$

Also, on Φ,

$\eqalign{\alpha^{\lpar 1\rpar } \ge & \limsup_{t \to \infty } {1 \over t} \vint_0^t C_s\ ds \cr & \ge \lim_{k \to \infty} {1 \over T_{N_{k}}} \sum \limits_{m = 1}^{N_{k}} C_{m} = {\displaystyle\sum \limits_{x, a} \bar{c}\lpar x, a\rpar Z^{\prime}\lpar x, a\rpar \over \displaystyle\sum \limits_{x, a} \!\tau\, \lpar x, a\rpar Z^{\prime}\lpar x, a\rpar }}$

Thus, Z ′(x, a) is in the feasible set implying that T ⁽¹⁾_i is nonempty. Hence, on Φ,

${\displaystyle\sum \limits_{x, a} \bar{r}\,\lpar x, a\rpar Z^{\prime}\lpar x, a\rpar \over \displaystyle\sum \limits_{x, a} \!\tau\,\lpar x, a\rpar Z^{\prime} \lpar x, a\rpar } \le t_{i}^{\lpar 1\rpar }.$

In a similar manner,

$\mathop{\lim \inf}\limits_{t \to \infty} {1 \over t} \vint_0^t R_s\, ds \le \lim_{k \to \infty} {1 \over T_{N_{k}}} \sum_{m = 1}^{N_{k}} R_{m} = {\displaystyle\sum \limits_{x, a} \bar{r} \,\lpar x, a\rpar Z^{\prime}\lpar x, a\rpar \over \displaystyle\sum \limits_{x, a} \tau\,\lpar x, a\rpar Z^{\prime}\lpar x, a\rpar },$

which gives the desired result. Combining Eq. (26) with Proposition 4 gives Eq. (27).

Next, we consider the expected time-average variability criterion.■

Lemma 2

For all i = 1, … , I and for all policies u, we have

(29)

$P_{\bi u}\left\{ \mathop{lim\,inf}\limits_{t \to \infty} {1 \over t} \vint_0^t h \left( R_s, {1 \over t} \vint_0^t R_q\, dq \right) ds \le t_i^{\lpar 2\rpar } \vert X_{n} \in {\cal C}_{i}\;a.s. \right\} = 1$

and, consequently,

(30)

$\nu\lpar {\bi u}\rpar \le \sum \limits_{i = 1}^{I} t_{i}^{\lpar 2\rpar } P_{\bi u} \{X_{n} \in {\cal C}_{i}\, a.s. \}.$

Proof

The proof is similar to the proof of Lemma 1. We only need to note that

$\eqalign{\mathop{\rm {lim\,inf}}\limits_{t \to \infty} & {1 \over t} \vint_0^t h \left( R_s, {1 \over t} \vint_0^t R_q\, dq \right) ds \cr \le \lim_{k \to \infty} {1 \over T_{N_{k}}} \sum \limits_{m = 1}^{N_{k}} h \left( R_m, {1 \over T_{N_{k}}} \sum \limits_{l = 1}^{N_{k}} R_{l} \right) \cr & = \lim_{k \to \infty} {1 \over T_{N_{k}}} \sum \limits_{m = 1}^{N_{k}} h \left( R_m, \lim_{k \to \infty} {1 \over T_{N_{k}}} \sum \limits_{l = 1}^{N_{k}} R_{l} \right) \cr & = {\displaystyle\sum \limits_{x \in {\cal C}_{i}} \displaystyle\sum \limits_{a \in {\cal F}_{x}} h \left[\bar{r} \lpar x, a\rpar , {\frac{\displaystyle\sum \limits_{y \in {\cal C}_{i}} \displaystyle\sum \limits_{b \in {\cal F}_{y}} \bar{r}\,\lpar y, b\rpar Z^{\prime}\lpar y, b\rpar} {\displaystyle\sum \limits_{x\,^{\prime} \in {\cal C}_{i}} \displaystyle\sum \limits_{a^{\prime} \in {\cal F}_{x\,^{\prime}}} \tau\, \lpar x\,^{\prime}, a^{\prime}\rpar Z^{\prime} \lpar x\,^{\prime}, a^{\prime}\rpar }} \right] Z^{\prime}\lpar x, a\rpar \over \displaystyle\sum \limits_{x\,^{\prime}\in {\cal C}_{i}} \displaystyle\sum \limits_{a^{\prime}\in {\cal F}_{x\,^{\prime}}} \tau\,\lpar x\,^{\prime},a^{\prime}\rpar Z^{\prime}\lpar x\,^{\prime},a^{\prime}\rpar },}$

on Φ.■

5. The Communicating Case

We assume that the SMDP is communicating. This implies that there is only one strongly communicating class and that 𝒮 = 𝒞₁. The analysis of this section draws on results and observations from [Reference Ross and Varadarajan32].

In this section we will show that, in general, there does not exist an optimal stationary policy for both criteria. Instead, we show that an ε-optimal stationary policy can be constructed. First, we consider the constrained problem: Eqs. (3) and (4). Let h ⁽¹⁾(x, y) = x in Eqs. (16) and (21).

Proposition 5

Fix η ≥ 0 and let { z ^η(x, a)} be an optimal extreme point for Q ⁽¹⁾_η. Define a policy f^η by the transformation

(31)

${\bi f}^{\eta} = \left\{ \matrix{ \displaystyle{z^{\eta} \lpar x, a\rpar \over z^{\eta}\lpar x\rpar } \hfill& \quad if \ z^{\eta} \lpar x\rpar \gt 0 \cr uniformly \, over \, the \, actions & otherwise.} \right.$

Then

(32)

$\sum_{x} z^{\eta}\lpar x\rpar P_{xy}\lpar \;{\bi f}^{\eta}\rpar = z^{\eta} \lpar y\rpar ,$

(33)

$\sum_{x} z^{\eta}\lpar x\rpar = 1.$

If P(f^η) is unichain, then f^η∈ U_f and P_f^η {lim inf_t→∞(1/t)R_s ds = q_η⁽¹⁾} = 1. In particular, if P(f⁰) is unichain, then f ⁰is an optimal stationary policy for the constrained problem.

Proof

It is straightforward to show equations (32) and (33). If P(f^η) is unichain, there is a unique probability vector π (f ^η) associated with P(f^η). Hence, π_x (f ^η) = z ^η(x), giving P _{f ^η}-almost surely

$\eqalign{ \mathop{\lim\hbox{sup}}\limits_{t \to \infty} {1 \over t} \vint _0^t C_s\ ds = {\displaystyle\sum \limits_{x, a} \bar{c} \lpar x, a\rpar \pi_{x} \lpar \;{\bi f}^{\eta}\rpar f_{xa}^{\eta} \over \displaystyle\sum \limits_{x, a} \tau \lpar x, a\rpar \pi_{x} \lpar\; {\bi f}^{\eta}\rpar f_{xa}^{\eta}}\cr = {\displaystyle\sum \limits_{x, a} \bar{c}\lpar x, a\rpar z^{\eta} \lpar x, a\rpar \over \displaystyle\sum \limits_{x, a} \tau\, \lpar x, a\rpar z^{\eta} \lpar x, a\rpar } \le \alpha^{\lpar 1\rpar }.}$

In a similar manner, we have P _f^η-a.s.

$\mathop{\lim\hbox{sup}}_{t \to \infty} {1 \over t} \vint _0^t R_s\ ds ={\displaystyle\sum \limits_{x, a} \bar{r}\lpar x, a\rpar z^{\eta} \lpar x, a\rpar \over \displaystyle\sum \limits_{x, a} \tau\, \lpar x, a\rpar z^{\eta} \lpar x, a\rpar } = q_{\eta}^{\lpar 1\rpar } \qquad \hbox{\squf}$

Only the outline of the proof of the following theorem will be given since it follows the proofs of Propositions 5–7 in [Reference Ross and Varadarajan32].

Theorem 1

Suppose that the SMDP is communicating. Then U_f is nonempty if and only if Q ⁽¹⁾₀is nonempty. If Q ⁽¹⁾₀is nonempty, then for each ε > 0, there exists an ε-optimal stationary policy for the constrained problem.

Proof

Proposition 5 proves the (only if) part. To prove the (if) part assume that {z ⁰(x, a)} is an optimal extreme point of Q ⁽¹⁾₀. Let f⁰ be the policy obtained via transformation (31). It follows from Eq. (32) that the set of states where z ⁰(x) > 0 is a closed set, and by Lemma 2 of [Reference Ross and Varadarajan32], all states outside of this closed set are transient. This closed set can be composed of the union of m recurrent classes R ₁, … , R _m associated with P(f⁰). For each recurrent class, we can define

$d_k = {\displaystyle\sum \limits_{x \in R_k} \displaystyle\sum \limits_{a} \bar{c} \lpar x, a\rpar z^0 \lpar x, a\rpar \over \displaystyle\sum \limits_{x \in R_{k}} \displaystyle\sum \limits_{a} \tau \,\lpar x, a\rpar z^0 \lpar x, a\rpar }.$

The value d _k has the interpretation of being the average cost per unit time, given that the process has entered R _k. Let l = arg min_{1≤ k≤m}d _k. Then since {z ⁰(x, a)} is feasible for Q ⁽¹⁾₀, we have d _l ≤ α⁽¹⁾. Since d _k can be greater than α⁽¹⁾ for some k, f⁰ does not necessarily belong to U _f. However, we can define a stationary policy f˜ that is equal to f⁰ in R _l, and outside R _l it takes every available action with equal probability. Clearly, since the SMDP is communicating, R _l is the only recurrent class associated with P(f˜) and f˜ is in U _f. Thus, U _f is nonempty.

For the second part of the theorem, we assume that Q ⁽¹⁾₀ is nonempty. Using the machinery developed in [Reference Ross and Varadarajan32], whenever there exists a policy that strictly meets the constraint, one can construct a feasible stationary policy that chooses every action with positive probability and gives rise to an irreducible Markov chain. Otherwise, the stationary policy f⁰ given by the transformation (31) gives rise to a unichain P(f⁰); thus, f⁰ is the optimal policy.

Thus, we assume that there exists a policy that strictly meets the constraint. In this case, there exists an ζ > 0 such that for each η that satisfies 0 < η < ζ, there is a feasible solution for Q ⁽¹⁾_η, and P(f^η) is irreducible for f^η obtained via transformation (31). From Proposition 5, we have f^η ∈ U _f and P _f^η {lim inf_t→∞ (1/t )∫₀^tR _sds = q _η⁽¹⁾ = 1}. To prove that lim_η→0q _η⁽¹⁾ = q ₀⁽¹⁾, we can transform the fractional program into a linear program using transformation (6) for ν_{f ^η} [Reference Charnes and Cooper8, Reference Derman12]. Then the desired continuity holds by the piecewise linearity and convexity of the objective function with respect to the right-hand-side value of η / ∑_{x, a} τ(x, a).■

Next, we present the mathematical programs obtained via transformation (6) explicitly, in terms of the decision variables ν (x, a),

Program LT _i^(j)

$\eqalign{t_i^{\,\lpar\, j\rpar } & = \max \left\{\sum \limits_{x \in {\cal C}_i} \sum \limits_{a \in {\cal F}_x} h^{\, \lpar \,j\rpar } \Bigg[\bar {r} \,\lpar x, a\rpar , \sum \limits_{y \in {\cal C}_i, b \in {\cal F}_y} \bar {r} \,\lpar y, b\rpar \nu \lpar y, b\rpar \Bigg] \nu \lpar x, a\rpar \right\} \cr &\quad\hbox{s.t.} \sum \limits_{x \in {\cal C}_i, a \in {\cal F}_x} \lpar \delta_{xy} - P_{xay}\rpar \nu \lpar x, a\rpar = 0, \qquad y \in {\cal C}_i \cr &\quad \sum \limits_{x \in {\cal C}_i, a \in {\cal F}_x} \tau\, \lpar x, a\rpar \nu \lpar x, a\rpar = 1, \cr &\quad \sum \limits_{x \in {\cal C}_i, a \in {\cal F}_x} \bar {c} \lpar x, a\rpar \nu \lpar x, a\rpar \le \alpha^{\lpar\, j\rpar }, \cr&\quad \nu \lpar x, a\rpar \ge 0, x \in {\cal C}_i, a \in {\cal F}_x.}$

For each η ≥ 0, we also define the following program with decision variables ν(x, a), x ∈ 𝒮, a ∈ 𝒜.

Program LQ _η^(j)

$\eqalign{q_{\eta}^{\, \lpar \,j\rpar } & = \max \left\{\sum_{x \in {\cal S}, a \in {\cal A}} h^{\,\lpar \,j\rpar } \Bigg [{\bar {r}} \,\lpar x, a\rpar , \sum \limits_{y \in {\cal S}, b \in {\cal A}} \bar {r} \,\lpar y, b\rpar \nu \lpar y, b\rpar \Bigg] \nu \lpar x, a\rpar \right\} \cr & \quad \hbox{s.t.} \sum \limits_{x \in {\cal S}, a \in {\cal A}} \lpar \delta_{xy} - P_{xay}\rpar \nu \lpar x, a\rpar = 0, \quad y \in {\cal S}, \cr & \quad \sum \limits_{x \in {\cal S}, a \in {\cal A}} \tau\, \lpar x, a\rpar \nu \lpar x, a\rpar = 1, \cr & \quad \sum \limits_{x \in {\cal S}, a \in {\cal A}} \bar {c} \lpar x, a\rpar \nu \lpar x, a\rpar \le \alpha^{\,\lpar \,j\rpar }, \cr&\quad \nu \lpar x, a\rpar \ge \eta,\quad x \in {\cal S},\quad a \in {\cal A}.}$

Now, we can present the following procedure to locate the optimal or near optimal policies for the constrained problem.

Step 1: Solve the LP LQ ₀⁽¹⁾ by the simplex method. If LQ ₀⁽¹⁾ is not feasible, then there does not exist a policy that meets the sample path constraint, stop; otherwise go to Step 2.
Step 2: Let { ν⁰ (x, a) } be an optimal extreme point for the LP LQ ₀⁽¹⁾ and let f⁰ be the corresponding stationary policy obtained via transformation (31). If P(f⁰) is unichain, then f⁰ is an optimal stationary policy, stop; otherwise go to Step 3.
Step 3: Solve the parametric LP LQ ⁽¹⁾_η, η ≥ 0 over some interval [0, δ] beginning with η = 0. Then employ the transformation (31) to obtain an ε-optimal stationary policy for ε as small as desired.

For the second criterion, we consider that the right-hand-side value of the cost constraint is equal to infinity; that is, α⁽²⁾ = ∞ and the objective function is equal to ν(u). We have the following lemma, which easily follows from the invariance of the steady-state distribution.

Lemma 3

Let zbe a feasible solution for Program LQ ₀⁽²⁾and let f be defined as in Eq. (31). If P(f) is unichain, then

$\nu\lpar {\,\,\bi f}\rpar = {\displaystyle\sum \limits_{x \in {\cal S}} \displaystyle\sum\limits _{a \in {\cal A}} h^{\lpar 2\rpar } \left[ \bar {r} \lpar x, a\rpar , {\displaystyle\sum\limits_{x,a} {\bar {r}} \,\lpar x, a\rpar z \lpar x, a\rpar \over \displaystyle\sum\limits_{x,a} \tau \,\lpar x, a\rpar z \lpar x, a\rpar } \right] z\lpar x, a\rpar \over \displaystyle\sum \limits_{x \in {\cal S}} \displaystyle\sum \limits_{a \in {\cal A}} \tau \,\lpar x, a\rpar z \lpar x, a\rpar }.$

For the communicating SMDP Q _η⁽²⁾, consequently the feasible region of program LQ _η⁽²⁾ is nonempty for all η ∈ [0, δ] for some δ > 0. Now for each η, let ν^η be an optimal solution to Program LQ _η⁽²⁾. If there is an optimal extreme point solution to Program LQ ⁽²⁾₀, further require that ν⁰ to be an extreme point. For each η ∈ [0,δ], let f^η be defined from ν^η according to transformation given in Eq. (31).

Theorem 2

Fix ε > 0. If the SMDP is communicating, then for η > 0 sufficiently small, the stationary policy f^η is ε-optimal for ν( u). If, in addition, h⁽²⁾(x,y) = x − λ(x – y) ² with λ > 0, then the policy f ⁰is the optimal pure policy for the expected average variability criterion.

Proof

Noting that the objective function of program LQ ⁽²⁾_η is continuous over the feasible region of Program LQ ₀⁽²⁾, the proof follows from the proof of Theorem 1 in [Reference Baykal-Gürsoy and Ross3].■

6. Multichain SMDPs

In this section we impose no restrictions on the law of motion P _xay, x ∈ 𝒮, a ∈ 𝒜, and y ∈ 𝒮. We now construct ε-optimal stationary policies for the constrained problem and for the expected average variability problem. Since the arguments are similar for both criteria, we will present the combined results. The construction of the optimal policy follows closely the developments for the MDP problem in [Reference Ross and Varadarajan33]; thus, we will only give the outlines of the proofs.

Recall that SMDP-i is communicating. By Theorems 1 and 2 we can construct an ε-optimal stationary policy f_i^(j) for each SMDP-i, i = 1, … , I, and for either criterion, j = 1, 2. Recall that t _i^(j) is the value of Program LT _i^(j). We will make the following modification to t _i⁽¹⁾ in the constrained problem. Although t _i⁽¹⁾ is assigned to each communicating class i whenever program LT _i⁽¹⁾ has a feasible solution, if there does not exist any feasible policy for program LT _i⁽¹⁾, then t _i⁽¹⁾ = −∞ is assigned to discourage the process from going into class 𝒞_i.

Consider the problem of finding a policy that maximizes the following time-average expected reward for each criterion:

$\beta^{\lpar \;j\rpar } \lpar {\bi u}\rpar = \mathop{\rm {lim inf}}_{n \to \infty} {1 \over n} \sum \limits_{m = 1}^n E_{\bi u} \left[\sum \limits_{i = 1}^I t^{\lpar\, j\rpar }_i {\bf 1} \{X_{m-1} \in {\cal C}_i\}\right].$

This problem is referred as the intermediate SMDP. At this stage, the decision-maker decides which communicating class generates the maximum reward while satisfying the constraint. It is known that there exists an optimal pure policy g ^(j) for each criterion that can be found by policy improvement, value iteration, or linear programming. Let

$H^{\lpar j\rpar } = \{ i:\ {\cal C}_i \hbox{ contains a recurrent class under}\ P\lpar \,{\bi g}^{\lpar\,\, j\rpar }\rpar \}.$

Modify g^{( j)} so that 𝒞_i is closed for each i ∈ H ^{( j)} and so that g ^{( j)} remains optimal for the intermediate problem (see [Reference Ross and Varadarajan33]).

We now construct stationary policy f^(1)* (f^(2)*) as follows: When in state x ∈ 𝒞_i, i ∈ H ⁽¹⁾ (H ⁽²⁾), apply f_i¹ (f_i²); otherwise, apply g⁽¹⁾ (g⁽²⁾). The main result is as follows:

Theorem 3

The stationary policy f^(1)*( f^(2)*) is ε-optimal for ψ ( u) (ν( u)).

Proof

Employing Eq. (14) it can be shown that

$\beta\,^{\lpar \;j\rpar } \lpar {\bi u}\rpar = \sum_{i = 1}^{I} t_{i}^{\lpar \;j\rpar } P_{\bi u} \{ X_n \in {\cal C}_i\ {\rm a.s.}\}$

for all policies u ∈ U _f and j = 1, 2. Thus, from Lemma 1, we have

$\psi\, \lpar {\bi u}\rpar \le \beta^{\lpar 1\rpar } \lpar \,{\bi g}^{\lpar 1\rpar }\rpar$

for all policies u ∈ U _f. From Lemma 2, we have

$\nu\lpar {\bi u}\rpar \le \beta^{\lpar 2\rpar }\lpar \,{\bi g}^{\lpar 2\rpar }\rpar$

for all policies u. From Proposition 3 and the construction of f^(1)* and f^(2)*, we have

$\psi \,\lpar\; {\bi f}^{\lpar 1\rpar^* }\rpar = \sum_{i = 1}^{I} \psi_{i}\lpar\; {\bi f}_{i}^{\,\lpar 1\rpar }\rpar P_{{\bi g}^{\lpar 1\rpar }} \{ X_{n} \in {\cal C}_{i}\ {\rm a.s.}\}$

and

$\nu\lpar\; {\bi f}\,^{\lpar 2\rpar ^*}\rpar = \sum_{i = 1}^{I} \nu_{i}\lpar\; {\bi f}_{i}^{\; \lpar 2\rpar }\rpar P_{{\bi g}^{\lpar 2\rpar }} \{ X_{n} \in {\cal C}_{i}\ {\rm a.s.}\},$

Combining the above equations with Theorems 1 and 2 gives the desired results.■

In order to construct the ε-optimal (respectively optimal) stationary (respectively pure) policy f* for the constained problem and for the expected variability criteria (expected time-average variability criteria when h ⁽²⁾(x, y) = x − λ(x − y)², λ > 0), we can use the following procedure.

Step 1: Determine the strongly communicating classes 𝒞_i, i = 1, … , I.
Step 2: For the constrained problem (respectively the expected time-average variability criterion), solve Program LT _i⁽¹⁾ and obtain policies f_i⁽¹⁾ and optimal values t _i⁽¹⁾ (respectively LT _i⁽²⁾, f_i⁽²⁾, and t _i⁽²⁾) for i = 1, … , I.
Step 3: For the constrained problem (respectively the expected time-average variability criterion) solve the intermediate SMDP and obtain g ⁽¹⁾ and H ⁽¹⁾ (respectively g ⁽²⁾and H ⁽²⁾). Then combine it with f_i⁽¹⁾ (f_i⁽²⁾) for i ∈ H ⁽¹⁾ (H ⁽²⁾), to get the ε-optimal (or optimal) policy f^(1)* (f^(2)*).

7. Conclusions

In this article, we first considered the expected time-average reward ψ (u) subject to a sample path constraint on the time-average cost. In general, there exists an ε-optimal stationary policy that can be obtained from the decomposition algorithm outlined in Section 6. If the SMDP is unichain, then the policy is optimal for the constrained problem. The optimal (ε-optimal) policy can be found for unichain (respectively communicating) SMDPs from the algorithm presented in Section 5.

Then we considered the expected time-average variability ν(u). In general, there exists an ε-optimal stationary policy that can be obtained from the decomposition algorithm outlined in Section 6. If h(x, y) = x − λ(x − y)² with λ > 0, then there exists an optimal pure policy that can again be obtained from the decomposition algorithm; moreover, in this case, each restricted SMDP can be solved with parameteric LP. For general h(·,·) an optimal (ε-optimal) policy can be found for unichain (respectively communicating) SMDPs by solving the mathematical program LQ ₀⁽²⁾ (respectively mathematical programs LQ _η⁽²⁾, η ≥ 0).

Multiple Constraints

Multiple sample-path constraints could be handled by the theory presented above and in [Reference Ross and Varadarajan32]; they were omitted in order to simplify the notation. Multiple sample-path constraints could be introduced as

$P_{\bi u} \left\{ \mathop{\rm lim\, sup}\limits_ {{n \to \infty}} {1 \over t} \vint_0^t C_s \, ds \le \alpha_{k}^{\lpar 1\rpar } \right\} = 1$

for all k = 1, … , K. To incorporate these constraints, the programs T ^(j)_i, Q ^(j)_η, LT ^(j)_i, and LQ ^(j)_η should be modified accordingly. One can see that all of the results in Sections 3, 4, and 6 continue to hold. However, note that except in the unichain case, for general SMDPs, the existence of a stationary policy is not implied by the nonemptiness of Q ⁽¹⁾₀ when there is more than one constraint. Thus, Theorem 1 should be altered similar to [Reference Ross and Varadarajan32], as below.

Theorem 4

Suppose that SMDP is communicating. If there exists a policy u and a ν > δ such that

$P_{\bi u} \left\{\mathop{\rm lim\, sup}\limits_ {{t \to \infty}} {1 \over t} \vint_0^t C_s \,ds \le \alpha_k^{\lpar 1\rpar } - \delta \right\} = 1$

for all k = 1, … , K, then for any ε > 0, there exists an ε-optimal stationary policy for the sample-path criterion.

Since the modified program LQ ⁽¹⁾₀ is an LP with |𝒮| + K linearly independent constraints, one could see that the number of additional actions that an ε-optimal policy uses in communicating SMDP problems is equal to the number of constraints.

Acknowledgments

The first author's research was supported by the NSF under grant No. NCR-9110105. The first author would like to thank K.W. Ross for introducing this problem and for his valuable comments. The authors acknowledge with gratitude the insightful comments and suggestions by an anonymous referee that improved the presentation substantially.

References

1.Altman, E. (1993). Asymptotic properties of constrained Markov decision processes. Mathematical Methods of Operations Research 37: 151–170.CrossRef Google Scholar

2.Bather, J. (1973). Optimal decision procedures in finite Markov chains. Part II: Communicating systems. Advances in Applied Probability 5: 521–552.CrossRef Google Scholar

3.Baykal-Gürsoy, M. & Ross, K.W. (1992). Variability sensitive Markov decision processes. Mathematics of Operations Research 17: 558–571.CrossRef Google Scholar

4.Beutler, F.J. & Ross, K.W. (1985). Optimal policies for controlled Markov chains with a constraint. Journal of Mathematical Analysis and Applications 112: 236–252.CrossRef Google Scholar

5.Beutler, F.J. & Ross, K.W. (1986). Time-average optimal constrained semi-Markov decision processes. Advances in Applied Probability 18: 341–359.CrossRef Google Scholar

6.Beutler, F.J. & Ross, K.W. (1987). Uniformization for semi-Markov decision processes under stationary policies. Advances in Applied Probability 24: 644–656.CrossRef Google Scholar

7.Bouakiz, M.A. & Sobel, M.J. (1985). Nonstationary policies are optimal for risk-sensitive Markov decision processes. Technical Report, Georgia Institute of Technology.Google Scholar

8.Charnes, A. & Cooper, W.W. (1962). Programming with linear fractional functionals. Naval Research Logistics Quarterly 9: 181–186.CrossRef Google Scholar

9.Çinlar, E. (1975). Introduction to stochastic processes. Englewood Cliffs, NJ: Prentice-Hall.Google Scholar

10.Denardo, E.V. (1971). Markov renewal programs with small interest rate. Annals of Mathematical Statistics 42: 477–496.CrossRef Google Scholar

11.Denardo, E.V. & Fox, B.L. (1968). Multichain Markov renewal programs. SIAM Journal of Applied Mathematics 16: 468–487.CrossRef Google Scholar

12.Derman, C. (1962). On sequential decisions and Markov chains. Management Science 9: 16–24.CrossRef Google Scholar

13.Derman, C. (1970). Finite state Markovian decision processes. New York: Academic Press.Google Scholar

14.Derman, C. & Veinott, A.F. Jr. (1972). Constrained Markov decision chains. Management Science 19: 389–390.CrossRef Google Scholar

15.Federgruen, A., Hordijk, A., & Tijms, H.C. (1979). Denumerable state semi-Markov decison processes with unbounded costs, average cost criterion. Stochastic Processes and Applications 9: 223–235.CrossRef Google Scholar

16.Federgruen, A. & Tijms, H.C. (1978). The optimality equation in average cost denumerable state semi-Markov decison problems, recurrency conditions and algorithms. Journal of Applied Probability 15: 356–373.CrossRef Google Scholar

17.Federgruen, A., Schweitzer, P.J., & Tijms, H.C. (1983). Denumerable undiscounted semi-Markov decision processes with unbounded rewards. Mathematics of Operations Research 8(2): 298–313.CrossRef Google Scholar

18.Feinberg, E.A. (1994). Constrained semi-Markov decision processes with average rewards. Mathematical Methods of Operations Research 39: 257–288.CrossRef Google Scholar

19.Filar, J.A., Kallenberg, L.C.M., & Lee, H.M. (1989). Variance penalized Markov decision processes. Mathematics of Operations Research 14: 147–161.CrossRef Google Scholar

20.Fox, B. (1966). Markov renewal programming by linear fractional programming. SIAM Journal of Applied Mathematics 16: 1418–1432.CrossRef Google Scholar

21.Heyman, D.P. & Sobel, M.J. (1983). Stochastic models in operations research. Vol. II: Stochastic optimization. New York: McGraw-Hill.Google Scholar

22.Jewell, W.S. (1963). Markov renewal programming I: Formulation, finite return models. Journal of Operations Research 11: 938–948.CrossRef Google Scholar

23.Jewell, W.S. (1963). Markov renewal programming II: Inifinite return models, example. Operations Research 11: 949–971.CrossRef Google Scholar

24.Jianyong, L. & Xiaobo, Z. (2004). On average reward semi-Markov decision processes with a general multichain structure. Mathematics of Operations Research 29(2): 339–352.CrossRef Google Scholar

25.Kallenberg, L.C.M. (1983). Linear programming and finite Markovian control problems. Mathematical Centre Tracts Vol. 146. Amsterdam: Elsevier.Google Scholar

26.Loeve, M. (1978). Probability theory, Vol. 2. New York: Springer-Verlag.Google Scholar

27.Lippman, S.A. (1971). Maximal average reward policies for semi-Markov renewal processes with arbitrary state and action spaces. Annals of Mathematical Statistics 42: 1717–1726.CrossRef Google Scholar

28.Mine, H. & Osaki, S. (1970). Markovian decision processes. New York: Elsevier.Google Scholar

29.Puterman, M.L. (1994). Markov decision processes: Discrete stochastic dynamic programming. New York: Wiley.CrossRef Google Scholar

30.Ross, S.M. (1970). Average cost semi-Markov processes. Journal of Applied Probability 7: 649–656.CrossRef Google Scholar

31.Ross, S. (1971). Applied probability models with optimization applications. San Francisco: Holden-Day.Google Scholar

32.Ross, K.W. & Varadarajan, R. (1989). Markov decision processes with sample path constraints: The communicating case. Operations Research 37: 780–790.CrossRef Google Scholar

33.Ross, K.W. & Varadarajan, R. (1991). Multichain Markov decision processes with a sample path constraint: A decomposition approach. Mathematics of Operations Research 16: 195–207.CrossRef Google Scholar

34.Schäl, M. (1992). On the second optimality equation for semi-Markov decision models. Mathematics of Operations Research 17(2): 470–486.CrossRef Google Scholar

35.Schweitzer, P.J. & Federgruen, A.F. (1978). The functional equations of undiscounted Markov renewal programming. Mathematics of Operations Research 3: 308–321.CrossRef Google Scholar

36.Sennott, L.I. (1989). Average cost semi-Markov decision processes and the control of queueing systems. Probability in the Engineering and Informational Sciences 3: 247–272.CrossRef Google Scholar

37.Sennott, L.I. (1993). Constrained average cost Markov decision chains. Probability in the Engineering and Informational Sciences 7: 69–83.CrossRef Google Scholar

38.Sobel, M.J. (1994). Mean variance tradeoffs in an undiscounted MDP. Operations Research 42(1): 175–183.CrossRef Google Scholar

39.Yushkevich, A.A. (1981). On semi-Markov controlled models with an average reward criterion. Theory of Probability and Its Applications 26: 796–802.CrossRef Google Scholar

Article contents

SEMI-MARKOV DECISION PROCESSES

Abstract

1. INTRODUCTION

2. NOTATIONS

Assumption 1

3. PRELIMINARY RESULTS

Proposition 1

Proof

Proposition 2

Proof

Proposition 3

Decomposition and Sample Path Theory

Proposition 4

4. OPTIMIZATION RESULTS

Program Ti(j)

Program Qη(j)

Lemma 1

Proof

Lemma 2

Proof

5. The Communicating Case

Proposition 5

Proof

Theorem 1

Proof

Program LT i(j)

Program LQ η(j)

Lemma 3

Theorem 2

Proof

6. Multichain SMDPs

Theorem 3

Proof

7. Conclusions

Multiple Constraints

Theorem 4

Acknowledgments

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests

Program T_i^(j)

Program Q_η^(j)

Program LT _i^(j)

Program LQ _η^(j)