On gradual-impulse control of continuous-time Markov decision processes with exponential utility

Xin Guo; Aiko Kurushima; Alexey Piunovskiy; Yi Zhang

doi:10.1017/apr.2020.64

On gradual-impulse control of continuous-time Markov decision processes with exponential utility

Part of: Markov processes

Published online by Cambridge University Press: 01 July 2021

Xin Guo ,

Aiko Kurushima ,

Alexey Piunovskiy and

Yi Zhang

Show author details

Xin Guo*: Affiliation:
Tsinghua University
Aiko Kurushima*: Affiliation:
Sophia University
Alexey Piunovskiy*: Affiliation:
University of Liverpool
Yi Zhang*: Affiliation:
University of Liverpool
*: *Postal address: School of Economics and Management, Tsinghua University, Beijing 100084, China. Email address: guoxin5@sem.tsinghua.edu.cn
**Postal address: Department of Economics, Sophia University, 7-1 Kioi-cho, Chiyoda-ku, Tokyo, 102-8554, Japan. Email address: kurushima@sophia.ac.jp
***Postal address: Department of Mathematical Sciences, University of Liverpool, Liverpool, L69 7ZL, UK.
***Postal address: Department of Mathematical Sciences, University of Liverpool, Liverpool, L69 7ZL, UK.

Article contents

Abstract
Introduction
Model description and problem statement
Optimality results
The hat DTMDP model
Proofs of the main statements
Conclusion
References

Rights & Permissions

Abstract

We consider a gradual-impulse control problem of continuous-time Markov decision processes, where the system performance is measured by the expectation of the exponential utility of the total cost. We show, under natural conditions on the system primitives, the existence of a deterministic stationary optimal policy out of a more general class of policies that allow multiple simultaneous impulses, randomized selection of impulses with random effects, and accumulation of jumps. After characterizing the value function using the optimality equation, we reduce the gradual-impulse control problem to an equivalent simple discrete-time Markov decision process, whose action space is the union of the sets of gradual and impulsive actions.

Keywords

Continuous-time Markov decision processes dynamic programming gradual-impulse control optimality equation

MSC classification

Primary: 90C40: Markov and semi-Markov decision processes

Secondary: 60J75: Jump processes

Type: Original Article
Information: Advances in Applied Probability , Volume 53 , Issue 2 , June 2021 , pp. 301 - 334

DOI: https://doi.org/10.1017/apr.2020.64 [Opens in a new window]
Copyright: © The Author(s), 2021. Published by Cambridge University Press on behalf of Applied Probability Trust

1. Introduction

This paper considers a gradual-impulse control problem for continuous-time Markov decision processes (CTMDPs) with the performance to be minimized being the expected exponential utility of the total cost. In this model, the decision-maker can control the process gradually via its local characteristics (transition rate), and also has the option of affecting impulsively the state of the process. The system dynamics is depicted in Figure 1 below.

Figure 1. Illustration of the system dynamics in the gradual-impulse control problem, and how the policy acts on the system dynamics. Here $\textbf{X}=[0,\infty).$ The second coordinate indicates the impulse (including the ‘pseudo-impulse’ $\Delta$) used at that state, which is recorded in the first coordinate. At the initial time $t=\theta_1\equiv 0$, three impulses are applied in turn. The first jump in the indicated sample path of the marked point process $\{(T_n,Y_n)\}_{n=1}^\infty$ takes place at $t_2=\theta_2.$ It is triggered by a natural jump because $x_0'\ne x_3.$ Along the displayed sample path, the system state remains at $x_3$ before the first jump of the marked point process. The second jump of the marked point process is triggered by a planned (or let us say active) impulse, because $x_0''=x_0'$. Infinitely many impulses are applied at $t_3=t_2+\theta_3$, so that the process is ‘killed’ after the infinitely many impulses at $t_3$; i.e., $\omega=(y_0,0,y_1,\theta_2,y_2,\theta_3,y_3,\infty,\Delta,\infty,\Delta,\dots).$ Note also that, under the policy $u=\{u_{n}\}_{n=0}^\infty$ in Definition 1, $y_1\in \textbf{Y}_3$ is a realization from the distribution $u_0(\cdot|x_0)$, $\bar{x}(y_1)=x_3;$ $y_2\in \textbf{Y}_0$ is a realization from the distribution $\Gamma^0_1(\cdot|h_1,\theta_2,x_0')$, as the jump at $t_2$ is triggered by a natural jump, $\bar{x}(y_2)=x_0'$; and $y_3\in (\textbf{X}\times\textbf{A}^I)^\infty$ is a realization from the distribution $\Gamma_2^1(\cdot|h_2)$, as the jump at $t_3$ is not triggered by a natural jump, $\bar{x}(y_3)=\Delta.$

There is no lack of situations where an action can affect the state of the controlled process instantaneously. For example, in a susceptible–infected–recovered (SIR) epidemic model, the controller elaborates the immunization policy, affecting the transition rate from the susceptible to the infected population, as well as the isolation policy, which instantaneously reduces the number of infected individuals. Let us formulate another simple example, which contains some features motivating the present paper.

Example 1. A rat (or intruder) may invade the kitchen. For each time unit it remains alive in the kitchen, a constant cost of $l\ge 0$ is incurred. The rat spends an exponentially distributed amount of time with mean $\frac{1}{\mu}>0$ in the kitchen, and then goes outside and settles down in another house (and thus never returns). When the rat is in the kitchen, the housekeeper (defender) can decide to shoot at it, with a chance $p\in(0,1)$ of hitting and killing the rat. If the rat dodges, it remains in the kitchen. Each bullet costs $C>0$. Assume that the successive shootings are independent. Let us mention some features in the above example. ‘Shoot’ is an impulse. The location of the rat is the state. The effect of an impulse on the post-impulse state is random, as the shooting may be dodged. Each time unit the rat is present in the kitchen is costly. Suppose the cost of impulse is relatively low. It can happen that after one impulse, if the rat is still alive and in the kitchen, then it is reasonable to immediately shoot again. This means one should allow multiple impulses at a single time moment in this problem. We will return to this problem in Example 2 below, which demonstrates the situations in which applying only one impulse is insufficient for optimality.

Most previous works on gradual-impulse control do not allow multiple simultaneous impulses at a single time moment; see [Reference Costa and Davis3, Reference Costa and Raymundo4, Reference Davis6, Reference Hordijk and van der Duyn Shouten14, Reference Miller, Miller and Stepanyan17, Reference Palczewski and Stettner18, Reference Van der Duyn Schouten21]. Extra conditions are imposed in these works to guarantee that there is no need to apply more than one impulse at a single time moment. Example 1 described a situation that does not satisfy those conditions. There is convenience if only one impulse is allowed at a given time moment, because at each time moment, there is only one state, so that one can construct the process under control in the (original) state space of the gradual-impulse control problem. When there is only gradual control, it is convenient to construct the CTMDP using a marked point process with the mark space being the same as the state space of the original control problem. If multiple impulses were applied in a sequence at a single time moment, then there would be multiple states associated with the single time moment. If one wishes to construct the problem using a marked point process, then the mark space must be enlarged, so that a sequence of impulses applied at the single time moment and the post-impulse states are merged as a single ‘mark’, which will be called an intervention. Necessarily this leads to a more complicated marked point process with each mark corresponding to a sample path of a discrete-time Markov decision process (DTMDP). This idea was employed and implemented in [Reference Dufour and Piunovskiy7].

Another way of constructing rigorously a gradual-impulse control problem of CTMDPs admitting multiple simultaneous impulses comes from [Reference Yushkevich24]. The idea is to keep the original state space, but to enlarge the time $t\in[0,\infty)$ to (n,t), with the first coordinate, roughly speaking, counting the number of impulses applied at the time t. Consequently, several concepts about stochastic processes need to be extended.

In the present work, we follow the construction of [Reference Dufour and Piunovskiy7] but with more general control policies. Compared to the previous literature on impulse or gradual-impulse control problems of CTMDPs, to the best of our knowledge, we consider the most general setup: the policy allows relaxed gradual controls and randomized impulsive controls with randomized consequences, multiple simultaneous impulses are allowed, and accumulation of jumps of the process is not excluded. We study the gradual-impulse control problem of CTMDPs with the system performance measure being the expectation of the exponential utility of the total cost to be minimized. For risk-sensitive CTMDPs with gradual control only and total or average cost criteria, see e.g. [Reference Ghosh and Saha11, Reference Guo and Zhang12, Reference Kumar and Pal16, Reference Piunovski and Khametov19, Reference Wei22, Reference Zhang25]. In close relation to the present paper, the paper [Reference Bäuerle and Popp1] recently considered the risk-sensitive optimal stopping problem of a continuous-time Markov chain, which is a special impulse control problem but with a more general utility function.

The main optimality results of this paper lie in the following. We characterize the value function of the gradual-impulse control problem for CTMDPs in terms of the optimality equation, and show the existence of deterministic stationary optimal policies, under quite general and natural conditions compared to the literature. For example, the growth on the gradual cost rates and impulse cost functions, as well as the transition rate, can be quite general. In comparison, only bounded transition and cost rates were allowed in [Reference Dufour and Piunovskiy7], which deals with a discounted problem with linear utility. The boundedness conditions guarantee that the Dynkin formula is applicable to the functions of interest in [Reference Dufour and Piunovskiy7], which is important for the argument therein.

The method of investigation in the present paper is different from that of [Reference Dufour and Piunovskiy7], but is closer to that of [Reference Zhang25], which studies a similar problem for CTMDPs but with gradual control only. Although both the present paper and [Reference Zhang25] follow the same idea of reducing the original problem to a DTMDP, the implementation for the gradual-impulse control problem is more involved. In particular, the connection between a strategy in the induced DTMDP and a policy in the gradual-impulse control problem, which is at the core of the justification of the reduction, becomes more delicate; see Subsection 4.2 below.

In the induced DTMDP, an action is a triplet, consisting of the time until the next time an impulse is applied (if no natural jump has occurred by then), the next impulse itself, and the decision rule for the selection of gradual controls. Apart from having a more complicated action space than the original problem, the induced DTMDP model is not so convenient. For example, it is not a semicontinuous model even if the system primitives of the gradual-impulse control problem satisfy the compactness–continuity conditions. Consequently, the existence of an optimal policy does not follow automatically from the reduction to this DTMDP. In this connection, we mention that Lemma 5.12 of [Reference Zhang25] is inaccurate unless further conditions are imposed therein. Here we incidentally show that the optimality results in [Reference Zhang25] remain correct in spite of that error. Accordingly, the second step in the investigation is to further reduce the DTMDP model to yet another one, which is a semicontinuous model, and with a simple action space (the union of the set of gradual actions and impulses). This second reduction is done based on the investigation of the optimality equation of the DTMDP obtained from the first reduction.

The rest of the paper is organized as follows. We present the rigorous construction of the controlled process and problem statement in Section 2. Section 3 consists of the main optimality results, whose proof is postponed to Section 5. The argument is based on the connection with a DTMDP model, which is introduced in Section 4. The paper is finished with a conclusion in Section 6. To improve readability, we summarize the relevant notions and facts about DTMDPs in the appendix.

Notation and conventions. In what follows, ${\mathcal{B}}(X)$ is the Borel $\sigma$-algebra of the topological space X, I stands for the indicator function, and $\delta_{x}(\cdot)$ is the Dirac measure concentrated on the singleton $\{x\},$ assumed to be measurable. A measure is $\sigma$-additive and $[0,\infty]$-valued. Here and below, unless stated otherwise, the term ‘measurability’ is always understood in the Borel sense. Throughout this paper, we adopt the conventions $\frac{0}{0}:=0,~0\cdot\infty:=0,~\frac{1}{0}:=+\infty,~\infty-\infty:=\infty.$ For each function f on X, we let $||f||:=\sup_{x\in X}|f(x)|.$

2. Model description and problem statement

2.1. System primitives of the gradual-impulse control problem

We describe the primitives of the model as follows. The state space is $\textbf{X}$, the space of gradual controls is $\textbf{A}^G$, and the space of impulsive controls is $\textbf{A}^I$. It is assumed that $\textbf{X}$, $\textbf{A}^G$, and $\textbf{A}^I$ are all Borel spaces, endowed with their Borel $\sigma$-algebras ${\mathcal B}(\textbf{X}),$ ${\mathcal B}(\textbf{A}^G)$, and ${\mathcal B}(\textbf{A}^I)$, respectively. The transition rate, on which the gradual control acts, is given by $q(dy|x,a)$, which is a signed kernel from $\textbf{X}\times\textbf{A}^G$, endowed with its Borel $\sigma$-algebra, to ${\mathcal B}(\textbf{X}),$ satisfying the following conditions:

\begin{align*}&q(\Gamma|x,a)\in[0,\infty) \quad \text{for each } \Gamma\in{\mathcal B}(\textbf{X}),\ x\notin \Gamma;\\[3pt] &q(\textbf{X}|x,a)=0, \qquad x\in \textbf{X},\ a\in\textbf{A}^G;\\[3pt] &\bar{q}_x:=\sup_{a\in \textbf{A}^G}q_x(a)<\infty, \qquad x\in\textbf{X},\end{align*}

where $q_x(a):=-q(\{x\}|x,a)$ for each $(x,a)\in\textbf{X}\times\textbf{A}^G.$ For notational convenience, we introduce $\tilde{q}(dy|x,a):=q(dy\setminus\{x\}|x,a)$, for every $x\in\textbf{X}$, $a\in\textbf{A}^G$. If the current state is $x\in\textbf{X}$, and an impulsive control $b\in\textbf{A}^I$ is applied, then the state immediately following this impulse obeys the distribution given by $Q(dy|x,b)$, which is a stochastic kernel from $\textbf{X}\times\textbf{A}^I$ to ${\mathcal B}(\textbf{X}).$ Finally, given the current state $x\in\textbf{X}$, the cost rate of applying a gradual control $a\in \textbf{A}^G$ is $c^G(x,a)$, and the cost of applying an impulsive control $b\in\textbf{A}^I$ is $c^I(x,b,y)$, where $c^G$ and $c^I$ are $[0,\infty)$-valued measurable functions on $\textbf{X}\times\textbf{A}^G$ and $\textbf{X}\times\textbf{A}^I\times\textbf{X}$, respectively. Throughout this paper, we assume that $\textbf{A}^G$ and $\textbf{A}^I$ are compact Borel spaces. Without loss of generality we may regard $\textbf{A}^G$ and $\textbf{A}^I$ as two disjoint compact subsets of a Borel space $\tilde{\textbf{A}}=\textbf{A}^G\cup\textbf{A}^I$. Furthermore, we assume that

(1)

\begin{eqnarray}\sup_{a\in\textbf{A}^G}c^G(x,a)<\infty \qquad \forall x\in\textbf{X}.\end{eqnarray}

The system dynamics in the gradual-impulse control problem of interest can be described as follows. In the absence of impulses, the system is just a controlled Markov pure jump process in the state space $\textbf{X}$, where the (gradual) control, selected from $\textbf{A}^G$, acts on the local characteristics of the process, leading to natural jumps. This is conveniently described as a marked point process, which consists of the pairs of subsequent jump moments and the post-jump states (marks). The mark space is thus $\textbf{X}$.

In the gradual-impulse control problem, we will again describe the system using a marked point process. However, when the decision-maker is allowed to apply a finite or countably infinite sequence of impulses from $\textbf{A}^I$ at a single time moment, with each impulse resulting in a post-impulse state, there will be a sequence of states in $\textbf{X}$ at a single time moment. Moreover, the order of the impulses and their resulting states are also relevant. Therefore, the marked point process we use now is in an enlarged mark space. More precisely, each mark contains a sequence of impulses applied at the same time moment, the state before the impulses are applied, and all the states resulting from these impulses. Each jump moment is triggered either by an impulse (or a sequence of impulses), or by a natural jump. A mark in this marked point process is referred to as an intervention. This term is naturally understandable when the mark consists of impulses. That said, we will also allow ‘interventions’ that do not contain any impulses, or that consist of an empty sequence of impulses. This occurs when the decision-maker chooses not to apply any impulse immediately after a natural jump. In the rest of this section, following the method of [Reference Dufour and Piunovskiy7], we will elaborate on this idea and describe rigorously the continuous-time gradual-impulse control problem that we are studying. We begin by stating the precise definition of an intervention in the next subsection.

2.2. Definition and interpretation of an intervention

At the beginning of an intervention, the decision-maker chooses whether to apply an impulse, and which impulse to apply. If the current state is $x\in\textbf{X}$, after an impulse $b\in\textbf{A}^I$ is chosen, the new state, say $y\in\textbf{X}$, is instantaneously realized, following the distribution $Q(dy|x,b)$. Then, based on x,b,y, the decision-maker chooses the next impulse, if any at all, and so on. To be consistent, a cemetery point $\Delta\notin \textbf{A}^I\cup \textbf{X}$ is artificially fixed, which is chosen when the decision-maker decides not to apply any more impulses at the current instant; this leads to the post-impulse state, also denoted by $\Delta,$ which is absorbing, i.e., $Q(\Delta|\Delta,\Delta)\equiv 1.$ Therefore, an intervention is itself a sequential decision process. More precisely, an intervention can be regarded as a trajectory of the following DTMDP, which we refer to as the ‘intervention’ DTMDP model, to distinguish it from several other DTMDP models to appear subsequently.

Definition 1. The intervention DTMDP model is specified by the tuple $\{\mathbf{X}_{\Delta},\mathbf{A}^{I}_{\Delta},Q\}$, which is defined in terms of the primitives of the gradual-impulse control problem given in Subsection 2.1, where the state space is $\mathbf{X}_{\Delta}:=\mathbf{X}\cup\{\Delta\}$ with $\Delta\notin \textbf{X}\cup \textbf{A}^I$ being a cemetery point, the action space is $\mathbf{A}^{I}_{\Delta}:= \mathbf{A}^{I}\cup\{\Delta\},$ and the one-step transition probability from $\textbf{X}_\Delta\times\textbf{A}^I_\Delta$ to ${\mathcal B}(\textbf{X}_\Delta)$ is $Q(dy|x,b)$. Here we have accepted that $Q(\{\Delta\}|x,b):=1$ if $x=\Delta$ or $b=\Delta$.

Let the initial distribution in the intervention DTMDP be always concentrated on $\textbf{X}.$ Then its canonical sample space is $\mathbf{Y}:=\left(\bigcup_{k=0}^{\infty}\mathbf{Y}_k\right) \cup(\textbf{X}\times\textbf{A}^I)^\infty,$ where, for each $\infty>k\ge 1$,

\begin{equation*}\mathbf{Y}_k:=(\mathbf{X}\times \mathbf{A}^I)^k\times(\mathbf{X}\times\{\Delta\})\times(\{\Delta\}\times\{\Delta\})^\infty,\end{equation*}

and $\textbf{Y}_0:=(\mathbf{X}\times\{\Delta\})\times(\{\Delta\}\times\{\Delta\})^\infty.$ Here, if $y\in \textbf{Y}_k$, $\infty>k\ge 0$, then there are k impulses applied in the intervention y. Similarly, if $y\in (\textbf{X}\times\textbf{A}^I)^\infty,$ then there are infinitely many impulses applied in the intervention y.

Now we give the following definition.

Definition 2. An intervention is an element of $\textbf{Y}.$ In other words, $\textbf{Y}$ as defined above is the space of all interventions. It will be the mark space of the marked point process $\{(T_n,Y_n)\}$ introduced in the next subsection.

With the notation introduced above, we now reiterate, more rigorously than in the beginning of this subsection, the interpretation of an intervention. Given the current state $x\in \textbf{X}$, if the controller decides to use $\Delta$, then this means that no more impulse is used at this instant, and the intervention DTMDP is absorbed at $\Delta$ in the next step; if the controller decides to use an impulse $b\in\textbf{A}^I$, then the post-impulse state follows the distribution $Q(dy|x,b)$. At the next post-impulse state y, if $y=\Delta,$ then the only decision is $\Delta;$ if $y\ne \Delta,$ then the controller decides either to use no impulse, leading to the next post-impulse state $\Delta$, or to use impulse b’, leading to the next post-impulse state, which follows the distribution given by $Q(\cdot|y,b')$, and so on. In other words, an intervention consists of a state and a finite or countable sequence of pairs of impulsive actions and the associated post-impulse states. In particular, no impulse is applied in an intervention if the intervention belongs to $\mathbf{Y}_{0}$; see Figure 1 and its caption for an example. Let

\begin{equation*}\mathbf{Y}^{*}:=\textbf{Y}\setminus \textbf{Y}_0=\left(\bigcup_{k=1}^{\infty}\mathbf{Y}_k\right) \cup (\textbf{X}\times\textbf{A}^I)^\infty\end{equation*}

be the set of interventions where some impulses are applied.

In an intervention, locally, the selection of impulses (including the ‘pseudo-impulse’ $\Delta$) from $\textbf{A}^I_\Delta$ is governed by a strategy in the intervention DTMDP model. This adverb ‘locally’ is understood in comparison with the definition of a policy for the gradual-impulse control problem, as given in Definition 3 below, which governs the selection of impulsive controls as well as gradual controls, and is thus ‘global’. Let $\Xi$ be the set of (possibly randomized and history-dependent) strategies $\sigma$ in the intervention DTMDP. We refer the reader to the appendix for standard terminology related to DTMDPs. The way that a strategy in the intervention DTMDP model is incorporated into a policy in Definition 1 below is through its strategic measure. Let $\beta^\sigma(\cdot|x)$ denote the corresponding strategic measure of a strategy $\sigma$ of the intervention DTMDP, given the initial state $x\in\textbf{X}$. By the Ionescu-Tulcea theorem (see e.g. Proposition C.10 in [Reference Hernández-Lerma and Lasserre13]), the mapping $x\in\textbf{X}\rightarrow \beta^\sigma(\cdot|x)$ is measurable. Let $\mathcal{P}^\mathbf{Y}$ be the collection of all such stochastic kernels generated by some strategy $\sigma\in\Xi$, and $\mathcal{P}^\mathbf{Y}(x):=\{\beta^\sigma(\cdot|x): \sigma\in\Xi\}$ for each state $x\in \mathbf{X}$. Let

\begin{equation*}\mathcal{P}^{\mathbf{Y}^\ast}:=\{\beta(\cdot|\cdot)\in \mathcal{P}^{\textbf{Y}}:~\beta(\textbf{Y}^\ast|x)=1\ \forall~x\in\textbf{X}\},\end{equation*}

and for each $x\in\textbf{X},$

\begin{equation*}\mathcal{P}^\mathbf{Y^\ast}(x):=\{\beta(\cdot|x): \beta(\cdot|\cdot)\in \mathcal{P}^{\mathbf{Y}},~\beta(\textbf{Y}^\ast|x)=1\}.\end{equation*}

2.3 Construction of the controlled process

Let us now describe the promised marked point process $\{(T_{n},Y_{n})\}_{n=1}^\infty$ for the system dynamics of the gradual-impulse control problem, where the mark space is the space of interventions. Then the continuous-time process $\{\xi_t\}_{t\ge 0}$ under control is defined based on this marked point process.

Let $\mathbf{Y}_\Delta:=\mathbf{Y}\cup\{\Delta\},$

\begin{equation*}\Omega_0:=\mathbf{Y}\times(\{0\}\times \textbf{Y})\times(\{\infty\}\times\{\Delta\})^\infty,\end{equation*}

and

\begin{equation*} \Omega_{n}:=\mathbf{Y}\times(\{0\}\times \textbf{Y})\times ((0,\infty)\times \mathbf{Y})^n\times(\{\infty\}\times\{\Delta\})^\infty\end{equation*}

for all $n=1,2,\dots.$ The canonical space $\Omega$ is defined as

\begin{equation*}\Omega:=\left(\bigcup_{n=0}^\infty \Omega_{n}\right)\cup \big( \mathbf{Y}\times((0,\infty)\times \mathbf{Y})^\infty \big)\end{equation*}

and is endowed with its Borel $\sigma$-algebra, denoted by $\mathcal{F}$. We will use the following generic notation for a point in $\Omega$: $\omega=(y_0,\theta_1,y_1,\theta_2,y_2,\ldots).$ Below, unless stated otherwise, the notation $x_0\in\textbf{X}$ will indicate the initial state of the gradual-impulse control problem. Then we put

(2)

\begin{eqnarray}y_0:=(x_0,\Delta,\Delta,\dots), \qquad \theta_1\equiv 0.\end{eqnarray}

The sequence $\{\theta_n\}_{n=1}^\infty$ represents the sojourn times between consecutive interventions. Here $\theta_1=0$ corresponds to the fact that we allow the possibility of applying impulsive control at the initial time moment; cf. (5) below.

For each $n=0,1,\dots$, let

\begin{eqnarray*}h_{n}:=(y_0,\theta_1,y_1,\theta_2,y_2,\dots \theta_{n},y_{n})=(y_0,0,y_1,\theta_2,y_2,\dots \theta_{n},y_{n}),\end{eqnarray*}

where the second equality holds because $\theta_1\equiv 0$; see (2). The collection of all such partial histories $h_n$ is denoted by $\mathbf{H}_{n}$. Let us introduce the coordinate mappings

\begin{eqnarray*}Y_{n}(\omega)=y_{n} \qquad \forall~n\ge 0,\\\Theta_{n}(\omega)=\theta_{n} \qquad \forall~n\ge 1.\end{eqnarray*}

The sequence $\{T_{n}\}_{n=1}^{\infty}$ of $[0,\infty]$-valued mappings is defined by

\begin{equation*}T_{n}(\omega):=\sum_{i=1}^n\Theta_i(\omega)=\sum_{i=1}^n\theta_i\end{equation*}

and

\begin{equation*}T_\infty(\omega):=\lim_{n\to\infty}T_{n}(\omega)\end{equation*}

for all $\omega\in\Omega.$ Let $H_{n}:=(Y_0,\Theta_1,Y_1,\dots,\Theta_{n},Y_{n}).$ Finally, we define the controlled process $\big\{\xi_t\big\}_{t\in [0,\infty)}$ by

\begin{eqnarray*}\xi_t(\omega)=\left\{\begin{array}{ll}Y_{n}(\omega) & \mbox{ if } T_{n}\le t<T_{n+1} \mbox{ for } n\ge 1, \\\Delta & \mbox{ if } T_\infty\le t.\end{array}\right..\end{eqnarray*}

It is convenient to introduce the random measure $\mu$ of the marked point process $\{(T_{n},Y_{n})\}_{n=1}^\infty$ on $(0,\infty)\times \mathbf{Y}$:

\begin{equation*}\mu(dt\times dy)=\sum_{n\ge 2}I_{\{T_{n}<\infty\}}\delta_{(T_{n},Y_{n})}(dt\times dy).\end{equation*}

Let

\begin{equation*}\mathcal{F}_t:=\sigma\{H_1\}\vee\sigma\{\mu((0,s]\times B):~s\le t,B\in\mathcal{B}(\mathbf{Y})\}\end{equation*}

for $t\in[0,\infty)$.

We will use the following notation in the next definition. For each intervention $y=(x_0,b_0,x_1,b_1,\dots)\in \mathbf{Y}$, define $\bar x(y):=x_{k}$ if $\infty>k=0,1,\dots$ is the unique integer such that $y\in \mathbf{Y}_k$ (if $k\ge 1,$ then $\bar{x}(y)$ is the state after the last impulse in the intervention y); if such an integer k does not exist, then $y\in(\textbf{X}\times\textbf{A}^I)^\infty$ and $\bar{x}(y):=\Delta.$ The previous equality corresponds to the fact that we kill the process after an infinite number of impulses are applied at a single time moment. An example of a trajectory of the system dynamics in the gradual-impulse control problem is displayed in Figure 1.

Definition 3. A policy is a sequence $u=\{u_{n}\}_{n=0}^\infty$ such that $u_{0}\in \mathcal{P}^{\mathbf{Y}}$ and, for each $n=1,2,\dots$, $u_{n}=\left( \Phi_{n},\Pi_{n},\Gamma_{n}^0,\Gamma_n^1 \right),$ where $\Phi_{n}$ is a stochastic kernel on $(0,\infty]$ given $\mathbf{H}_{n}$ such that $\Phi_n(\{\infty\}|h_n)=1$ if $y_n\in (\textbf{X}\times\textbf{A}^I)^\infty$; $\Pi_{n}$ is a stochastic kernel on $\mathbf{A}^{G}$ given $ \mathbf{H}_{n}\times (0,\infty)$; $\Gamma^0_{n}$ is a stochastic kernel on $\mathbf{Y}$ given $ \mathbf{H}_{n}\times (0,\infty)\times \mathbf{X}$ satisfying

\begin{equation*}\Gamma_n^0(\cdot|h_n,t,x)\in {\mathcal P}^{\textbf{Y}}(x)\end{equation*}

for each $h_n\in\textbf{H}_n$, $x\in\textbf{X}$, and $t\in(0,\infty)$; and $\Gamma^1_{n}$ is a stochastic kernel on $\mathbf{Y}$ given $ \mathbf{H}_{n}$ satisfying

\begin{equation*}\Gamma^1_n(\cdot|h_n)\in {\mathcal P}^{\textbf{Y}^\ast}(\bar{x}(y_n))\end{equation*}

for each $h_n\in \textbf{H}_n$. (The above conditions apply when $y_n\ne \Delta$; otherwise, all the values of $\Phi_n$, $\Pi_n$, $\Gamma_n^0$, and $\Gamma_n^1$ are immaterial and may be put arbitrarily.) The set of policies is denoted by $\mathcal U$.

Let us provide an interpretation of how a policy u acts on the system dynamics. Roughly speaking, an intervention is over as soon as the (possibly empty) sequence of simultaneous impulses is over. Given that the nth intervention is over, the kernel $\Phi_n$ specifies the conditional distribution of the planned time until the next impulse (or next sequence of impulses). The (conditional) distribution of the time until the next natural jump (if there are no interventions before it) is the non-stationary exponential distribution with rate $\int_{\textbf{A}^G}q_{\bar{x}(Y_n)}(a)\Pi_n(da|H_n,t)$. Below, we put $q_{\Delta}(a):=0$ for each $a\in\textbf{A}^G.$ In other words, $\Pi_n$ is the (decision rule of) relaxed gradual control. Given that the nth intervention is over, the next intervention is triggered by either the next planned impulse or the next natural jump; in the former case, the new intervention has the distribution given by $\Gamma^1_n$, and in the latter case the new intervention has the distribution given by $\Gamma^0_n.$ This interpretation will be seen to be consistent with (3) and (4) below, where one can see how a policy u acts on the conditional law of the marked point process $\{(T_n,Y_n)\}_{n=1}^\infty.$ See also the caption of Figure 1.

Suppose a policy $u=\{u_{n}\}_{n=0}^\infty $ is fixed. Let us now present the conditional law of the marked point process $\{(T_n,Y_n)\}_{n=1}^\infty$ under the policy u, which determines the underlying probability measure $\mathbb{P}_{x_0}^u$ on $(\Omega,\mathcal F)$, where $x_0\in\textbf{X}$ is the fixed initial state of the system dynamics. For brevity, we introduce the following notation for each $n\ge 1$, $\Gamma\in \mathcal{B}(\mathbf{X})$, and $h_{n}=(y_0,\theta_1,y_1,\ldots,\theta_{n},y_{n})\in \mathbf{H}_{n}$:

\begin{eqnarray*}\lambda_{n}^u(\Gamma|h_{n},t): = \int_{\mathbf{A}^{G}} \tilde{q}(\Gamma | \overline{x}(y_{n}),a) \Pi_{n}(da | h_{n},t),~\Lambda_{n}^u(\Gamma|h_{n},t): = \int_{0}^t \lambda_{n}^u(\Gamma|h_{n},s) ds.\end{eqnarray*}

Now, for each $n\ge 1$, we introduce the stochastic kernel $G_{n}^u$ on $(0,\infty]\times \mathbf{Y}_{\Delta}$ given $\mathbf{H}_{n}$ as follows. For each $h_{n}=(y_0,\theta_1,y_1,\ldots,\theta_{n},y_{n})\in \mathbf{H}_{n}$,

(3)

\begin{eqnarray}G_{n}^u(\{+\infty\}\times \{\Delta\} | h_{n}):= \delta_{y_{n}} (\{\Delta\}) + \delta_{y_{n}} (\mathbf{Y})e^{-\Lambda_{n}^u(\mathbf{X}|h_{n},+\infty)}\Phi_{n}(\{+\infty\}|h_n),\end{eqnarray}

and

(4)

\begin{align}G_{n}^u(dt\times dy| h_{n})&:=\delta_{y_{n}} (\mathbf{Y}) \left\{ \Gamma_{n}^1(dy| h_{n}) e^{-\Lambda_{n}^u(\mathbf{X}|h_{n},t)} \Phi_{n}(dt | h_{n}) \right.\nonumber \\[2pt]&\quad \left.+ \int_{\mathbf{X}} \Phi_{n}([t,\infty] | h_{n}) \Gamma_{n}^0(dy| h_{n},t,x) \lambda_{n}^u(dx|h_{n},t) e^{-\Lambda_{n}^u(\mathbf{X}|h_{n},t)} dt \right\}\end{align}

on $(0,\infty)\times \textbf{Y}$. For each fixed initial state $x_{0}\in\mathbf{X}$, by the Ionescu-Tulcea theorem (see e.g. Proposition C.10 in [Reference Hernández-Lerma and Lasserre13]), there exists a probability $ \mathbb{P}^{u}_{x_{0}}$ on $(\Omega,\mathcal{F})$ such that the restriction of $\mathbb{P}^{u}_{x_{0}}$ to $(\Omega,\mathcal{F}_{0})$ is given by

(5)

\begin{eqnarray}\mathbb{P}^{u}_{x_{0}} \left( \left(\{y_{0}\}\times \{0\} \times \Gamma \times ((0,\infty]\times \mathbf{Y}_{\Delta})^{\infty} \right)\cap \Omega\right) & = &u_{0}(\Gamma|x_{0})\end{eqnarray}

for each $\Gamma\in \mathcal{B}(\mathbf{Y})$; and for each $n\ge 1$, under $\mathbb{P}^{u}_{x_{0}}$, the conditional distribution of $(Y_{n+1},\Theta_{n+1})$ given $\mathcal{F}_{T_{n}}:=\sigma\{H_{n}\}$ is determined by $G_{n}^u(\cdot | H_{n})$, while the conditional survival function of $\Theta_{n+1}$ given $\mathcal{F}_{T_{n}}$ under $\mathbb{P}^{u}_{x_{0}}$ is given by $G_{n}^u([t,+\infty]\times \mathbf{Y}_{\Delta}| H_{n})$.

The cost associated with an intervention $y=(x_{0},b_{0},x_{1},b_{1},\ldots)\in \mathbf{Y}$ is given by $C^I(y):=\sum_{k=0}^\infty c^I(x_k,b_k,x_{k+1}).$ Here, recall that an intervention consists of the current state, the sequence of impulses applied in turn at the same time moment, and the associated post-impulse states; and each impulse b applied at state x results in a cost $c^I(x,b,z)$ if it leads to the post-impulse state z. (We accept that $c^I(x,\Delta,\Delta):= 0$ for all $x\in\textbf{X}_\Delta$.) With this notation, we now introduce the performance measure considered in this paper:

\begin{eqnarray*}\mathcal{V}(x,u)&:=& \mathbb{E}^{u}_{x} \left[ e^{\sum_{n=1}^\infty \left(C^{I}(Y_n)+ \int_{T_n}^{T_{n+1}} \int_{\mathbf{A}^{G}} c^{G}(\bar{x}(\xi_{s}),a) \Pi_n(da |H_n,s-T_n) ds\right)} \right]\end{eqnarray*}

for each $x\in\textbf{X}$ and policy $u\in {\mathcal U}$. Here we recall that $T_1=\Theta_1\equiv 0$; see (2). To illustrate more explicitly how the policy acts on the impulses, consider the example of only one intervention and null gradual cost $c^G(x,a)\equiv 0$. Then we may write

\begin{eqnarray*}\mathbb{E}^{u}_{x} \left[ e^{ C^{I}(Y_1) } \right]&=&\int_{\textbf{X}\times \textbf{A}^I\times \textbf{X}\times \dots} u_0(dx_0\times db_0\times dx_1\times db_1\times\dots|x)e^{\sum_{k=0}^\infty c^I(x_k,b_k,x_{k+1}) }\\[2pt] &=&\int_{\textbf{X}\times \textbf{A}^I\times \textbf{X}\times \dots} u_0(dy|x)e^{ C^I(y) }.\end{eqnarray*}

More generally, one can compute

\begin{equation*}\mathbb{E}^{u}_{x} \left[ e^{ C^{I}(Y_{n+1}) } \right]=\mathbb{E}^{u}_{x} \left[\mathbb{E}^{u}_{x} \left[ e^{C^{I}(Y_{n+1}) }|H_n \right]\right],\end{equation*}

where

\begin{equation*}\mathbb{E}^{u}_{x} \left[ e^{ C^{I}(Y_{n+1}) }|H_n \right]\end{equation*}

can be written out as an integral similar to that of the $n=0$ case using the conditional laws (3) and (4).

Let the value function ${\mathcal V}^\ast$ be denoted by ${\mathcal V}^\ast(x):=\inf_{u\in {\mathcal U}}{\mathcal V}(x,u)$ for each $x\in\textbf{X}$. A policy $u^\ast$ satisfying ${\mathcal V}(x,u^\ast)={\mathcal V}^\ast(x)$ for all $x\in\textbf{X}$ is called optimal for the following gradual-impulse control problem:

(6)

\begin{eqnarray}\mbox{Minimize over $u\in{\mathcal U}:$~}{\mathcal V}(x,u).\end{eqnarray}

In this paper, we will present conditions on the system primitives that guarantee the existence of an optimal policy in a simple form as defined next.

Definition 4. A policy u is called deterministic stationary if there exist some measurable mappings $(\varphi,\psi,f)$ on $\textbf{X}$, where $\varphi(x)\in\{0,\infty\}$ for each $x\in \textbf{X}$, and $\psi$ and f are $\textbf{A}^I$-valued and $\textbf{A}^G$-valued, such that $\Phi_n(\{\infty\}|h_n)=1$,

\begin{equation*}\Pi_n(da|h_n,t)=\delta_{f(\bar{x}(y_n))}(da)\end{equation*}

for all $t\ge 0$, and $u_0(\cdot|x)=\Gamma^0_n(\cdot|h_n,t,x)=\beta^\pi(\cdot|x)$ for some deterministic stationary strategy $\pi$ in the intervention DTMDP model defined by

\begin{equation*}\pi(\{\Delta\}|x_0,b_0,x_1,b_1,\dots,x_n)=I\{\varphi(x_n)=\infty\}\end{equation*}

and

\begin{equation*}\pi(db|x_0,b_0,x_1,b_1,\dots,x_n)=I\{\varphi(x_n)=0\}\delta_{\psi(x_n)}(db).\end{equation*}

In the above definition, $\Gamma^1_n$ was left arbitrary, because, under such a deterministic stationary policy, a new intervention taking place at some $t\in(0,\infty)$ is always triggered by a natural jump.

3. Optimality results

In this section, we present the main optimality results in this paper. In a nutshell, under natural conditions on the system primitives of the gradual-impulse control problem (6), we show that it can be solved via the problem (21) for a simple DTMDP model, which we refer to as the tilde DTMDP model. In this way, we show that the gradual-impulse control problem (6) admits a deterministic stationary optimal policy.

In order to formulate the tilde DTMDP model, we impose the following condition.

Condition 1. There exists a $[1,\infty)$-valued continuous function w on $\textbf{X}$ such that $c^G(x,a)+q_x(a)+1\le w(x)$ for each $(x,a)\in\textbf{X}\times\textbf{A}^G.$ If $c^G$ is a continuous function, then the above condition is a consequence of Condition 1 below and the Berge theorem; see Proposition 7.32 of [Reference Bertsekas and Shreve2]. Several of the statements below do not need the bounding function w in Condition 3 to be continuous. In this connection, we also mention that a Borel measurable function w satisfying the inequality in Condition 1 always exists; see Lemma 1 of [Reference Feinberg, Mandava and Shiryaev9] and recall (1).

Recall that $\tilde{\textbf{A}}=\textbf{A}^I\cup\textbf{A}^G$ is the disjoint union of $\textbf{A}^G$ and $\textbf{A}^I$. We are now in position to define the tilde DTMDP model in terms of the system primitives of the gradual-impulse control problem (6).

Definition 5. The tilde DTMDP model is specified by $\{\textbf{X},\tilde{\textbf{A}},\tilde{Q},\tilde{l}\}$, where $\textbf{X}$ and $\tilde{\textbf{A}}$ are its state and action spaces, and its transition probability $\tilde{Q}$ on $\textbf{X}$ given $\textbf{X}\times\tilde{\textbf{A}}$ and cost function $\tilde{l}$ are defined by

\begin{equation*}\tilde{Q}(dy|x,a):= \frac{q(\Gamma|x,a)}{w(x)}+\delta_{x}(dy),\qquad\tilde{l}(x,a,y):=\ln\frac{w(x)}{w(x)-c^G(x,a)}\end{equation*}

for all $a\in\textbf{A}^G$, and

\begin{equation*}\tilde{Q}(dy|x,b):=Q(dy|x,b), \qquad \tilde{l}(x,b,y):=c^I(x,b,y)\end{equation*}

for all $b\in\textbf{A}^I$.

For the solvability of the problem (21) for the tilde DTMDP model, we impose the following compactness–continuity condition.

Condition 2. The functions $c^I$ and $c^G$ are lower semicontinuous on $\textbf{X}\times\textbf{A}^I\times\textbf{X}$ and $\textbf{X}\times\textbf{A}^G$, respectively, and for each bounded continuous function g on $\textbf{X}$, $\int_{\textbf{X}}g(y)Q(dy|x,b)$ and $\int_{\textbf{X}}g(y)\tilde{q}(dy|x,a)$ are continuous in $(x,b)\in \textbf{X}\times\textbf{A}^I$ and $(x,a)\in \textbf{X}\times\textbf{A}^G$, respectively. (Recall also that $\textbf{A}^G$ and $\textbf{A}^I$ are compact.) Under Conditions 1 and 2, one can easily check that the tilde DTMDP model is semicontinuous, so that the value function $W^\ast$ of the problem (21) for the tilde DTMDP model is lower semicontinuous, and there exists an optimal deterministic stationary strategy for it; see Proposition 4(e). We collect these observations in the next statement for future reference.

Proposition 1. Suppose Conditions 1 and 2 are satisfied. Then the value function $W^\ast$ of the problem (21) for the tilde DTMDP model is the minimal $[1,\infty]$-valued lower semicontinuous function satisfying

(7)

\begin{eqnarray}V(x)= \inf_{\tilde{a}\in\tilde{\textbf{A}}}\left\{\int_{\textbf{X}}e^{\tilde{l}(x,\tilde{a},y)}V(y)\tilde{Q}(dy|x,\tilde{a})\right\}, \qquad x\in\textbf{X}.\end{eqnarray}

($W^\ast$ is also the minimal $[1,\infty]$-valued lower semicontinuous solution to the optimality inequality obtained by replacing ‘$=$’ with ‘$\ge $’ in (7).) A pair of measurable mappings $(\psi^\ast,f^\ast)$ from $\textbf{X}$ to $\textbf{A}^I$ and $\textbf{A}^G$, respectively, is a deterministic stationary optimal strategy for the problem (21) for the tilde DTMDP model if and only if, for all $x\in\textbf{X}$, there is some x-dependent $\tilde{a}^\ast\in\tilde{\textbf{A}}$ such that

(8)

\begin{eqnarray}&&\int_{\textbf{X}}e^{\tilde{l}(x,\tilde{a}^\ast,y)}W^\ast(y)\tilde{Q}(dy|x,\tilde{a}^\ast)=\inf_{\tilde{a}\in\tilde{\textbf{A}}}\left\{\int_{\textbf{X}}e^{\tilde{l}(x,\tilde{a},y)}W^\ast(y)\tilde{Q}(dy|x,\tilde{a})\right\}\nonumber\\&=&\int_{\textbf{X}}e^{\tilde{l}(x,\psi^\ast(x),y)}W^\ast(y)\tilde{Q}(dy|x,\psi^\ast(x))I\{\tilde{a}^\ast\in\textbf{A}^I\}\nonumber\\&&+\int_{\textbf{X}}e^{\tilde{l}(x,f^\ast(x),y)}W^\ast(y)\tilde{Q}(dy|x,f^\ast(x))I\{\tilde{a}^\ast\in\textbf{A}^G\}.\end{eqnarray}

Such a pair $(\psi^\ast,f^\ast)$ of measurable selectors exists.

We introduce the notation to be used in the next statement. For each $[1,\infty]$-valued universally measurable function g on $\textbf{X}$, define

(9)

\begin{align}\textbf{X}^G(g):=\bigg\{x\in\textbf{X}:\infty>g(x)=\inf_{a\in \textbf{A}^G}\bigg\{&\int_{\textbf{X}}g(y)\tilde{q}(dy|x,a)\nonumber\\&-(q_x(a)-c^G(x,a))g(x)\bigg\}\bigg\}\end{align}

and

\begin{equation*}\textbf{X}^I(g):=\left\{x\in\textbf{X}:~g(x)=\inf_{b\in\textbf{A}^I}\left\{\int_{\textbf{X}}g(y)e^{c^I(x,b,y)}Q(dy|x,b)\right\}\right\}.\end{equation*}

Proposition 7 asserts that, without imposing Condition 3, $W^\ast$ is universally measurable, so that the integrals $\int_{\textbf{X}}W^\ast(y)\tilde{q}(dy|x,a)$ and

\begin{equation*}\int_{\textbf{X}}W^\ast(y)e^{c^I(x,b,y)}Q(dy|x,b)\end{equation*}

are defined.

Theorem 1. Suppose Conditions 3 and 3 are satisfied. Then the following assertions hold.

(a) The value function $W^\ast$ of the problem (21) for the tilde DTMDP model coincides with ${\mathcal V}^\ast$.
(b) $\textbf{X}\setminus \textbf{X}^I(W^\ast)\subseteq \textbf{X}^G(W^\ast).$
(c) There is a deterministic stationary optimal policy for the gradual-impulse control problem (6), which can be obtained as follows. For each pair $(\psi^\ast,f^\ast)$ of measurable mappings satisfying (8) (and there exists such a pair by Proposition 3), the deterministic stationary policy $(\varphi,\psi^\ast,f^\ast)$ is optimal, where $\varphi(x)=\infty$ for all $x\in \textbf{X}\setminus \textbf{X}^I(W^\ast)$ and $\varphi(x)=0$ for all $x\in\textbf{X}^I(W^\ast)$. The proofs of this and the other statements in this section are postponed to Section 5.

According to Theorem 1, roughly speaking, if the current state is in $\textbf{X}^G(W^\ast)$, then it is optimal not to apply an impulse until the next natural jump; and if the current state is in $\textbf{X}^I(W^\ast)$, then it is optimal to apply an impulse immediately. Also, Equation (7) is the optimality equation for the gradual-impulse control problem (6). It can be written out in an equivalent form that does not involve the function w, which might be more convenient sometimes.

Corollary 1. Suppose Conditions 1 and 2 are satisfied. Then the following assertions hold.

(a) ${\mathcal V}^\ast$ is the minimal $[1,\infty]$-valued lower semicontinuous function on $\textbf{X}$ satisfying
(10) \begin{align}&\inf_{a\in\textbf{A}^G}\left\{\int_{\textbf{X}} {\mathcal V}^\ast(y) \tilde{q}(dy|x,a)-(q_x(a)-c^G(x,a)){\mathcal V}^\ast(x) \right\}\ge 0\\ &\quad \forall~x\in\textbf{X}^\ast({\mathcal V}^\ast):=\{x\in\textbf{X}:~{\mathcal V}^\ast(x)<\infty\}\nonumber\end{align}
and
(11) \begin{eqnarray}{\mathcal V}^\ast(x)\le \inf_{b\in\textbf{A}^I}\left\{ \int_{\textbf{X}} e^{c^I(x,b,y)}{\mathcal V}^\ast(y)Q(dy|x,b) \right\}, \qquad x\in\textbf{X},\end{eqnarray}
where at each $x\in\textbf{X}$, the inequality in either (10) or (11) holds with equality.
(b) A pair $(\psi^\ast,f^\ast)$ of measurable mappings satisfies (8) if and only if
\begin{eqnarray*}&&\inf_{a\in \textbf{A}^G}\left\{\int_{\textbf{X}}{\mathcal V}^\ast(y)\tilde{q}(dy|x,a)-(q_x(a)-c^G(x,a)){\mathcal V}^\ast(x)\right\}\\&=&\int_{\textbf{X}}{\mathcal V}^\ast(y)\tilde{q}(dy|x,f^\ast(x))-(q_x(f^\ast(x))-c^G(x,f^\ast(x))){\mathcal V}^\ast(x)\end{eqnarray*}
for each $x\in\textbf{X}^G({\mathcal V}^\ast)$, and
\begin{eqnarray*}\inf_{b\in\textbf{A}^I}\left\{\int_{\textbf{X}}e^{c^I(x,b,y)}{\mathcal V}^\ast(y)Q(dy|x,b)\right\}&=&\int_{\textbf{X}}{\mathcal V}^\ast(y)e^{c^I(x,\psi^\ast(x),y)}Q(dy|x,\psi^\ast(x))\end{eqnarray*}
for each $x\in\textbf{X}$. (According to Theorem 3, $(\psi^\ast,f^\ast)$ gives rise to a deterministic stationary optimal policy for the gradual-impulse control problem (6).)

To end this section, we present a simple example to demonstrate a situation where it is natural and necessary to allow multiple impulses at a single time moment.

Example 2. Let us revisit Example 1. The model has a state space $\{1,2\}$, where 1 indicates that the rat is present in the kitchen, and 2 indicates that the rat is either dead or outside the house. The space of gradual controls is a singleton and will not be indicated explicitly, and the space of impulses is $\textbf{A}^I=\{0,1\}$, with 1 or 0 standing for shooting or not. So the inequalities (10) and (11) for the value function ${\mathcal V}^\ast$ read as follows:

\begin{eqnarray*}&&{\mathcal V}^\ast(2)=1;\\&&\mu {\mathcal V}^\ast(2) -(\mu-l){\mathcal V}^\ast(1)\ge 0; \qquad {\mathcal V}^\ast(1)\le \min\{e^C p {\mathcal V}^\ast(2)+e^C (1-p){\mathcal V}^\ast(1),{\mathcal V}^\ast(1)\}.\end{eqnarray*}

Suppose $1-e^C(1-p)>0.$ By Theorem 1 and Corollary 1, if

\begin{equation*}\frac{e^C p}{1-e^C(1-p)}>\frac{\mu}{\mu-l}>0,\end{equation*}

then

\begin{equation*}{\mathcal V}^\ast(1)=\frac{\mu}{\mu-l},\end{equation*}

and the optimal deterministic stationary policy is to never shoot at the rat; otherwise,

\begin{equation*}{\mathcal V}^\ast(1)=\frac{e^C p}{1-e^C(1-p)}=E[e^{CZ}]\end{equation*}

with Z following the geometric distribution with success probability p, and the optimal deterministic stationary policy is to keep shooting as soon as the rat is in kitchen, until the rat is hit.

The proofs of the statements in this section are based on the investigation of an optimal control problem for another DTMDP model, which will be introduced in the next section.

4. The hat DTMDP model

In this section, we describe a DTMDP problem which will serve the investigation of the gradual-impulse control problem. To distinguish it from the intervention DTMDP model, we shall refer to it as the hat DTMDP model. The system primitives of this DTMDP model are defined in terms of those of the gradual-impulse control problem. At the end of this section we will reveal in greater detail the connections relevant to this paper between the hat DTMDP problem and the gradual-impulse control problem. For a first impression, roughly speaking, the state process of the hat DTMDP model comes from the system dynamics of the gradual-impulse control problem in the following way. The state has two coordinates. Along the (discrete-time) state process of the hat DTMDP model, the second coordinate records the system state of the graduate-impulse control problem immediately after a natural jump (of the marked point process $\{(T_n,Y_n)\}_{n=1}^\infty$) or an ‘actual’ impulse (thus the state immediately after the pseudo-impulse $\Delta$ will not be recorded). The first coordinate records the time elapsed in the gradual-impulse control problem between two consecutive states as recorded in the second coordinate.

The hat DTMDP has a more complicated action space than the original gradual-impulse control problem. To describe the action space of the hat DTMDP model, let us recall some known and general facts. Let ${\mathcal R}$ be the collection of ${\mathcal P}(\textbf{A}^G)$-valued measurable mappings on $[0,\infty)$ with any two elements therein being identified if they differ only on a null set with respect to the Lebesgue measure. Recall that ${\mathcal P}(\textbf{A}^G)$ stands for the space of probability measures on $(\textbf{A}^G, {\mathcal B}(\textbf{A}^G))$. We endow ${\mathcal P}(\textbf{A}^G)$ with its weak topology (generated by bounded continuous functions on $\textbf{A}^G$) and the Borel $\sigma$-algebra, so that ${\mathcal P}(\textbf{A}^G)$ is a Borel space; see Chapter 7 of [Reference Bertsekas and Shreve2]. It is known (see Lemma 1 of [Reference Yushkevich23]) that the space ${\mathcal R}$, endowed with the smallest $\sigma$-algebra with respect to which the mapping

\begin{equation*}\rho=(\rho_t(da))\in{\mathcal R}\rightarrow \int_0^\infty e^{-t}g(t,\rho_t)dt\end{equation*}

is measurable for each bounded measurable function g on $(0,\infty)\times {\mathcal P}(\textbf{A}^G)$, is a Borel space. Recall that $\textbf{A}^I$ and $\textbf{A}^G$ are compact Borel spaces. Then, according to Section 43 of [Reference Davis6], the space ${\mathcal R}$ is a compact metrizable space, endowed with the Young topology, which is the coarsest topology with respect to which the mapping

\begin{equation*}\rho=(\rho_t(da))\in {\mathcal R}\rightarrow \int_0^\infty \int_{\textbf{A}^G} g(t,a)\rho_t(da)dt\end{equation*}

is continuous for each function g on $(0,\infty)\times \textbf{A}^G$ satisfying the following properties:

(a) for each $t\in(0,\infty),$ $g(t,\cdot)$ is continuous on $\textbf{A}^G$;
(b) for each $a\in\textbf{A}^G$, $g(\cdot,a)$ is measurable on $(0,\infty);$ and
(c)
\begin{equation*}\int_0^\infty \sup_{a\in\textbf{A}^G}|g(t,a)|dt<\infty.\end{equation*}
Below we shall use, without special reference, the following notation: if $\mu$ is a measure on a Borel space $(X,{\mathcal{B}}(X))$, then $f(\mu):=\int_X f(x)\mu(dx)$ for each measurable function f on $(X,{\mathcal{B}}(X))$, provided that the integral is well defined.

4.1. Primitives of the hat DTMDP model $\{\hat{\textbf{X}}, \hat{\textbf{A}}, p, l\}$

The state space of the hat DTMDP model is $\hat{\textbf{X}}:=\{(\infty,x_\infty)\}\cup [0,\infty)\times\textbf{X}$, where $(\infty,x_\infty)$ is an isolated point, and the action space of the DTMDP is $\hat{\textbf{A}}:=[0,\infty]\times \textbf{A}^I\times {\mathcal R}$. Endowed with the product topology, where $[0,\infty]$ is compact in the standard topology of the extended real line, $\hat{\textbf{A}}$ is also a compact Borel space. Here, $\textbf{X}$, $\textbf{A}^I$, and $\textbf{A}^G$ are the state, impulse, and gradual action spaces in the gradual-impulse control problem. The transition probability p is defined as follows, using the notation introduced prior to this subsection, e.g., $q_x(\rho_t):=\int_{\textbf{A}^G} q_x(a)\rho_t(da)$ and

\begin{equation*}c^G(x,\rho_t):=\int_{\textbf{A}^G}c^G(x,a)\rho_t(da).\end{equation*}

For each bounded measurable function g on $\hat{\textbf{X}}$ and action $\hat{a}=(c,b,\rho)\in\hat{\textbf{A}}$,

\begin{eqnarray*}&&\int_{\hat{\textbf{X}}}g(t,y)p(dt\times dy|(\theta,x),\hat{a})\\[3pt]&:=&I\{c=\infty\}\left\{ g(\infty,x_\infty)e^{-\int_0^\infty q_x(\rho_s)ds}+\int_{0}^\infty\int_{\textbf{X}}g(t,y)\tilde{q}(dy|x,\rho_t)e^{-\int_0^t q_x(\rho_s)ds} dt \right\}\\[3pt]&&+I\{c<\infty\}\left\{\int_0^c \int_{\textbf{X}} g(t,y)\tilde{q}(dy|x,\rho_t)e^{-\int_0^t q_x(\rho_s)ds}dt\right.\\[3pt]&&\left.+e^{-\int_0^c q_x(\rho_s)ds}\int_{\textbf{X}}g(c,y)Q(dy|x,b) \right\}\\[3pt]&=&\int_0^c \int_{\textbf{X}} g(t,y)\tilde{q}(dy|x,\rho_t)e^{-\int_0^t q_x(\rho_s)ds}dt+I\{c=\infty\} g(\infty,x_\infty)e^{-\int_0^\infty q_x(\rho_s)ds}\\[3pt]&&+I\{c<\infty\}e^{-\int_0^c q_x(\rho_s)ds}\int_{\textbf{X}}g(c,y)Q(dy|x,b)\end{eqnarray*}

for each state $(\theta,x)\in[0,\infty)\times \textbf{X}$, and

\begin{equation*}\int_{\hat{\textbf{X}}}g(t,y)p(dt\times dy|(\infty,x_\infty),\hat{a}):=g(\infty,x_\infty).\end{equation*}

It is known (see e.g. [Reference Costa and Dufour5, Reference Forwick, Schäl and Schmitz10]) that for each bounded measurable function g on $\hat{\textbf{X}}$, the above expressions are indeed measurable on $\hat{\textbf{X}}\times \hat{\textbf{A}}$, and the same also holds for the cost function l on $\hat{\textbf{X}}\times\hat{\textbf{A}}\times\hat{\textbf{X}}$, defined as follows:

\begin{eqnarray*}l((\theta,x),\hat{a},(t,y)):=I\{(\theta,x)\in [0,\infty)\times\textbf{X}\}\left\{\int_0^t c^G(x,\rho_s)ds+I\{t=c\}c^I(x,b,y)\right\}\end{eqnarray*}

for each $(\theta,x),\hat{a},(t,y))\in\hat{\textbf{X}}\times\hat{\textbf{A}}\times\hat{\textbf{X}}$, accepting that $c^I(x,b,x_\infty)\equiv 0.$ Recall that the generic notation $\hat{a}=(c,b,\rho)\in\hat{\textbf{A}}$ for an action in this hat DTMDP model has been in use. The pair (c,b) is the pair of the planned time until the next impulse and the next planned impulse, and $\rho$ is (the rule of) the relaxed control to be used during the next sojourn time. Figure 2 displays the realization of the components $\{(C_n,B_n)\}_{n=0}^\infty$ of the action process in the hat DTMDP model corresponding to the sample path in Figure 1 for the gradual-impulse control problem.

Figure 2. The realization of the state process in the hat DTMDP model corresponding to the sample path in the gradual-impulse control problem in Figure 1. The time index is discrete from $\{0,1,\dots\}$. The realizations of the components $\{(C_n,B_n)\}_{n=0}^\infty$ in the action process $\{\hat{A}_n\}_{n=0}^\infty$ are indicated above the dashed lines between consecutive states. For example, $(0,b_0)$ next to the state $(0,x_0)$ indicates that the decision-maker applies an impulse $b_0$ immediately, which results in the next state $(0,x_1).$ All the components $x_0,x_1,\dots,$ $x_0'$, $x_1''$, $x_2''$ and $b_1,b_2,b_0'',b_1'',b_2''$ are the same as in Figure 1. The only exception is $(c_3,b_3),$ which does not appear in Figure 1. Nevertheless, $c_3>\theta_2$, because in Figure 1, the first jump in the marked point process therein at the time moment $\theta_1+\theta_2=\theta_2$ is triggered by a natural jump.

The optimal control problem for the hat DTMDP model reads as follows:

(12)

\begin{eqnarray}\mbox{Minimize over $\sigma$:}~\hat{\mathbb{E}}_{(\theta,x)}^\sigma\left[e^{\sum_{n=0}^\infty l(\hat{X}_n,\hat{A}_n,\hat{X}_{n+1})}\right]=:V((\theta,x),\sigma),\end{eqnarray}

where $\{\hat{X}_n\}_{n=0}^\infty$ and $\{\hat{A}_n\}_{n=0}^\infty$ are the state and action processes, and the minimization problem is over all strategies $\sigma$ in the hat DTMDP model. We denote by $V^\ast$ the value function of this optimal control problem, i.e.,

\begin{equation*}V^\ast(\theta,x):=\inf_{\sigma}\hat{\mathbb{E}}_{\hat{x}}^\sigma\left[e^{\sum_{n=0}^\infty l(\hat{X}_n,\hat{A}_n,\hat{X}_{n+1})}\right]\end{equation*}

for each $\hat{x}=(\theta,x)\in\hat{\textbf{X}}$, where the infimum is over all strategies. Clearly, $V^\ast(\infty,x_\infty)=1$. It will be seen in Lemma 5 that $V^\ast$ depends on $(\theta,x)$ only through x, and a strategy $\sigma$ is optimal if $V((0,x),\sigma)=V^\ast(x)$ for each $x\in\textbf{X}$. Below, when the context is clear, we often consider the restriction of $V^\ast$ on $\textbf{X}$ but still use the same notation.

Let $\hat{h}_n=((\theta_0,x_0),(c_0,b_0,\rho_0),(\theta_1,x_1), (c_1,b_1,\rho_1),(\theta_2,x_2),\dots,(\theta_n,x_n))$ be the generic notation for an n-history in the hat DTMDP model. A strategy in the hat DTMDP model is a sequence $\sigma=\{\sigma_n\}_{n=0}^\infty$, where for each $n\ge 0,$ $\sigma_n(d\hat{a}|\hat{h}_n)$ is a stochastic kernel on $\hat{\textbf{A}}$ given $\hat{h}_n$, which specifies the conditional distribution of the next action $(c,b,\rho)$ given $\hat{h}_n$. In general, a strategy in the hat DTMDP model can make use of past decision rules of relaxed controls, and the selection of the next relaxed control and that of the next planned impulse time and impulse need not be (conditionally) independent. Therefore, a general strategy in the hat DTMDP model does not immediately correspond to a policy in the gradual-impulse control problem described in the previous section. To relate the gradual-impulse control problem (6) and the hat DTMDP problem (12) (see Proposition 2 below), we introduce the following class of strategies in the hat DTMDP model.

Definition 6. A strategy $\sigma$ in the hat DTMDP model is called typical if under it, given $\hat{h}_n$, the selection of the next action (c,b) and $\rho$ are conditionally independent, and moreover, the selection of $\rho$ is deterministic, i.e.,

\begin{equation*}\sigma_n(dc\times db\times d\rho|\hat{h}_n)=\sigma_n'(dc\times db|\hat{h}_n)\delta_{F^n(\hat{h}_n)}(d\rho),\end{equation*}

where $F^n(\hat{h}_n)$ is measurable in its argument and takes values in ${\mathcal R}$, and $\sigma_n'(dc\times db|\hat{h}_n)$ is a stochastic kernel on $[0,\infty]\times \textbf{A}^I$ given $\hat{h}_n$. One can always write $\sigma_n'(dc\times db|\hat{h}_n)=\varphi_n(dc|\hat{h}_n)\psi_n(db|\hat{h}_n,c)$ for some stochastic kernels $\varphi_n$ and $\psi_n$. Intuitively, $\varphi_n$ defines the (conditional) distribution of the planned time duration till the next impulse, and $\psi_n(db|\hat{h}_n,c)$ specifies the distribution of the next impulsive action given the history $\hat{h}_n$ and the next impulse moment c, provided that it takes place before the next natural jump. Therefore, we identify a typical strategy $\sigma=\{\sigma_n\}_{n=0}^\infty$ as $\{(\varphi_n,\psi_n,F^n)\}_{n=0}^\infty.$ For further notational brevity, when the stochastic kernels $\varphi_n$ are identified with underlying measurable mappings, we will use $\varphi_n$ for the measurable mappings, and write $\varphi_n(\hat{h}_n)$ instead of $\varphi_n(dc|\hat{h}_n)$. The same applies to other stochastic kernels such as $\psi_n$. The context will exclude any potential confusion. Finally, in general, we often do not indicate the arguments that do not affect the values of the mappings in question. For example, if $\varphi_n(\hat{h}_n)$ depends on $\hat{h}_n$ only through $x_n$, then we write $\varphi_n(dc|\hat{h}_n)$ as $\varphi_n(dc|x_n)$.

4.2 Relation between gradual-impulse control and hat DTMDP problems

Each policy u as introduced in Definition 1 induces a strategy $\{(\varphi_n,\psi_n,F^n)\}_{n=0}^\infty$ in the hat DTMDP model as follows, where we need only consider $x_n\in\textbf{X}$, as the definition of the strategies at $x_n=x_\infty$ is immaterial and can be arbitrary. For each $m\ge 1$, and $h_m\in\textbf{H}_m$, there exists a strategy

\begin{equation*}\pi^{\Gamma^1_m,h_m}=\{\pi^{\Gamma^1_m,h_m}_n\}_{n=0}^\infty\end{equation*}

in the intervention DTMDP model such that

\begin{equation*}\Gamma^1_m(dy|h_m)=\beta^{\pi^{\Gamma^1_m,h_m}}(dy|\bar{x}(y_m)).\end{equation*}

Similarly, for each $x\in\textbf{X}$, $t>0$, there exists a strategy

\begin{equation*}\pi^{\Gamma^0_m,h_m,t,x}=\{\pi^{\Gamma^0_m,h_m,t,x}_n\}_{n=0}^\infty\end{equation*}

in the intervention DTMDP model such that

\begin{equation*}\Gamma^0_m(dy|h_m,t,x)=\beta^{\pi^{\Gamma^0_m,h_m,t,x}}(dy|x).\end{equation*}

Finally, there is a strategy

\begin{equation*}\pi^{u_0}=\{(\pi^{u_0}_n)\}_{n=0}^\infty\end{equation*}

in the intervention DTMDP model satisfying

\begin{equation*}u_0(dy|x)=\beta^{\pi^{u_0}}(dy|x)\end{equation*}

for each $x\in\textbf{X}$.

Consider the case $n=0$. Here we define

\begin{align*}\varphi_0(\{0\}|\theta,x)&:= 1-\pi_0^{u_0}(\{\Delta\}|x);\\[3pt] \varphi_0(dc|\theta,x)&:=\pi_0^{u_0}(\{\Delta\}|x)\Phi_1(dc|(x,\Delta,\Delta,\dots),0,(x,\Delta,\Delta,\dots)) \quad \text{on } (0,\infty]; \\[3pt] \psi_0(db|\theta,x,c):=&\frac{\pi^{u_0}_0(db|x)}{1-\pi^{u_0}_0(\{\Delta\}|x)}I\{c=0\}\\&\quad +I\{c>0\} \frac{\pi^{\Gamma^1_1,((x,\Delta,\dots),0,(x,\Delta,\dots))}_0(db|x)}{1-\pi^{\Gamma^1_1,((x,\Delta,\dots),0,(x,\Delta,\dots))}_0(\{\Delta\}|x)}\\& =\frac{\pi^{u_0}_0(db|x)}{1-\pi^{u_0}_0(\{\Delta\}|x)}I\{c=0\}+I\{c>0\} \pi^{\Gamma^1_1,((x,\Delta,\dots),0,(x,\Delta,\dots))}_0(db|x);\end{align*}

and $F^0(\theta,x)_t(da):=\Pi_1(da|(x,\Delta,\Delta,\dots),0,(x,\Delta,\Delta,\dots),t),$ where the second equality in the definition of $\psi_0(db|\theta,x,c)$ holds because

\begin{equation*}\pi^{\Gamma^1_1,((x,\Delta,\dots),0,(x,\Delta,\dots))}_0(\{\Delta\}|x)=0,\end{equation*}

which follows from the requirement that

\begin{equation*}\Gamma^1_n(\cdot|h_n)\in {\mathcal P}^{\textbf{Y}^\ast}(\bar{x}(y_n))\end{equation*}

for all $n\ge 1$ in Definition 3. Also concerning the definition of $\psi_0(db|\theta,x,c)$, note that if the denominator $ {1-\pi^{u_0}_0(\{\Delta\}|x)}$ equals 0, we put

\begin{equation*}\frac{\pi^{u_0}_0(db|x)}{1-\pi^{u_0}_0(\{\Delta\}|x)}\end{equation*}

as an arbitrary stochastic kernel. The reason is that in the expression

\begin{equation*}\frac{\pi^{u_0}_0(db|x)}{1-\pi^{u_0}_0(\{\Delta\}|x)}I\{c=0\},\end{equation*}

equality ${1-\pi^{u_0}_0(\{\Delta\}|x)}=0$ would indicate that the probability of selecting an instantaneous impulse is zero, and so $I\{c=0\}=0$ almost surely. The same explanation applies to the definitions of $\psi_n(db|\hat{h}_n,c)$ below, and will not be repeated there. Note that the right-hand side does not depend on $\theta\in[0,\infty)$, because the initial time moment is always fixed to be $\theta=0$.

The intuition behind the above definition of $(\varphi_0,\psi_0,F^0)$ is as follows. Recall that if the initial system state is $x\in\textbf{X}$, then the intervention $y_1\in\textbf{Y}$ at the initial time in the gradual-impulse control problem is a realization from the distribution

\begin{equation*}u_0(\cdot|x)=\beta^{\pi^{u_0}}(\cdot|x),\end{equation*}

which is the strategic measure of some strategy $\pi^{u_0}=\{\pi_n^{u_0}\}_{n=0}^\infty$ in the intervention DTMDP model. Then $\pi_0^{u_0}(\{\Delta\}|x)$ is the probability that no impulse is applied at the initial time 0 (given the initial system state x) in the gradual-impulse control problem. Consequently, $1-\pi_0^{u_0}(\{\Delta\}|x)$ is the probability of applying an impulse immediately, i.e., of waiting time 0 until the next impulse, and thus $\varphi_0(\{0\}|\theta,x)$. This quantity does not depend on $\theta$, because the initial time is always $0.$ Then for a measurable subset $\Gamma_1\subseteq(0,\infty]$,

\begin{eqnarray*}&&\pi_0^{u_0}(\{\Delta\}|x)\Phi_1(\Gamma_1|(x,\Delta,\Delta,\dots),0,(x,\Delta,\Delta,\dots))\\[3pt]&=&\mbox{$\mathbb{P}$(no impulse at initial time 0 given initial system state \textit{x})}\\[3pt] &&\times \mbox{$\mathbb{P}$(time to wait until next impulse is in $\Gamma_1$ given no impulse is immediately} \\[3pt] &&\mbox{applied at the initial time with the initial state \textit{x})},\end{eqnarray*}

which is equal to

\begin{eqnarray*}&&\mbox{$\mathbb{P}$(no immediate impulse, and the time duration until the next planned} \\[3pt] &&\mbox{impulse is in $\Gamma$)}=\mbox{$\mathbb{P}$(the time duration until the next planned impulse is in $\Gamma$)},\end{eqnarray*}

and thus $\varphi_0(\Gamma|\theta,x)$, where the equality follows because $\Gamma\subseteq(0,\infty].$ (Recall that a planned impulse takes place if no natural jump occurs during the time leading up to it.) Finally, as for $\psi_0(db|\theta,x,c)$, if $c=0$ and $\Gamma_2\in{\mathcal B}(\textbf{A}^I)$, then

\begin{eqnarray*}&&\frac{\pi^{u_0}_0(\Gamma_2|x)}{1-\pi^{u_0}_0(\{\Delta\}|x)}=\frac{\mbox{$\mathbb{P}$(an immediate impulse from $\Gamma_2$ is applied)}}{\mbox{$\mathbb{P}$(an immediate impulse is applied)}}\\[3pt] &=&\mbox{$\mathbb{P}$(an impulse is applied immediately from $\Gamma_2$}\\[3pt] &&\mbox{given that an impulse is applied after time duration 0)},\end{eqnarray*}

which is thus $\psi_0(\Gamma_2|\theta,x,0).$ One can understand $\psi_0(db|\theta,x,c)$ when $c>0$ in the same manner. Very similar intuition guides the definition of $(\varphi_n,\psi_n,F^n)$ below.

Now consider $n\ge 1.$ Let $\hat{h}_n$ be the n-history in the hat DTMDP model. If $\{1\le i\le n: ~\theta_i>0\}= \emptyset,$ then we define $\varphi_n(\{0\}|\hat{h}_n):=1-\pi^{u_0}_n(\{\Delta\}|x_0,b_0,\dots,b_{n-1},x_n);$

\begin{align*}\varphi_n(dc|\hat{h}_n):=&\pi^{u_0}_n(\{\Delta\}|x_0,b_0,\dots,b_{n-1},x_n)\Phi_1(dc|y_0,0,(x_1,b_1,\dots,x_n,\Delta,\Delta,\dots))\\[3pt] &\mbox{on~} (0,\infty];\\[3pt] \psi_n(db|\hat{h}_n,c):=&\frac{\pi_n^{u_0}(db|x_0,b_0,x_1,b_1,\dots,x_n)}{1-\pi^{u_0}_n(\{\Delta\}|x_0,b_0,x_1,b_1,\dots,x_n)}I\{c=0\}\\[3pt] &+I\{c>0\}\frac{\pi^{\Gamma^1_1,(y_0,0,(x_0,b_0,\dots,x_n,\Delta,\dots))}_0(db|x_n)}{1-\pi^{\Gamma^1_1,(y_0,0,(x_0,b_0,\dots,x_n,\Delta,\dots))}_0(\{\Delta\}|x_n)}\\[3pt] =&\frac{\pi_n^{u_0}(db|x_0,b_0,x_1,b_1,\dots,x_n)}{1-\pi^{u_0}_n(\{\Delta\}|x_0,b_0,x_1,b_1,\dots,x_n)}I\{c=0\}\\[3pt] &+I\{c>0\}\pi^{\Gamma^1_1,(y_0,0,(x_0,b_0,\dots,x_n,\Delta,\dots))}_0(db|x_n);\end{align*}

and $F^n(\hat{h}_n)_t(da):=\Pi_1(da|y_0,0,(x_0,b_0,\dots,x_n,\Delta,\Delta,\dots),t).$ Recall the notation that was introduced earlier: $y_0=(x_0,\Delta,\Delta,\dots).$

If $\{1\le i\le n: ~\theta_i>0\}\ne \emptyset,$ then let $m(\hat{h}_n):=\#\{1\le i\le n: ~\theta_i>0\}$ and $l(\hat{h}_n):=\max\{1\le i\le n: \theta_i>0\}.$ When the context is clear, we write m and l instead of $m(\hat{h}_n)$ and $l(\hat{h}_n)$ for brevity. Let $h_m$ be the m-history in the gradual-impulse control problem contained in $\hat{h}_n$. More precisely, $h_m$ is defined based on $\hat{h}_n$ as follows. Let $\tau_0(\hat{h}_n)=0,$ and $\tau_i(\hat{h}_n):=\inf\{j>\tau_{i-1}:~\theta_j>0\}$ for each $i\ge 1.$ Note that $l=\tau_m.$ Then

\begin{equation*}h_m=h_m(\hat{h}_n)=(y_0,0,y_1,\theta_{\tau_1},y_{2},\dots,\theta_{\tau_{m-1}},y_m),\end{equation*}

where $y_0=(x_0,\Delta,\Delta,\dots)$;

\begin{equation*}y_1=(x_0,b_0,x_1,b_1,\dots,x_{\tau_1-1},\Delta,\Delta,\dots);\end{equation*}

if $\theta_{\tau_1}=c_{\tau_1-1}$, then

\begin{equation*}y_2=(x_{\tau_1-1},b_{\tau_1-1},x_{\tau_1},b_{\tau_1},\dots,x_{\tau_2-1},\Delta,\Delta,\dots),\end{equation*}

while if $\theta_{\tau_1}<c_{\tau_1-1}$, then

\begin{equation*}y_2=(x_{\tau_1},b_{\tau_1},\dots,\linebreak x_{\tau_2-1},\Delta,\Delta,\dots);\end{equation*}

$\dots$; if $\theta_{\tau_{m-1}}=c_{\tau_{m-1}-1}$, then

\begin{equation*}y_m=(x_{\tau_{m-1}-1},\dots,x_{\tau_{m}-1},\Delta,\Delta,\dots),\end{equation*}

while if $\theta_{\tau_{m-1}}<c_{\tau_{m-1}-1}$, then

\begin{equation*}y_m=(x_{\tau_{m-1}},\dots,x_{\tau_{m}-1},\Delta,\Delta,\dots).\end{equation*}

For example, if

\begin{eqnarray*}\hat{h}_5&=&((0,x_0),(b_0,0,\rho^0),(0,x_1),(b_1,3,\rho^1),(3,x_2),(b_2,0,\rho^2),(0,x_3),\\[3pt]&&(b_3,2,\rho^3),(1,x_4),(b_4,0,\rho^4),(0,x_5)), \end{eqnarray*}

then $n=5$, $m=2$, $l=4,$ $\tau_1=2$, $\tau_2=4,$ and $h_2=(y_0,0,y_1,3,y_2)$ with $y_1=(x_0,b_0,x_1,\Delta,\dots)$ and $y_2=(x_1,b_1,x_2,b_2,x_3,\Delta,\dots).$ Roughly speaking, the integer $m(\hat{h}_n)$ counts the number of interventions (except $y_0$) contained in the n-history of the hat DTMDP model.

If $0<\theta_l=c_{l-1},$ we define

\begin{align*}\varphi_n(\{0\}|\hat{h}_n):=&1-\pi^{\Gamma^1_m,h_m}_{n-l+1}(\{\Delta\}|x_{l-1},b_{l-1},\dots,b_{n-1},x_n);\\[3pt]\varphi_n(dc|\hat{h}_n):=&\pi^{\Gamma^1_m,h_m}_{n-l+1}(\{\Delta\}|x_{l-1},b_{l-1},\dots,b_{n-1},x_n)\Phi_m(dc|h_m)~\mbox{on~} (0,\infty];\\[3pt]\psi_n(db|\hat{h}_n,c):=&\frac{\pi^{\Gamma^1_m,h_m}_{n-l+1}(db|x_{l-1},b_{l-1},\dots,b_{n-1},x_n)}{1-\pi^{\Gamma^1_m,h_m}_{n-l+1}(\{\Delta\}|x_{l-1},b_{l-1},\dots,b_{n-1},x_n)}I\{c=0\}\\[3pt] &+I\{c>0\}\frac{\pi^{\Gamma^1_{m+1},(h_m,\theta_l,(x_{l-1},b_{l-1},\dots,x_n,\Delta,\dots))}_0(db|x_n)}{1-\pi^{\Gamma^1_{m+1},(h_m,\theta_l,(x_{l-1},b_{l-1},\dots,x_n,\Delta,\dots))}_0(\{\Delta\}|x_n)}\\[3pt] = &\frac{\pi^{\Gamma^1_m,h_m}_{n-l+1}(db|x_{l-1},b_{l-1},\dots,b_{n-1},x_n)}{1-\pi^{\Gamma^1_m,h_m}_{n-l+1}(\{\Delta\}|x_{l-1},b_{l-1},\dots,b_{n-1},x_n)}I\{c=0\}\\[3pt] &+I\{c>0\} \pi^{\Gamma^1_{m+1},(h_m,\theta_l,(x_{l-1},b_{l-1},\dots,x_n,\Delta,\dots))}_0(db|x_n);\\[3pt] F^n(\hat{h}_n)_t(da):=&\Pi_m(da|h_m,t).\end{align*}

Finally, if $0<\theta_l<c_{l-1}$, then we define

\begin{align*}\varphi_n(\{0\}|\hat{h}_n):=&1-\pi^{\Gamma^0_m,h_m,\theta_l,x_l}_{n-l}(\{\Delta\}|x_{l},b_{l},\dots,b_{n-1},x_n);\\[3pt] \varphi_n(dc|\hat{h}_n):=&\pi^{\Gamma^0_m,h_m,\theta_l,x_l}_{n-l}(\{\Delta\}|x_{l},b_{l},\dots,b_{n-1},x_n)\Phi_m(dc|h_m)~\mbox{on~} (0,\infty];\\[3pt] \psi_n(db|\hat{h}_n,c):=&\frac{\pi^{\Gamma^0_m,h_m,\theta_l,x_l}_{n-l}(db|x_{l},b_{l},\dots,b_{n-1},x_n)}{1-\pi^{\Gamma^0_m,h_m,\theta_l,x_l}_{n-l}(\{\Delta\}|x_{l},b_{l},\dots,b_{n-1},x_n)}I\{c=0\}\\[3pt] &+I\{c>0\}\frac{\pi^{\Gamma^1_{m+1},(h_m,\theta_l,(x_{l},b_{l},\dots,x_n,\Delta,\dots))}_0(db|x_n)}{1-\pi^{\Gamma^1_{m+1},(h_m,\theta_l,(x_{l},b_{l},\dots,x_n,\Delta,\dots))}_0(\{\Delta\}|x_n)}\\[3pt] =&\frac{\pi^{\Gamma^0_m,h_m,\theta_l,x_l}_{n-l}(db|x_{l},b_{l},\dots,b_{n-1},x_n)}{1-\pi^{\Gamma^0_m,h_m,\theta_l,x_l}_{n-l}(\{\Delta\}|x_{l},b_{l},\dots,b_{n-1},x_n)}I\{c=0\}\\[3pt] &+I\{c>0\} \pi^{\Gamma^1_{m+1},(h_m,\theta_l,(x_{l},b_{l},\dots,x_n,\Delta,\dots))}_0(db|x_n);\\[3pt] F^n(\hat{h}_n)_t(da):=&\Pi_m(da|h_m,t).\end{align*}

To be specific, we refer to the typical strategy $\sigma=\{(\varphi_n,\psi_n,F^n)\}_{n=0}^\infty$ defined above as the strategy induced by the policy u. The next statement reveals a connection between a policy u and its induced strategy $\sigma$ for the hat DTMDP model.

Proposition 2. For each policy u and the strategy $\sigma=\{(\varphi_n,\psi_n,F^n)\}_{n=0}^\infty$ induced by u, ${\mathcal V}(x,u)=V((0,x),\sigma)$, and therefore ${\mathcal V}^\ast(x)\ge V^\ast(x)$ for each $x\in\textbf{X}.$

Proof. One can verify that

\begin{eqnarray*}&&\mathbb{E}_x^u\left[e^{\sum_{i=1}^n C^I(Y_i) +\sum_{i=2}^n\int_{0}^{\Theta_{i}} \int_{\textbf{A}^G}c^G(\overline{x}(Y_{i-1}),a)\Pi_{i-1}(da|H_{i-1},s)ds}\right]\\[3pt]&=&\hat{\mathbb{E}}_{(0,x)}^{\sigma}\left[e^{\sum_{i=0}^{\tau_n-1} c^I(X_i,B_i,X_{i+1})I\{C_i=\Theta_{i+1}\}}\right.\\[3pt]&&\left. e^{\sum_{i=2}^{n}\int_{0}^{\Theta_{\tau_{i-1}}} \int_{\textbf{A}^G}c^G(X_{\tau_{i-1}-1},a)F^{\tau_{i-1}-1}(\hat{H}_{\tau_{i-1}-1})_s(da)ds}\right]\end{eqnarray*}

for each $n\ge 1$. Passing to the limit as $n\rightarrow \infty$ and applying the monotone convergence theorem yields the equality in the statement. The last assertion follows automatically from the first assertion.

Remark 1. A deterministic stationary policy, say $u^D$, identified by $(\varphi,\psi,f)$ as in Definition 4 is associated with a strategy $\sigma^D=(\varphi,\psi,F)$ in the hat DTMDP model, where $F(x)_t(da)=\delta_{f(x)}(da)$ for all $t\ge 0,$ and vice versa. It is evident that ${\mathcal V}(x,u^D)=V(x,\sigma^D)$ for each $x\in\textbf{X}.$ Thus, if the hat DTMDP problem (12) has an optimal strategy in this form $\sigma^D=(\varphi,\psi,F)$, then the previous discussions imply that ${\mathcal V}^\ast(x)=V^\ast(x)$, and that the deterministic stationary policy $u^D$ associated with $\sigma^D$ is optimal for the gradual-impulse control problem (6).

To end this section, note that Condition 2 does not imply that the hat DTMDP model is semicontinuous, as defined in the appendix. In fact, the transition probability p, in general, does not satisfy the weak continuity condition, even under Condition 2. The simplest example is as follows.

Example 3. Suppose $q_x(a)\equiv 0$, and $\textbf{A}^G$ and $\textbf{A}^I$ are both singletons. Consider $\hat{a}_n=(c_n,b,\rho)$, where $c_n\rightarrow \infty$ and $c_n\in[0,\infty)$ for each $n\ge 1$, together with the bounded continuous function on $\hat{\textbf{X}}$ defined by $g(t,x)\equiv 1$ for each $(t,x)\in[0,\infty)\times \textbf{X}$, and $g(\infty,x_\infty)=0$. Then

\begin{equation*}\int_{\hat{\textbf{X}}}g(t,y)p(dt\times dy|(\theta,x),\hat{a}_n)=\int_{\textbf{X}}g(c_n,y)Q(dy|x,b)=1\end{equation*}

for each $n\ge 1,$ whereas

\begin{equation*}\int_{\hat{\textbf{X}}}g(t,y)p(dt\times dy|(\theta,x),(\infty,b,\rho))=g(\infty,x_\infty)=0\ne 1.\end{equation*}

One can also construct examples where the transition probability p is not continuous with respect to $\rho\in{\mathcal R}.$

5. Proofs of the main statements

In this section, we prove the results stated in Section 3. This is based on the investigation of the problem (12) for the hat DTMDP model described in Section 4. In this section, unless specified otherwise, $V^\ast$ is understood as the value function of the problem (12) for the hat DTMDP model. The main facts concerning $V^\ast$ are summarized in the next statement.

Proposition 3. (a) $V^\ast$ is a $[1,\infty]$-valued lower semianalytic function on $\textbf{X}$ satisfying

(13)

\begin{align}\inf_{a\in\textbf{A}^G}\left\{\int_{\textbf{X}} V^\ast(y) \tilde{q}(dy|x,a)-(q_x(a)-c^G(x,a))V^\ast(x) \right\}\ge 0&\\\forall~x\in\textbf{X}^\ast(V^\ast):=\{x\in\textbf{X}:~V^\ast(x)<\infty\}&\nonumber\end{align}

and

(14)

\begin{eqnarray}V^\ast(x)\le \inf_{b\in\textbf{A}^I}\left\{ \int_{\textbf{X}} e^{c^I(x,b,y)}V^\ast(y)Q(dy|x,b) \right\}, \qquad x\in\textbf{X},\end{eqnarray}

where at each $x\in\textbf{X}$, the inequality in either (13) or (14) holds with equality.

(b) $\textbf{X}\setminus \textbf{X}^I\subseteq \textbf{X}^G,$ where $\textbf{X}^G:=\textbf{X}^G(V^\ast)$ (see (9)), and $\textbf{X}^I:=\textbf{X}^I(V^\ast)$. (Lemma 1 below asserts that $V^\ast$ is universally measurable, so that the integrals $\int_{\textbf{X}}V^\ast(y)\tilde{q}(dy|x,a)$ and

\begin{equation*}\int_{\textbf{X}}V^\ast(y)e^{c^I(x,b,y)}Q(dy|x,b)\end{equation*}

are defined.)

Proof. See Lemmas 1, 3, and 4 below.

Lemma 1. (a) The value function $V^\ast$ depends on the state $(\theta,x)$ only through the second coordinate, and thus we write $V^\ast(x)$ instead of $V^\ast(\theta,x).$ The function $V^\ast$ is a $[1,\infty]$-valued lower semianalytic function satisfying

(15)

\begin{eqnarray}V(x)&=&\inf_{\hat{a}\in\hat{\textbf{A}}}\left\{\int_0^c \int_{\textbf{X}}V(y)\tilde{q}(dy|x,\rho_t)e^{-\int_0^t (q_x(\rho_s)-c^G(x,\rho_s))ds}dt\right. \\[3pt]&&\left. +I\{c=\infty\} e^{-\int_0^\infty q_x(\rho_s)ds} e^{\int_0^\infty c^G(x,\rho_s)ds}\right.\nonumber\\[3pt]&&\left.+I\{c<\infty\}e^{-\int_0^c (q_x(\rho_s)-c^G(x,\rho_s))ds}\int_{\textbf{X}}V(y)e^{c^I(x,b,y)}Q(dy|x,b)\right\}, \qquad x\in\textbf{X}, \nonumber\\[3pt]V(x_\infty)&=&1,\nonumber\end{eqnarray}

and is the minimal $[1,\infty]$-valued lower semianalytic function satisfying the following inequality:

(16)

\begin{eqnarray}V(x)&\ge&\inf_{\hat{a}\in\hat{\textbf{A}}}\left\{\int_0^c \int_{\textbf{X}}V(y)\tilde{q}(dy|x,\rho_t)e^{-\int_0^t (q_x(\rho_s)-c^G(x,\rho_s))ds}dt \right.\\[3pt]&&\left.+I\{c=\infty\} e^{-\int_0^\infty q_x(\rho_s)ds} e^{\int_0^\infty c^G(x,\rho_s)ds}\right.\nonumber\\[3pt]&&\left.+I\{c<\infty\}e^{-\int_0^c (q_x(\rho_s)-c^G(x,\rho_s))ds}\int_{\textbf{X}}V(y)e^{c^I(x,b,y)}Q(dy|x,b)\right\}, \qquad x\in\textbf{X},\nonumber\\[3pt]V(x_\infty)&=&1.\nonumber\end{eqnarray}

(b) For each $\epsilon>0,$ there exists an $\epsilon$-optimal deterministic Markov universally measurable strategy that depends on the state $(\theta,x)$ only through the second coordinate for the hat DTMDP problem (12).

(c) A deterministic stationary strategy that depends on the state $(\theta,x)$ only through x is optimal if and only if it attains the infimum in (15) with $V^\ast$ replacing V, for each $x\in\textbf{X}.$

(d) For each $x\in\textbf{X}$, $V^\ast(x)=\inf_{\pi\in\Pi^{U}}V(x,\pi)$, where $\Pi^U$ is the class of universally measurable strategies in the hat DTMDP model.

Proof. The fact that the value function $V^\ast$ is the minimal $[1,\infty]$-valued lower semianalytic function satisfying

\begin{align*}g(\theta,x)\ge &\inf_{\hat{a}\in\hat{\textbf{A}}}\left\{\int_0^c \int_{\textbf{X}}g(t,y)\tilde{q}(dy|x,\rho_t)e^{-\int_0^t (q_x(\rho_s)-c^G(x,\rho_s))ds}dt\right.\\&\left. +I\{c=\infty\} e^{-\int_0^\infty q_x(\rho_s)ds} e^{\int_0^\infty c^G(x,\rho_s)ds}\right.\\&\left.+I\{c<\infty\}e^{-\int_0^c (q_x(\rho_s)-c^G(x,\rho_s))ds}\int_{\textbf{X}}g(c,y)e^{c^I(x,by)}Q(dy|x,b)\right\}, \qquad x\in\textbf{X},\\g(\infty,x_\infty)=&1,\end{align*}

where the inequality can be replaced by equality, follows from Proposition 4. The existence of an $\epsilon$-optimal deterministic Markov universally measurable strategy follows from Proposition 4, too. Furthermore, note that the first coordinate in the state $(\theta,x)$ does not affect the cost function or the transition probability, from which the independence from the first coordinate of the state $(\theta,x)$ follows; cf. [Reference Feinberg8, Reference Zhang25]. Now assertions (a) and (b) follow. Finally, the last two assertions follow from Proposition 4.

Lemma 2. The function in $t\in[0,\infty)$ defined by

\begin{eqnarray*} \int_0^t \int_{\textbf{X}} e^{-\int_0^\tau (q_x(\rho_s)-c^G(x,\rho_s))ds}V^\ast(y)\tilde{q}(dy|x,\rho_\tau)d\tau +e^{-\int_0^t (q_x(\rho_s)-c^G(x,\rho_s))ds}V^\ast(x)\end{eqnarray*}

is increasing, for each $x\in\textbf{X}$ and $\rho\in{\mathcal R}.$

Proof. Let $0\le t_1<t_2<\infty$ and $x\in\textbf{X}$ be fixed, and we will verify

\begin{eqnarray*}&&\int_0^{t_2} e^{-\int_0^\tau (q_x(\rho_s)-c^G(x,\rho_s))ds}\int_{\textbf{X}}V^\ast(y)\tilde{q}(dy|x,\rho_\tau)d\tau\\&&+e^{-\int_0^{t_2} (q_x(\rho_s)-c^G(x,\rho_s))ds}V^\ast(x)\\&\ge&\int_0^{t_1} e^{-\int_0^\tau (q_x(\rho_s)-c^G(x,\rho_s))ds}\int_{\textbf{X}}V^\ast(y)\tilde{q}(dy|x,\rho_\tau)d\tau\\&&+e^{-\int_0^{t_1} (q_x(\rho_s)-c^G(x,\rho_s))ds}V^\ast(x),\end{eqnarray*}

as follows. It is sufficient to consider the case when the left-hand side is finite, for otherwise the above inequality would hold automatically. Then the goal is to show, by subtracting the right-hand side from the left-hand side, that

\begin{eqnarray*}&&0\le \int_{t_1}^{t_2} e^{-\int_0^\tau (q_x(\rho_s)-c^G(x,\rho_s))ds}\int_{\textbf{X}}V^\ast(y)\tilde{q}(dy|x,\rho_\tau)d\tau \\&&+ e^{-\int_0^{t_2} (q_x(\rho_s)-c^G(x,\rho_s))ds}V^\ast(x)-e^{-\int_0^{t_1} (q_x(\rho_s)-c^G(x,\rho_s))ds}V^\ast(x).\end{eqnarray*}

The right-hand side of this inequality can be further rewritten as

\begin{eqnarray*}&&\int_0^{t_2-t_1} e^{-\int_0^{t_1} (q_x(\rho_s)-c^G(x,\rho_s))ds }e^{-\int_{t_1}^{\tau+t_1}(q_x(\rho_s)-c^G(x,\rho_s))ds }\int_{\textbf{X}} V^\ast(y) \\&&\tilde{q}(dy|x,\rho_{\tau+t_1})d\tau+e^{-\int_{0}^{t_1}(q_x(\rho_s)-c^G(x,\rho_s))ds }\left(e^{-\int_{t_1}^{t_2}(q_x(\rho_s)-c^G(x,\rho_s))ds}-1\right)V^\ast(x)\\&=&e^{-\int_{0}^{t_1}(q_x(\rho_s)-c^G(x,\rho_s))ds } \left\{\int_0^{t_2-t_1} e^{-\int_0^\tau (q_x(\rho_{s+t_1})-c^G(x,\rho_{s+t_1}))ds} \right.\\&&\left.\int_{\textbf{X}}V^\ast(y)\tilde{q}(dy|x,\rho_{\tau+t_1})d\tau +\left(e^{-\int_{0}^{t_2-t_1 }(q_x(\rho_{t_1+s})-c^G(x,\rho_{t_1+s}))ds}-1\right)V^\ast(x) \right\}.\end{eqnarray*}

Introduce $\tilde{\rho}_s:=\rho_{t_1+s}$ for each $s\ge 0.$ The target becomes to show

\begin{eqnarray*}&&\int_0^{t_2-t_1} e^{-\int_0^\tau (q_x(\tilde{\rho}_s)-c^G(x,\tilde{\rho}_s))ds}\int_{\textbf{X}}V^\ast(y)\tilde{q}(dy|x,\tilde{\rho}_\tau)d\tau \\&& +e^{-\int_0^{t_2-t_1}(q_x(\tilde{\rho}_s)-c^G(x,\tilde{\rho}_s))ds }V^\ast(x)\ge V^\ast(x).\end{eqnarray*}

To this end, for a fixed $\epsilon>0$, let us consider a deterministic Markov $\epsilon$-optimal universally measurable strategy $\{(\varphi^\ast_n,\psi^\ast_n,F^{\ast,n})\}_{n=0}^\infty$ coming from Lemma 1, and an associated universally measurable strategy $\pi^{New}=\{(\varphi_n,\psi_n,F^n)\}_{n=0}^\infty$ defined by $\varphi_0(\theta, x):=\varphi_0^\ast(x)+t_2-t_1,$ $\psi_0(\theta,x)=\psi^\ast_0(x)$, $F^0(\theta,x)_s=\tilde{\rho}_s$ if $s\le t_2-t_1$, and

\begin{equation*} F^0(\theta,x)_s=F^{\ast,0}(\theta,x)_{s-(t_2-t_1)}\end{equation*}

if $s>t_2-t_1;$ and for $n\ge 1$, $\varphi_n((\theta,x),\hat{a},(t,y))= \varphi^\ast_{n-1}(y)$, $\psi_n((\theta,x),\hat{a},(t,y))=\psi^\ast_{n-1}(y),$ and $F^n((\theta,x),\hat{a},(t,y))_s= F^{\ast,n-1}(y)_s$ for all $s\ge 0.$ Under the universally measurable strategy $\pi^{New}$, only the gradual control action $\tilde{\rho}$ is used up to either $t_2-t_1$ or the natural jump moment, whichever takes place first, after which the $\epsilon$-optimal universally measurable strategy is in use, and so

\begin{eqnarray*}&&V^\ast(x)\le V(x,\pi^{New})\le \int_0^{t_2-t_1} e^{-\int_0^\tau (q_x(\tilde{\rho}_s) -c^G(x,\tilde{\rho}_s))ds}\\&&\int_{\textbf{X}}(V^\ast(y)+\epsilon)\tilde{q}(dy|x,\tilde{\rho}_\tau)d\tau+e^{-\int_0^{t_2-t_1}(q_x(\tilde{\rho}_s)-c^{G}(x,\tilde{\rho}_s))ds }(V^\ast(x)+\epsilon)\\&=&\int_0^{t_2-t_1} e^{-\int_0^\tau (q_x(\tilde{\rho}_s) -c^G(x,\tilde{\rho}_s))ds}\int_{\textbf{X}}V^\ast(y)\tilde{q}(dy|x,\tilde{\rho}_\tau)d\tau\\&&+e^{-\int_0^{t_2-t_1}(q_x(\tilde{\rho}_s)-c^{G}(x,\tilde{\rho}_s))ds }V^\ast(x)\\&&+\epsilon\left( \int_0^{t_2-t_1} e^{-\int_0^\tau (q_x(\tilde{\rho}_s) -c^G(x,\tilde{\rho}_s))ds}q_x(\tilde{\rho}_\tau) )d\tau+e^{-\int_0^{t_2-t_1}(q_x(\tilde{\rho}_s)-c^{G}(x,\tilde{\rho}_s))ds}\right),\end{eqnarray*}

where the first inequality holds because of the last assertion of Lemma 1. Since the expression in the last pair of brackets is nonnegative and finite, and $\epsilon>0$ was arbitrarily fixed, we see that

\begin{align*}V^\ast(x) \le &\int_0^{t_2-t_1} e^{-\int_0^\tau (q_x(\tilde{\rho}_s) -c^G(x,\tilde{\rho}_s))ds}\int_{\textbf{X}}V^\ast(y)\tilde{q}(dy|x,\tilde{\rho}_\tau)d\tau\\&+e^{-\int_0^{t_2-t_1}(q_x(\tilde{\rho}_s)-c^{G}(x,\tilde{\rho}_s))ds }V^\ast(x),\end{align*}

as desired.

Lemma 3. The relations (13) and (14) hold. (Recall from Lemma 5 that $V^\ast$ is universally measurable.)

Proof. Let $x\in\textbf{X}$ be fixed. Inequality (14) immediately follows from Lemma 1, if on the right-hand side of (15), with $V^\ast$ replacing V, one takes the infimum over actions $\hat{a}\in\hat{\textbf{A}}$ with $c=0$. (Recall the notation in use: $\hat{a}=(c,b,\rho)\in\hat{\textbf{A}}$.) Let us verify (13) as follows. Suppose $V^\ast(x)<\infty$. Let $a\in\textbf{A}^G$ be arbitrarily fixed. If $\int_{\textbf{X}}V^\ast(y)\tilde{q}(dy|x,a)=\infty$, then trivially

\begin{equation*}\int_{\textbf{X}} V^\ast(y) \tilde{q}(dy|x,a)-(q_x(a)-c^G(x,a))V^\ast(x)\ge 0.\end{equation*}

Consider the case when $\int_{\textbf{X}}V^\ast(y)\tilde{q}(dy|x,a)<\infty.$ Let $t>0$ be arbitrarily fixed. Then

\begin{equation*}\int_0^t e^{-\tau (q_x(a) -c^G(x,a))} \int_{\textbf{X}}V^\ast(y)\tilde{q}(dy|x,a)d\tau+e^{-t (q_x(a)-c^G(x,a))}V^\ast(x)\end{equation*}

is finite. Upon differentiating it with respect to t and applying the fundamental theorem of calculus, we see

\begin{equation*} e^{-(q_x(a)-c^G(x,a))t} \int_{\textbf{X}} V^\ast(y)\tilde{q}(dy|x,a)-(q_x(a)-c^G(x,a))e^{-t(q_x(a)-c^G(x,a))}V^\ast(x)\ge 0,\end{equation*}

where the inequality follows from Lemma 2. Thus,

\begin{equation*}\int_{\textbf{X}} V^\ast(y) \tilde{q}(dy|x,a)-(q_x(a)-c^G(x,a))V^\ast(x)\ge 0.\end{equation*}

Since $a\in\textbf{A}^G$ was arbitrarily fixed, we see that (13) holds.

Lemma 4. For each $x\in \textbf{X}$, the inequality in either (13) or (14) holds with equality.

Proof. Let $x\in\textbf{X}$ be fixed. If the equality in (14) holds at this point, then there is nothing to prove. Suppose that strict inequality holds in (14). Then necessarily $V^\ast(x)<\infty.$ The objective is to show that, in this case, (13) holds with equality. For the infimum in (15) with $V^\ast$ replacing V, it suffices to consider $c>0$, because (14) holds with strict inequality at the fixed point $x\in\textbf{X}$ here. Let $\epsilon>0$ be fixed, and let $(c^\ast,b^\ast,\rho^\ast)\in\hat{\textbf{A}}$ be such that

\begin{align*}V^\ast(x)+\epsilon\ge &\int_0^{c^\ast} \int_{\textbf{X}}V^\ast(y)\tilde{q}(dy|x,\rho^\ast_t)e^{-\int_0^t (q_x(\rho^\ast_s)-c^G(x,\rho^\ast_s))ds}dt\\& +I\{c^\ast=\infty\} e^{-\int_0^\infty q_x(\rho^\ast_s)ds} e^{\int_0^\infty c^G(x,\rho^\ast_s)ds}\nonumber\\&+I\{c^\ast<\infty\}e^{-\int_0^{c^\ast} (q_x(\rho^\ast_s)-c^G(x,\rho^\ast_s))ds}\int_{\textbf{X}}V^\ast(y)e^{c^I(x,b^\ast,y)}Q(dy|x,b^\ast).\end{align*}

There are two cases to be considered: (a) $0<c^\ast<\infty$, and (b) $c^\ast=\infty$.

Consider Case (a). Then

\begin{eqnarray*}&&\epsilon+V^\ast(x)\ge \int_0^{c^\ast} \int_{\textbf{X}}V^\ast(y)\tilde{q}(dy|x,\rho^\ast_t)e^{-\int_0^t (q_x(\rho^\ast_s)-c^G(x,\rho^\ast_s))ds}dt\\&&+e^{-\int_0^{c^\ast} (q_x(\rho^\ast_s)-c^G(x,\rho^\ast_s))ds}\int_{\textbf{X}}V^\ast(y)e^{c^I(x,b^\ast,y)}Q(dy|x,b^\ast)\\&\ge&\inf_{\rho\in{\mathcal R}}\left\{\int_0^{c^\ast} e^{-{\int_0^t (q_x(\rho_s)-c^G(x,\rho_s))ds }} \int_{\textbf{X}} V^\ast(y)\tilde{q}(dy|x,\rho_t)dt\right.\\&&\left.+e^{-\int_0^{c^\ast} (q_x(\rho_s)-c^G(x,\rho_s))ds}V^\ast(x) \right\}\ge V^\ast(x),\end{eqnarray*}

where the second inequality holds because of (14), and the last inequality holds because of Lemma 2. Thus, as $\epsilon>0$ was arbitrarily fixed,

(17)

\begin{eqnarray}V^\ast(x)&=&\inf_{\rho\in{\mathcal R}}\left\{\int_0^{c^\ast} e^{-{\int_0^t (q_x(\rho_s)-c^G(x,\rho_s))ds }} \int_{\textbf{X}} V^\ast(y) \tilde{q}(dy|x,\rho_t)dt\right.\nonumber\\[3pt]&&\left.+e^{-\int_0^{c^\ast} (q_x(\rho_s)-c^G(x,\rho_s))ds}V^\ast(x) \right\}.\end{eqnarray}

Let $\delta>0$ be fixed. There is some $\rho\in{\mathcal R}$ such that $\int_0^{c^\ast} (q_x(\rho_s)-c^G(x,\rho_s))ds<\infty$,

\begin{equation*}\int_0^{c^\ast} e^{-{\int_0^t (q_x(\rho_s)-c^G(x,\rho_s))ds }} \int_{\textbf{X}} V^\ast(y)\tilde{q}(dy|x,\rho_t)dt<\infty,\end{equation*}

and

\begin{eqnarray*}\delta&\ge&\int_0^{c^\ast} e^{-\int_0^s (q_x(\rho_v)-c^G(x,\rho_v))dv}\int_{\textbf{X}}V^\ast(y)\tilde{q}(dy|x,\rho_s)ds\\[3pt]&&+e^{-\int_0^{c^\ast} (q_x(\rho_s)-c^G(x,\rho_s))ds}V^\ast(x)-V^\ast(x)\\[3pt]&=&\int_0^{c^\ast} e^{-\int_0^s (q_x(\rho_v)-c^G(x,\rho_v))dv}\int_{\textbf{X}}V^\ast(y)\tilde{q}(dy|x,\rho_s)ds\\[3pt]&&-\int_0^{c^\ast} (q_x(\rho_\tau)-c^G(x,\rho_\tau))e^{-\int_0^\tau (q_x(\rho_s)-c^G(x,\rho_s))ds}d\tau V^\ast(x)\\[3pt]&=& \int_0^{c^\ast} e^{-\int_0^s (q_x(\rho_v)-c^G(x,\rho_v))dv}\left\{\int_{\textbf{X}}V^\ast(y)\tilde{q}(dy|x,\rho_s)\right.\\[3pt]&&\left. - (q_x(\rho_s)-c^G(x,\rho_s))V^\ast(x)\right\}ds\ge \int_0^{c^\ast} e^{-\int_0^s (q_x(\rho_v)-c^G(x,\rho_v))dv}ds \\[3pt]&&\times \inf_{a\in\textbf{A}^G}\left\{\int_{\textbf{X}}V^\ast(y)\tilde{q}(dy|x,a) - (q_x(a)-c^G(x,a))V^\ast(x)\right\}\\[3pt]&\ge& \int_0^{c^\ast} e^{-\overline{q}_x s}ds \inf_{a\in\textbf{A}^G}\left\{\int_{\textbf{X}}V^\ast(y)\tilde{q}(dy|x,a) - (q_x(a)-c^G(x,a))V^\ast(x)\right\} \ge 0,\end{eqnarray*}

where the last inequality holds because of (13). Since $\int_0^{c^\ast} e^{-\overline{q}_x s}ds>0$ and $\delta>0$ was arbitrarily fixed, we see that (13) holds with equality.

Now consider Case (b). Here,

\begin{eqnarray*}\epsilon+V^\ast(x)&\ge&\inf_{\rho\in {\mathcal R}}\left\{\int_0^{\infty} e^{-{\int_0^t (q_x(\rho_s)-c^G(x,\rho_s))ds }} \int_{\textbf{X}} V^\ast(y)\tilde{q}(dy|x,\rho_t)dt\right.\\[3pt]&&\left.+e^{-\int_0^{\infty} q_x(\rho_s)ds} e^{\int_0^\infty c^G(x,\rho_s)ds} \right\}.\end{eqnarray*}

One can show that for each $t\in[0,\infty),$

(18)

\begin{eqnarray}V^\ast(x)&=& \inf_{\rho\in{\mathcal R}}\left\{\int_0^{t} e^{-{\int_0^t (q_x(\rho_s)-c^G(x,\rho_s))ds }} \int_{\textbf{X}} V^\ast(y)\tilde{q}(dy|x,\rho_t)dt\right.\nonumber\\[3pt]&&\left.+e^{-\int_0^{t} (q_x(\rho_s)-c^G(x,\rho_s))ds}V^\ast(x) \right\}.\end{eqnarray}

The details are as follows. We only need consider $t>0$; the case of $t=0$ is trivial. Let $\delta>0$ be arbitrarily fixed. Then there is some $\hat{\rho}\in {\mathcal R}$ such that

\begin{eqnarray*}\epsilon+V^\ast(x)+\delta&\ge& \int_0^\infty e^{-\int_0^\tau (q_x(\hat{\rho}_s)-c^G(x,\hat{\rho}_s))ds}\int_{\textbf{X}}V^\ast(y)\tilde{q}(dy|x,\hat{\rho}_\tau)d\tau\\[3pt]&&+e^{-\int_0^\infty q_x(\hat{\rho}_s)ds}e^{\int_0^\infty c^G(x,\hat{\rho}_s)ds}.\end{eqnarray*}

Define $\tilde{\rho}\in {\mathcal R}$ by $\tilde{\rho}_s=\hat{\rho}_{t+s}$ for each $s\ge 0.$ Then, for each $t\ge 0,$

\begin{eqnarray*}&& \epsilon+V^\ast(x)+\delta\ge \int_0^t e^{-\int_0^\tau (q_x(\hat{\rho}_s)-c^G(x,\hat{\rho}_s))ds}\int_{\textbf{X}}V^\ast(y)\tilde{q}(dy|x,\hat{\rho}_\tau)d\tau \\[3pt]&& +\int_t^\infty e^{-\int_0^\tau (q_x(\hat{\rho}_s)-c^G(x,\hat{\rho}_s))ds}\int_{\textbf{X}}V^\ast(y)\tilde{q}(dy|x,\hat{\rho}_\tau)d\tau\\[3pt]&&+e^{-\int_0^t (q_x(\hat{\rho}_s)-c^G(x,\hat{\rho}_s))ds} e^{-\int_t^\infty q_x(\hat{\rho}_s)ds}e^{\int_t^\infty c^G(x,\hat{\rho}_s)ds}\\[3pt]&=& \int_0^t e^{-\int_0^\tau (q_x(\hat{\rho}_s)-c^G(x,\hat{\rho}_s))ds}\int_{\textbf{X}}V^\ast(y)\tilde{q}(dy|x,\hat{\rho}_\tau)d\tau +e^{-\int_0^t (q_x(\hat{\rho}_v)-c^G(x,\hat{\rho}_v))dv}\\[3pt]&&\times \left\{\int_0^\infty e^{-\int_0^s (q_x(\tilde{\rho}_v))-c^G(x,\tilde{\rho}_v))dv}\int_{\textbf{X}}V^\ast(y)\tilde{q}(dy|x,\tilde{\rho}_s)ds\right.\\[3pt]&&\left.+ e^{-\int_0^\infty q_x(\tilde{\rho}_s)ds}e^{\int_0^\infty c^G(x,\tilde{\rho}_s)ds} \right\}\\[3pt]&\ge &\int_0^t e^{-\int_0^\tau (q_x(\hat{\rho}_s)-c^G(x,\hat{\rho}_s))ds}\int_{\textbf{X}}V^\ast(y)\tilde{q}(dy|x,\hat{\rho}_\tau)d\tau +e^{-\int_0^t (q_x(\hat{\rho}_v)-c^G(x,\hat{\rho}_v))dv}V^\ast(x)\\[3pt]&\ge& \inf_{\rho\in{\mathcal R}} \left\{\int_0^t e^{-\int_0^\tau (q_x(\rho_s)-c^G(x,{\rho}_s))ds}\int_{\textbf{X}}V^\ast(y)\tilde{q}(dy|x,{\rho}_\tau)d\tau\right.\\[3pt]&&\left. +e^{-\int_0^t (q_x({\rho}_v)-c^G(x,{\rho}_v))dv}V^\ast(x)\right\}\ge V^\ast(x),\end{eqnarray*}

where the second inequality is by Lemma 1(a), which in particular asserts that $V^\ast$ satisfies (15), and the last inequality is by Lemma 2. Since $\epsilon>0$ and $\delta>0$ were arbitrarily fixed, the above implies (18). Comparing (18) with (17), we see that Case (b) is reduced to Case (a).

Lemma 5. Let w be a measurable $[1,\infty)$-valued function satisfying the inequality in Condition 1, whose existence is guaranteed as mentioned in the paragraph below Condition 1. Consider the transition probability $\tilde{p}(dy|x,a)$ on ${\mathcal B}(\textbf{X})$ given $(x,a)\in\textbf{X}\times\textbf{A}^G$ defined by

\begin{equation*}\tilde{p}(\Gamma|x,a):=\frac{q(\Gamma|x,a)}{w(x)}+\delta_{x}(dy)\end{equation*}

for each $\Gamma\in{\mathcal B}(\textbf{X})$, $(x,a)\in\textbf{X}\times\textbf{A}^G$. Then a $[1,\infty]$-valued lower semianalytic function $V^\ast$ (here the notation $V^\ast$ does not necessarily mean the value function) satisfies (13) and (14), and for each $x\in\textbf{X}$, either (13) or (14) holds with equality, if and only if it satisfies (14), for each $x\in\textbf{X}$

(19)

\begin{eqnarray}V^\ast(x)\le \inf_{a\in\textbf{A}^G} \left\{\frac{w(x)}{w(x)-c^G(x,a)}\int_{\textbf{X}}V^\ast(y)\tilde{p}(dy|x,a)\right\},\end{eqnarray}

and either (14) or (19) holds with equality, i.e.,

(20)

\begin{eqnarray}V^\ast(x)&=&\min\left\{\inf_{a\in\textbf{A}^G} \left\{\frac{w(x)}{w(x)-c^G(x,a)}\int_{\textbf{X}}V^\ast(y)\tilde{p}(dy|x,a)\right\},\right.\nonumber\\[3pt]&&\left.\inf_{b\in\textbf{A}^I}\left\{ \int_{\textbf{X}}V^\ast(y)e^{c^I(x,b,y)}Q(dy|x,b)\right\} \right\}.\end{eqnarray}

Note that (19) automatically holds with equality at $x\in \textbf{X}\setminus\textbf{X}^\ast(V^\ast):=\{x\in\textbf{X}:~V^\ast(x)=\infty\}.$ Also note that the function w in the previous lemma does not need to be continuous.

Proof of Lemma 5. We prove the ‘only if’ part (the argument for the ‘if’ part is the same, and is omitted). Consider a $[1,\infty]$-valued lower semianalytic function $V^\ast$ that satisfies (13) and (14), such that for each $x\in\textbf{X}$, either (13) or (14) holds with equality. For $x\in \textbf{X}^\ast(V^\ast)=\{x\in\textbf{X}:~V^\ast(x)<\infty\}$, (13) implies for each $a\in\textbf{A}^G$ that

\begin{align*}0&\le c^G(x,a)V^\ast(x)+\int_{\textbf{X}}V^\ast(y)q(dy|x,a)\\[3pt]&=(c^G(x,a)-w(x))V^\ast(x)+w(x)\int_{\textbf{X}}V^\ast(y)\tilde{p}(dy|x,a),\end{align*}

and so

\begin{equation*}V^\ast(x)\le \inf_{a\in\textbf{A}^G}\left\{\frac{w(x)}{w(x)-c^G(x,a)}\int_{\textbf{X}}V^\ast(y)\tilde{p}(dy|x,a)\right\},\end{equation*}

i.e., (19) holds. Let $x\in \textbf{X}^\ast(V^\ast)$ be a point where (13) holds with equality. Let us verify that at this point $x\in\textbf{X}^\ast(V^\ast),$ (19) also holds with equality. For each $\epsilon>0$, there is some $a_\epsilon\in\textbf{A}^G$ such that $\epsilon\ge c^G(x,a_{\epsilon})V^\ast(x)+\int_{\textbf{X}}V^\ast(y)q(dy|x,a_\epsilon)$, so that

\begin{align*}V^\ast(x)+\epsilon \ge &V^\ast(x)+\frac{\epsilon}{w(x)-c^G(x,a_\epsilon)}\\[3pt]\ge &V^\ast(x)+\frac{c^G(x,a_\epsilon)V^\ast(x)+\int_{\textbf{X}}V^\ast(y)q(dy|x,a_\epsilon)}{w(x)-c^G(x,a_{\epsilon})}\\[3pt]=&\frac{w(x)}{w(x)-c^G(x,a_{\epsilon})}\int_{\textbf{X}}\tilde{p}(dy|x,a_{\epsilon})V^\ast(y)\\[3pt]\ge& \inf_{a\in\textbf{A}^G} \left\{\frac{w(x)}{w(x)-c^G(x,a)}\int_{\textbf{X}}V^\ast(y)\tilde{p}(dy|x,a)\right\},\end{align*}

and thus

\begin{equation*}V^\ast(x)\ge \inf_{a\in\textbf{A}^G} \left\{\frac{w(x)}{w(x)-c^G(x,a)}\int_{\textbf{X}}V^\ast(y)\tilde{p}(dy|x,a)\right\}.\end{equation*}

The opposite direction of this inequality was seen earlier, and so (19) holds with equality at this point. This completes the ‘only if’ part. As previously mentioned, the argument for the ‘if’ part is the same, and is omitted.

Remark 2. Consider the function $V^\ast$ in the previous statement. By inspecting the above proof we see the following fact: a pair of measurable mappings $\psi^\ast$ and $f^\ast$ from $\textbf{X}$ to $\textbf{A}^I$ and $\textbf{A}^G$ satisfy

\begin{eqnarray*}&&\frac{w(x)}{w(x)-c^G(x,f^\ast(x))}\int_{\textbf{X}}V^\ast(y)\tilde{p}(dy|x,f^\ast(x))\\[3pt]&=&\inf_{a\in\textbf{A}^G} \left\{\frac{w(x)}{w(x)-c^G(x,a)}\int_{\textbf{X}}V^\ast(y)\tilde{p}(dy|x,a)\right\}\end{eqnarray*}

for each $x\in\textbf{X}$ at which (19) holds with equality, and

\begin{eqnarray*} \int_{\textbf{X}} e^{c^I(x,\psi^\ast(x),y)} V^\ast(y)Q(dy|x,\psi^\ast(x)) =\inf_{b\in\textbf{A}^I}\left\{ \int_{\textbf{X}} e^{c^I(x,b,y)}V^\ast(y)Q(dy|x,b) \right\}~\forall~x\in\textbf{X},\end{eqnarray*}

if and only if

\begin{eqnarray*}&&\inf_{a\in \textbf{A}^G}\left\{\int_{\textbf{X}}V^\ast(y)\tilde{q}(dy|x,a)-(q_x(a)-c^G(x,a))V^\ast(x)\right\}\\[3pt]&=&\int_{\textbf{X}}V^\ast(y)\tilde{q}(dy|x,f^\ast(x))-(q_x(f^\ast(x))-c^G(x,f^\ast(x)))V^\ast(x)\end{eqnarray*}

for each $x\in\textbf{X}$ at which the left-hand side equals zero, and

Lemma 6. Suppose Conditions 1 and 2 are satisfied. Then $W^\ast(x)=V^\ast(x)$ for each $x\in\textbf{X}$.

Proof. According to Proposition 4(a)–(b), the value function $W^\ast$ for the tilde model is the minimal $[1,\infty]$-valued lower semianalytic function satisfying (7) as well as the inequality obtained by replacing the equality in (7) by ‘$\ge$’. Let us verify that $W^\ast=V^\ast$ as follows. According to Lemmas 3, 4, and 5, the value function $V^\ast$ is a $[1,\infty]$-valued lower semianalytic function satisfying (7); cf. (20). Therefore, $W^\ast\le V^\ast$ pointwise.

For the opposite direction of this inequality, let $x\in\textbf{X}$ be fixed. It suffices to show that $W^\ast$ satisfies (16) at the point x. Then, since the point $x\in\textbf{X}$ was arbitrarily fixed, one can apply Lemma 5 to obtain $V^\ast\le W^\ast$ pointwise.

Recall that, as observed in the beginning of this proof, $W^\ast$ satisfies (20). By Lemma 5, it satisfies (13) and (14), one of which holds with equality at this point x. If (14) holds with equality for $W^\ast$ at x, then

\begin{eqnarray*}W^\ast(x)&=&\inf_{b\in\textbf{A}^I}\left\{\int_{\textbf{X}}W^\ast(y)e^{c^I(x,b,y)}Q(dy|x,b)\right\}\\[3pt]&\ge& \inf_{\hat{a}\in\hat{\textbf{A}}}\left\{\int_0^c \int_{\textbf{X}}W^\ast(y)\tilde{q}(dy|x,\rho_t)e^{-\int_0^t (q_x(\rho_s)-c^G(x,\rho_s))ds}dt \right.\\[3pt]&&\left.+I\{c=\infty\} e^{-\int_0^\infty q_x(\rho_s)ds} e^{\int_0^\infty c^G(x,\rho_s)ds}\right.\nonumber\\[3pt]&&\left.+I\{c<\infty\}e^{-\int_0^c (q_x(\rho_s)-c^G(x,\rho_s))ds}\int_{\textbf{X}}W^\ast(y)e^{c^I(x,b,y)}Q(dy|x,b)\right\},\end{eqnarray*}

and thus (16) is satisfied by $W^\ast$ at x, as required. Now suppose (13) holds with equality for $W^\ast$ at x. It suffices to consider $W^\ast(x)<\infty,$ for otherwise (16) automatically holds for $W^\ast$ at x. According to Remark 2 after Lemma 5, and because the tilde model is semicontinuous, there is some $a^\ast\in\textbf{A}^G$ satisfying

\begin{eqnarray*}&&\int_{\textbf{X}} W^\ast(y) \tilde{q}(dy|x,a^\ast)-(q_x(a^\ast)-c^G(x,a^\ast))W^\ast(x)\\[3pt]&=&\inf_{a\in\textbf{A}^G}\left\{\int_{\textbf{X}} W^\ast(y) \tilde{q}(dy|x,a)-(q_x(a)-c^G(x,a))W^\ast(x) \right\}=0,\end{eqnarray*}

and hence $\int_{\textbf{X}} W^\ast(y) \tilde{q}(dy|x,a^\ast)=(q_x(a^\ast)-c^G(x,a^\ast))W^\ast(x)$. This implies $q_x(a^\ast)\ge c^G(x,a^\ast)$, as the left-hand side of the previous equality is nonnegative and $W^\ast(x)\ge 1,$ and for the same reason, if $c^G(x,a^\ast)=q_x(a^\ast)$, then $c^G(x,a^\ast)=q_x(a^\ast)=0$, in which case

\begin{align*}W^\ast(x)\ge &1\\[3pt]= &\int_0^\infty \int_{\textbf{X}}W^\ast(y)\tilde{q}(dy|x,a^\ast)e^{-\int_0^t (q_x(a^\ast)-c^G(x,a^\ast))ds}dt\\[3pt]&+ e^{-\int_0^\infty q_x(a^\ast)ds} e^{\int_0^\infty c^G(x,a^\ast)ds}\\[3pt]\ge &\inf_{\hat{a}\in\hat{\textbf{A}}}\left\{\int_0^c \int_{\textbf{X}}W^\ast(y)\tilde{q}(dy|x,\rho_t) e^{-\int_0^t (q_x(\rho_s)-c^G(x,\rho_s))ds}dt \right.\\[3pt]&+I\{c=\infty\} e^{-\int_0^\infty q_x(\rho_s)ds} e^{\int_0^\infty c^G(x,\rho_s)ds}\\[3pt]&\left.+I\{c<\infty\}e^{-\int_0^c (q_x(\rho_s)-c^G(x,\rho_s))ds}\int_{\textbf{X}}W^\ast(y)e^{c^I(x,b,y)}Q(dy|x,b)\right\}.\end{align*}

That is, (16) is satisfied by $W^\ast$ at x, as desired. Finally, if $c^G(x,a^\ast)<q_x(a^\ast)$, then

\begin{eqnarray*} &&\inf_{\hat{a}\in\hat{\textbf{A}}}\left\{\int_0^c \int_{\textbf{X}}W^\ast(y)\tilde{q}(dy|x,\rho_t)e^{-\int_0^t (q_x(\rho_s)-c^G(x,\rho_s))ds}dt \right.\\[3pt]&&\left.+I\{c=\infty\} e^{-\int_0^\infty q_x(\rho_s)ds} e^{\int_0^\infty c^G(x,\rho_s)ds}\right.\nonumber\\[3pt]&&\left.+I\{c<\infty\}e^{-\int_0^c (q_x(\rho_s)-c^G(x,\rho_s))ds}\int_{\textbf{X}}W^\ast(y)e^{c^I(x,b,y)}Q(dy|x,b)\right\}\\[3pt] &\le&\int_0^\infty \int_{\textbf{X}}W^\ast(y)\tilde{q}(dy|x,a^\ast)e^{-\int_0^t (q_x(a^\ast)-c^G(x,a^\ast))ds}dt+ e^{-\int_0^\infty q_x(a^\ast)ds} e^{\int_0^\infty c^G(x,a^\ast)ds}\\[3pt] &=&\frac{\int_{\textbf{X}}W^\ast(y)\tilde{q}(dy|x,a^\ast)}{q_x(a^\ast)-c^G(x,a^\ast)}+0=W^\ast(x),\end{eqnarray*}

as desired. Thus, $W^\ast$ satisfies (16). Consequently, $W^\ast= V^\ast$ on $\textbf{X}$, as required.

Proof of Theorem 1. Part (b) was seen in the proof of Lemma 4.

Consider the pair of measurable mappings $(\psi^\ast,f^\ast)$ from Proposition 1. Recall that $W^\ast=V^\ast$ on $\textbf{X}$ by Lemma 6. Keeping in mind Remark 2, we find from an inspection of the proof of Lemma 6 that the deterministic stationary strategy $(\varphi(x),\psi^\ast(x),t\rightarrow \delta_{f^\ast(x)}(da))\in\hat{\textbf{A}}$ in the hat DTMDP model, where $\varphi$ is defined in Part (c) of this theorem, attains the infimum in

\begin{eqnarray*}V^\ast(x)&=&\inf_{\hat{a}\in\hat{\textbf{A}}}\left\{\int_0^c \int_{\textbf{X}}V^\ast(y)\tilde{q}(dy|x,\rho_t)e^{-\int_0^t (q_x(\rho_s)-c^G(x,\rho_s))ds}dt \right.\\[3pt]&&\left.+I\{c=\infty\} e^{-\int_0^\infty q_x(\rho_s)ds} e^{\int_0^\infty c^G(x,\rho_s)ds}\right.\nonumber\\[3pt]&&\left.+I\{c<\infty\}e^{-\int_0^c (q_x(\rho_s)-c^G(x,\rho_s))ds}\int_{\textbf{X}}V^\ast(y)e^{c^I(x,b,y)}Q(dy|x,b)\right\}\nonumber\end{eqnarray*}

for each $x\in\textbf{X}$. By Theorem 3, this deterministic stationary strategy $(\varphi(x),\psi^\ast(x),t\rightarrow \delta_{f^\ast(x)}(da))\in\hat{\textbf{A}}$ is optimal for the problem (12) for the hat DTMDP model. This and Remark 1 imply that $V^\ast={\mathcal V}^\ast$ on $\textbf{X}$ and that Part (c) holds. By Lemma 6, we see now that ${\mathcal V}^\ast=W^\ast$ on $\textbf{X}$, and thus Part (a) holds.

Proof of Corollary 1. This corollary follows at once from Theorem 3, Lemma 5, and Remark 2.

6 Conclusion

In this paper we investigated the gradual-impulse control problem of CTMDPs with a rigorous and general construction, which allows the consideration of quite a large class of control policies, and is not restricted to the form of performance measures under consideration. Possible future research thus includes the investigation of gradual-impulse control problems of CTMDPs with other performance measures (such as the long-run average cost in the risk-neutral as well as risk-sensitive setups).

Appendix A. Relevant results about DTMDPs

In this appendix we present the relevant facts about DTMDPs. The proofs of these statements can be found in [Reference Jaśkiewicz15] or [Reference Zhang25]. The standard description of a DTMDP can be found in e.g. [Reference Hernández-Lerma and Lasserre13, Reference Piunovskiy20]. The notation used in this section is independent of the previous sections.

A DTMDP has the primitives $\{\textbf{X},\textbf{A}, p, l\}$, where $\textbf{X}$ is a nonempty Borel state space, $\textbf{A}$ is a nonempty Borel action space, $p(dy|x,a)$ is a stochastic kernel on ${\mathcal B}(\textbf{X})$ given $(x,a)\in \textbf{X}\times\textbf{A}$, and l is a $[0,\infty]$-valued measurable cost function on $\textbf{X}\times\textbf{A}\times\textbf{X}.$

Condition 3. (a) The function l(x,a,y) is lower semicontinuous in $(x,a,y)\in \textbf{X}\times\textbf{A}\times\textbf{X}.$

(b) For each bounded continuous function f on $\textbf{X}$, $\int_{\textbf{X}}f(y)p(dy|x,a)$ is continuous in $(x,a)\in\textbf{X}\times \textbf{A}.$

(c) The space $\textbf{A}$ is a compact Borel space. The DTMDP model $\{\textbf{X},\textbf{A}, p, l\}$ is called semicontinuous if it satisfies Condition 7.

Let us define $\textbf{H}_n:=\textbf{X}\times(\textbf{A}\times\textbf{X})^n$ for each $n=1,2,\dots,\infty,$ and $\textbf{H}_{0}:=\textbf{X}.$ A strategy $\sigma=(\sigma_n)_{n=0}^\infty$ in the DTMDP is given by a sequence of stochastic kernels $\sigma_n(da|h_{n})$ on ${\mathcal B}(\textbf{A})$ from $h_{n}\in \textbf{H}_{n}$ for $n=0,1,2,\dots.$ A strategy $\sigma=(\sigma_n)$ is called deterministic Markov if for each $n=0,1,2,\dots$ we have $\sigma_n(da|h_{n})=\delta_{\{\varphi_n(x_{n})\}}(da)$, where $\varphi_{n}$ is an $\textbf{A}$-valued measurable mapping on $\textbf{X}.$ We identify such a deterministic Markov strategy with $(\varphi_n).$ A deterministic Markov strategy $(\varphi_n)$ is called deterministic stationary if $\varphi_n$ does not depend on n, and it is identified with the underlying measurable mapping $\varphi$ from $\textbf{X}$ to $\textbf{A}.$ Let $\Sigma$ be the space of strategies, and let $\Sigma_{DM}$ be the space of all deterministic Markov strategies for the DTMDP.

Let the controlled and controlling process be denoted by $\{Y_n\}_{n=0}^\infty$ and $\{A_n\}_{n=0}^\infty$. Here, for each $n=0,1,\dots,$ $Y_n$ is the projection of $\textbf{H}_\infty$ to the $(2n+1)$th coordinate, and $A_n$ to the $(2n+2)$th coordinate. Under a strategy $\sigma=(\sigma_n)$ and a given initial probability distribution $\nu$ on $(\textbf{X},{\mathcal B}(\textbf{X}))$, by the Ionescu-Tulcea theorem (cf. [Reference Hernández-Lerma and Lasserre13, Reference Piunovskiy20]), one can construct a probability measure $\mathbb{P}_\nu^\sigma$ on $(\textbf{H}_\infty,{\mathcal B}(\textbf{H}_\infty))$ such that

\begin{eqnarray*}&&\mathbb{P}_\nu^\sigma(Y_0\in dx)=\nu(dx),\\[3pt]&&\mathbb{P}_\nu^\sigma(A_n\in da|Y_0,A_0,\dots,Y_n)=\sigma_{n}(da|Y_0,A_0,\dots,Y_n), \qquad n=0,1,\dots,\\[3pt]&&\mathbb{P}_\nu^\sigma(Y_{n+1}\in dx|Y_0,A_0,\dots,Y_n,A_n)=p(dx|Y_n,A_n), \qquad n=0,1,\dots.\end{eqnarray*}

As usual, equalities involving conditional expectations and probabilities are understood in the almost sure sense.

The probability measure $\mathbb{P}_\nu^\sigma$ is called a strategic measure (of the strategy $\sigma$) in the DTMDP model $\{\textbf{X},\textbf{A}, p, l\}$ (with the initial distribution $\nu$). The expectation taken with respect to $\mathbb{P}_\nu^\sigma$ is denoted by $\mathbb{E}_\nu^\sigma.$ When $\nu$ is concentrated on the singleton $\{x\}$, $\mathbb{P}_\nu^\sigma$ and $\mathbb{E}_\nu^\sigma$ are written as $\mathbb{P}_x^\sigma$ and $\mathbb{E}_x^\sigma.$

Consider the following optimal control problem:

(21)

\begin{eqnarray}\mbox{Minimize over $\sigma$}:&& \mathbb{E}_x^\sigma\left[e^{\sum_{n=0}^\infty l(Y_n,A_n,Y_{n+1})}\right]=:\textbf{V}(x,\sigma), \qquad x\in \textbf{X}.\end{eqnarray}

We denote the value function of the problem (21) by $\textbf{V}^\ast$. Then a strategy $\sigma^\ast$ is called optimal for the problem (21) if $\textbf{V}(x,\sigma^\ast)=\textbf{V}^\ast(x)$ for each $x\in \textbf{X}.$ For a constant $\epsilon>0$, a strategy is called $\epsilon$-optimal for the problem (21) if $\textbf{V}(x,\sigma^\ast)\le\textbf{V}^\ast(x)+\epsilon$ for each $x\in \textbf{X}.$

Occasionally we will also consider the so-called universally measurable strategies, in which case the stochastic kernels $\sigma_n(da|h_{n})$ are universally measurable, i.e., for each measurable subset $\Gamma$ of $\textbf{A}$, $\sigma(\Gamma|h_n)$ is universally measurable in $h_n\in\textbf{H}_n.$ The meaning of ‘universally measurable deterministic Markov strategy’ or ‘universally measurable deterministic stationary strategy’ is understood similarly, i.e., the underlying mappings are universally measurable in their arguments. See Chapter 7.7 of [Reference Bertsekas and Shreve2] for the definition of universal measurability and other related measurability concepts, such as the definition of a lower semianalytic function.

We collect the relevant statements in Section 3 of [Reference Zhang25] in the next proposition.

Proposition 4. (a) The value function $\textbf{V}^\ast$ is the minimal $[1,\infty]$-valued lower semianalytic solution to

(22)

\begin{eqnarray}\textbf{V}(x)=\inf_{a\in \textbf{A}}\left\{\int_{\textbf{X}}e^{l(x,a,y)}\textbf{V}(y)p(dy|x,a)\right\}, \qquad x\in \textbf{X}.\end{eqnarray}

(b) Let $\textbf{U}$ be a $[1,\infty]$-valued lower semianalytic function on $\textbf{X}$. If

\begin{eqnarray*}\textbf{U}(x)\ge \inf_{a\in \textbf{A}}\left\{\int_{\textbf{X}}e^{l(x,a,y)}\textbf{U}(y)p(dy|x,a)\right\} \qquad \forall~x\in \textbf{X},\end{eqnarray*}

then $\textbf{U}(x)\ge \textbf{V}^\ast(x)$ for each $x\in \textbf{X}.$

(23)

\begin{eqnarray}\textbf{V}^\ast(x)=\int_{\textbf{X}}e^{l(x,\varphi(x),y)}\textbf{V}^\ast(y)p(dy|x,\varphi(x)) \qquad \forall~x\in \textbf{X},\end{eqnarray}

then $\textbf{V}^\ast(x)=\textbf{V}(x,\varphi)$ for each $x\in \textbf{X}.$

(d) $\textbf{V}^\ast(x)=\inf_{\sigma\in \Sigma^U}\textbf{V}(x,\sigma),$ where $\Sigma^U$ is the set of universally measurable strategies. Moreover, for each $\epsilon>0$, there is some universally measurable deterministic stationary $\epsilon$-optimal strategy for the problem (21).

(e) Suppose Condition 3 is satisfied. Then the value function $\textbf{V}^\ast$ is the minimal $[1,\infty]$-valued lower semicontinuous solution to (22). Moreover, there exists a deterministic stationary strategy $\varphi$ satisfying (23), and so in particular there exists a deterministic stationary optimal strategy for the problem (21). Part (d) of the above statement follows from the proof of Proposition 3.2 of [Reference Zhang25], while all the other parts follow from Propositions 3.1, 3.4, and 3.7 of [Reference Zhang25].

Acknowledgements

We thank the editors and referees for comments and remarks that significantly improved the readability of this paper. This work was supported by the Royal Society (grant number IE160503) and the Daiwa Anglo-Japanese Foundation (UK) (grant reference 4530/12801).

References

Bäuerle, N. and Popp, A. (2018). Risk-sensitive stopping problems for continuous-time Markov chains. Stochastics 90, 411–431.CrossRef Google Scholar

Bertsekas, D. and Shreve, S. (1978). Stochastic Optimal Control. Academic Press, New York.Google Scholar

Costa, O. and Davis, M. (1989). Impulsive control of piecewise-deterministic processes. Math. Control Signals Systems 2, 187–206.CrossRef Google Scholar

Costa, O. and Raymundo, C. (2000). Impulse and continuous control of piecewise deterministic Markov processes. Stochastics 70, 75–107.Google Scholar

Costa, O. and Dufour, F. (2013). Continuous Average Control of Piecewise Deterministic Markov Processes. Springer, New York.CrossRef Google Scholar

Davis, M. (1993). Markov Models and Optimization. Chapman and Hall, London.CrossRef Google Scholar

Dufour, F. and Piunovskiy, A. (2015). Impulsive control for continuous-time Markov decision processes. Adv. Appl. Prob. 47, 106–127.CrossRef Google Scholar

Feinberg, E. (2005). On essential information in sequential decision processes. Math. Meth. Operat. Res. 62, 399–410.CrossRef Google Scholar

Feinberg, E., Mandava, M. and Shiryaev, A. (2017). Kolmogorov’s equations for jump Markov processes with unbounded jump rates. To appear in Ann. Operat. Res.CrossRef Google Scholar

Forwick, L., Schäl, M. and Schmitz, M. (2004). Piecewise deterministic Markov control processes with feedback controls and unbounded costs. Acta Appl. Math. 82, 239–267.CrossRef Google Scholar

Ghosh, M. and Saha, S. (2014). Risk-sensitive control of continuous time Markov chains. Stochastics 86, 655–675.CrossRef Google Scholar

Guo, X. and Zhang, Y. (2020). On risk-sensitive piecewise deterministic Markov decision processes. Appl. Math. Optimization 81, 685–710.CrossRef Google Scholar

Hernández-Lerma, O. and Lasserre, J. (1996). Discrete-Time Markov Control Processes. Springer, New York.CrossRef Google Scholar

Hordijk, A. and van der Duyn Shouten, F. (1984). Discretization and weak convergence in Markov decision drift processes. Math. Operat. Res. 9, 121–141.CrossRef Google Scholar

Jaśkiewicz, A. (2008). A note on negative dynamic programming for risk-sensitive control. Operat. Res. Lett. 36, 531–534.CrossRef Google Scholar

Kumar, S. and Pal, C. (2013). Risk-sensitive control of pure jump process on countable space with near monotone cost. Appl. Math. Optimization 68, 311–331.Google Scholar

Miller, A., Miller, B. and Stepanyan, K. (2018). Simultaneous impulse and continuous control of a Markov chain in continuous time. Automation Remote Control 81, 469–482.CrossRef Google Scholar

Palczewski, J. and Stettner, L. (2017). Impulse control maximising average cost per unit time: a non-uniformly ergodic case. SIAM J. Control Optimization 55, 936–960.CrossRef Google Scholar

Piunovski, A. and Khametov, V. (1985). New effective solutions of optimality equations for the controlled Markov chains with continuous parameter (the unbounded price-function). Problems Control Inf. Theory 14, 303–318.Google Scholar

Piunovskiy, A. (1997). Optimal Control of Random Sequences in Problems with Constraints. Kluwer, Dordrecht.CrossRef Google Scholar

Van der Duyn Schouten, F. (1983). Markov Decision Processes with Continuous Time Parameter. Mathematisch Centrum, Amsterdam.Google Scholar

Wei, Q. (2016) Continuous-time Markov decision processes with risk-sensitive finite-horizon cost criterion. Math. Meth. Operat. Res. 84, 461–487.CrossRef Google Scholar

Yushkevich, A. (1980). On reducing a jump controllable Markov model to a model with discrete time. Theory Prob. Appl. 25, 58–68.CrossRef Google Scholar

Yushkevich, A. (1988). Bellman inequalities in Markov decision deterministic drift processes. Stochastics 23, 25–77.CrossRef Google Scholar

Zhang, Y. (2017). Continuous-time Markov decision processes with exponential utility. SIAM J. Control Optimization 55, 2636–2660.CrossRef Google Scholar

Figure 1. Illustration of the system dynamics in the gradual-impulse control problem, and how the policy acts on the system dynamics. Here $\textbf{X}=[0,\infty).$ The second coordinate indicates the impulse (including the ‘pseudo-impulse’ $\Delta$) used at that state, which is recorded in the first coordinate. At the initial time $t=\theta_1\equiv 0$, three impulses are applied in turn. The first jump in the indicated sample path of the marked point process $\{(T_n,Y_n)\}_{n=1}^\infty$ takes place at $t_2=\theta_2.$ It is triggered by a natural jump because $x_0'\ne x_3.$ Along the displayed sample path, the system state remains at $x_3$ before the first jump of the marked point process. The second jump of the marked point process is triggered by a planned (or let us say active) impulse, because $x_0''=x_0'$. Infinitely many impulses are applied at $t_3=t_2+\theta_3$, so that the process is ‘killed’ after the infinitely many impulses at $t_3$; i.e., $\omega=(y_0,0,y_1,\theta_2,y_2,\theta_3,y_3,\infty,\Delta,\infty,\Delta,\dots).$ Note also that, under the policy $u=\{u_{n}\}_{n=0}^\infty$ in Definition 1, $y_1\in \textbf{Y}_3$ is a realization from the distribution $u_0(\cdot|x_0)$, $\bar{x}(y_1)=x_3;$$y_2\in \textbf{Y}_0$ is a realization from the distribution $\Gamma^0_1(\cdot|h_1,\theta_2,x_0')$, as the jump at $t_2$ is triggered by a natural jump, $\bar{x}(y_2)=x_0'$; and $y_3\in (\textbf{X}\times\textbf{A}^I)^\infty$ is a realization from the distribution $\Gamma_2^1(\cdot|h_2)$, as the jump at $t_3$ is not triggered by a natural jump, $\bar{x}(y_3)=\Delta.$

Figure 2. The realization of the state process in the hat DTMDP model corresponding to the sample path in the gradual-impulse control problem in Figure 1. The time index is discrete from $\{0,1,\dots\}$. The realizations of the components $\{(C_n,B_n)\}_{n=0}^\infty$ in the action process $\{\hat{A}_n\}_{n=0}^\infty$ are indicated above the dashed lines between consecutive states. For example, $(0,b_0)$ next to the state $(0,x_0)$ indicates that the decision-maker applies an impulse $b_0$ immediately, which results in the next state $(0,x_1).$ All the components $x_0,x_1,\dots,$$x_0'$, $x_1''$, $x_2''$ and $b_1,b_2,b_0'',b_1'',b_2''$ are the same as in Figure 1. The only exception is $(c_3,b_3),$ which does not appear in Figure 1. Nevertheless, $c_3>\theta_2$, because in Figure 1, the first jump in the marked point process therein at the time moment $\theta_1+\theta_2=\theta_2$ is triggered by a natural jump.

Article contents

On gradual-impulse control of continuous-time Markov decision processes with exponential utility

Abstract

Keywords

MSC classification

1. Introduction

2. Model description and problem statement

2.1. System primitives of the gradual-impulse control problem

2.2. Definition and interpretation of an intervention

2.3 Construction of the controlled process

3. Optimality results

4. The hat DTMDP model

4.1. Primitives of the hat DTMDP model $\{\hat{\textbf{X}}, \hat{\textbf{A}}, p, l\}$

4.2 Relation between gradual-impulse control and hat DTMDP problems

5. Proofs of the main statements

6 Conclusion

Appendix A. Relevant results about DTMDPs

Acknowledgements

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests