An asymptotically optimal heuristic for general nonstationary finite-horizon restless multi-armed, multi-action bandits

Gabriel Zayas-Cabán; Stefanus Jasin; Guihua Wang

doi:10.1017/apr.2019.29

An asymptotically optimal heuristic for general nonstationary finite-horizon restless multi-armed, multi-action bandits

Part of: Computer system organization Operations research and management science

Published online by Cambridge University Press: 03 September 2019

Gabriel Zayas-Cabán ,

Stefanus Jasin and

Guihua Wang

Show author details

Gabriel Zayas-Cabán*: Affiliation:
University of Wisconsin-Madison
Stefanus Jasin*: Affiliation:
University of Michigan
Guihua Wang*: Affiliation:
University of Michigan
*: *Postal address: Mechanical Engineering Building, University of Wisconsin-Madison, 1513 University Avenue, Room 3011 Madison, WI 53706-1691, USA. Email address: zayascaban@wisc.edu
**Postal address: Stephen M. Ross School of Business, University of Michigan, 701 Tappan Street, Ann Arbor, MI 48109, USA.
**Postal address: Stephen M. Ross School of Business, University of Michigan, 701 Tappan Street, Ann Arbor, MI 48109, USA.

Article contents

Abstract
Introduction
Related literature
Fixed population model
Randomized activation control
Numerical experiments
Dynamic population model
Closing remarks
References

Rights & Permissions

Abstract

We propose an asymptotically optimal heuristic, which we term randomized assignment control (RAC) for a restless multi-armed bandit problem with discrete-time and finite states. It is constructed using a linear programming relaxation of the original stochastic control formulation. In contrast to most of the existing literature, we consider a finite-horizon problem with multiple actions and time-dependent (i.e. nonstationary) upper bound on the number of bandits that can be activated at each time period; indeed, our analysis can also be applied in the setting with nonstationary transition matrix and nonstationary cost function. The asymptotic setting is obtained by letting the number of bandits and other related parameters grow to infinity. Our main contribution is that the asymptotic optimality of RAC in this general setting does not require indexability properties or the usual stability conditions of the underlying Markov chain (e.g. unichain) or fluid approximation (e.g. global stable attractor). Moreover, our multi-action setting is not restricted to the usual dominant action concept. Finally, we show that RAC is also asymptotically optimal for a dynamic population, where bandits can randomly arrive and depart the system.

Keywords

Restless bandit asymptotic optimality finite horizon nonstationary arm acquiring nonindexable bandit

MSC classification

Primary: 90C40: Markov and semi-Markov decision processes

Secondary: 68M20: Performance evaluation; queueing; scheduling 90B36: Scheduling theory, stochastic

Type: Original Article
Information: Advances in Applied Probability , Volume 51 , Issue 3 , September 2019 , pp. 745 - 772

DOI: https://doi.org/10.1017/apr.2019.29 [Opens in a new window]
Copyright: © Applied Probability Trust 2019

1. Introduction

We present a heuristic control/policy that is asymptotically optimal for a finite-horizon restless multi-armed, multi-action bandit problem with a fixed population (i.e. the set of bandits remains constant throughout the entire horizon) and dynamic population (i.e. bandits can arrive or depart the system). A multi-armed bandit problem (MABP) involves activating competing bandits/arms sequentially over time. In its original form, a fixed number of bandits need to be activated at any given time, and the state of each bandit evolves according to a controlled stochastic process when it is activated. A solution to an MABP specifies which bandits need to be activated at each decision epoch to minimize either the expected discounted or long-run average cost associated with how the states of the bandits evolve over time. In the MABP literature, a celebrated result is the so-called Gittins index policy, introduced in Gittins and Jones (Reference Gittins and Jones1974) for a fixed population bandit problem. This policy assigns each bandit an index as a function of its current state and then activates the bandit(s) with the largest indices. When only one bandit can be activated at each period, this policy optimizes the infinite-horizon expected discounted costs and infinite-horizon long-run average cost. Whittle (Reference Whittle and Gani1988) generalized the MABP to allow nonactive bandits to also change their states (dubbed by Whittle as a changing world setting), still assuming a fixed population setting, giving rise to the restless multi-armed bandit problem (RMABP).

The RMABP is a general modeling framework encompassing applications in sequential selection of clinical trials in medicine, sensor management, manufacturing systems, queueing networks, appointment scheduling, and capacity management in healthcare. We refer interested readers to the classic text by Gittins et al. (Reference Gittins, Glazebrook and Weber2011) and to Bertsimas and Nino-Mora (Reference Bertsimas and Niño-Mora2000) (and the references therein) for a more detailed discussion of applications of RMABPs in clinical trials, manufacturing systems, and queueing networks, among others; to Washburn (Reference Washburn2008), Mahajan and Teneketzis (Reference Mahajan and Teneketzis2008), and Ahmad et al. (Reference Ahmad2009) for applications in sensor management; and to Deo et al. (Reference Deo2013), Ayer et al. (Reference Ayer2016), and Lee et al. (Reference Lee, Lavieri and Volk2018) for applications in healthcare. The optimal policy for an RMABP is rarely an index policy and it is frequently difficult to determine in any tractable manner (cf. Papadimitriou and Tsitsiklis (Reference Papadimitriou and Tsitsiklis1999)). As a result, Whittle (Reference Whittle and Gani1988) proposed to solve a relaxed version of the RMABP in which the number of activated bandits per period is no longer fixed, but has an upper bound given by the total number of bandits. He then defined an indexability property to ensure the relaxed problem has an optimal policy that is similar to a Gittins index policy, i.e. one that assigns each bandit an index and activates bandits based on their index. This policy, famously known as Whittle’s index policy, approximates the solution of the original RMABP and reduces to the Gittins index policy when nonactivated bandits do not change their states. Whittle conjectured that this index policy is asymptotically optimal when the number of bandits that can be activated per period and the size of the population of bandits grow proportionally large. This conjecture was proved in the seminal work of Weber and Weiss (Reference Weber and Weiss1990) for the case of bandits that are governed by the same probability transition matrix, as long as the differential equation corresponding to the fluid approximation of the index policy has a globally stable attractor. Weber and Weiss also showed that Whittle’s index policy can fail to be asymptotically optimal if the global attractor condition is not satisfied.

In the same asymptotic setting considered in Whittle (Reference Whittle and Gani1988) and Weber and Weiss (Reference Weber and Weiss1990), we introduce an asymptotically optimal heuristic control, called randomized assignment control (RAC), that does not require an indexability property and applies to general, finite-horizon fixed population and dynamic population RMABPs. RAC works as follows. At each time period, we check the state of each bandit. Depending on the state of the bandit under evaluation, we then randomize the action to be applied to the bandit according to a certain probability distribution, as long as we have not exceeded a prespecified budget for that period. Otherwise, we do not activate the bandit. The control parameters under RAC (i.e. the probability of applying a certain action to a bandit in a given state) are computed using the optimal solution of a linear programming (LP) relaxation of the RMABP. Although the asymptotic optimality of RAC in this paper is proved for the finite-horizon problem, our analysis shows that RAC remains asymptotically optimal when the length of the decision horizon grows at a certain rate. We explain in detail this important point in discussions at several places throughout the paper (cf. Propositions 1 and 2 in Sections 4 and 6, respectively).

In addition to providing a theoretical analysis for the asymptotic optimality of RAC, we also show numerically that RAC performs sufficiently well in most cases considered in the numerical study, even for instances when the number of bandits that can be activated at each period and the size of the population of bandits are relatively small (i.e. the nonasymptotic setting). This suggests that RAC can be used in many applications. As mentioned, we consider both fixed and dynamic population RMABPs. In the dynamic population case, we allow bandits to arrive (and leave) stochastically at the beginning of every period (cf. Verloop (Reference Verloop2016)). Again, RAC is shown to be asymptotically optimal, thereby extending the utility of RAC to applications that can be modeled as discrete-time controlled Markov processes, including server scheduling problems and flow and server control in discrete-time queueing systems, when decisions are made in batches (cf. Nain and Ross (Reference Nain and Ross1986) and Altman (Reference Altman1999, Chapters 5 and 10)).

We view our work as having the following contributions.

1. To the best of our knowledge, we are the first to propose an asymptotically optimal heuristic control for a general finite-horizon RMABP for fixed and dynamic population models. Our LP relaxation approach provides a tractable alternative to solving a dynamic program (DP) exactly, which is usually not tractable.
2. Our proposed heuristic control does not rely on any structural assumptions made in the existing literature. These assumptions include indexability properties, as well as assumptions on bandit dynamics, such as the global attractor property referenced above and/or assumptions regarding the recurrence structure of the underlying Markov chain—see Section 2 for a more detailed discussion.
3. We allow for a nonstationary (or time-dependent) bandit activation budget and an arbitrary finite number of actions, without having to restrict to policies that implement a so-called dominant action, i.e. the optimal policy always chooses the same action for each activated bandit from a specific class and in a specific state. (Indeed, our analysis can also be applied to the setting with nonstationary transition matrix and nonstationary cost function.) We emphasize that the analysis of the RMABP with more than two actions has remained elusive in the literature unless additional structure is assumed, such as the dominant action. Our analysis can be easily extended to the setting with multi-class bandits (cf. Verloop (Reference Verloop2016)).

The remainder of the paper is organized as follows. In Section 2 we summarize related literature and highlight our contributions. Section 3 details the basic model with a fixed population of bandits, along with the corresponding LP relaxation of the stochastic RAMBP, and some preliminary results. In Section 4 we provide the definition of RAC and analyze its performance for the fixed population model under both the total and discounted expect cost criteria. In Section 5 we provide results from simple numerical examples to test the nonasymptotic performance of RAC. In Section 6 we analyze the performance of RAC in the setting with a dynamic population of bandits. Section 7 concludes the paper.

2. Related literature

Here we only review papers that are most closely related to our study and refer interested readers to the classic text by Gittins et al. (Reference Gittins, Glazebrook and Weber2011) for a systematic and comprehensive treatment of MABPs and to the recent paper by Verloop (Reference Verloop2016) for other related references.

Existing literature on the MABP and RMABP often assumes a stationary activation budget (i.e. the maximum number of bandits that can be activated per period is independent of time) and only two possible actions per bandit (Verloop (Reference Verloop2016)). To the best of our knowledge, a nonstationary activation budget has only been considered in Cohen et al. (Reference Cohen, Zhao and Scaglione2014). The contrast between Cohen et al. (Reference Cohen, Zhao and Scaglione2014) and our work is that we allow for an arbitrary finite number of actions and states, and we focus on developing an asymptotically optimal policy instead of deriving sufficient conditions for optimality of a certain policy form.

RMABPs with multiple actions are often referred to as superprocesses and were first considered for the MABP in the setting where only one bandit can be activated per period (Whittle (Reference Whittle1980)). For this setting, it was shown that, under the condition that each state has a dominant action, there is an optimal policy that is indexable. A less strict condition is given in Gittins et al. (Reference Gittins, Glazebrook and Weber2011), but still has the same purpose of ensuring that there is an optimal policy that is indexable. Multiple actions were also considered in Verloop (Reference Verloop2016) and generalize the concept of dominant action to their setting. Our proposed RAC makes no such restriction to a class of dominant action policies.

Verloop (Reference Verloop2016) analyzed multi-class restless bandits with a long-run average cost criteria. A class of priority policies that do not require indexability are introduced, and their asymptotic optimality is analyzed for a fixed and dynamic population of bandits that can arrive and depart from the system. Conditions for asymptotic optimality require that the differential equation that describes the fluid model associated with the RMABP have a globally attracting equilibrium point, similar to the conditions needed in Weber and Weiss (Reference Weber and Weiss1990). This condition guarantees that the equilibrium point induces a priority policy and that the process under this priority policy converges to the equilibrium point independent of the initial conditions. To guarantee the condition holds, the family of processes that scales to the fluid model must each have a unique invariant probability distribution with finite first moment, and the collection of these unique invariant probability distributions must be tight and uniformly integrable. For a fixed population of bandits, these conditions are satisfied when the generated Markov process is unichain (i.e. the state space for the Markov chain has a single recurrent class C and there is no other absorbing set disjoint from C), so that the resulting Markov chain has a unique equilibrium distribution. For a dynamic population of bandits, they are satisfied when the generated Markov process is irreducible and state 0 (i.e. the ‘empty’ state) is positive recurrent for any bandit that is never activated, i.e. inactivated bandits eventually leave the system.

Our dynamic population framework differs from that in Verloop (Reference Verloop2016) in four ways. First, we allow time-varying constraints on the number of active bandits per period (again, our results can also be extended to the setting with nonstationary transition matrices and nonstationary costs). Second, we consider a finite-horizon setting, so we do not require a global attractor condition. Third, we make no assumptions about the transition probability matrices or the underlying dynamics generated by the policies of consideration, such as being unichain. Fourth, we propose a new type of policy that is neither a priority policy nor an index policy, but is still asymptotically optimal under certain relatively mild conditions.

Kelly (Reference Kelly1981) considered an expected long-run average reward criterion for a finite population bandit. Each arm generates a sequence of Bernoulli random variables whose parameters are themselves random variables, and are independent with common distribution satisfying a regularity condition. He showed that, as the discount factor approaches 1, the optimal policy converges to that which pulls the arm(s) that has(ve) incurred the least number of ‘failures’. Moreover, as the discount factor approaches 1, the performance of the optimal discounted expected reward policy as judged with respect to the expected average reward criterion approaches the best possible.

Our work is also related to the literature that uses LP relaxation to approximate RMABPs. Bertsimas and Nino-Mora (Reference Bertsimas and Niño-Mora2000) were the first to consider a sequence of LP relaxations to obtain a primal-dual index policy for the infinite-horizon discounted-cost RMABP. Le Ny et al. (Reference Le Ny, Dahleh and Feron2008) extended the work of Bertsimas and Nino-Mora (Reference Bertsimas and Niño-Mora2000) to include switching times/costs between activating bandits. Both Bertsimas and Nino-Mora (Reference Bertsimas and Niño-Mora2000) and Le Ny et al. (Reference Le Ny, Dahleh and Feron2008) considered LP relaxations of the dynamic program formulation of infinite-horizon RMABPs. An alternative LP relaxation can be derived by considering a fluid model of the RMABP that only takes into account mean drifts of bandit dynamics (Weber and Weiss (Reference Weber and Weiss1990) and Verloop (Reference Verloop2016)). This fluid approach is also called the certainty equivalent approach in the operations research literature and is closest to the LP formulation in this paper.

Finite-horizon MABPs have been studied in Robbins (Reference Robbins1952) and Bradt et al. (Reference Bradt, Johnson and Karlin1956), who focused on the case where engaging a project corresponds to sampling from a Bernoulli population with unknown success probability and the objective is to maximize the expected number of successes over a finite number of plays. We refer interested readers to the monograph by Berry and Fristedt (Reference Berry and Fristedt1985) for additional references on finite-horizon MABPs. More recently, Caro and Gallien (Reference Caro and Gallien2007), considered the setting motivated by a dynamic assortment problem in the fashion retail industry. Nino-Mora (2011) (see also the references therein) considered a class of finite-horizon discrete-state bandit problems whose optimal policy is known to be of index type (i.e. the counterpart of the Gittins index for a finite-horizon discrete-state bandit), and proposed an efficient and exact algorithm to compute the index.

Finite horizon RMABPs have been less studied. To the best of our knowledge, the work by Hu and Frazier (Reference Hu and Frazier2017) is the closest to the current study. Hu and Frazier (Reference Hu and Frazier2017) considered the RMABP with a finite number of independent and identically distributed bandits which can be either activated or not, except over a finite horizon and with a fixed and stationary number M of multiple pulls per period. They proposed an index policy for this model and showed that it is asymptotically optimal. Their asymptotically optimal index policy was obtained in three main steps. First, they constructed a Lagrangian relaxation of the original RMABP, which provides an upper bound to the optimal value of the RMABP, and computed Lagrange multipliers that minimize the value of the RMABP relaxation using subgradient descent. Second, they decomposed the relaxation into smaller Markov decision processes (MDPs), one for each bandit, and each of which can be optimally solved using backward induction. For each smaller MDP, they obtained an ‘index’ for each time period. They then constructed an index policy for the original RMABP by activating the M largest indices corresponding to the smaller MDPs in each time period. One potential issue is that there can be ties between indices so that it may not be possible to choose exactly M indices. As a result, the third step is a tie-breaking rule that activates the remaining subprocesses between tied states according to a probability distribution induced by an optimal policy that is obtained by solving a linear program with respect to occupation measures and that depends on the Lagrange multipliers.

In addition to our population and multi-action setting, there are three other components of our approach that distinguish it from that of Hu and Frazier (Reference Hu and Frazier2017) and which make their approach not directly applicable to ours. First is the objective. We introduce a set of undesirable states, which are the states in which a bandit will incur a high penalty cost of at the end of the horizon. In particular, a cost is incurred for each excess bandit that ends up in an undesirable state immediately after the end of the horizon and we allow for at most a fixed number of bandits to be in any of the undesirable states at this time without being penalized. The dependency between the bandits introduced by this terminal cost implies that the Lagrangian relaxation and the subsequent decomposition into sub-MDPs in Hu and Frazier (Reference Hu and Frazier2017) is not directly applicable. The second difference is that we allow for the number of bandits to be activated in each period to be upper bounded by a time-dependent constant. Because the latter constraint is nonbinding, the Lagrangian relaxation of Hu and Frazier is again not directly applicable. Finally, the index policy in Hu and Frazier (Reference Hu and Frazier2017) is based on aggregating ‘indices’ from smaller MDPs based on the Lagrangian relaxation of the original RMABP and on a tie-breaking rule, akin to our formulation of our stochastic control problem, based on an LP formulation for state occupation measures and the Lagrange multipliers. By contrast, ours is based on an LP relaxation where the random variables in the original RMABP are replaced by their means.

Another potential distinction from Hu and Frazier (Reference Hu and Frazier2017) is that our analysis can be extended to include nonstationary transition probability matrices and costs and state-dependent terminal costs. It can also be generalized to the setting where we have several constraints for several subsets of bandits or one constraint on the total number of activated bandits over the entire finite-time horizon. The proposed approach would also work for the multi-action setting where each action contributes a different weight to the budget. To the best of our knowledge, we are the first to analyze finite-horizon RMABPs with a nonstationary activation budget and asymptotically optimal policies that are nonindexable.

3. Fixed population model

We now describe the setting of a fixed population model where the number of bandits in the system remains constant throughout the horizon (i.e. no new bandits arrive and no existing bandits depart); this setting will be assumed throughout Sections 3 to 5. We consider a finite-horizon, discrete-time model, where time $t \in \{1, \dots, T + 1 \}$ with $T \lt\infty$. Let $\mathbb{J} = \{j\colon 1 \le j \le J\}$ with $J \lt \infty$ denoting the set of feasible states, and let $\mathbb{A} = \{a\colon 0 \le a \le A\}$ with $A \lt \infty$ denoting the set of feasible actions. Our results and analysis will still apply when each state has its own set of feasible actions; so, without loss of generality, we will simply assume that all states share the same set of feasible actions $\mathbb{A}$. We also introduce a set of states $\mathbb{U} \subseteq \mathbb{J}$, called the undesirable states, which are the states in which a bandit will incur a high penalty cost at the end of the horizon. Undesirable states can represent machine failure in maintenance and replacement models, customers that did not complete service before the end of the horizon or abandoned the system before completing service in queueing scheduling applications, or deteriorated/critical health conditions in healthcare applications. We refer to action $a = 0$ as no action (or no treatment) and any action $a \gt 0$ as a proper action (or proper treatment). Moreover, using the standard terminology from the RMABP literature, we will call a bandit active when a proper action is applied to it and passive otherwise. We assume that the state of each bandit transitions from state i to state j under action a according to a probability $p^a_{ij}$ and that the maximum number of active bandits at time t is $b_t \gt 0$, which we call the activation budget. Two types of costs can be incurred by the system. First, a cost $c^a_j$ is incurred each time action a is applied to a bandit in state j. Second, a cost $\phi$ is incurred for each excess bandit that ends up in an undesirable state at time $T+1$; we allow for at most $m \ge 0$ bandits to be in any of the undesirable states at time $T+1$ without being penalized. Finally, we use $n_{{\rm{tot}}}$ to denote the total number of bandits in the system. For ease of exposition, we have used a time-independent transition matrix and cost function, but remark that our analysis can be applied to the setting with a time-dependent transition matrix and cost function.

The decision-making scenario is as follows. At the beginning of time t, the decision maker views the state of each bandit in the system and then decides the action/treatment to be applied to each bandit. After receiving the treatment, each bandit incurs a cost and its state transitions to a potentially new state at the beginning of time $t+1$. The objective of the decision maker is to minimize the expected total treatment and penalty costs, or simply, the expected total costs.

Let $\Pi$ denote the set of all nonanticipating controls, and let $\pi$ denote a feasible control in $\Pi$. For a given control $\pi$, since all bandits share identical transition dynamics and costs, we do need to keep track of the individual bandits; instead, it is sufficient to simply keep track of the number of bandits in state j that receive treatment a at time t, which we denote by $X^{\pi, a}_j(t)$, as well as the number of bandits in state i that receive treatment a at time t whose states transition to state j at time $t+1$, which we denote by $\smash{Y^{\pi,a}_{ij}(t)}$. Let $n_{j}$ denote the number of bandits in state j at time $t=1,$ and let $V^{S}$ denote the expected total costs under an optimal control $\pi^*$. Then

(1a)

\begin{equation}V^{S} = \min_{\pi \in \Pi} \mathbb{E}^{\pi} \bigg[\sum_{t=1}^T\sum_{a=0}^A \sum_{j=1}^J c^a_j \cdot X^{\pi, a}_j(t) + \phi \cdot \bigg(\sum_{a=0}^A \sum_{j \in \mathbb{U}} X^{\pi,a}_j(T+1) - m \bigg)^+\bigg]\end{equation}

such that

(1b)

\begin{equation}\sum_{a=0}^A X^{\pi, a}_j(t) = \sum_{a=0}^A\sum_{i=1}^J Y^{\pi, a}_{ij}(t-1) \quad\text{for all } j\ge 1, \, t \ge2, \end{equation}

(1c)

\begin{equation}\sum_{a=0}^A X^{\pi, a}_j(1) = n_{j} \quad\text{for all } j \ge 1,\end{equation}

(1d)

\begin{equation}\sum_{a=1}^A \sum_{j=1}^J X^{\pi, a}_j(t) \le b_t \quad\text{for all } t\ge 1 ,\end{equation}

where the constraints hold almost surely (i.e. with probability 1).

We remark that the seemingly simple stochastic control model (1) is in fact surprisingly general. As alluded to in Section 1, it can be used for machine maintenance and replacement problems or capacity management and/or resource allocation in healthcare, among others. Additionally, this model can be modified to include dynamic populations (i.e. ‘birth’ and ‘death’ RMABP) by including nonstationary random arrival processes, nonstationary random service capacity (e.g. nonstationary budget), random service completions, and random abandonments. Lastly, we remark that the formulation (1) includes the budget constraints explicitly, which are typically included in the definition of the class $\Pi$ of admissible controls. In what follows, and for ease of notation, whenever it is clear from the context which policy is being used, we will suppress the notational dependency on $\pi$.

In theory, the stochastic control problem (1) can be solved using the standard DP algorithm. Unfortunately, this approach suffers from the well-known curse of dimensionality. Thus, rather than solving (1) exactly, we instead construct and analyze a provably near-optimal heuristic control. To describe our heuristic control in Section 4, we need to introduce the following family of linear programs indexed by $\boldsymbol{\epsilon} \,:\!= ({\epsilon_1, \epsilon_2,\dots, \epsilon_T) \geq \mathbf{0} = (0, 0, \dots, 0)$:

(2a)

\begin{equation}V^D(\boldsymbol{\epsilon}) = \min_{x,z} \sum_{t = 1}^T \sum_{a=0}^A \sum_{j =1}^{J} c_{j}^a \cdot x_{j}^a(t, \boldsymbol{\epsilon}) + \phi \cdot z(\boldsymbol{\epsilon})\end{equation}

such that

(2b)

\begin{equation}\sum_{a=0}^{A} x_{j}^a(t, \boldsymbol{\epsilon}) =\sum_{a=0}^A \sum_{i=1}^{J} x_{i}^a(t-1, \boldsymbol{\epsilon}) \cdot p^a_{ij}\quad\text{for all } j\ge 1 , \, t \ge 2, \end{equation}

(2c)

\begin{equation}\sum_{a = 0}^A x_{j}^{a} (1, \boldsymbol{\epsilon}) = n_{j} \quad\text{for all }j \ge 1, \end{equation}

(2d)

\begin{equation}\sum_{a=1}^A \sum_{j=1}^J x^a_j(t, \boldsymbol{\epsilon}) \le b_t - \epsilon_t\quad\text{for all } t \ge 1, \end{equation}

(2e)

\begin{equation}z(\boldsymbol{\epsilon}) \ge \sum_{j \in \mathbb{U}} \sum_{a=0}^A \sum_{i=1}^J x^a_i(T,\boldsymbol{\epsilon}) \cdot p^a_{ij} - m, \end{equation}

(2f)

\begin{equation}z(\boldsymbol{\epsilon}), \, x^a_j(t, \boldsymbol{\epsilon}) \ge 0 \quad\text{for all } a\ge 0, \, j \ge 1, \, t \ge 1. \end{equation}

One can arrive at linear program (2) by replacing all the random variables in (1) with their expected values, replacing the original activation budget $b_t$ in (1) with $b_t -\epsilon_t$, and introducing a new variable $z(\boldsymbol{\epsilon})$ to capture the number of excess bandits in undesirable states at the end of the time horizon. In the special case when $\boldsymbol{\epsilon} = \mathbf{0}$, linear program (2) is simply the deterministic relaxation of the stochastic control problem (1). Otherwise, when $\boldsymbol{\epsilon}> \mathbf{0}$, we interpret vector $\boldsymbol{\epsilon}$ as a buffer for making sure that the activation budgets are not exceeded when a heuristic control constructed using the solution of (2) is implemented in the original stochastic system.

We end this section with the following result.

Lemma 1. It holds that $V^D(\mathbf{0}) \le V^S$.

Proof. Let $\pi^*$ denote an optimal policy for our original stochastic control problem (1), and let $\smash{x^a_j(t,\mathbf{0}) =\mathbb{E}[X^{\pi^* a}_j(t)]}$ for all a, j, t. By Jensen’s inequality,

\begin{equation}\mathbb{E}\bigg[\bigg(\sum_{a=0}^A \sum_{j \in U} X^{\pi^* a}_j(T+1) -m\bigg)^+\bigg] \ge \sum_{a=0}^A \sum_{j\in U} x^a_j(T+1,\mathbf{0}) - m.\end{equation}

Thus, let

\begin{align*}z(\mathbf{0}) =\begin{cases}\sum_{a=0}^A \sum_{j\in U} x^a_j(T+1,\mathbf{0}) - m & \text{if$\sum_{a=0}^A \sum_{j\in U} x^a_j(T+1,\mathbf{0}) \geq m$},\\0 & \text{otherwise.}\end{cases}\end{align*}

It follows that $\smash{x^a_j(t,\mathbf{0})}$ for all a, j, t and $z(\mathbf{0})$ provide a feasible solution to (2) with $\boldsymbol{\epsilon} \,:\!= (\epsilon_1, \epsilon_2, \dots, \epsilon_T) =\mathbf{0} = (0, 0, \dots, 0)$ having total costs in (2) smaller than the total expected costs in (1) under $\pi^*$. This yields $V^D(\mathbf{0}) \le V^S$.□

Lemma 1 allows us to use $V^D(\mathbf{0})$ as a proxy for $V^S$ and study the performance of any feasible control $\pi$ by analyzing the difference between $V^{\pi}$ and $V^D(\mathbf{0})$, as claimed.

4. Randomized activation control

We now discuss our heuristic control, which we call randomized activation control (RAC). To do this, we first need to introduce some notation. For a given $\boldsymbol{\epsilon} \geq \mathbf{0}$, let $x^{*,a}_j(t,\boldsymbol{\epsilon})$ and $z^*(\boldsymbol{\epsilon})$ denote an optimal solution of linear program (2). Define $n^*_j(t, \boldsymbol{\epsilon})$ and $q^{*,a}_j(t, \boldsymbol{\epsilon})$ as

\begin{align*}n^*_j(t, \boldsymbol{\epsilon}) = \sum_{a=0}^A x^{*,a}_j(t, \boldsymbol{\epsilon}), q^{*,a}_j(t, \boldsymbol{\epsilon})=\begin{cases}\frac{x^{*,a}_j(t, \boldsymbol{\epsilon})}{ n^*_j(t, \boldsymbol{\epsilon}) } &\text{if } n^*_j(t, \boldsymbol{\epsilon}) \gt 0,\\1\{a=0\} & \text{if } n^*_j(t, \boldsymbol{\epsilon}) = 0.\end{cases}\end{align*}

Note that, by definition, we have $\smash{n^*_j(1, \boldsymbol{\epsilon})} =n_{j}$ for all j. Note that $\smash{n^*_j(t,\boldsymbol{\epsilon})}$ is the total number of bandits in state j at time t (in the deterministic system) and $\smash{q^{*,a}_j(t,\boldsymbol{\epsilon})}$ is the fraction of bandits in state j at time t that receive action a. Let $Z_{jl}(t,\boldsymbol{\epsilon})$ be a categorical random variable on the set $\mathbb{A}$ with the property that

\begin{equation}\mathbb{P}(Z_{jl}(t, \boldsymbol{\epsilon}) = a) = q^{*,a}_j(t, \boldsymbol{\epsilon})\quad\text{for all } a \in \mathbb{A},\end{equation}

where $\ell$ denotes a bandit. Also, let $\boldsymbol{\varphi} = (\varphi_1,\varphi_2, \dots, \varphi_{n_{{\rm tot}}})$ denote a permutation (or ordering) of all of the bandits in the system, where $n_{{\rm tot}} =\smash{\sum_{j=1}^J n_{j}},$ and let $j(t, \ell)$ denote the state of bandit $\ell$ at time t before any action is applied. The description of RAC is given below.

Algorithm 1. (RAC.)

1. Pick $\boldsymbol{\epsilon}$ and solve linear program (2).
2. Compute $q^{*,a}_j(t,\boldsymbol{\epsilon})$ for all a, j, and t.
3. Pick a permutation $\boldsymbol{\varphi}$.
4. For time $t = 1$ to T, do:
1. (a) set $k = 1$;
2. (b) randomly generate $Z_{j(t, \varphi(k)), k}(t,\boldsymbol{\epsilon})$;
3. (c) if $\smash{\sum_{m=1}^k 1\{Z_{j(t, \varphi(m)), m}(t,\boldsymbol{\epsilon}) \not= 0\}} \ge b_t$, apply $a = 0$ to bandit $\varphi(k)$; otherwise, apply action $\smash{Z_{j(t, \varphi(k)), k}(t,\boldsymbol{\epsilon})}$ to bandit $\varphi(k)$;
4. (d) update $k = k + 1$;
5. (e) while $k \le n_{{\rm tot}}$, go back to step (b).

The idea behind RAC is as follows. At each time t, we check the bandits in the order prescribed by $\varphi$. Suppose that we are currently checking bandit $\ell$. If its current state is j then we apply action a to $\ell$ with probability $\smash{q^{*,a}_j(t,\mathbf{\epsilon})}$ as long as we have not exceeded the allowed budget in period t; otherwise, we simply do not apply a proper treatment to $\ell$. It turns out that the order in which we assess the bandits in RAC (i.e. the permutation $\boldsymbol{\varphi}$) does not affect our analysis and results. Thus, for simplicity, we will simply use $\varphi(k) = k$ for all k. The choice of $\boldsymbol{\epsilon}$, on the other hand, significantly affects the performance of RAC. To see this, note that we can decompose the loss of RAC as

\begin{equation}0 \leq V^{{\rm RAC}} - V^{D}(\mathbf{0}) = [V^{{\rm RAC}} -V^D(\boldsymbol{\epsilon})] + [V^D(\boldsymbol{\epsilon}) - V^D(\mathbf{0})].\end{equation}

First note that the first inequality follows from Lemma 1 since $\smash{V^{{\rm RAC}} \geq V^S}$. Next, observe that if all components of $\boldsymbol{\epsilon}$ are close to 0, then $V^D(\boldsymbol{\epsilon})- V^D(\mathbf{0})$ is also close to 0. However, $V^{{\rm RAC}} -V^D(\boldsymbol{\epsilon})$ could be large. This is so because the deviations of $X^a_j(t)$ from $x^{*,a}_j(t,\boldsymbol{\epsilon})$ (due to the stochastic nature of the system) are likely to lead RAC to exhaust the activation budget at time t before those bandits that should have been activated are even considered for activation. As a result, many bandits may end up in undesirable states at the beginning of period $T+1$. On the other hand, if the components of $\boldsymbol{\epsilon}$ are large, $V^{{\rm RAC}} -V^D(\boldsymbol{\epsilon})$ will be close to 0. This is because there is a negligible probability that we will exhaust the activation budget in each period and we will be able to activate all bandits that need to be activated. However, the term $V^D(\boldsymbol{\epsilon}) - V^D(\mathbf{0})$ can be large (see Lemma 2 below). It follows that care must be taken when choosing $\boldsymbol{\epsilon}$.

The following lemma provides an upper bound for $V^D(\boldsymbol{\epsilon}) -V^D(\mathbf{0})$.

Lemma 2. There exists a constant $M \gt 0$, independent of T, and a vector $\boldsymbol{\epsilon} \ge \boldsymbol{0}$ satisfying $\epsilon_t \le b_t$ for all t such that

\begin{equation}V^D(\boldsymbol{\epsilon}) - V^D(\mathbf{0}) \le M \cdot \bigg[\sum_{t=1}^T\epsilon_t\bigg].\end{equation}

Lemma 2 says that $V^D(\boldsymbol{\epsilon}) - V^D(\mathbf{0})$ is roughly proportional to $\smash{\sum_{t=1}^T \epsilon_t}$. The proof of this lemma can be found in Appendix A and utilizes a standard duality argument for LP sensitivity analysis.

Let $C^{{\rm RAC}}$ denote a realization of total costs under RAC. By definition, we have $\mathbb{E}[C^{{\rm RAC}}] = V^{{\rm RAC}}$. The following lemma tells us that $C^{{\rm RAC}} - V^D(\epsilon)$ too is roughly proportional to $\smash{\sum_{t=1}^T \epsilon_t}$, with a positive probability.

Lemma 3. For any $\boldsymbol{\epsilon} \ge \boldsymbol{0}$, we have

(3)

\begin{equation}C^{{\rm RAC}} - V^D(\boldsymbol{\epsilon}) \le [\phi + \max_{a,j} c^a_j ] \cdot \bigg[\sum_{t=1}^T \epsilon_t\bigg]\label{eqn11}\end{equation}

with probability at least

(4)

\begin{align}1 - 3 &\cdot A \cdot J^2 \cdot \sum_{t=1}^T \exp\bigg\{-\frac{\epsilon^2_t}{12 \cdot A^2 \cdot J^4 \cdot [\sum_{\ell = 1}^Jn_{\ell}] }\bigg\} \label{eqn12}\end{align}

\begin{align}&\quad- 3 \cdot A \cdot J^2 \cdot \sum_{t=1}^T \exp\bigg\{- \frac{\epsilon_t}{6\cdot A \cdot J^2 }\bigg\}. \nonumber\end{align}

If the components of $\boldsymbol{\epsilon}$ are uniformly small then the bound in (3) is small and holds with a small probability. If, on the other hand, the components of $\boldsymbol{\epsilon}$ are uniformly large, the bound in (3) is large and holds with a large probability. Ideally, we would like to choose $\boldsymbol{\epsilon}$ that yields a small bound in (3) that holds with a large probability.

To better characterize the impact of the choice of $\boldsymbol{\epsilon}$ on the performance of RAC, we now turn to an asymptotic setting where $n_{j}$ (for all j), $b_t$ (for all t), and m are uniformly scaled by a factor of $\theta \gt 0$. This is the standard asymptotic setting considered in the RMABP literature (e.g. Weber and Weiss (Reference Weber and Weiss1990) and Verloop (Reference Verloop2016)). Let $\smash{V^S_{\theta}}$ and $\smash{V^{\pi}_{\theta}}$ denote the expected total costs under the optimal control and a feasible control $\pi$, respectively, when we use the scaling factor $\theta$. Also, let $\smash{V^D_{\theta}(\boldsymbol{\epsilon})}$ denote the optimal value of the corresponding linear program (2). It is not difficult to see that the optimal solution of linear program (2) for $\boldsymbol{\epsilon} =\mathbf{0}$ and $\theta \gt 0$ is given by

\begin{equation}x^{*,a}_{\theta, j}(t, \mathbf{0}) = \theta \cdot x^{*,a}_j(t,\mathbf{0}) \quad\text{and}\quad z^{*}_{\theta}(\mathbf{0}) = \theta\cdot z^{*}(\mathbf{0}),\end{equation}

which implies that $\smash{V^D_{\theta}(\mathbf{0}) = \theta \cdot V^D(\mathbf{0})}$. We also define the terms $\smash{n^*_{\theta,j}(t,\boldsymbol{\epsilon})}$ and $\smash{q^{*,a}_{\theta,j}}(t, \boldsymbol{\epsilon})$ analogously to how $\smash{n^*_j(t, \boldsymbol{\epsilon})}$ and $\smash{q^{*,a}_j(t, \boldsymbol{\epsilon})}$ are defined, with $\smash{x^{*,a}_j(t, \mathbf{0})}$ being replaced with $\smash{x^{*,a}_{\theta, j}(t, \mathbf{0})}$. Let $n_{{\rm tot}} \,:\!=\smash{\sum_{i=1}^J n_{i}},$ and define $S(t) \,:\!= \smash{\{(a,j)\colon x^{*,a}_j(t, \mathbf{0}) \gt 0 \mbox{ and } c^a_j> 0\}}$. We state our first theorem below.

Theorem 1. Let d be a positive number. Suppose that $S(t) \not= \emptyset$ for all t, and let

\begin{equation} \epsilon_t = 6 \cdot A \cdot J^2 \cdot \sqrt{d \cdot n_{{\rm tot}} \cdot \theta \cdot \ln \theta }\end{equation}

for all time periods t. For all $\theta \gt \theta^*$ where

(5)

\begin{equation}\frac{\theta^*}{\ln \theta^*} = \frac{36 \cdot A^2 \cdot J^4 \cdot d\cdot n_{{\rm tot}}}{\min_t b_t},\label{eqn13}\end{equation}

there exists a constant $M \gt 0$ independent of T, d, and $\theta$ such that

\begin{equation} \frac{V^{{\rm RAC}}_{\theta} - V^S_{\theta}}{V^S_{\theta}} \le M \cdot \bigg[\frac{T}{\theta^{d}} + \sqrt{\frac{d \cdot \ln\theta}{\theta}}\bigg].\end{equation}

Proof. Fix $d \gt 0$. It is not difficult to see that condition (5) is equivalent to $\theta \cdot b_t \gt \epsilon_t$ for all t. Its role is to guarantee that the right-hand side of budget constraints in linear program (2) is positive and, therefore, the linear program has a feasible solution. Now, note that $\smash{\sqrt{d \cdot n_{{\rm tot}}\cdot \theta \cdot \ln \theta}} \ge d \cdot \ln \theta$ for all large $\theta$. Thus, by Lemma 3, under the choice of $\boldsymbol{\epsilon}$ in Theorem 1,

\begin{equation}C^{{\rm RAC}}_{\theta} - V^D_{\theta}(\boldsymbol{\epsilon}) \, = \,O(T \cdot \sqrt{d \cdot \theta \cdot \ln \theta})\end{equation}

with probability at least

\begin{equation}1 - \Theta\bigg(\frac{T}{\theta^{3d}}\bigg) -\Theta\bigg(\frac{T}{\theta^d}\bigg) = 1 -\Theta\bigg(\frac{T}{\theta^d}\bigg).\end{equation}

Since there is a total of $\theta \cdot n_{{\rm tot}}$ bandits in each period, $\smash{C^{{\rm RAC}}_{\theta} - V^D_{\theta}(\boldsymbol{\epsilon})}$ is at most $\Theta(T \cdot \theta \cdot n_{{\rm tot}})$ with probability at most $\Theta({T}/{\theta^d})$. We conclude that

\begin{equation}V^{{\rm RAC}}_{\theta} - V^D_{\theta}(\boldsymbol{\epsilon}) = O\bigg(T \cdot \sqrt{d \cdot \theta \cdot \ln \theta} + \frac{T^2}{\theta^{d-1}}\bigg).\end{equation}

By Lemma 2, under the choice of $\boldsymbol{\epsilon}$ in Theorem 1, we have

\begin{equation}V^D_{\theta}(\boldsymbol{\epsilon}) - V^D_{\theta}(\mathbf{0}) = O(T \cdot \sqrt{d \cdot \theta \cdot \ln \theta}).\end{equation}

Putting the bounds for $V^{{\rm RAC}}_{\theta} -V^D_{\theta}(\boldsymbol{\epsilon})$ and $V^D_{\theta}(\boldsymbol{\epsilon}) -V^D_{\theta}(\mathbf{0})$ together yields

\begin{equation}V^{{\rm RAC}}_{\theta} - V^D_{\theta}(\mathbf{0}) = O\bigg(T \cdot \sqrt{d \cdot \theta \cdot \ln \theta} + \frac{T^2}{\theta^{d-1}}\bigg).\end{equation}

The proof is complete by noting that $\smash{{(V^{{\rm RAC}}_{\theta} -V^S_{\theta})}/{V^S_{\theta}}} \le \smash{{(V^{{\rm RAC}}_{\theta} -V^D_{\theta}(\mathbf{0}))}/{V^D_{\theta}(\mathbf{0})}}$ and $\smash{V^D_{\theta}(\mathbf{0})}$ is $\Theta(T \cdot \theta)$ (because $S(t) \not= \emptyset$ for all t, which implies that in each period we always incur a total cost that is at least proportional to $\theta$).□

Remarks 1. (Theorem 1.) (a) The condition $S(t) \not= 0$ for all t simply means that, in each period, we always activate some bandits in the deterministic system. This condition is quite mild and is immediately satisfied, for example, when $\smash{c^a_j }> 0$ for all a, j. It is also satisfied when, for each action $a, \smash{c^a_j}$ is monotone nondecreasing in j so that lower (higher) states correspond to ‘better’ (‘worse’) states. The latter condition is typically satisfied in the context of machine maintenance and replacement problems and capacity management/resource allocation problems in healthcare.

(b) The function $f(\theta) = {\theta}/{\ln \theta}$ is increasing on $[0, \infty)$ with $f(0) = 0$ and $\lim_{\theta \to \infty} f(\theta) =\infty$. As a result, the parameter $\theta^*$ in (5) is well defined.

(c) Since we do not scale T in our asymptotic setting, the bound in Theorem 1 tells us that RAC is asymptotically optimal as long as $d \gt 0$. However, the bound depends on T only through the term ${T}/{\theta^{d}}$. This means that we can also consider an alternative asymptotic setting where T is allowed to grow as a function of $\theta$. For example, if $T \sim \theta^n$, where n is an arbitrary natural number, then we can choose $d \gt n$ and RAC is still asymptotically optimal. Thus, while our model considers a finite-horizon setting, RAC is versatile in the sense that it can be applied to problems with a very long decision horizon.

The bound in Theorem 1 is for the expected total costs criteria, but our analysis can also be applied to the expected total discounted cost criteria. This is detailed in Proposition 1 below, which is the direct analogue of Theorem 1.

Proposition 1. Let d be a positive number. Suppose that we multiply the cost at time t with $\delta^{t-1}$ for some discount factor $\delta \in (0,1)$. Suppose that $S(t) \not= 0$ for all t. Let

\begin{equation}\epsilon_t = 6 \cdot A \cdot J^2 \cdot \sqrt{d \cdot n_{{\rm tot}} \cdot \ln(t+e-1) \cdot \theta \cdot \ln \theta \,}\end{equation}

for all t (note that $\epsilon_t$ now depends on t). For all $\theta> \theta^*$ where

\begin{equation}\frac{\theta^*}{\ln \theta^*} = \frac{36 \cdot A^2 \cdot J^4 \cdot d\cdot n_{{\rm tot}} \cdot \log (T + {\rm e} - 1)}{\min_t b_t},\end{equation}

there exists a constant $M \gt 0$ independent of T and $\theta$ such that

\begin{equation}\frac{V^{{\rm RAC}}_{\theta} - V^S_{\theta}}{V^S_{\theta}} \le\frac{V^{{\rm RAC}}_{\theta} -V^D_{\theta}(\mathbf{0})}{V^D_{\theta}(\mathbf{0})} \le M \cdot \bigg[\frac{1}{\theta^d} + \sqrt{\frac{d \cdot \ln\theta}{\theta}}\bigg].\end{equation}

Proof. The proof is akin to the proof of Theorem 1. Here, we will only provide its outline. Note that, when we multiply the cost at time t with $\delta^{t-1}$ for some discount factor $\delta \in (0,1)$, the bound in Lemma 2 becomes

\begin{equation}V^D(\boldsymbol{\epsilon}) - V^D(\mathbf{0}) \le M \cdot \bigg[\sum_{t=1}^T\delta^{t-1} \cdot \epsilon_t\bigg]\end{equation}

for some $M \gt 0$ independent of T and $\boldsymbol{\epsilon} \ge \mathbf{0}$. The bound can be shown using exactly the same arguments as used in the proof of Lemma 2. See Remark 4 in Appendix A.1. Similarly, the bound in (3) becomes

\begin{equation}C^{{\rm RAC}} - V^D(\boldsymbol{\epsilon}) \le \Big[\phi + \max_{a,j} \, c^a_j\Big] \cdot \bigg[\sum_{t=1}^T \delta^{t-1} \cdot \epsilon_t\bigg],\end{equation}

which can be shown using exactly the same arguments as used in the proof of Lemma 3. See Appendix A.2. Next, observe that, akin to the proof of Theorem 1, our choice of $\epsilon_t$ implies that

\begin{equation}C^{{\rm RAC}}_{\theta} -V^D_{\theta}(\boldsymbol{\epsilon}) =O\bigg(\sum_{t=1}^T\delta^{t-1} \cdot \sqrt{d \cdot \ln(t+{\rm e}-1) \cdot \theta \cdot \ln \theta}\bigg) = O(\sqrt{d \cdot \theta \cdot \ln\theta})\end{equation}

with probability at least

\begin{equation}1 - \Theta\bigg(\sum_{t=1}^T \frac{1}{\theta^{3d \cdot \ln(t + {\rm e} -1)}}\bigg) - \Theta\bigg(\sum_{t=1}^T \frac{1}{\theta^{d \cdot \ln(t+{\rm e}-1)}}\bigg) = 1 - \Theta\bigg(\sum_{t=1}^T \frac{1}{(t+{\rm e}-1)^{d\cdot \ln \theta}}\bigg).\end{equation}

Additionally, $\smash{C^{{\rm RAC}}_{\theta} -V^D_{\theta}(\boldsymbol{\epsilon})}$ is at most

\begin{equation}\Theta \bigg(\sum_{t=1}^T \delta^{t-1} \cdot \theta \cdot n_{{\rm tot}}\bigg) = \Theta(\theta \cdot n_{{\rm tot}})\end{equation}

with probability at most $\smash{\Theta(\sum_{t=1}^T {1}/{(t+{\rm e}-1)^{d\cdot \ln \theta}})}.$ We conclude that

\begin{equation}V^{{\rm RAC}}_{\theta} - V^D_{\theta}(\mathbf{0}) = O\bigg(\sqrt{d \cdot \theta \cdot \ln \theta} + \sum_{t=1}^T \frac{\theta}{(t+{\rm e}-1)^{d \cdot \ln \theta}}\bigg).\end{equation}

Since $\smash{V^D_{\theta}(\mathbf{0})}$ is $\smash{\Theta(\sum_{t=1}^T\delta^{t-1} \cdot \theta)} = \Theta(\theta)$ (by our assumption that $S(t) \not= \emptyset$ for all t) and

\begin{equation}\sum_{t=1}^T \frac{1}{(t+{\rm e}-1)^{d \cdot \ln \theta}} \le \frac{1}{{\rm e}^{d\cdot \ln \theta}} + \int_1^{\infty} \frac{{\rm d}x}{(x+{\rm e}-1)^{d \cdot \ln\theta}} \le \frac{2}{{\rm e}^{d \cdot \ln \theta}} = \frac{2}{\theta^{d}}\end{equation}

for large $\theta$, we have the desired bound.□

Remark 2. (Discounted cost.) Akin to Theorem 1, in order for the bound in Proposition 1 to hold, the condition $\theta \gt \theta^*$ is equivalent to $\theta \cdot b_t \ge \epsilon_t$ holding for all t. This is important for making sure that the right-hand side of linear program (2) is positive. Given our choice of $\epsilon_t$, this condition is immediately satisfied for all t and all large $\theta$ so long as $T = o(\exp \{{\theta}/{d \cdot \ln \theta} \})$. Thus, not only is RAC asymptotically optimal for discounted expected cost criteria, its relative loss is also independent of T for very large T.

5. Numerical experiments

5.1 RAC performance

In this section, we test the performance of RAC using two experiments. In the first experiment, we consider an instance of the RMABP with two states and two actions; in the second experiment we consider an instance of the RMABP with five states and five actions. In each experiment, we use $\epsilon_t = \smash{\sqrt{\theta \cdot \log \theta}}$ for all t and a wide range of $\theta$. For both experiments, we randomly generated the cost coefficients according to a uniformly distributed random variable U(0,10). We also generated the transition probabilities by first generating a uniformly distributed random variable U(0,1), and then normalizing the generated rows to obtain transition probability matrix P. We used the same budget $b_t=1$ for all t for both experiments. The remaining details regarding all other parameters can be found in the online supplement Zayas-Cabán et al. (2019). In what follows, we report the percentage losses of RAC (i.e. $\smash{{(V^{{\rm{RAC}}}_{\theta} - V^D_{\theta}(\mathbf{0}))}/{V^D_{\theta}(\mathbf{0})}}$) and their corresponding confidence intervals out of 1000 Monte Carlo runs for the two experiments in Tables 1 and 2, respectively. Additional simulation scenarios and their corresponding results can be found in the online supplement Zayas-Cabán et al. (2019).

Table 1: Percentage loss for two states and two actions.

Table 2: Percentage loss for five states and five actions.

As predicted by Theorems 1 and 2, RAC performs better as $\theta$ grows large. This confirms the asymptotic optimality of RAC. The performance of RAC for small $\theta$ depends on the problem parameters (see the online supplement Zayas-Cabán et al. (2019)). However, in most cases, the percentage loss of RAC is only a single digit number starting from $\theta = 10$, which suggests the robustness of RAC even for problems with relatively small $\theta$. Finally, we remark that while the expected percentage loss should be positive, it is possible to have negative percentage loss in simulations. This is due to sampling error in estimating the average costs using Monte Carlo simulations, and most of the negative percentage losses are very close to 0.

5.2 Comparing RAC with an index policy

In this section, we compare the performance of the RAC with a strict priority rule (or index policy). In the two examples below, we consider two states and two actions (i.e. treatment/active or no treatment/inactive) and assume that there is a budget of 1 (i.e. a maximum of one active bandit) in each time period. Our priority rule works as follows: we use our heuristic to calculate the probability of activating an arm in each state and then activate the arm with the highest probability directly. Note that, since we are activating an arm in each period, we will use the entire budget in each period. We then compare the costs for this priority rule with our heuristic. The results are summarized in Tables 3 and 4. Note that RAC outperforms the specific priority rule. Moreover, it does so without always using the full budget in every period. (We report in Table 5 the percentage budget used in RAC. That is, if the budget in each period is b and the number of periods under consideration is T, then the total budget allowed is $b\times T$. If the total budget actually used during the T time periods is B, then the percentage budget is calculated as ${B}/{(b\times T)}$.) This is perhaps not too surprising.

Table 3: Percentage loss for two states and one action example using RAC.

Table 4: Percentage loss for two states and one action using priority rule.

Table 5: Percentage budget used for two states and one action.

Since a treatment could be costly, it may not be optimal to always use up all the budgets; it is also important to take into account the trade-off between the cost of a treatment and its benefit.

6. Dynamic population model

In this section, we allow new bandits to arrive in each time period, and we also consider random departures (e.g. random service completions or random abandonments). The same notation as before will be used, with an addition that we also use $\Lambda_{jt}$ to denote the (random) arrival of bandits in state j at time t, where $\Lambda_{jt}$ is a Poisson random variable with mean $\lambda_{jt} \gt 0$. The stochastic control formulation of our problem can now be written as

\begin{gather} V^S = \min_{\pi \in \Pi} \mathbb{E}\bigg[\sum_{t=1}^T \sum_{a=0}^A\sum_{j=1}^J c^a_j \cdot X^{\pi, a}_j(t) + \phi \cdot \bigg(\sum_{a=0}^A\sum_{j \in \mathbb{U}} X^{\pi, a}_j(T+1) - m \bigg)^+\bigg]\\[2pt]\text{such that} \sum_{a=0}^A X^{\pi, a}_j(t) = \sum_{a=0}^A\sum_{i=1}^J Y^{\pi, a}_{ij}(t-1) + \Lambda_{jt} \quad\text{for all }j\ge 1, \, t \ge 2,\\[2pt]\sum_{a=0}^A X^{\pi, a}_j(1) = n_{j} + \Lambda_{j1} \quad\text{for all }j \ge 1,\\[2pt]\sum_{a=1}^A \sum_{j=1}^J X^{\pi, a}_j(t) \le b_t \quad\text{for all } t\ge 1.\end{gather}

The corresponding deterministic formulation is given by

(6a)

\begin{equation}V^D(\mathbf{\epsilon}) = \min_{x, z} \sum_{t = 1}^T \sum_{a=0}^A \sum_{j= 1}^{J} c_{j}^a \cdot x_{j}^a(t, \boldsymbol{\epsilon}) + \phi \cdot z(\boldsymbol{\epsilon})\end{equation}

such that

(6b)

\begin{equation}\sum_{a=0}^{A} x_{j}^a(t, \boldsymbol{\epsilon}) =\sum_{a=0}^A \sum_{i=1}^{J} x_{i}^a(t-1, \boldsymbol{\epsilon}) \cdot p^a_{ij} \,+ \, \lambda_{jt} \quad\text{for all } j\ge 1 , \, t \ge 2,\end{equation}

(6c)

\begin{equation}\sum_{a = 0}^A x_{j}^{a} (1, \boldsymbol{\epsilon}) = n_{j} + \lambda_{j1}\quad\text{for all } j \ge 1,\end{equation}

(6d)

\begin{equation}\sum_{a=1}^A \sum_{j=1}^J x^a_j(t, \boldsymbol{\epsilon}) \le b_t - \epsilon_t\quad\text{for all } t \ge 1,\end{equation}

(6e)

\begin{equation}z(\boldsymbol{\epsilon}) \, \ge \, \sum_{j \in \mathbb{U}} \sum_{a=0}^A \sum_{i=1}^Jx^a_i(T, \boldsymbol{\epsilon}) \cdot p^a_{ij} - m,\end{equation}

(6f)

\begin{equation}z(\boldsymbol{\epsilon}), \, x^a_j(t, \boldsymbol{\epsilon}) \ge 0 \quad\text{for all }\, a \ge 0, \, j \ge 1, \, t \ge 1.\end{equation}

Both Lemmas 1 and 2 still hold in the new setting (we omit the details), and the definition of RAC is still the same as that presented in Section 4, with one difference: since the number of bandits in the system potentially changes at every time period, the permutation $\boldsymbol{\varphi}$, which provides the order in which we assess the state of existing bandits in step 3 in the original definition of RAC, must now be selected at the beginning of every time period. That said, as noted in Section 4, the exact choice of $\boldsymbol{\varphi}$ does not affect our asymptotic analysis and thus we can choose an arbitrary permutation $\boldsymbol{\varphi}$ that randomizes the order of existing bandits at each time period. The following result is the analogue of Lemma 3 and the proof can be found in Appendix B.

Lemma 4. For any $\boldsymbol{\epsilon} \ge \boldsymbol{0}$, we have

(7)

\begin{equation}C^{{\rm RAC}} - V^D(\mathbf{\epsilon}) \le \Big[\phi + \max_{a,j} c^a_j\Big] \cdot \bigg[\sum_{t=1}^T \epsilon_t\bigg]\label{eqn20}\end{equation}

with probability at least

(8)

\begin{align}&1 - 3 \cdot A \cdot J^2 \cdot \sum_{t=1}^T \exp\bigg\{\!-\frac{\epsilon^2_t}{48 \cdot A^2 \cdot J^4 \cdot [\sum_{\ell = 1}^Jn_{\ell}] }\bigg\}- 3 \cdot A \cdot J^2 \cdot \sum_{t=1}^T \exp\bigg\{\!-\frac{\epsilon_t}{12 \cdot A \cdot J^2 }\bigg\} \nonumber\\&- 3 \cdot A \cdot J^2 \cdot \sum_{t=1}^T\exp\bigg\{-\frac{\epsilon_t^2}{64 \cdot A^2 \cdot J^4 \cdot[\sum_{s=1}^t \sum_{i=1}^J \lambda_{is}]} \bigg\}.\label{eqn21}\end{align}

The last summation in (8) is due to the arrivals of new bandits at each time period. If $\lambda_{tj} = 0$ for all t and j then the last summation equals 0 and the bound in (8) is almost identical to the bound in (4), with the exception that the numbers 12 and 6 that appear in the denominator in the original bound have now been replaced with 48 and 12 in the denominator above. Let $\lambda_{{\rm tot}} \,:\!= \max_t \smash{\sum_{j=1}^J\lambda_{jt}}$. We consider the same asymptotic setting as in Section 4, where we also scale the arrival rate $\lambda_{jt}$ by $\theta$. Let $n_{{\rm tot}}$ and S(t) be as defined in Section 3. The following theorem is the analogue of Theorem 1.

Theorem 2. Let d be a positive number. Suppose that $S(t) \not= \emptyset$ for all t, and let

\begin{equation} \epsilon_t = 12 \cdot A \cdot J^2 \cdot \sqrt{t \cdot d \cdot \max\{n_{{\rm tot}}, \, \lambda_{{\rm tot}}\} \cdot \theta \cdot \ln\theta \,}\end{equation}

for all t. For all $\theta \gt \theta^*$ where

(9)

\begin{equation}\frac{\theta^*}{\ln \theta^*} = 144 \cdot A^2 \cdot J^4 \cdot d \cdot \max\{n_{{\rm tot}}, \lambda_{{\rm tot}}\} \cdot \Big[\max_t\frac{t}{b_t} \Big],\label{eqn22}\end{equation}

there exists a constant $M \gt 0$ independent of T and M such that

(10)

\begin{equation}\frac{V^{{\rm RAC}}_{\theta} - V^S_{\theta}}{V^S_{\theta}} \le M \cdot \bigg[\frac{T^{3/2}}{\theta^{d/2}} + \sqrt{\frac{d \cdot T \cdot \ln\theta}{\theta}}\bigg].\label{eqn23}\end{equation}

Proof. The proof is similar to that of Theorem 1. First, condition (9) is equivalent to $\theta \cdot b_t \gt \epsilon_t$ for all t. It is important to guarantee that the right-hand side of the budget constraints in linear program (6) is positive and, therefore, the linear program has a feasible solution.

Let E denote the event where (7) is satisfied. By Lemma 4 and our choice of $\boldsymbol{\epsilon}$, $\mathbb{P}(E)$ is at least $1 - \Theta({T}/{\theta^d})$. Now, we consider the event $E^c$. Unlike in the proof of Theorem 1 where we can simply bound $\smash{C^{{\rm RAC}}_{\theta} - V^D_{\theta}(\boldsymbol{\epsilon})}$ with a number that is of order $\Theta(T \cdot \theta \cdot n_{{\rm tot}})$, we now need to take into account new bandits that arrive at each time period. At time t, we have at most $\theta\cdot n_{{\rm tot}} +\smash{\sum_{s=1}^t \sum_{j=1}^J \Lambda^{\theta}_{js}}$ bandits present in the system, where $\Lambda^{\theta}_{js}$ is a Poisson random variable with rate $\theta \cdot \lambda_{js}$. As a result, we can bound:

(11)

\begin{align}&\mathbb{E}[(C^{{\rm RAC}}_{\theta} - V^D_{\theta}(\boldsymbol{\epsilon})) \cdot \mathbf{1}\{E^c\}]\nonumber\\& \le \Big[\phi + \max_{a,j} c^a_j\Big] \cdot \mathbb{E}\bigg[\bigg( T\cdot \theta \cdot n_{{\rm tot}} + \sum_{t=1}^T t \cdot \bigg(\sum_{j=1}^J \Lambda^{\theta}_{j,t}\bigg)\bigg) \cdot \mathbf{1}\{E^c\}\bigg].\label{eqn24}\end{align}

Note that $\smash{\sum_{t=1}^T t \cdot (\sum_{j=1}^J\Lambda^{\theta}_{jt})}$ is stochastically dominated by the random variable $X \sim $ Pois$(T^2 \cdot \theta \cdot \lambda_{{\rm tot}})$. By the Cauchy–Schwarz inequality, we can further bound (11) as

(12)

\begin{align}\mathbb{E}[(C^{{\rm RAC}}_{\theta} - V^D_{\theta}(\boldsymbol{\epsilon})) \cdot \mathbf{1}\{E^c\}] &\le \Big[\phi + \max_{a,j} c^a_j\Big] \cdot \mathbb{E}[(T \cdot \theta \cdot n_{{\rm tot}} + X)^2]^{1/2} \cdot \mathbb{P}(E^c)^{1/2} \nonumber\\& = O\bigg(\frac{T^{5/2}}{\theta^{d/2 - 1}}\bigg)\label{eqn25}\end{align}

where the last equality follows since

\begin{equation}\mathbb{P}(E^c)^{1/2} = O\bigg(\frac{T^{1/2}}{\theta^{d/2}}\bigg)\quad\mbox{and}\quad \mathbb{E}[X^2]^{1/2} = T^2 \cdot \theta \cdot \lambda_{{\rm tot}},\end{equation}

which implies that $\mathbb{E}[(T \cdot \theta \cdot n_{{\rm tot}} +X)^2]^{1/2} = O(T^2 \cdot \theta)$.

Putting (12) together with

\begin{equation}\mathbb{E}[(C^{{\rm RAC}}_{\theta} - V^D_{\theta}(\boldsymbol{\epsilon})) \cdot \mathbf{1}\{E\}] = O(T^{3/2} \cdot \sqrt{d \cdot \theta \cdot \log\theta})\end{equation}

(by Lemma 4 and our choice of $\boldsymbol{\epsilon}$) and the fact that $\smash{V^D_{\theta}(\mathbf{0})} = \Omega(T \cdot \theta)$ (because $S(t) \not= \emptyset$ for all t) yields the desired results. This completes the proof of Theorem 2.□

Remark 3. (Theorem 2.) Unlike the choice of $\epsilon_t$ in Theorem 1, our choice of $\epsilon_t$ in Theorem 2 scales with $\sqrt{t}$. This is crucial for dealing with additional randomness (due to random arrivals of bandits) in each time period, which requires a more conservative buffer. This scaling results in a weaker bound for RAC in Theorem 2 compared to that in Theorem 1. However, we want to emphasize that the bound in (10) holds for very general settings and does not exploit any special structure that might arise in specific instances of this model (e.g. control of queueing systems). For example, if bandits are leaving the system at a geometric rate then it is possible to derive a much tighter bound for the relative loss of RAC. We discuss this in more detail below.

The following proposition is similar to Proposition 1 in Section 4.

Proposition 2. Let d be a positive number. Suppose that we multiply the cost at time t with $\delta^{t-1}$ for some discount factor $\delta \in (0,1)$. Suppose that $S(t) \not= 0$ for all t. Let

\begin{equation}\epsilon_t = 12 \cdot A \cdot J^2 \cdot \sqrt{t \cdot \log(t + {\rm e} - 1)\cdot d \cdot \max\{n_{{\rm tot}}, \lambda_{{\rm tot}}\} \cdot \theta\cdot \ln \theta }\end{equation}

for all t (note that $\epsilon_t$ now depends on t). For all $\theta> \theta^*$ where

\begin{equation}\frac{\theta^*}{\ln \theta^*} = 144 \cdot A^2 \cdot J^4 \cdot d \cdot \max\{n_{{\rm tot}}, \lambda_{{\rm tot}}\} \cdot \log(T+{\rm e}-1) \cdot \Big[\max_t \frac{t}{b_t} \Big],\end{equation}

there exists a constant $M \gt 0$ independent of T and $\theta$ such that

\begin{equation}\frac{V^{{\rm RAC}}_{\theta} - V^S_{\theta}}{V^S_{\theta}} \le\frac{V^{{\rm RAC}}_{\theta} - V^D_{\theta}(\mathbf{0})}{V^D_{\theta}(\mathbf{0})} \le M \cdot \bigg[\frac{1}{\theta^{d/2}} +\sqrt{\frac{d \cdot \ln \theta}{\theta}}\bigg].\end{equation}

The proof of Proposition 2 is very similar to the proof of Theorem 2 and, therefore, is omitted. Unlike in Remark 2, where the bound in Proposition 1 holds for $T= o(\exp \{{\theta}/{(d \cdot \ln \theta)} \})$, the bound in Proposition 2 holds so long as

\begin{equation}T \cdot \log T = o\bigg(\frac{\theta}{\log \theta}\bigg).\end{equation}

In what follows, we discuss a special case of the dynamic population model in which bandits can arrive and leave randomly over time.

6.1. Bandit departure and abandonment

Suppose now that each bandit waits to receive a proper treatment ($a \ge1$) and immediately leaves the system once it receives a proper treatment (i.e. a service completion). Moreover, suppose that, if a bandit has not received a proper treatment up until the end of the current time period, it will stay in the system for the next period with probability $\alpha\in (0,1)$ and will leave (i.e. abandon) the system with probability $1 - \alpha$. These features can be included in the current modeling framework by introducing an extended state variable that includes the type of treatment a bandit received in the previous time period and a tracking variable that indicates whether the bandit is still in the system at the current time period. For example, we can let $j_t = (j'_t,a_{t-1}, \ell_t)$ be our extended state variable, where $j'_t$ is the current state/condition of the bandit, $a_{t-1}$ is the treatment received in the previous time period, and $\ell_t \in \{0, 1\}$ an indicator of the bandit’s presence in the system with 0 (1) denoting it left (remains in) the system at the beginning of period t. The one-step transition probabilities for this model now satisfy the following conditions:

\begin{align}\label{eqn26}\end{align}

(13)

\begin{align}p^{a''}_{(j', a', 0), (j'', a'', 1)} = 0 \quad\text{for all } j',a', j\prime\prime,a\prime\prime, \label{eqn27}\end{align}

(14)

\begin{align}p^{a''}_{(j',0,1), (j'', a'',1)} = 0 \quad\text{for all } j', j\prime\prime, a\prime\prime\ge 1, \label{eqn28}\end{align}

(15)

\begin{align}\sum_{j''} p^0_{(j', 0, 1), \, (j'', 0, 1)} = \alpha \quad\text{for all }j'.\label{eqn29}\end{align}

Condition (13) simply says that a bandit that leaves the system never returns; condition (14) says that a bandit immediately leaves the system after receiving a proper treatment; and condition (15) says that the probability a bandit stays in the system for the next time period given that it has not received a proper treatment is $\alpha$.

For simplicity, suppose that, at the end of the horizon, we only get penalized for the number of bandits who leave the system without being properly treated (i.e. abandonments). This requires keeping track of the number of abandonments at each time period. To do this, we set $\smash{c^a_{(j',0,1)}} = M$ for all j’ and $a \ge 1$, where M is a sufficiently large number, and ${c^0_{(j',0,1)}} = 0$ for all j’. Since applying a proper action to a nonexisting bandit is very costly, this setup implies that, once a bandit leaves the system without being properly treated, the last two components of its state remains (0,1) throughout the horizon. As a result, the total number of abandonments throughout the horizon simply equals $\smash{\sum_{a=0}^A \sum_{j'}X^{a}_{(j', 0, 1)}}(T+1)$ and its corresponding total penalty costs is

\begin{equation}\phi \cdot \bigg(\sum_{a=0}^A \sum_{j'} X^{a}_{(j', 0, 1)}(T+1) -m\bigg)^+,\end{equation}

where $m \ge 0$ can be interpreted as the maximum number of acceptable abandonments.

The setting with service completions and abandonments discussed above is a special case of the more general setting considered in Theorem 2. As a result, the bound in (10) still holds in this setting. In fact, as previously noted, this bound can be significantly tightened by exploiting the special structure of the problem. That is, we have the following result.

Proposition 3. Let d be a positive number. Consider the setting with bandit departure and abandonment described above. Suppose that $S(t) \not= \emptyset$ for all t. If we let

\begin{equation}\epsilon_t = \frac{12 \cdot A \cdot J^2}{\min_t b_t} \cdot \sqrt{\bigg(\sum_{s=0}^{t-1} \alpha^s \bigg) \cdot d \cdot \max\{n_{{\rm{tot}}}, \lambda_{{\rm tot}}\} \cdot \theta \cdot \ln \theta }\end{equation}

for all t, then

\begin{equation}\frac{V^{{\rm RAC}}_{\theta} - V^S_{\theta}}{V^S_{\theta}} =O\bigg(\frac{1}{1 - \alpha} \cdot \frac{T^{1/2}}{\theta^{d/2}} +\sqrt{\frac{d \cdot \ln \theta}{(1 - \alpha) \cdot \theta}}\bigg).\end{equation}

Proof. The proof of Proposition 3 is very similar to the proof of Theorem 2. Here, we will only provide its outline. Note that we can strengthen the bound in Lemma 5 by replacing the term $\smash{\sum_{s=1}^t \sum_{j=1}^J \lambda_{jt}}$ in the third summation in (8) with $\smash{\sum_{s=1}^t \alpha^{t-s}\cdot [\sum_{j=1}^J \lambda_{jt}]}$ (see Remark 5 in Appendix B.1). Let E denote the event where (7) is satisfied. By our choice of $\boldsymbol{\epsilon}$, we have

\begin{equation}\mathbb{E}[(C^{{\rm RAC}}_{\theta} - V^D_{\theta}(\boldsymbol{\epsilon})) \cdot \mathbf{1}\{E\}] = O\bigg(T \cdot \sqrt{\frac{d \cdot \theta \cdot \ln\theta}{1-\alpha}} \bigg).\end{equation}

As for $\smash{\mathbb{E}[(C^{{\rm RAC}}_{\theta} -V^D_{\theta}(\boldsymbol{\epsilon})) \cdot \mathbf{1}\{E^c\}]}$, since each bandit stays in the system with probability at most $\alpha$, we can upper bound (in the stochastic dominance sense) the number of initial bandits that are still present in the system at time t by Binomial $(\theta \cdot n_{{\rm tot}}, \alpha^{t-1})$. Similarly, we can also upper bound the number of bandits that arrived in the system at time s in state j and are still present in the system at time t by Poisson $(\theta \cdot \lambda_{js} \cdot \alpha^{t-s})$. Since $\mathbb{P}(E^c)= O({T}/{\theta^d})$, by similar arguments as in the proof of Theorem 2 (using the Cauchy–Schwarz inequality), we have

\begin{equation}\mathbb{E}[(C^{{\rm RAC}}_{\theta} - V^D_{\theta}(\boldsymbol{\epsilon})) \cdot \mathbf{1}\{E^c\}] = O\bigg(\frac{T \cdot \theta}{1 - \alpha}\bigg) \cdot O\bigg(\frac{T^{1/2}}{\theta^{d/2}}\bigg).\end{equation}

The proof is complete by noting that $\smash{V^D_{\theta}(\mathbf{0})} =\Omega(T \cdot \theta)$ (because $S(t) \not= \emptyset$ for all t).

Note that, if $T \sim \theta^n$, where n is a natural number, then we can choose $d \gt n$ and RAC is asymptotically optimal. This illustrates that RAC performs well for a large decision horizon T with Poisson arrivals and geometrically distributed waiting times.

7. Closing remarks

In this paper, we consider discrete-time, finite-horizon, RMABPs for fixed and dynamic populations of bandits. To our knowledge, we are the first to simultaneously allow for multiple actions and a nonstationary number of active bandits. We proposed a heuristic control, RAC, that is based on an LP relaxation of the original stochastic control formulation and showed that it is asymptotically optimal. Similar to Verloop (2016), our heuristic does not depend on any indexability properties. In contrast to Verloop (Reference Verloop2016), it does not require assumptions on the underlying structure of the Markov process generated by each policy or assumptions on the dynamics on the associated deterministic approximation.

As mentioned, we can include nonstationary transition probability matrices and nonstationary costs. We can also generalize to the setting when there are several constraints for several subsets of bandits or one constraint on the total number of activated bandits over the entire finite-time horizon. Since we consider an asymptotic regime where the initial budgets and the number of initial bandits are proportionally scaled, it can be shown that these constraints (one or multiple constraints) will not be violated with a ‘high’ probability. Also, having state-dependent terminal costs will not change the asymptotic order. Finally, the proposed approach would also work for the multi-action setting where each action contributes a different weight to the budget since once the linear program is solved, we can show that the budget constraints will not be violated with a very high probability. This holds regardless of whether each action contributes the same or different weights to the budget.

There are several avenues for further research. The approach in Hu and Frazier (Reference Hu and Frazier2017) does not directly translate to our setting, but our approach may be applicable to their model since we make no idexability assumptions. One topic for further exploration is to test our heuristic using their data in the specific examples considered in their study. Developing and analyzing alternative policies is of clear interest. Comparing alternative policies with RAC in a numerical study can be considered. One might consider the same model but allow for states to be observed only when a treatment is applied. This would extend the model considered by Deo et al. (Reference Deo2013) that is motivated by capacity management in healthcare. This model requires keeping track of additional information (i.e. the time between interventions) and, hence, will require a larger state space. In this case, RAC may no longer be asymptotically optimal and, more broadly, the model would present significant technical challenges for the analysis and computation of good control policies. Another consideration is the possibility of allowing more general (i.e. nonlinear) cost structure and constraints. For instance, to capture the fact that, for many settings (e.g. EMS response, humanitarian logistics), the number of available servers is random, the budget $b_t$ may be assumed to be a random variable for each t. Notwithstanding, RMABPs are a very useful modeling framework that can be applied to a broad range of problems; they provide significant technical challenges for optimal control, which are a bright research direction.

The supplementary material for this article can be found at https://doi.org/10.1017/apr.2019.29.

Appendix A. Proofs of the results in Section 4

A.1. Proof of Lemma 2

Note that linear program (2) can be equivalently written as

\begin{gather}V^D(\boldsymbol{\epsilon}) = \min_{x, z} \sum_{t = 1}^T \sum_{a=0}^A \sum_{j =1}^{J} c_{j}^a \cdot x_{j}^a(t, \boldsymbol{\epsilon}) + \phi \cdot z(\boldsymbol{\epsilon})\\\text{such that} \sum_{a=0}^A \sum_{i=1}^{J} x_{i}^a(t,\boldsymbol{\epsilon}) \cdot p^a_{ij} = n_j(t+1, \boldsymbol{\epsilon}) \quad\text{for \ all } j\ge 1 , \, t \ge 1 ,\\\sum_{a = 0}^A x_{j}^{a} (t, \boldsymbol{\epsilon}) = n_{j}(t,\boldsymbol{\epsilon})\quad\text{for all } j \ge 1, \, t \ge 2,\\\sum_{a = 0}^A x_{j}^{a} (1, \boldsymbol{\epsilon}) = n_{j} \quad\text{for all }j \ge 1,\\\sum_{a=1}^A \sum_{j=1}^J x^a_j(t, \boldsymbol{\epsilon}) \le b_t - \epsilon_t\quad\text{for all } t \ge 1,\\z(\boldsymbol{\epsilon}) \, \ge \, \sum_{j \in \mathbb{U}} \sum_{a=0}^A \sum_{i=1}^Jx^a_i(T, \boldsymbol{\epsilon}) \cdot p^a_{ij} - m,\\z(\boldsymbol{\epsilon}), \, x^a_j(t, \boldsymbol{\epsilon}), \, n_j(t, \boldsymbol{\epsilon})\ge 0 \quad\text{for all } a \ge 0, \, j \ge 1, \, t \ge 1.\end{gather}

Let $\boldsymbol{\mu}^*(\boldsymbol{\epsilon}) = \smash{(\mu^*_1(\boldsymbol{\epsilon}),\mu^*_2(\boldsymbol{\epsilon}), \dots, \mu^*_T(\boldsymbol{\epsilon}))}$ denote the optimal dual solution corresponding to constraints $\smash{\sum_{a=1}^A\sum_{j=1}^J x^a_j(t, \boldsymbol{\epsilon}) \le b_t} - \epsilon_t$ for all $t\ge 1$. By the standard duality argument (see, for example, Schrijver (2000, Section 10.4, Equation (24))), we can bound

\begin{equation}V^D(\boldsymbol{\epsilon}) - V^D(\boldsymbol{0}) \le \bigg| \sum_{t=1}^T \mu^*_t(\boldsymbol{0})\cdot \epsilon_t \bigg|.\end{equation}

Since, by definition, $\mu^*_t(\boldsymbol{0})$ is trivially independent of $\boldsymbol{\epsilon}$, we simply need to show that $\mu^*_t(\boldsymbol{\epsilon})$ is independent of T (i.e. there exists a constant $M \gt 0$ independent of T such that $\mu^*_t(\boldsymbol{0}) \le M$ for all t). We do this by first writing the optimal value of linear program (2) as a sum of the optimal values of T separable linear programs and then argue that the optimal dual solution for linear program (2) is also optimal for these linear programs, and vice versa. Specifically,

\begin{equation}V^D(\boldsymbol{\epsilon}) = \sum_{t=1}^T V^D_t(\boldsymbol{\epsilon}),\end{equation}

where $V^D_t(\boldsymbol{\epsilon})$ is defined as

(16a)

\begin{equation}V^D_t(\boldsymbol{\epsilon}) = \min_{x, z} \sum_{a=0}^A \sum_{j = 1}^{J} c_{j}^a\cdot x_{j}^a(t, \boldsymbol{\epsilon})\end{equation}

such that

(16b)

\begin{equation}\sum_{a=0}^A \sum_{i=1}^{J} x_{i}^a(t,\boldsymbol{\epsilon}) \cdot p^a_{ij} = n^*_j(t+1, \boldsymbol{\epsilon}) \quad\text{for \ all } j\ge 1,\end{equation}

(16c)

\begin{equation}\sum_{a = 0}^A x_{j}^{a} (t, \boldsymbol{\epsilon}) = n^*_{j}(t,\boldsymbol{\epsilon})\quad\text{for all } j \ge 1,\end{equation}

(16d)

\begin{equation}\sum_{a=1}^A \sum_{j=1}^J x^a_j(t, \boldsymbol{\epsilon}) \le b_t -\epsilon_t,\end{equation}

(16e)

\begin{equation}x^a_j(t, \boldsymbol{\epsilon}) \ge 0 \quad\text{for all } a \ge 0, \, j \ge1,\end{equation}

for $t \le T - 1$ and

(17a)

\begin{equation}V^D_T(\boldsymbol{\epsilon}) = \min_{x, z} \sum_{a=0}^A \sum_{j = 1}^{J} c_{j}^a\cdot x_{j}^a(T, \boldsymbol{\epsilon}) + \phi \cdot z(\boldsymbol{\epsilon})\end{equation}

such that

(17b)

\begin{equation}\sum_{a=0}^A \sum_{i=1}^{J} x_{i}^a(T,\boldsymbol{\epsilon}) \cdot p^a_{ij} = n^*_j(T+1, \boldsymbol{\epsilon}) \quad\text{for \ all } j\ge 1,\end{equation}

(17c)

\begin{equation}\sum_{a = 0}^A x_{j}^{a} (T, \boldsymbol{\epsilon}) = n^*_{j}(T,\boldsymbol{\epsilon})\quad\text{for all } j \ge 1,\end{equation}

(17d)

\begin{equation}\sum_{a=1}^A \sum_{j=1}^J x^a_j(T, \boldsymbol{\epsilon}) \le b_T -\epsilon_T,\end{equation}

(17e)

\begin{equation}z(\boldsymbol{\epsilon}) \, \ge \, \sum_{j \in \mathbb{U}} \sum_{a=0}^A \sum_{i=1}^Jx^a_i(T, \boldsymbol{\epsilon}) \cdot p^a_{ij} - m,\end{equation}

(17f)

\begin{equation}z(\boldsymbol{\epsilon}), \, x^a_j(T, \boldsymbol{\epsilon}) \ge 0 \quad\text{for all } a\ge 0, \, j \ge 1.\end{equation}

Observe that the union of all the constraints in linear programs (16) (for all $t \le T-1$) and (17) constitutes all the constraints in linear program (2), with a minor difference that we replace the variable $n_j(t,\boldsymbol{\epsilon})$ with $n^*_j(t,\boldsymbol{\epsilon})$. Since $n_j(t,\boldsymbol{\epsilon}) =n^*_j(t,\boldsymbol{\epsilon})$ is optimal for linear program (2), it is not difficult to see that the optimal solution for linear program (2) is also optimal for linear programs (16) (for all $t\le T-1$) and (17), and vice versa. Moreover, the optimal dual solution corresponding to each constraint in linear programs (16) (for $t \le T-1$) and (17) is also optimal for the corresponding constraint in linear program (2) (this can be easily checked by considering the Karush–Kuhn–Tucker (KKT) conditions at the optimal solution; that is, by checking that complementary slackness holds). This observation has an important consequence. We can indirectly compute $\mu^*_t(\boldsymbol{0})$ by calculating the optimal dual solution corresponding to constraint $\smash{\sum_{a=1}^A \sum_{j=1}^J x^a_j(t, \boldsymbol{0}) \le b_t}$ in either linear program (16) (for $t \le T-1$) or (17), with $\boldsymbol{\epsilon} = \boldsymbol{0}$. The proof is complete by noting that the optimal dual solution for either linear program (16) (for $t \le T-1$) or (17) cannot possibly scale up with T (see Remark 4 below); hence, they can be uniformly bounded by a constant that is independent of T.

Remark 4. For any generic linear program $\max\{c'x \colon Ax \le b\}$, the optimal dual solution $y^*$ must satisfy $A'y^* = c'$, so $||y^*||_1 \le n \cdot \Delta \cdot ||c||_1,$ where n is the dimension of x and $\Delta$ is an upper bound of the absolute values of the components of $B^{-1}$ for all invertible submatrices B of A (see, for example, Schrijver (2000, Section 10.4)). Applied to our setting, the value of the corresponding terms $n \cdot \Delta \cdot ||c||_1$ in linear programs (16) and (17) obviously do not depend on T. So, $\mu^*_t(\boldsymbol{0})$ is independent of T. Under the discounted cost setting discussed in Remark 3, the term $||c||_1$ will now be multiplied by $\delta^{t-1}$. So, there exists a constant $M \gt 0$ such that $\mu^*_t(\boldsymbol{0}) \le M \cdot \delta^{t-1}$ for all t.

A.2. Proof of Lemma 3

To prove Lemma 3, it is more convenient to work with a modified version of RAC defined below instead of the original RAC.

Algorithm 2. (Modified randomized assignment control (MRAC).)

1. Pick $\boldsymbol{\epsilon}$ and solve linear program (2).
2. Compute $q^{*,a}_j(t,\boldsymbol{\epsilon})$ for all a, j, and t.
3. Pick a permutation $\boldsymbol{\varphi}$.
4. For time $t = 1$ to T, do:

(a) set $k = 1$;
(b) randomly generate $Z_{j(t, \varphi(k)), k}(t,\boldsymbol{\epsilon})$;
(c) apply action $Z_{j(t, \varphi(k)), k}(t,\boldsymbol{\epsilon})$ to bandit $\varphi(k)$;
(d) update $k = k + 1$;
(e) while $k \le n_{{\rm tot}}$, go back to step (b).

Note that MRAC proceeds in the same manner as RAC, with an exception that it ignores the budget constraint in the original step 4c (i.e. MRAC continues activating bandits regardless of the given budget $b_t$ has been exhausted). Let $\smash{\tilde{X}^a_j(t, \boldsymbol{\epsilon})}$ denote the number of bandits in state j being applied action a at period t under MRAC, and let $\smash{\tilde{N}_j(t, \boldsymbol{\epsilon})}$ denote the number of bandits in state j at the beginning of period t under MRAC. Define four events $\smash{\tilde{\mathcal{A}}_1(\boldsymbol{\epsilon})}$, $\smash{\tilde{\mathcal{A}}_2(\boldsymbol{\epsilon})}$, $\smash{\tilde{\mathcal{A}}_3(\boldsymbol{\epsilon})}$, and $\smash{\tilde{\mathcal{A}}(\boldsymbol{\epsilon})}$ as follows:

\begin{gather}\tilde{\mathcal{A}}_1(\boldsymbol{\epsilon}) = \bigg\{\tilde{X}^a_j(t,\boldsymbol{\epsilon}) - x^{*a}_j(t, \boldsymbol{\epsilon}) \le \frac{\epsilon_t}{2A J}\quad\text{for all } a \ge 1, \, j, \, t \bigg\},\\\tilde{\mathcal{A}}_2(\boldsymbol{\epsilon}) = \bigg\{\tilde{X}^0_j(t,\boldsymbol{\epsilon}) - x^{*0}_j(t, \boldsymbol{\epsilon}) \le \frac{\epsilon_t}{2J}\quad\text{for all } j, \, t \bigg\},\\\tilde{\mathcal{A}}_3(\boldsymbol{\epsilon}) = \bigg\{\tilde{N}_j(T+1, \epsilon)- n^*_j(T+1, \boldsymbol{\epsilon}) \le \frac{\epsilon_T}{|\mathbb{U}|} \quad\text{for \ all } j \in \mathbb{U} \bigg\},\\\tilde{\mathcal{A}}(\boldsymbol{\epsilon}) = \tilde{\mathcal{A}}_1(\boldsymbol{\epsilon})\cap \tilde{\mathcal{A}}_2(\boldsymbol{\epsilon}) \cap\tilde{\mathcal{A}}_3(\boldsymbol{\epsilon}).\end{gather}

Note that the conditions in $\smash{\tilde{\mathcal{A}}_1(\boldsymbol{\epsilon})}$ collectively imply that $\smash{\sum_{a=1}^A \sum_{j=1}^J \tilde{X}^a_j(t, \boldsymbol{\epsilon})} \le b_t$ for all t; thus, under the same sample path realizations, MRAC is equivalent to RAC on $\smash{\tilde{\mathcal{A}}(\boldsymbol{\epsilon})}$. Moreover, on $\smash{\tilde{\mathcal{A}}(\boldsymbol{\epsilon})}$, we also have

\begin{equation}C^{\rm{MRAC}} - V^D(\boldsymbol{\epsilon}) \le \Big[\max_{a,j} c^a_j \Big] \cdot \bigg[\sum_{t=1}^T \epsilon_t\bigg] + \phi \cdot \epsilon_T \le \Big[\phi+ \max_{a,j} c^a_j \Big] \cdot \bigg[\sum_{t=1}^T \epsilon_t\bigg],\end{equation}

where the first inequality holds because $(a - x)^+ - (b - x)^+ \le (a -b)^+$ for all a, b, x. Since MRAC is equivalent to RAC on $\smash{\tilde{\mathcal{A}}(\boldsymbol{\epsilon})}$, to prove Lemma 3, we simply need to show that the lower bound stated in Lemma 3 holds for $\smash{\mathbb{P}(\tilde{\mathcal{A}}(\boldsymbol{\epsilon}))}$ under MRAC.

Note that we can bound

(18)

\begin{equation}\mathbb{P}(\tilde{\mathcal{A}}(\boldsymbol{\epsilon})) \ge 1 -\mathbb{P}(\tilde{\mathcal{A}}_1(\boldsymbol{\epsilon})^c) -\mathbb{P}(\tilde{\mathcal{A}}_2(\boldsymbol{\epsilon})^c) -\mathbb{P}(\tilde{\mathcal{A}}_3(\boldsymbol{\epsilon})^c).\label{eqn41}\end{equation}

We will now compute an upper bound for each $\smash{\mathbb{P}(\tilde{\mathcal{A}}_i(\boldsymbol{\epsilon})^c),}\,i = 1, 2, 3$. We start with $\smash{\mathbb{P}(\tilde{\mathcal{A}}_1(\boldsymbol{\epsilon})^c)}$. By the subadditive property of probability,

(19)

\begin{equation}\mathbb{P}(\tilde{\mathcal{A}}_1(\boldsymbol{\epsilon})^c) \le \sum_{t=1}^T \sum_{a=1}^A\sum_{j=1}^J \mathbb{P}\bigg(\tilde{X}^a_j(t, \boldsymbol{\epsilon}) - x^{*a}_j(t,\mathbf{\epsilon}) \gt \frac{\epsilon_t}{2A J}\bigg).\label{eqn42}\end{equation}

We make an important observation: For all a and j, the random variable $\smash{\tilde{X}^a_j(t, \boldsymbol{\epsilon})}$ can be written as a sum of J independent binomial random variables. Specifically,

(20)

\begin{equation}\tilde{X}^a_j(t, \boldsymbol{\epsilon}) \sim \sum_{i=1}^J \mbox{Bin}(n_{i},v_{iaj}(t, \boldsymbol{\epsilon})),\label{eqn43}\end{equation}

where $v_{iaj}(t, \boldsymbol{\epsilon})$ is the probability that, under MRAC, a bandit that starts with state i at the beginning of period 1 ends up with state j at the beginning of period t and then being applied action a in period t. (It is possible to give an explicit expression of $v_{iaj}(t, \boldsymbol{\epsilon})$ in terms of the transition probabilities $\{\,p^a_{ij}\}$ and the activation probabilities $\smash{\{q^{*a}_j(t,\boldsymbol{\epsilon})\}}$; but, this is not necessary for our purpose.) Note that $x^{*a}_j(t, \boldsymbol{\epsilon}) = \smash{\sum_{i=1}^J n_{i}} \cdot v_{iaj}(t, \boldsymbol{\epsilon})$. Let $\smash{\tilde{S}_{aj}(t, \boldsymbol{\epsilon})}= \{i\colon v_{iaj}(t, \boldsymbol{\epsilon}) \gt 0\}$. Using our observation in (20), we can further bound $\smash{\mathbb{P}(\tilde{\mathcal{A}}_1(\boldsymbol{\epsilon})^c)}$ as

\begin{align}&\mathbb{P}(\tilde{\mathcal{A}}_1(\boldsymbol{\epsilon})^c) \label{eqn44}\end{align}

\begin{align}&\le \sum_{t=1}^T \sum_{\substack{(a,j)\colon \tilde{S}_{aj}(t,\boldsymbol{\epsilon}) \not= \emptyset\\ i \in \tilde{S}_{aj}(t, \boldsymbol{\epsilon})}}\mathbb{P} \bigg(\mbox{Bin}(n_{i}, v_{iaj}(t, \boldsymbol{\epsilon})) - n_i \cdot v_{iaj}(t, \boldsymbol{\epsilon}) \, \gt \, \frac{\epsilon_t}{2 \cdot|\tilde{S}_{aj}(t, \boldsymbol{\epsilon})| \cdot A \cdot J}\bigg)\nonumber\end{align}

(21)

\begin{align}& \le \sum_{t=1}^T \sum_{\substack{(a,j)\colon \tilde{S}_{aj}(t,\boldsymbol{\epsilon}) \not= \emptyset \\ i \in \tilde{S}_{aj}(t,\boldsymbol{\epsilon})}} \exp\bigg\{- \frac{\epsilon^2_t}{12 \cdot|\tilde{S}_{aj}(t, \boldsymbol{\epsilon})|^2 \cdot A^2 \cdot J^2 \cdot n_{i}\cdot v_{iaj}(t, \boldsymbol{\epsilon})}\bigg\} \nonumber\\& + \sum_{t=1}^T \sum_{\substack{(a,j)\colon \tilde{S}_{aj}(t,\boldsymbol{\epsilon}) \not= \emptyset \\ i \in \tilde{S}_{aj}(t,\boldsymbol{\epsilon})}} \exp\bigg\{- \frac{\epsilon_t}{6 \cdot|\tilde{S}_{aj}(t, \boldsymbol{\epsilon})| \cdot A \cdot J }\bigg\} \nonumber\\& \le A \cdot J^2 \cdot \sum_{t=1}^T \exp\bigg\{-\frac{\epsilon^2_t}{12 \cdot A^2 \cdot J^4 \cdot [\sum_{\ell = 1}^Jn_{\ell}] }\bigg\} \nonumber\\& + A \cdot J^2 \cdot \sum_{t=1}^T \exp\bigg\{-\frac{\epsilon_t}{6 \cdot A \cdot J^2 }\bigg\}.\label{eqn45}\end{align}

The first inequality in (21) follows since $\smash{\tilde{S}_{a,j}(t, \boldsymbol{\epsilon})} = \emptyset$ implies that

\begin{equation}\mathbb{P}\bigg(\tilde{X}^a_j(t, \boldsymbol{\epsilon}) - x^{*a}_j(t, \boldsymbol{\epsilon})> \frac{\epsilon_t}{2A J}\bigg) = 0.\end{equation}

The second inequality in (21) follows by application of the Chernoff bound for the binomial random variable, specifically, if $X \sim\mbox{Bin}(n,p)$ then

\begin{gather}\mathbb{P}(X - np \gt \delta) \le \exp \bigg\{- \frac{\delta^2}{3 np} \bigg\}\quad\text{for all }\delta \in [0, np),\\\mathbb{P}(X - np \gt \delta) \le \exp \bigg\{- \frac{\delta}{3} \bigg\}\quad\text{for all }\delta \ge np,\end{gather}

which implies that

\begin{equation}\mathbb{P}(X - np \gt \delta) \le \exp \bigg\{- \frac{\delta^2}{3 np} \bigg\} +\exp\bigg\{- \frac{\delta}{3} \bigg\} \quad\text{for all } \delta \ge 0.\end{equation}

The last inequality in (21) follows since $\smash{|\tilde{S}_{a,j}(t, \boldsymbol{\epsilon})|} \le J$ and $n_{i} \cdot v_{iaj}(t, \mathbf{\epsilon}) \le \smash{\sum_{\ell = 1}^J n_{\ell}}$.

Similarly, we can also bound $\mathbb{P}(\tilde{\mathcal{A}}_2(\boldsymbol{\epsilon})^c)$ as

(22)

\begin{equation}\mathbb{P}(\tilde{\mathcal{A}}_2(\boldsymbol{\epsilon})^c) \le \, J^2 \cdot \sum_{t=1}^T\bigg[\exp\bigg\{- \frac{\epsilon^2_t}{12 \cdot J^4 \cdot [\sum_{\ell =1}^J n_{\ell}] }\bigg\} + \exp\bigg\{- \frac{\epsilon_t}{6 \cdot J^2}\bigg\} \bigg].\label{eqn46}\end{equation}

As for $\smash{\mathbb{P}(\tilde{\mathcal{A}}_3(\boldsymbol{\epsilon})^c)}$, note that $\smash{\tilde{N}_j(T+1, \boldsymbol{\epsilon})}$ can also be written as a sum of J independent binomial random variables. Specifically,

\begin{equation}\tilde{N}_j(T+1, \boldsymbol{\epsilon}) \sim \sum_{i=1}^J \mbox{Bin}(n_{i},r_{ij}(t, \boldsymbol{\epsilon})),\end{equation}

where $r_{ij}(t, \boldsymbol{\epsilon})$ is the probability that, under MRAC, a bandit that starts with state i at the beginning of period 1 ends up with state j at the beginning of period $T+1$. Applying the Chernoff bound to $\mathbb{P}(\tilde{\mathcal{A}}_3(\boldsymbol{\epsilon})^c) $, together with the fact that $|\mathbb{U}| \le J$, yields

(23)

\begin{equation}\mathbb{P}(\tilde{\mathcal{A}}_3(\boldsymbol{\epsilon})^c) \le J^2 \cdot \bigg[\exp\bigg\{- \frac{\epsilon^2_T}{3 \cdot J^4 \cdot [\sum_{\ell =1}^J n_{\ell}] }\bigg\} + \exp\bigg\{- \frac{\epsilon_T}{3 \cdot J^2}\bigg\} \bigg].\label{eqn47}\end{equation}

Putting all the bounds in (21), (22), and 23) back into (18), and noting that the bound in (21) is larger than the bounds in (22) and (23), completes the proof.

Appendix B. Proof of the results in Section 6

B.1. Proof of Lemma 4

The proof is similar to the proof of Lemma 3 (unless otherwise noted, all notation used here have the same meaning as those used in the proof of Lemma 3). The difference lies in computing a bound for $\smash{\mathbb{P}(\tilde{\mathcal{A}}_i(\boldsymbol{\epsilon})^c)}$ for $i = 1, 2,3$. Note that $\smash{\tilde{X}^a_j(t, \boldsymbol{\epsilon})}$ is now the sum of J independent binomial random variables and a Poisson random variable, i.e.

\begin{equation} \tilde{X}^a_j(t, \boldsymbol{\epsilon}) \sim \sum_{i=1}^J \mbox{Bin}(n_{i},v_{iaj}(t, \boldsymbol{\epsilon})) + \mbox{Pois}\bigg(\sum_{s=1}^t \sum_{i=1}^J\lambda_{is} \cdot \tilde{v}_{iaj}(s,t, \boldsymbol{\epsilon})\bigg),\end{equation}

where $\tilde{v}_{iaj}(s,t, \boldsymbol{\epsilon})$ is the probability that a new bandit that arrives with state i at time s ends up with state j at the beginning of time t and being applied action a at time t. We will use the following inequality for the Poisson random variable: if $X\sim \mbox{Pois}(\lambda)$ then

\begin{align*}\mathbb{P}(X - \lambda \gt \delta) & \le \frac{\mathbb{E}[\exp\{r \cdot (X -\lambda)\}]}{\exp\{r \delta\}}\\& = \exp\{\lambda \cdot ({\rm e}^r - r - 1) - \delta r\}\\&\le \exp\{\lambda r^2 - \delta r \}\end{align*}

for all $r \in [0,1]$; specifically, if $0 \le \delta \le 2 \lambda$, then $\mathbb{P}(X - \lambda \gt \delta) \le \exp\{- {\delta^2}/{4 \lambda} \}$ (this can be proved by simply substituting $r = {\delta}/{2 \lambda}$ in the previous bound). Now, as in the proof of Lemma 3, we can bound

\begin{align}&\mathbb{P}(\tilde{\mathcal{A}}_1(\boldsymbol{\epsilon})^c)\label{eqn48}\end{align}

(24)

\begin{align}&\quad {\begin{gathered}\le \sum_{t=1}^T\sum_{\substack{(a,j): \, \tilde{S}_{a,j}(t, \boldsymbol{\epsilon}) \not=\emptyset \\ i \in \tilde{S}_{a,j}(t, \boldsymbol{\epsilon})}} \mathbb{P}\bigg(\mbox{Bin}(n_{i}, v_{iaj}(t, \boldsymbol{\epsilon})) - n_i \cdot v_{iaj}(t,\boldsymbol{\epsilon}) \, \gt \, \frac{\epsilon_t}{2 \cdot (1 + |\tilde{S}_{a,j}(t,\boldsymbol{\epsilon})|) \cdot A \cdot J}\bigg)\end{gathered}} \nonumber\\&\quad + \sum_{t=1}^T \mathbb{P}\bigg(\mbox{Pois}\bigg(\sum_{s=1}^t\sum_{i=1}^J \lambda_{si} \cdot \tilde{v}_{iaj}(s,t, \boldsymbol{\epsilon})\bigg)\nonumber\\&\quad - \sum_{s=1}^t \sum_{i=1}^J \lambda_{si} \cdot \tilde{v}_{iaj}(s,t, \boldsymbol{\epsilon}) \, \gt \, \frac{\epsilon_t}{2 \cdot (1+ |\tilde{S}_{a,j}(t, \boldsymbol{\epsilon})|) \cdot A \cdot J}\bigg) \nonumber\\&\quad \le A \cdot J^2 \cdot \sum_{t=1}^T \exp\bigg\{-\frac{\epsilon^2_t}{48 \cdot A^2 \cdot J^4 \cdot [\sum_{\ell = 1}^Jn_{\ell}] }\bigg\} \nonumber\\&\quad + A \cdot J^2 \sum_{t=1}^T \exp\bigg\{- \frac{\epsilon_t}{12\cdot A \cdot J^2 }\bigg\} \nonumber\\&\quad + \sum_{t=1}^T \exp\bigg\{-\frac{\epsilon_t^2}{64 \cdot A^2\cdot J^4 \cdot [\sum_{s=1}^t \sum_{i=1}^J \lambda_{is}]}\bigg\},\label{eqn49}\end{align}

where the last inequality follows since $J \ge 1$ implies that $1 +|\smash{\tilde{S}_{a,j}(t, \boldsymbol{\epsilon})}| \le 1 + J \le 2\,J$ and the simple fact that $\smash{\sum_{s=1}^t \sum_{i=1}^J \lambda_{si}} \cdot \tilde{v}_{iaj}(s,t, \boldsymbol{\epsilon}) \le \smash{\sum_{s=1}^t \sum_{i=1}^J\lambda_{si}}$.

Similarly, we can also bound $\smash{\mathbb{P}(\tilde{\mathcal{A}}_2(\boldsymbol{\epsilon})^c)}$ and $\smash{\mathbb{P}(\tilde{\mathcal{A}}_3(\boldsymbol{\epsilon})^c)}$ as follows:

(25)

\begin{align}\mathbb{P}(\tilde{\mathcal{A}}_2(\boldsymbol{\epsilon})^c) & \le J^2 \cdot \sum_{t=1}^T\bigg[\exp\bigg\{- \frac{\epsilon^2_t}{48 \cdot J^4 \cdot [\sum_{\ell =1}^J n_{\ell}] }\bigg\} + \exp\bigg\{- \frac{\epsilon_t}{12 \cdot J^2}\bigg\} \bigg] \nonumber\\& + \sum_{t=1}^T \exp\bigg\{-\frac{\epsilon_t^2}{64 \cdot J^4 \cdot[\sum_{s=1}^t \sum_{i=1}^J \lambda_{is}]} \bigg\}, \label{eqn50}\end{align}

\begin{align}\text{and} \mathbb{P}(\tilde{\mathcal{A}}_3(\boldsymbol{\epsilon})^c) & \le J^2 \cdot \bigg[\exp\bigg\{- \frac{\epsilon^2_T}{12 \cdot J^2 \cdot[\sum_{\ell = 1}^J n_{\ell}] }\bigg\} + \exp\bigg\{- \frac{\epsilon_T}{6\cdot J }\bigg\} \bigg]\label{eqn51}\end{align}

(26)

\begin{align}& + \exp\bigg\{-\frac{\epsilon_T^2}{16 \cdot J^4 \cdot [\sum_{s=1}^T\sum_{i=1}^J \lambda_{is}]} \bigg\}.\label{eqn52}\end{align}

Putting the bounds in (24), (25), and (26) together with $\smash{\mathbb{P}(\tilde{\mathcal{A}}(\boldsymbol{\epsilon}))} \ge 1 -\smash{\mathbb{P}(\tilde{\mathcal{A}}_1(\boldsymbol{\epsilon})^c)} -\smash{\mathbb{P}(\tilde{\mathcal{A}}_2(\boldsymbol{\epsilon})^c)} -\smash{\mathbb{P}(\tilde{\mathcal{A}}_3(\boldsymbol{\epsilon})^c)}$ yields the desired result (simply note that the bound in (24) is larger than the bounds in (25) and (26)). This completes the proof of Lemma 4.

Remark 5. For the setting with bandit departure and abandonment as discussed in Section 6, bounds (24), (25), and (26) can be significantly tightened by replacing the term $\smash{\sum_{s=1}^t\sum_{i=1}^J} \lambda_{is}$ with $\smash{\sum_{s=1}^t \alpha^{t-s} \cdot[\sum_{j=1}^J \lambda_{jt}]}$. This is so because $\tilde{v}_{iaj}(s,t,\boldsymbol{\epsilon}) \le \delta^{t-s}$ for all $t \gt s$, which implies that $\smash{\sum_{s=1}^t \sum_{i=1}^J} \lambda_{si} \cdot \tilde{v}_{iaj}(s,t, \boldsymbol{\epsilon}) \le \smash{\sum_{s=1}^t \sum_{i=1}^J\delta^{t-s}} \cdot \lambda_{si}$.

References

Ahmad, S. H. A. et al. (2009). Optimality of myopic sensing in multichannel opportunistic access. IEEE Trans. Inf. Theory 55, 4040–4050.CrossRef Google Scholar

Altman, E. (1999). Constrained Markov Decision Processes. Chapman & Hall/CRC, Boca Raton, FL.Google Scholar

Ayer, T. et al. (2016). Prioritizing hepatitis C treatment in U.S. prisons. Preprint. Available at SSRN: https://ssrn.com/abstract=2869158.Google Scholar

Berry, D. A. and Fristedt, B. (1985). Bandit Problems: Sequential Allocation of Experiments. Chapman & Hall, London.CrossRef Google Scholar

Bertsimas, D. and Niño-Mora, J. (2000). Restless bandits, linear programming relaxations, and a primal-dual index heuristic. Operat. Res. 48, 80–90.CrossRef Google Scholar

Bradt, R. N., Johnson, S. M. and Karlin, S. (1956). On sequential designs for maximizing the sum of n observations. Ann. Math. Statist. 27, 1060–1074.CrossRef Google Scholar

Caro, F. and Gallien, J. (2007). Dynamic assortment with demand learning for seasonal consumer goods. Manag. Sci. 53, 276–292.CrossRef Google Scholar

Cohen, K., Zhao, Q. and Scaglione, A. (2014). Restless multi-armed bandits under time-varying activation constraints for dynamic spectrum access. In 2014 48th Asilomar Conference on Signals, Systems and Computers, IEEE, pp. 1575–1578.CrossRef Google Scholar

Deo, S. et al. (2013). Improving health outcomes through better capacity allocation in a community-based chronic care model. Operat. Res. 61, 1277–1294.CrossRef Google Scholar

Gittins, J. C. and Jones, D. M. (1974). A dynamic allocation index for the sequential design of experiments. In Progress in Statistics (Budapest, 1972; Colloq. Math. Soc. János Bolyai 9), North-Holland, Amsterdam, pp. 241–266.Google Scholar

Gittins, J., Glazebrook, K. and Weber, R. (2011). Multi-Armed Bandit Allocation Indices, 2nd edn. John Wiley, Chichester.CrossRef Google Scholar

Hu, W. and Frazier, P. (2017). An asymptotically optimal index policy for finite-horizon restless bandits. Preprint. Available at https://arxiv.org/abs/1707.00205v1.Google Scholar

Kelly, F. P. (1981). Multi-armed bandits with discount factor near one: the Bernoulli case. Ann. Statist. 9, 987–1001.CrossRef Google Scholar

Le Ny, J., Dahleh, M. and Feron, E. (2008). A linear programming relaxation and a heuristic for the restless bandit problem with general switching costs. Preprint. Available at https://arxiv.org/abs/0805.1563v1.Google Scholar

Lee, E., Lavieri, M. S. and Volk, M. (2018). Optimal screening for hepatocellular carcinoma: a restless bandit model. Manufacturing Service Operat. Manag. 21.Google Scholar

Mahajan, A. and Teneketzis, D. (2008). Multi-armed bandit problems. In Foundations Applications of Sensor Management, Springer, Boston, MA, pp. 121–151.CrossRef Google Scholar

Nain, P. and Ross, K. W. (1986). Optimal priority assignment with hard constraint. IEEE Trans. Automatic Control 31, 883–888.CrossRef Google Scholar

Niño-Mora, J. (2011). Computing a classic index for finite-horizon bandits. INFORMS J. Comput. 23, 254–267.CrossRef Google Scholar

Papadimitriou, C. H. and Tsitsiklis, J. N. (1999). The complexity of optimal queuing network control. Math. Operat. Res. 24, 293–305.CrossRef Google Scholar

Robbins, H. (1952). Some aspects of the sequential design of experiments. Bull. Amer. Math. Soc. 58, 527–535.CrossRef Google Scholar

Schrijver, A. (2000). Theory of Linear and Integer Programming. John Wiley, New York.Google Scholar

Verloop, I. M. (2016). Asymptotically optimal priority policies for indexable and nonindexable restless bandits. Ann. Appl. Prob. 26, 1947–1995.CrossRef Google Scholar

Washburn, R. B. (2008). Application of multi-armed bandits to sensor management. In Foundations and Applications of Sensor Management, Springer, Boston, MA, pp. 153–175.CrossRef Google Scholar

Weber, R. R. and Weiss, G. (1990). On an index policy for restless bandits. J. Appl. Prob. 27, 637–648.CrossRef Google Scholar

Whittle, P. (1980). Multi-armed bandits and the Gittins index. J. R. Statist. Soc. B 42, 143–149.Google Scholar

Whittle, P. (1988). Restless bandits: activity allocation in a changing world. In A Celebration of Applied Probability (J. Appl. Prob. Spec. Vol. 25(A)), ed. Gani, J., Applied Probability Trust, Sheffield, pp. 287–298.Google Scholar

Zayas-Cabán, G., Jasin, S. and Wang, G. (2019). An asymptotically optimal heuristic for general nonstationary finite-horizon restless multi-armed, multi-action bandits. Supplementary material. Available at https://doi.org/10.1017/apr.2019.29 CrossRef Google Scholar

Table 1: Percentage loss for two states and two actions.

Table 2: Percentage loss for five states and five actions.

Table 3: Percentage loss for two states and one action example using RAC.

Table 4: Percentage loss for two states and one action using priority rule.

Table 5: Percentage budget used for two states and one action.

Zayas-Cabán et al. supplementary material

Supplementary data

PDF 163.6 KB

An asymptotically optimal heuristic for general nonstationary finite-horizon restless multi-armed, multi-action bandits: Corrigendum

Gabriel Zayas-Cabán Gabriel Zayas-Cabán , Jiaxin Liang , Stefanus Jasin and Guihua Wang

Advances in Applied Probability , Volume 55 , Issue 3

Article contents

An asymptotically optimal heuristic for general nonstationary finite-horizon restless multi-armed, multi-action bandits

Abstract

Keywords

MSC classification

1. Introduction

2. Related literature

3. Fixed population model

4. Randomized activation control

5. Numerical experiments

5.1 RAC performance

5.2 Comparing RAC with an index policy

6. Dynamic population model

6.1. Bandit departure and abandonment

7. Closing remarks

Appendix A. Proofs of the results in Section 4

A.1. Proof of Lemma 2

A.2. Proof of Lemma 3

Appendix B. Proof of the results in Section 6

B.1. Proof of Lemma 4

References

Zayas-Cabán et al. supplementary material

A correction has been issued for this article:

Linked content

This article has been cited by the following publications. This list is generated based on data provided by Crossref.

Article contents

An asymptotically optimal heuristic for general nonstationary finite-horizon restless multi-armed, multi-action bandits

Abstract

Keywords

MSC classification

1. Introduction

2. Related literature

3. Fixed population model

4. Randomized activation control

5. Numerical experiments

5.1 RAC performance

5.2 Comparing RAC with an index policy

6. Dynamic population model

6.1. Bandit departure and abandonment

7. Closing remarks

Appendix A. Proofs of the results in Section 4

A.1. Proof of Lemma 2

A.2. Proof of Lemma 3

Appendix B. Proof of the results in Section 6

B.1. Proof of Lemma 4

References

Zayas-Cabán et al. supplementary material

A correction has been issued for this article:

Linked content

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests