Abstract
We consider time-average Markov Decision Processes (MDPs), which accumulate a reward and cost at each decision epoch. A policy meets the sample-path constraint if the time-average cost is below a specified value with probability one. The optimization problem is to maximize the expected average reward over all policies that meet the sample-path constraint. The sample-path constraint is compared with the more commonly studied constraint of requiring the average expected cost to be less than a specified value. Assuming that a policy exists that meets the sample-path constraint, we establish that there exist nearly optimal stationary policies for communicating MDPs. A parametric linear programming algorithm is given to construct nearly optimal stationary policies. The discussion relies on well known results from the theory of stochastic processes and linear programming.
Original language | English (US) |
---|---|
Pages (from-to) | 780-790 |
Number of pages | 11 |
Journal | Operations Research |
Volume | 37 |
Issue number | 5 |
DOIs | |
State | Published - 1989 |
ASJC Scopus subject areas
- Computer Science Applications
- Management Science and Operations Research