The study of expectation optimality criteria (standard criteria) has constituted the vast majority of previous work in the area of Markov decision processes (MDP). However, the optimal policies obtained from such models are not reliable when considering a single or a few decision processes since only the average performance over many trials is guaranteed optimal. In fact, the expectation optimality criteria are insufficient to characterize the variability-risk features of practical problems. For many controllable stochastic dynamic systems, the goal is to maximize the reliability of normal operation. In insurance services or dynamic portfolio selection, the risk of total capital being less than some lower limit should generally be avoided as much as possible while investors are interested in strategies that can help them reach a given profit with maximal probability.
In all of these applications which demand high reliability, system performance is controlled on single trial basis and thus the task requirements are formulated as probabilities rather than expectations. There have been some papers devoted to the probability criteria for various rewards. [Filar,1983; 1995] studied the percentile performance criteria for the limiting average return. [Zhang et al, 1983; White, 1993] considered the threshold probability criteria for discounted MDP’s and focused on the properties of the optimality equations without discussion of the existence and properties of the optimal policies. We are further motivated to investigate the stochastic order optimization problems [Derman & Ignall, 1975; Ross, 1983; Stoyan, 1983; Bauakiz & Y. Kiber, 1995; Grandell, 1991; White, 1993], mainly on the distribution function criteria for non-discounted first arrival target total reward.
In our formulation, the target level problems are recast into the total reward and optimal stopping setting. The target is a prescribed set of system states, corresponding to the failure set in reliability applications. Once the system is in one of these states, the decision process is terminated. Different terminal states may have different exit rewards. For a policy p, the first arrival target total return W(p) is the sum of single stage rewards plus the exit reward upon system’s first visit to the target. The objective function of this model, Vi(p,l), is defined as the probability that the total reward exceeds a certain reward level l when the initial state is i. For example, the optimal regulation of a hydropower station reservoir should be to maximize the probability that electric power generation is more than some given value under normal water levels. The general optimization model is to find a policy p that maximizes Vi(p,x) for every initial state i and some return levels of interest.
The ideal and dominant optimal policy is the complete stochastic order optimal policy, which consistently provides the maximum reliability for any outcome level. We successively weaken the optimality requirement and study three classes of the set of these levels: 1) the infinite interval: for the complete stochastic order optimization model; 2) a finite interval: for the local stochastic order optimization model, 3) a single point: for the single point stochastic order optimization model. This approach can be regarded as an extension of multilevel optimization in many applications.
D. J. White, 1988. Mean, Variance and Probabilistic Criteria in Finite Markov Decision Processes: a Review, Journal of Optimization Theory & Applications, 56(1988), 1-29.
R. A. Howard & J. E. Matheson, 1972. Risk-sensitive Markov Decision Processes, Management Science, 8(1972), 356-369.
C. Derman & E. Ignall, 1975. On the Stochastic Ordering of Markov Chains, Operations Research, 23(1975), 574-576.
S. M. Ross, 1983. “Stochastic Processes,” Johns Wiley and Sons, New York, 1983.
D. Stoyan, 1983. “Comparison Methods for Queues and Stochastic Models,” Johns Wiley and Sons, New York, 1983.
Kun-Jen Chung & Matthew J. Sobel, 1987. Discounted MDP’s’ Distribution Functions & Exponential Utility Maximization, SIAM Journal on Control & Optimization, 25(1987), 49-62.
Yuanlie Lin, R. J. Tomkins & Chunglie Wang, 1994. Optimal Models for the First Arrival Time Distribution Function in Continuous Time - A Special Case, Acta Mathematicae Applicatae Sinica, 10(1994), 194-212.
Wanqi Zhang, Qiyuan Jiang & Yuanlie Lin, 1983. Reliability-constrained Markov Decision Programming and Penalty Factor Method, Journal of Tsinghua University, 23(1983), 61-71.
Yuanlie Lin & Qiyuan Jiang, 1984. Three Problems of Applying Markov Decision Programming to the Optimal Regulation of Hydropower Station Reservoirs, in “China-Japan Symposium on Statistics,” Beijing, China, 1984, 143-147.
M. Bauakiz & Y. Kiber, 1995. Target-level Criterion in Markov Decision Processes, Journal of Optimization & Applications, 86(1995), 1-15.
J.A.Filar, 1983. Percentiles and Markov Decision Processes, OR Letters, 2(1983), 13-15.
J. A. Filar, D. Krass & K. W. Ross, 1995. Percentile Performance Criteria For Limiting Average Markov Decision Processes, IEEE Transactions on Automatic Control, 40(1995), 2-9.
D. J. White, 1993. Minimizing a Threshold Probability in Discounted Markov Decision Processes, Journal of Mathematical Analysis & Applications, 173(1993), 634-646.
J. Grandell, 1991. “Aspects of Risk Theory,” Springer-Verlag, 1991.
Jianxing Lin, 1987. “A Study of Models for Markov Decision Programming in Discrete Time,” Master Thesis, Department of Applied Mathematics, Tsinghua University, Beijing, P. R. China, 1987.
Yuanlie Lin, R. J. Tomkins & Chunglie Wang, 1991. Optimal Models for the First Arrival Time Distribution Function in Continuous Time, Proceedings of APORS'91, 14(1991), 292-299.
Yuanlie Lin & Jianxing Lin, 1995. Models for the First Arrival Time Distribution Function Optimization and Risk Minimization, Journal of Tsinghua University, 36(1995), 53-59.
Harold J. Kushner & G. Dupuis, 1992. “Numerical Methods for Stochastic Control Problems in Continuous Time,” Springer-Verlag, New York, 1992.
Yuanlie Lin, 1991. Continuous Time First Arrival Target Models(I) – Discounted Moment Optimal Models, ACTA Mathematicae Applicatae Sinica, 14(1991), 116-124.
Yuanlie Lin, 1993. Optimal Models for the First Arrival Target In Continuous Time (II) – L Optimal Problems, Journal of Tsinghua University, 33(1993), 1-9.
Yuanlie Lin, Xingxing Yu & Jianxing Lin, 1995. Models for First Arrival Target Distribution Function Optimization and Risk Minimization in Discrete Time, Operations Research and Its Applications, First International Symposium, ISORA’95, Beijing, P. R. China, August 1995 Proceedings, 368-375.
Xingxing Yu, 1996. “Research on Joint Delay-Filter Identification and Reinforcement Learning Control Models,” Master Thesis, Department of Automation, Tsinghua University, Beijing, P. R. China, 1996.
David Blackwell, 1962. Discrete Dynamic Programming, Annals of Statistics, 33(1962), 719-726.