本论文研究了具有紧状态空间与紧行动空间的马尔可夫决策过程和部分可观测的马尔可夫决策过程中风险敏感控制的最大化收益问题以及最大化“上行机会”的大偏差控制的最优策略。对风险敏感控制问题,我们应用非线性算子情形下推广的Krein-Rutman定理,在转移概率具有多步可达性的假设下,用贝尔曼方程诱导的算子的谱半径来刻划风险敏感控制的最优值,并证明了存在非随机且平稳的最优策略。对大偏差控制问题,本论文通过建立大偏差控制问题与风险敏感控制问题的对偶关系,证明了对于满足特定条件的基准值,最大化“上行机会”问题的最优策略可以用风险敏感控制的非随机且平稳的最优策略逼近,并给出了当所选基准不满足该条件时对偶关系不成立以及最优策略不能被非随机且平稳的策略逼近的例子。鉴于这种对偶关系的重要性,本论文使用该函数关于风险敏感系数的右导数来刻划使得这种对偶关系成立的基准值,并通过建立风险敏感控制的最优值函数的一个重要的变分公式来刻划上述右导数,更直观地描述了使得上述对偶关系成立的基准值的区域。利用该变分公式,本论文进一步证明了当风险敏感系数趋于0时,风险敏感控制的最优值函数收敛于经典的无风险偏好的长期平均准则下的最优收益。在此基础上,本文还研究了上述收敛结论以及对偶关系和变分公式在部分可观测的马尔可夫决策过程中的拓展。本论文具有以下创新点:1、在研究风险敏感控制的非随机平稳策略的存在性时,将以往文献中的强遍历性条件或者任意两状态之间一步转移概率为正的严苛条件减弱为多步转移概率为正的要求,用推广的Krein-Rutman定理证明了非随机平稳策略的存在性,大大拓广了模型的适用范围;2、将以往文献中对风险敏感控制的最优值函数关于风险敏感系数可导的要求去掉,仅通过对该值函数凸性的细致分析,建立了大偏差控制与风险敏感控制的对偶关系;3、据我们所知以往文献并未直观地描述使得对偶关系成立的基准值,而本论文利用变分公式给出了一种具有实际意义的刻划,并通过扰动技术推广至可约情形,并得到了在这样的一般框架下风险敏感系数趋于0时最优值函数的极限;4、对于这样的极限以及上述对偶关系与变分公式,据我们所知以往文献未曾对部分可观测的马尔科夫决策过程建立类似的结论,而本论文在适当的条件下将上述工作拓展到了部分可观测模型上。
This dissertation studies the optimal policy of the maximizing reward risk-sensitive control and maximizing “up-side chance” large deviation control for Markov Decision Processes and partially observable Markov Decision Processes with compact state and action spaces. Under a multi-step communicating condition for the transition probability, a non-linear extension of the Krein-Rutman Theorem is applied to characterize the optimal risk-sensitive value by the spectral radius of the operator induced by Bellman equation, based on which a deterministic stationary optimal policy exists. For benchmarks satisfying a specific condition, by establishing a dual relation between large deviation and risk-sensitive control, it is proved that the optimal policy for large deviation control can be approximated by the deterministic stationary optimal policies for risk-sensitive control. Examples that the dual relation does not hold for invalid benchmark are shown. Valid benchmarks which make the duality holds are characterized by the right-hand derivative of the optimal risk-sensitive value function. An important variational formula is applied to characterize the right-hand derivative intuitively. By the variational formula, we also prove that the optimal risk-sensitive reward converges to the risk-neutral optimal average reward as the risk-sensitive factor tends to 0. Based on these, such dual relations and the variational formula along with the risk vanishing limit for partially observable Markov Decision Processes are also investigated.Innovations of this dissertation are as follows: 1. compared to the previous literature, existence of the deterministic stationary optimal policy for risk-sensitive control is established by an extension of the Krein-Rutman Theorem under a multi-step communicating condition, which is weaker than the strong ergodicity or the condition that requires one-step transition probability is positive for any action and any two states; 2. by analyzing of the convexity of the optimal risk-sensitive value function, the differentiable assumption for the dual relation is removed; 3. as far as we know, previous literature has not describe the valid benchmark that make the duality holds intuitively while we derive a characterization that has practical significance, which is also extended to reducible cases by a permutation technique; 4. extensions of above conclusions in partially observable models are proved, which, to the best of our knowledge, has not been established in the previous literature.