Abstract
In this paper, we study the robustness property of policy optimization (particularly Gauss–Newton gradient descent algorithm which is equivalent to the policy iteration in reinforcement learning) subject to noise at each iteration. By invoking the concept of input-to-state stability and utilizing Lyapunov’s direct method, it is shown that, if the noise is sufficiently small, the policy iteration algorithm converges to a small neighborhood of the optimal solution even in the presence of noise at each iteration. Explicit expressions of the upperbound on the noise and the size of the neighborhood to which the policies ultimately converge are provided. Based on Willems’ fundamental lemma, a learning-based policy iteration algorithm is proposed. The persistent excitation condition can be readily guaranteed by checking the rank of the Hankel matrix related to an exploration signal. The robustness of the learning-based policy iteration to measurement noise and unknown system disturbances is theoretically demonstrated by the input-to-state stability of the policy iteration. Several numerical simulations are conducted to demonstrate the efficacy of the proposed method.
Original language | English (US) |
---|---|
Pages (from-to) | 374-389 |
Number of pages | 16 |
Journal | Control Theory and Technology |
Volume | 21 |
Issue number | 3 |
DOIs | |
State | Published - Aug 2023 |
Keywords
- Input-to-state stability (ISS)
- Lyapunov’s direct method
- Policy iteration (PI)
- Policy optimization
ASJC Scopus subject areas
- Control and Systems Engineering
- Signal Processing
- Information Systems
- Modeling and Simulation
- Aerospace Engineering
- Control and Optimization
- Electrical and Electronic Engineering