Recovery RL Safe Reinforcement Learning with Learned Recovery Zones

Source

@ARTICLE{Thananjeyan_2021_recoveryrl,
    author={Thananjeyan, Brijen and Balakrishna, Ashwin and Nair, Suraj and Luo, Michael and Srinivasan, Krishnan and Hwang, Minho and Gonzalez, Joseph E. and Ibarz, Julian and Finn, Chelsea and Goldberg, Ken},
    journal={IEEE Robotics and Automation Letters}, 
    title={Recovery RL: Safe Reinforcement Learning With Learned Recovery Zones}, 
    year={2021},
    volume={6},
    number={3},
    pages={4915-4922},
    doi={10.1109/LRA.2021.3070252}
}

(UC Berkeley) | arXiv

TL;DR

…

Concept

Flash Reading

Abstract: Learning new tasks in uncertain environments requires extensive exploration, but safety requires limiting exploration. Recovery RL (RRL) leverages offline data to learn constraint violations before policy learning and separating the goals of improving task performance and constraint satisfaction across two policies: a task policy that optimizes the task reward and a recovery policy that guides the agent to safety when constraint violation is likely.
Introduction: Most prior work in safe RL integrates constraint satisfaction into the task objective. This work has two key ideas: two policies (task policy and recovery policy) and offline recovery set & policy learning. Recovery set and policy are kept updating during online learning.
Related Work: Prior work in safe RL focuses on (a) imposing constraints on expected return [1]; (b) risk measures [2]; (c) avoiding regions of MDP with possible constraint violations [3]. This work uses a learned recovery policy to keep the agent within a learned safe region. In [4], a safety critic is trained to estimate the probability of future constraint violation under the current policy and optimizes a Lagrangian objective to limit constraint violations. This work differs in that it uses the safety critic to determine when to execute a learned recovery policy instead of modifying the task policy objective. There are works that use recovery or shielding to restrict the exploration, with a prior knowledge of the system dynamics or constraints. This work learns the information from offline data and online experience, which gives better scalability and flexibility.
Problem Statement: Apart from the traditional MDP $(S,A,P,R,\gamma)$, an extra constraint cost function $C:S\rightarrow {0,1}$ is defined, and an associated discount factor $\gamma_{\text{risk}}$. The expected discounted constraint return is $Q^\pi_{\text{risk}}(s_i,a_i)=\mathbb{E}_{\pi,P}[\sum_{t=1}^\infty \gamma_{\text{risk}}^{t-i} C(s_{i+t})].$ The goal is to maximize the expected discounted task return while keeping the expected discounted constraint return below a threshold $\epsilon_\text{risk}$. This formulation is same as constraint MDP (CMDP) but the constraint cost is used as a binary indicator in this work. Setting $\gamma_{\text{risk}}=0$ makes it equivalent to a robust control problem.
Recovery RL: The discounted future probability of constraint violation $Q^\pi_{\text{risk}}$ is estimated by training a sample-based approximation $\hat{Q}^\pi_{\phi,\text{risk}}$ parameterized by $\phi$. A sample is a tuple of $(s_i,a_i,s_{i+1},c_i)$. The model is trained via the MSE loss $J_{\text{risk}}(s_t,a_t,s_{t+1};\phi) = \frac{1}{2}\left( \hat{Q}^\pi_{\phi,\text{risk}}(s_t,a_t) - (c_t + \gamma_{\text{risk}}(1-c_t)\mathbb{E}_{a_{t+1}\sim\pi}[\hat{Q}^\pi_{\phi_{\text{targ}},\text{risk}}(s_{t+1},a_{t+1})])\right)^2$ Define a safe set and a recovery set as $\mathcal{T}^\pi_{\text{safe}} = \{(s,a)|Q^\pi_{\text{risk}}(s,a) \leq \epsilon_{\text{risk}}\} \\ \mathcal{T}^\pi_{\text{recover}} = S \times A \setminus \mathcal{T}^\pi_{\text{safe}}$ If the task policy $\pi_{\text{task}}$ proposes an action $a^{\pi_{\text{task}}}$ such that it is in the recovery set, a recovery action sampled from the recovery policy $\pi_{\text{rec}}$ is executed instead. The recovery policy is trained to minimize the expected discounted constraint return (i.e. minimize the risk) $\hat{Q}^\pi_{\text{risk}}$. To avoid the task agent thinking the its actions are correct, during training, it is trained with its own actions rather than the final executed actions (which can be from the recovery policy). The recovery policy pretrained offline to have a good initial performance, which is collected under human supervision for safety. In this work, both the off-policy model-free DDPG and model-based MPC are used to learn the recovery policy. The task policy is learned via SAC [5].

References

[1] Constrained policy optimization, ICML 2017. arXiv
[2] Worst Cases Policy Gradients, CoRL 2019. arXiv
[3] Safe Model-based Reinforcement Learning with Stability Guarantees, NeurIPS 2017. arXiv
[4] Learning to be Safe: Deep RL with a Safety Critic. arXiv
[5] Soft actor-critic: Offpolicy maximum entropy deep reinforcement learning with a stochastic actor, ICML 2018. arXiv

Extension

The cost function $C:S\to {0,1}$ indicates whether a state is safe or unsafe. Instead, we can define if an action is safe or unsafe, i.e., $C:S\times A\to {0,1}$.
Training a critic at the same time as the task policy but from failed experiences only, telling us which state-action pairs lead to failures, which implies the safe actions.