Learning to be Safe Deep RL with a Safety Critic

Source

@misc{Srinivasan_2025_safecritic,
    title={Learning to be Safe: Deep RL with a Safety Critic}, 
    author={Krishnan Srinivasan and Benjamin Eysenbach and Sehoon Ha and Jie Tan and Chelsea Finn},
    year={2020},
    eprint={2010.14603},
    archivePrefix={arXiv},
    url={https://arxiv.org/abs/2010.14603}, 
}

(Stanford) | arXiv

TL;DR

…

Concept

Flash Reading

Abstract: Learning safety specifications is necessary when manual specification is impractical. This work proposes to learn a safety critic in one set of tasks and environments, and then use it to constrain exploration in new tasks and environments.
Introduction: Learning safety requires exploration in unsafe states. But experiences of being unsafe or failing in relatively safe environments can be used to learn safety in new and more risky environments, which increases the the safety of exploration in new tasks. This approach is called safety Q-functions for reinforcement learning (SQRL), which learns a critic that evaluates whether a state, action pair will lead to unsafe behavior, under a policy that is constrained by the safety-critic itself. This is achieved by concurrently training the safety-critic and policy. During pretraining, the agent is allowed to explore and learn about unsafe behaviors. During finetuning, the agent is constrained by the learned safety-critic while exploring the new environment.
Problem Statement: Conditions leading to failures are difficult to specify before learning. While unsafe states are easily identifiable, such as the robot falling down or dropping an object, the task of formally specifying safety constraints that capture these states is non-trivial, and can be potentially biased or significantly hinder learning. Failures are regarded as terminal states to avoid accumulating additional costs. This work proposes The training has two phases: (1) learning an exploratory policy that solves a simpler/safer task in the pre-training environment, and (2) transferring the learned policy to a more safety-critical target task with guarantees to safety.
SQRL (pretraining): SQRL learns a policy and notion of safety jointly in Phase 1, and finetunes the policy to the target task in Phase 2 with the learned safety critic as a constraint. The safety critic $Q^{\bar{\pi}}_{\text{safe}}$ estimates future failure probabilities of a policy given a state-action pair. During pretraining, the goal is to learn a safety critic $Q^{\bar{\pi}}_{\text{safe}}$ and an initial policy $\pi^*_{\text{pre}}$. Given the pretraining task $\mathcal{T}_\text{pre}$, the safety critic estimates $Q^{\pi}_{\text{safe}}(s_t,a_t) = \mathcal{I}(s_t) + (1-\mathcal{I}(s_t)) \sum^{T}_{t'=t+1} \mathbb{E}_{s'_t\sim P_{\text{pre}}}[\gamma_{\text{safe}}^{t'-t} \mathcal{I}(s_{t'})]$ where $\mathcal{I}(s_t)$ is a safety-incident indicator. This is estimated via Q-learning with the Bellman equation $\hat{Q}^{\pi}_{\text{safe}}(s,a) = \mathcal{I}(s) + (1-\mathcal{I}(s)) \mathbb{E}_{s'\sim P_{\text{pre}}(\cdot|s,a), a'\sim \pi(\cdot|s')}[\gamma_{\text{safe}} \hat{Q}^{\pi}_{\text{safe}}(s',a')]$ Parameterize the safety critic as $\hat{Q}^{\pi}_{\phi,\text{safe}}$ with parameters $\phi$. The objective is $J_{\text{safe}}(\phi) = \mathbb{E}_{(s,a,s')\sim D_{\text{pre}}}[(\hat{Q}^{\pi}_{\phi,\text{safe}}(s,a) - (\mathcal{I}(s) + (1-\mathcal{I}(s))\gamma_{\text{safe}} \bar{Q}^{\pi}_{\text{safe}}(s',a')))^2]$ To encourage exploration, the policy is learned via a maximum-reward and maximum-entropy RL objective. The safety critic is optimized under the mixture of policies that are constrained by the safety critic itself, $\bar{\pi}(a|s)$.
SQRL (finetuning): In Phase 2, all data is collected using the policy constraed by $\hat{Q}_{\text{safe}}$. More details are in the paper.