FuRL Visual-Language Models as Fuzzy Rewards for Reinforcement Learning

Source

@misc{fu_2024_furl,
    title={{FuRL}: Visual-Language Models as Fuzzy Rewards for Reinforcement Learning}, 
    author={Fu, Yuwei and Zhang, Haichao and Wu, Di and Xu, Wei and Boulet, Benoit},
    year={2024},
    eprint={2406.00645},
    archivePrefix={arXiv},
    primaryClass={cs.AI},
    url={https://arxiv.org/abs/2406.00645}, 
}

(McGrill, Horizon Robotics)

arXiv

TL;DR

…

VLM reward is fuzzy since it doesn't align with the goal reaching.

Flash Reading

Abstract: Leverage pre-trained VLMs for online RL with sparse reward tasks. The problem to solve is reward misalignment when applying VLMs as RL reward. Fuzzy VLM reward-aided RL is based on reward alignment (enhance performance) and relay RL (avoid local minima).
Introduction: DRL needs a large number of environmental interactions for policy learning. It’s important to increase the sample efficiency. There are works using better exploration strategies, in-house behavior data, transfer learning, and meta-learning. This work explores using VLMs to generate dense reward for RL tasks with sparse reward. Specifically, this work focuses on reward misalignment (inaccurate VLM rewards could trap the agent in local minima).
Method: Given an observation $s_t$, the agent generates an action $a_t\sim\pi_theta(a_t|s_t)$ and receives a sparse task reward $r^{\text{task}}_t$. The sparse reward is usually defined based on the final success or failure. To augment the sparse reward with a VLM reward, $r_t = r^{\text{task}}_t + \rho r^{\text{VLM}}_t.$ For example, the VLM reward can be the CLIP reward that measures the cosine similarity, $r^{\text{VLM}}_t = r^{\text{CLIP}}_t = \frac{\langle\Phi_L(l), \Phi_I(o_t)\rangle}{\|\Phi_L(l)\| \cdot \|\Phi_I(o_t)\|},$ where $\Phi_L(l)$ is the language embedding and $\Phi_I(o_t)$ is the image embedding. Prior work assumes that VLM rewards are accurate, while it is shown in the work that they are fuzzy. When pre-trained VLM representations fail to capture crucial information in the target RL tasks, inaccurate VLM rewards can hinder efficient exploration. This is mainly due to domain shift. FuRL uses two techniques: (i) Reward alignment, which fine-tunes heads of VLMs from collected data; (ii) Relay RL, which mitigates local minima and helps to collect more diverse data.
Method (details): The inaccuracy of VLM rewards can be defined as misalignment between image and text embeddings via the cosine similarity. FuRL introduces a lightweight alignment method, which freezes the VLM and appends two smaller head networks $f_{W_L}$ and $f_{W_I}$ to VLM’s text and image embeddings for finetuning. The VLM reward via the cosine similarity is measured for the outputs from these two heads. They are fine-tuned based on collected trajectories. For successful positive samples $o^p$ and unsuccessful negative samples $o^n$, the loss for reward alignment is $\mathcal{L} = \underbrace{\mathbb{E}l_\delta(o^p, o^n)}_{\mathcal{L}_{\text{pos-neg}}} + \underbrace{\mathbb{E}l_\delta(o_i^p, o_{i-k}^p)}_{\mathcal{L}_{\text{pos-pos}}},$ where $l_\delta(o^p, o^n):=\max(0, r^{\text{VLM}}(o^n)-r^{\text{VLM}}(o^p)+\delta)$ is a ranking loss with margin, so that the pos-neg loss pushes the VLM to learn positive samples, while the pos-pos loss pushes the VLM to progress a positive sample. In the beginning, there is no positive samples, so an optional step can be used when an additional goal image $o_g$ is available, i.e., learning from all negative samples $\mathcal{L}_{\text{neg-neg}} = \mathbb{E}_{\{L_2(o^n_i, o_g)<L_2(o^n_j, o_g)-\delta'\}}(l_\delta(o^n_i, o^n_j)),$ where $L_2$ is the L2 loss. The neg-neg loss can boost the progress to obtain the successful trajectories ASAP. For relay RL [1], it is used to solve the problem that VLM may get stuck at local minima, which means that the training cannot go to the second phase with positive samples. Specifically, an extra SAC agent $\pi_{\text{SAC}}$ is introduced besides the current VLM agent $\pi_{\text{VLM}}$. In an episode, these two agents are iteratively adopted, and the collected samples are added to a shared buffer, which is used to train these two agents with $r^{\text{task}}+\rho r^{\text{VLM}}$ for $\pi_{\text{VLM}}$ and $r^{\text{task}}$ for $\pi_{\text{SAC}}$. The idea is that the SAC agent can help the VLM agent to get out of local minima.
Experiments: Ten robotics tasks from the Meta-world MT10 environment with state-based observations and sparse rewards are conducted. The backbone is LIV [2]. It is observed that for tasks requiring the agent to master multiple subtasks, in terms of the sparse reward, no method achieves any success, which is also a common weakness within the existing VLM-as-reward framework. (It can be observed that when the goal is set to be random, the success rate of FuRL decreases.) From the ablation study, relay RL is very crucial, while the reward alignment also contributes.

References

[1] Relay Policy Learning: Solving Long-Horizon Tasks via Imitation and Reinforcement Learning, CoRL 2019. arXiv.
[2] LIV: Language-Image Representations and Rewards for Robotic Control, ICML 2023. arXiv.