Temporal Difference Learning for Model Predictive Control

Source

@inproceedings{Hansen_2022_tdmpc,
    title = {Temporal Difference Learning for Model Predictive Control},
    author = {Hansen, Nicklas and Wang, Xiaolong and Su, Hao},
    year = 2022,
    booktitle = {International Conference on Machine Learning (ICML)},
    publisher = {PMLR}
}

(UC San Diego) | arXiv

TL;DR

…

Concept

Flash Reading

Abstract: Data-driven MPC has two benefits: potentially improved sample efficiency and better performance as computational budget for planning increases. But it is costly to train over long horizons and challenging to obtain accurate environment models. This work uses a learned task-oriented latent dynamics model for local trajectory optimization over a short horizon, and use a learned terminal value function to estimate long-term return.
Introduction: The use of an environment model helps agent to plan a trajectory of actions ahead of time, rather than the trial-and-error of model-free RL. Model-based methods suffer from two problems: planning over long horizons is expensive, and learned models can have biases. While MPC is expensive over long horizons, a learned long-term terminal value function can help. The proposed TD-MPC jointly learns a task-oriented latent dynamics model and terminal value function. Three technical contributions: (a) The latent dynamics model is learned from rewards without prediction of states; (b) Gradients from the reward and TD-objective are backpropagated through multiple rollout steps of the model to improve long-term predictions; (c) A modality-agnostic prediction loss in latent space that enforces temporal consistency is used.
Preliminaries: The aim is to learn a policy $\Pi_\theta:\mathcal{S}\to\mathcal{A}$ that maximizes the expected discounted return $\mathbb{E}[\sum_{t=0}^\infty \gamma^t r_t]$. Model-free TD methods learn a state-action value function $Q_\theta:\mathcal{S}\times\mathcal{A}\to\mathbb{R}$ that satisfies the Bellman equation $Q_\theta(s_t,a_t)=r_t+\gamma \mathbb{E}_{a_{t+1}\sim\Pi_\theta}[Q_\theta(s_{t+1},a_{t+1})]$. The optimal Q-function $Q^*$ can be estimated by $\theta' \leftarrow \arg\min_\theta \mathbb{E}_{(s_t,a_t,s_{t+1})\sim\mathcal{B}} \| Q_\theta(s_t,a_t) - y \|^2.$ where $\mathcal{B}$ is a replay buffer, $y=r_t+\gamma \max_{a_{t+1}}[Q_{\bar{\theta}}(s_{t+1},a_{t+1})]$ is the Q-target, and $\bar{\theta}$ are the parameters of a target network. The policy can be learned by maximizing $Q_\theta(s_t,\Pi_\theta(s_t))$. In actor-critic RL, $\Pi$ is normally a neural network. In MPC, $\Pi$ is obtained by solving an optimization problem with a finite horizon $H$. If the value function is known, the MPC problem can be augmented with a terminal value function, called MPC with terminal value function. To distinguish between the two, $\Pi_\theta$ is used for MPC and $\pi_\theta$ is used for RL.
TD-MPC: It uses Model Predictive Path Integral (MPPI) [1] for planning $\Pi_\theta$, learned dynamics model $d_\theta$, learned reward model $R_\theta$, terminal state-action value function $Q_\theta$, and a policy $\pi_\theta$ guiding planning. During inference, in each outer $J$-loop, $H$ actions within the horizon are sampled from $\mathcal{N}(\mu^j,\sigma^j)$ to get a trajectory. Totally $N$ trajectories are sampled. For each trajectory, an estimated return is computed. Then, the mean and variance of the action distribution are updated according to the return-weighted average, until $\mu^J$ and $\sigma^J$. To make it converge faster, a warm-start is used by reusing the 1-step shifted mean µ obtained at the previous step. Model-free RL uses action noise to explore, here the sigma is constrained to be larger than a decaying minimum value. Apart from the randomly generated actions, extra trajectories are generated by the policy $\pi_\theta$.
Task-Oriented Latent Dynamics (TOLD): Rather than attempting to model the environment, TOLD learns to only model elements of the environment that are predictive of reward. During training, the model does two things: (a) improving TOLD using data collected from previous environment interaction; (b) collecting new data from the environment by online planning of action sequences with TD-MPC, using TOLD for generating imagined rollouts. TOLD consists of an encoder $z_t=h_\theta(s_t)$, a latent dynamics model $z_{t+1}=d_\theta(z_t,a_t)$, a reward model $r_t=R_\theta(z_t,a_t)$, a value function $Q_\theta(z_t,a_t)$, and a policy $\pi_\theta(z_t)$. … More in the paper.

References

[1] Aggressive driving with model predictive path integral control, ICRA 2016. IEEE Website

Extension

MPPI is an MPC algorithm that iteratively updates parameters for a family of distributions using an importance weighted average of the estimated top-k sampled trajectories.

The core idea is to simulate thousands of trajectory rollouts using the model, where each one has random inputs. The best input is the weighted average of the top-k trajectories, where the weights are based on the trajectory costs. An input sequence is determined by adding random noise to a nominal one. The solver for MPPI is stochastic optimization and gradient-free.