A Paragraph is All It Takes Rich Robot Behaviors from Interacting, Trusted LLMs

Source

@misc{openmind_2024_omone,
    title={A Paragraph is All It Takes: Rich Robot Behaviors from Interacting, Trusted LLMs},
    author={OpenMind, Shaohong Zhong, Adam Zhou, Boyuan Chen, Homin Luo, Jan Liphardt},
    year={2024},
    eprint={2412.18588},
    archivePrefix={arXiv},
    primaryClass={cs.RO},
    url={https://arxiv.org/abs/2412.18588},
}

(OpenMind) | arXiv

TL;DR

…

Minimal test system

Flash Reading

Abstract: LLMs are compact representations of public knowledge, which can be combined with robotics to achieve limited or zero tuning for general tasks. This work explores the advantages, limitations, and uses of LLMs to control physical robots. The basic system consists of four LLMs communicating via a human language data bus implemented via web sockets and ROS2 message passing. Rich robot behaviors and good performance are demonstrated through various tasks. The use of natural language as a communication medium allows for easy inspection and debugging of the system by humans. The rules to bias the system are written into Ethereum.
Introduction: Past efforts focus on using foundation models to generate robot actions from raw sensor data in an end-to-end manner, which brings challenges including ease of use and understanding by non-experts, acquisition of new skills, establishing trust, and convenient integration of robots in human environments. This work combines multiple LLMs with natural language as the intermediate representation to address these challenges. A modular way may allow new AI capabilities to be added as they become available. A natural language data bus enables imposition of guardrails coded by natural language rules. Blockchain technology is used to ensure the integrity of the rules.
Methodology: The goal is to build a robotics software that is (i) easy to use and understand by non-experts, (ii) transparent to human inspection, (iii) amenable to internal guardrails, (iv) allow multiple LLMs and flexible modules. A modular approach that uses a ROS2 data distribution layer and websockets to connect multiple LLMs and modules with natural language is proposed. This work focuses on data inputs with different forms and the use of multiple LLMs to map inputs to robot actions, while using immutable public rules accessed from a blockchain to ensure safety and trust.
Implementation: The platform is Unitree Go2 AIR. The vision processing VLM is VILA1.5 (3B). The audio processing node uses the RIVA model from NVIDIA. The data fusion node appends prefix to outputs from the vision and audio nodes for distinguishing. The blockchain node uses Ethereum (contracts satisfying the ERC 7777 interface) [2]. The LLM node runs Llama LLM with predefined output grammar for robotics [3], which takes in the fused output as well as the blockchain output. The final action node maps the LLM output to macro actions, such as ‘forward’, ‘turn left’, etc.

References

[1] VILA: On pre-training for visual language models, CVPR 2024. arXiv.
[2] Ethereum: A secure decentralised generalised transaction ledger, 2014. Online and ERC7777.
[3] LLaMA: Open and Efficient Foundation Language Models, 2023. arXiv.