VoroNav Voronoi-based Zero-shot Object Navigation with Large Language Model

Source

@inproceedings{wu_2024_voronav,
    author={Pengying Wu and Yao Mu and Bingxian Wu and Yi Hou and Ji Ma and Shanghang Zhang and Chang Liu},
    booktitle={International Conference on Machine Learning (ICML)}, 
    title={{VoroNav}: Voronoi-based Zero-shot Object Navigation with Large Language Model}, 
    year={2024},
    pages={53737-53775}
}

(Peking University)

arXiv

TL;DR

…

Concept

Flash Reading

Abstract: VoroNav is a semantic exploration framework that proposes the Reduced Voronoi Graph (RVG) to extract the topological structure from a semantic map constructed in real time. The images of surroundings and text descriptions of paths are generated from VoroNav. which are interpretable by LLMs.
Introduction: Zero-Shot Object Navigation (ZSON) requires a robot to navigate to an object in an unseen environment without prior training. Current methods are either end-to-end, network-based, or modular, map-based. The end-to-end methods map from RGB-D images to actions based on trained policy networks, which lacks interpretability and needs a lot of training data. Map-based methods leverage maps to store historical topological and semantic information. A new waypoint is planned either every fixed number of steps or when the increment in map building reaches a threshold. However, the selection of new waypoints can be suboptimal and not informative. This work develops a RVG generation approach to distill informative waypoints. Another problem is the integral representation of observed scenes for long-term planning, which can be addressed by fusing the observations of both maps and images. In this work, LLMs are used for spatial reasoning to understand scenes. Prompts are designed to integrate the map and images. The prompt is formulated to include the description of the scenes along traversed paths and farsight images (egocentric view). To encourage exploration, a hierarchical reward is designed with both topological map info and suggestions from LLMs.
Related Work: (a) ZSON: Image-based [1] and map-based [2] methods. (b) Scene representation: Frontier-based [3] and graph-based [4] methods. (c) LLM for navigation.
VoroNav: ZSON requires neither purposeful training nor prior knowledge of the target object. The input includes RGB-D images $I$ and the real-time pose $\bm{p}$. The observation data is $\mathcal{O}_t=\{(\bm{p}_0,I_0), \ldots, (\bm{p}_t,I_t)\}$. A 2D semantic map $\mathcal{M}_t$ is built from $I$, which contains $K+2$ channels ($K$ categorical maps for objects, an obstacle map, and a explored map). Given the depth image and the agent’s pose, a 3D point cloud is constructed. Point clouds near the floor are assigned to the explored map, and the rest are projected to the obstacle map. The category masks are predicted via Grounded-SAM [5] and mapped into 3D sematic point clouds.
VoroNav (Decision Module): The Generalized Voronoi Diagram (GVD) of a map contains a set of points that are equidistant to the two nearest obstacles, which can be obtained from the obstacle and explored maps. From GVD, RVD can be generated in the following way:
```
unoccupied map = explored map - obstacle map 
morphology(unoccupied map) from scikit-image
GVD = skeletonize(unoccupied map) from scikit-image
find nodes and edges from GVD -> RVG
find shortest path on RVG from current position to target
```
In RVG, nodes are classified into agent node (closest to the agent), neighbor node (connect to agent node), exploratory node (end node to unexplored area), and ordinary node. Navigable paths are generated from RVG (Wavefront) and scene images along the paths are embedded. A path descriptor generates text descriptions of the paths. To explore unknown areas beyond the explored map, farsight images are captioned via BLIP. The target point is selected based on exploration target, path length, and alignment with typical scene layouts. GPT3.5 is used to infer the most promising goal node. A hierarchical structure of rewards is designed for exploration, efficiency, and semantic. Fast Marching Method (FMM) is used for path planning.

References

[1] CoWs on Pasture: Baselines and Benchmarks for Language-Driven Zero-Shot Object Navigation, CVPR 2023. arXiv.
[2] ESC: Exploration with Soft Commonsense Constraints for Zero-shot Object Navigation, ICML 2023. arXiv.
[3] Navigating to objects in the real world, SR 2023. arXiv.
[4] Renderable neural radiance map for visual navigation, CVPR 2023. arXiv.
[5] Segment Anything, ICCV 2023. arXiv.

Extra:

Constraint-Aware Zero-Shot Vision-Language Navigation in Continuous Environments, TPAMI 2025.