SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding

Abstract
3D vision-language (3D-VL) grounding, which aims to align language with 3D physical environments, stands as a cornerstone in developing embodied agents. In comparison to recent advancements in the 2D domain, grounding language in 3D scenes faces two significant challenges: (i) the scarcity of paired 3D-VL data to support grounded learning of 3D scenes, especially considering complexities within diverse object configurations, rich attributes, and intricate relationships; and (ii) the absence of a unified learning framework to distill knowledge from grounded 3D data. In this work, we aim to address these major challenges in 3D-VL by examining the potential of systematically upscaling 3D-VL learning in indoor scenes. We introduce the first million-scale 3D-VL dataset, SceneVerse, encompassing 68K indoor scenes and comprising 2.5M vision-language pairs collected from both human annotations and our scalable scene-graph-based generation approach. We demonstrate that this scaling allows for a unified pre-training framework, Grounded Pretraining for Scenes (GPS), for 3D-VL learning. Through extensive experiments, we showcase the effectiveness of GPS by achieving state-of-the-art performance on existing 3D visual grounding and question-answering benchmarks. We also show that the data scaling effect is not limited to GPS, but is generally beneficial for models on tasks like 3D semantic segmentation. The vast potential of SceneVerse and GPS is unveiled through zero-shot transfer experiments in challenging 3D-VL tasks.
Authors
Baoxiong Jia*, Yixin Chen˚*, Huangyue Yu, Yan Wang, Xuesong Niu, Tengyu Liu, Qing Li, Siyuan Huang
Publication Year
2024
http://eng.bigai.ai/wp-content/uploads/sites/7/2024/09/ECCV24_SceneVerse.pdf
Publication Venue
ECCV
Scroll to Top