Masked Point-Entity Contrast for Open-Vocabulary 3D Scene Understanding

Abstract

Open-vocabulary 3D scene understanding is pivotal for enhancing physical intelligence, as it enables embodied agents to interpret and interact dynamically within real world environments. This paper introduces MPEC, a novel Masked Point-Entity Contrastive learning method for open vocabulary 3D semantic segmentation that leverages both 3D entity-language alignment and point-entity consistency across different point cloud views to foster entity-specific feature representations. MPEC improves semantic dis crimination and enhances the differentiation of unique in stances, achieving state-of-the-art results on ScanNet for open-vocabulary 3D semantic segmentation and demonstrating superior zero-shot scene understanding capabilities. Ex tensive fine-tuning experiments on 8 datasets, spanning from low-level perception to high-level reasoning tasks, showcase the potential of learned 3D features, driving consistent performance gains across varied 3D scene understanding tasks.

Authors

Yan Wang*, Baoxiong Jia*, Ziyu Zhu, Siyuan Huang

Publication Year

2025

https://openaccess.thecvf.com/content/CVPR2025/papers/Wang_Masked_Point-Entity_Contrast_for_Open-Vocabulary_3D_Scene_Understanding_CVPR_2025_paper.pdf

Publication Venue

CVPR