PureSpace: A Benchmark for Abstract Spatial Reasoning in Vision-Language Models

Abstract
Spatial reasoning remains a persistent challenge for Vision Language Models (VLMs). Toward this end, we introduce a new benchmark PURESPACE based on abstract geometric objects, isolating three core tasks: rotation, projection, and completion. Our experiments reveal that state-of-the art models achieve only modest performance, with accuracy showing no clear relationship with task difficulty, suggesting a lack of genuine spatial understanding. Furthermore, we find that while specialized models can excel at a single task, they fail to generalize and drop to near-random accuracy on unseen tasks. To overcome these shortcomings, we propose a cognitively-inspired framework that decomposes the problem: a perception module represents the geometric structure, a language model infers the viewpoint transformations, and a renderer synthesizes the target-view appearance, which are finally leveraged by a VLM to determine the correct answer. Experiments show that our method achieves substantial improvements on all three tasks, and provides enhanced interpretability and robustness.
Authors
Jinkai Li, Zhenliang Zhang, Lifeng Fan†, Wei Wang†
Publication Year
2026
http://eng.bigai.ai/wp-content/uploads/sites/7/2026/06/PURESPACE-A-Benchmark-for-Abstract-Spatial-Reasoning.pdf
Publication Venue
CVPR
Scroll to Top