SocialNav-SUB: Benchmarking VLMs for Scene Understanding in Social Robot Navigation

In Proceedings of the Conference on Robot Learning (CoRL 2025)

1 Department of Computer Science, The University of Texas at Austin
2 Army Research Laboratory
3 Sony AI
Correspondence: michaelmunje@utexas.edu

Overview

Overview figure for SocialNav-SUB
An overview of SocialNav-SUB, which facilitates the systematic evaluation of VLMs in social robot navigation scenarios. Using SCAND data, human-labeled VQA datasets, and various VLMs, this framework evaluates models across multiple dimensions of scene understanding, enabling advancements in prompt designs, social reasoning, and social robot navigation research in general.

Abstract

Robot navigation in dynamic, human-centered environments requires socially-compliant decisions grounded in robust scene understanding. Recent Vision-Language Models (VLMs) exhibit promising capabilities such as object recognition, common-sense reasoning, and contextual understanding—capabilities that align with the nuanced requirements of social robot navigation. However, it remains unclear whether VLMs can accurately understand complex social navigation scenes (e.g., inferring the spatial-temporal relations among agents and human intentions), which is essential for safe and socially compliant robot navigation. While some recent works have explored the use of VLMs in social robot navigation, no existing work systematically evaluates their ability to meet these necessary conditions. In this paper, we introduce the Social Navigation Scene Understanding Benchmark (SocialNav-SUB), a Visual Question Answering (VQA) dataset and benchmark designed to evaluate VLMs for scene understanding in real-world social robot navigation scenarios. SocialNav-SUB provides a unified framework for evaluating VLMs against human and rule-based baselines across VQA tasks requiring spatial, spatiotemporal, and social reasoning in social robot navigation. Through experiments with state-of-the-art VLMs, we find that while the best-performing VLM achieves an encouraging probability of agreeing with human answers, it still underperforms simpler rule-based approach and human consensus baselines, indicating critical gaps in social scene understanding of current VLMs. Our benchmark sets the stage for further research on foundation models for social robot navigation, offering a framework to explore how VLMs can be tailored to meet real-world social robot navigation needs.

What's in SocialNavSUB

What is in SocialNav-SUB

VQA Example from SocialNav-SUB

A social navigation scenario illustrating three evaluation categories—spatial reasoning, spatiotemporal reasoning, and social reasoning—each shown with a sample question. We build a set of VQA prompts by combining videos with carefully designed questions. Our benchmark evaluates against human labels for these VQA questions (≈5k questions across the benchmark).

Data Processing for VQA Prompts

Data processing pipeline for VQA prompts in SocialNav-SUB
The data processing pipeline for VQA prompts in SocialNav-SUB. We mine social robot navigation scenarios from SCAND, then use the PHALP algorithm for human tracking and 3D location estimation, which are used to construct BEV representations and annotated images. Along with these, a set of carefully designed questions—covering spatial, spatiotemporal, and social reasoning—provides the VQA prompts.

Evaluation of VLMs on Social Robot Navigation Scene Understanding

Average Performance Across Question Categories

Average Performance Across Question Categories.

This figure evaluates multiple VLMs and compares them with human and rule-based baselines. We define Probability of Agreement (PA) and Consensus-Weighted PA (CWPA) as evaluation metrics (additional details below and in paper) for all questions and for each category, with standard errors computed across questions. PA is the fraction of human respondents who chose the model’s answer; CWPA additionally weights each question by the strength of human consensus. The strongest VLM PA scores are shown in bold and may be statistically tied.

Metric details

Let NQ be the number of questions and NH the number of human respondents per question. For question q, let the model (or baseline) answer be Aq, and the i-th human’s answer be Ahq,i. The Probability of Agreement is:

PA = (1/NQ) · Σq=1..NQ ( (1/NH) · Σi=1..NH [ Aq = Ahq,i ] )
            

where [·] is 1 if the statement is true and 0 otherwise. CWPA re-weights each question by its human-consensus strength (e.g., proportional to the majority-vote probability) before averaging. Standard errors are computed across questions.

BibTeX

@inproceedings{munje2025socialnavsub,
  title     = {SocialNav-SUB: Benchmarking VLMs for Scene Understanding in Social Robot Navigation},
  author    = {Munje, Michael J. and Tang, Chen and Liu, Shuijing and Hu, Zichao and Zhu, Yifeng and Cui, Jiaxun and Warnell, Garrett and Biswas, Joydeep and Stone, Peter},
  booktitle = {Proceedings of the Conference on Robot Learning (CoRL)},
  year      = {2025},
  note      = {Project page: \url{https://larg.github.io/socialnav-sub}}
}