Average Performance Across Question Categories.
This figure evaluates multiple VLMs and compares them with human and rule-based baselines. We define Probability of Agreement (PA) and Consensus-Weighted PA (CWPA) as evaluation metrics (additional details below and in paper) for all questions and for each category, with standard errors computed across questions. PA is the fraction of human respondents who chose the model’s answer; CWPA additionally weights each question by the strength of human consensus. The strongest VLM PA scores are shown in bold and may be statistically tied.
Metric details
Let NQ
be the number of questions and NH
the number of human respondents per question.
For question q
, let the model (or baseline) answer be Aq
, and the i
-th human’s answer be Ahq,i
.
The Probability of Agreement is:
PA = (1/NQ) · Σq=1..NQ ( (1/NH) · Σi=1..NH [ Aq = Ahq,i ] )
where [·]
is 1 if the statement is true and 0 otherwise.
CWPA re-weights each question by its human-consensus strength (e.g., proportional to the majority-vote probability) before averaging.
Standard errors are computed across questions.