Understanding fine-grained articulation of human hands is critical in high-stakes settings such as robot-assisted surgery,
chip manufacturing, and AR/VR-based human–AI interaction. Despite strong performance on general vision-language benchmarks,
current vision-language models (VLMs) struggle with fine-grained spatial reasoning, especially for complex, articulated hand poses.
We introduce HandVQA, a large-scale diagnostic benchmark that evaluates VLMs’ understanding of detailed hand anatomy via
visual question answering. Built upon high-quality 3D hand datasets (FreiHAND, InterHand2.6M, FPHA),
HandVQA provides 1.6M+ controlled multiple-choice questions probing spatial relationships between hand joints, including
angles, distances, and relative positions.
We evaluate state-of-the-art VLMs (e.g., LLaVA, DeepSeek, Qwen-VL) in both base and LoRA fine-tuned settings.
Our results reveal systematic limitations such as hallucinated finger parts, incorrect geometric interpretations, and poor generalization.
HandVQA not only exposes these reasoning gaps, but also provides a validated path to improvement: the 3D-grounded spatial knowledge
learned from HandVQA transfers zero-shot to novel downstream tasks, improving hand gesture recognition (+10.33%) and
hand–object interaction recognition (+2.63%).