HandVQA: Diagnosing and Improving Fine-Grained Spatial Reasoning about Hands in Vision-Language Models

MD Khalequzzaman Chowdhury Sayem; Mubarrat Tajoar Chowdhury; Yihalem Yimolal Tiruneh; Muneeb A. Khan; Muhammad Salman Ali; Binod Bhattarai; Seungryul Baek

HandVQA: Diagnosing and Improving Fine-Grained Spatial Reasoning about Hands in Vision-Language Models

MD Khalequzzaman Chowdhury Sayem^1*, Mubarrat Tajoar Chowdhury^1*, Yihalem Yimolal Tiruneh¹, Muneeb A. Khan¹, Muhammad Salman Ali¹, Binod Bhattarai^2,3,4†, Seungryul Baek^1†

¹UNIST ²University of Aberdeen ³University College London ⁴Fogsphere (Redev.AI Ltd), UK

CVPR 2026

^* Equal contribution. ^† These authors jointly supervised this work.

Paper (Coming Soon!) Supplementary (Coming Soon!) arXiv (Coming Soon!) Code (Coming Soon!) Dataset (Coming Soon!)

HandVQA teaches fine-grained 3D hand geometry to vision-language models, enabling spatially-aware reasoning and strong zero-shot transfer to gesture recognition and hand-object interaction tasks.

Abstract

Understanding fine-grained articulation of human hands is critical in high-stakes settings such as robot-assisted surgery, chip manufacturing, and AR/VR-based human–AI interaction. Despite strong performance on general vision-language benchmarks, current vision-language models (VLMs) struggle with fine-grained spatial reasoning, especially for complex, articulated hand poses.

We introduce HandVQA, a large-scale diagnostic benchmark that evaluates VLMs’ understanding of detailed hand anatomy via visual question answering. Built upon high-quality 3D hand datasets (FreiHAND, InterHand2.6M, FPHA), HandVQA provides 1.6M+ controlled multiple-choice questions probing spatial relationships between hand joints, including angles, distances, and relative positions.

We evaluate state-of-the-art VLMs (e.g., LLaVA, DeepSeek, Qwen-VL) in both base and LoRA fine-tuned settings. Our results reveal systematic limitations such as hallucinated finger parts, incorrect geometric interpretations, and poor generalization. HandVQA not only exposes these reasoning gaps, but also provides a validated path to improvement: the 3D-grounded spatial knowledge learned from HandVQA transfers zero-shot to novel downstream tasks, improving hand gesture recognition (+10.33%) and hand–object interaction recognition (+2.63%).

Motivation

Despite impressive progress on general vision-language benchmarks, current VLMs still struggle with fine-grained spatial reasoning about articulated objects such as human hands. These failures are often hidden in aggregate benchmark scores but become evident when models must reason about precise finger configurations, inter-finger distances, or spatial ordering.

As shown below, base VLMs frequently hallucinate finger configurations, misjudge inter-finger distances, and confuse spatial ordering (e.g., left/right or front/behind). These errors become especially pronounced under occlusion, viewpoint changes, or complex hand poses.

To systematically diagnose and improve these limitations, we introduce HandVQA, a 3D-geometry-grounded benchmark designed to evaluate fine-grained spatial reasoning about hands through controlled visual question answering.

Where Current VLMs Fail

Are any fingers crossing?

Base VLM: No ✗
Spatial-aware VLM: Yes ✓

Key challenge: Detects self-occlusion patterns.

Which pair of fingers are spread widest?

Base VLM: Index–Middle ✗
Spatial-aware VLM: Middle–Ring ✓

Key challenge: Tests distance reasoning.

Which fingertip is closest to the palm?

Base VLM: Index ✗
Spatial-aware VLM: Ring ✓

Key challenge: Combines distance reasoning from reference points.

Is the index finger crossing the middle finger?

Base VLM: No ✗
Spatial-aware VLM: Yes ✓

Key challenge: Requires depth and left/right ordering.

Is the thumb left or right of the index finger?

Base VLM: Left ✗
Spatial-aware VLM: Right ✓

Key challenge: Tests X-axis spatial reasoning.

Are the two fingers touching?

Base VLM: No ✗
Spatial-aware VLM: Yes ✓

Key challenge: Detect contact between fingers.

HandVQA Benchmark

Overview

Evaluates 5 types of hand spatial understanding:

Angle (4-way)
Distance (3-way)
Relative Position X (left/right)
Relative Position Y (above/below)
Relative Position Z (front/behind)

Built from high-quality 3D joint annotations in FreiHAND, InterHand2.6M, and FPHA.
Contains 1.6M+ controlled multiple-choice questions.

Questions are generated deterministically from 3D joint coordinates and rendered into natural-language MCQ options, ensuring explicit geometric grounding.

Hand joint map (MANO joints) used in HandVQA

Hand joint map (MANO-style joints) used to compute pose descriptors.

What HandVQA Adds

Targets pose hallucination by probing joint-level spatial errors (e.g., finger angles and inter-joint distances).
Focuses on part–whole spatial relations within a single object (the hand), rather than inter-object relationships.
All questions are grounded in real 3D coordinates, testing Euclidean concepts such as distance and angle.
Evaluates structured geometry understanding—a diagnostic axis that is largely missing from standard VQA benchmarks.
Large scale: 1.6M+ questions, 100+ joint combinations, and 5 spatial reasoning types.

Pose Descriptors

We compute discrete labels from 3D joints and convert them into controlled MCQ options.

Angle

Bent completely inward: θ < 105°
Bent inward: 105° ≤ θ < 150°
Bent slightly inward: 150° ≤ θ < 170°
Straight: θ ≥ 170°

Distance

Close to: d < 0.1
Spread from: 0.1 ≤ d < 0.3
Spread wide from: d ≥ 0.3

Relative Position (X)

Left of: Δ_x < −0.15
Aligned: −0.15 ≤ Δ_x < 0.15
Right of: Δ_x ≥ 0.15

(Aligned cases are excluded in benchmark questions to avoid visually ambiguous labels.)

Relative Position (Y)

Below: Δ_y < −0.15
Aligned: −0.15 ≤ Δ_y < 0.15
Above: Δ_y ≥ 0.15

(Aligned cases are excluded in benchmark questions to avoid visually ambiguous labels.)

Relative Position (Z)

Behind: Δ_z < −0.15
Aligned: −0.15 ≤ Δ_z < 0.15
In front of: Δ_z ≥ 0.15

(Aligned cases are excluded in benchmark questions to avoid visually ambiguous labels.)

Benchmark Generation Pipeline

Overview of the HandVQA Pipeline. Our pipeline converts normalized 3D hand joints into interpretable VQA pairs through three deterministic stages. (1) Pose extraction: continuous pose descriptors — joint angles (θ), distances (d), and relative positions (Δx, Δy, Δz) — are computed and discretized into categorical pose descriptors. (2) Text generation: deterministic sentence templates are filled using these descriptors to produce candidate answer options, including both correct and distractor choices. (3) MCQ construction: each hand image is paired with its answer options to form structured multiple-choice VQA samples with a single correct label.

Question Format

Overview of HandVQA Question Format. The benchmark decomposes hand pose understanding into five spatial reasoning sub-tasks: Angle, Distance, and Relative Position along the X, Y, and Z axes. A hand image with annotated joint indices supports multiple-choice questions for each task, generated directly from 3D joint coordinates. Correct answers are highlighted in green.

Dataset Scale and Coverage

Breakdown of question types across training (left) and evaluation (right) splits. Each dataset in the HandVQA benchmark maintains a balanced distribution across five spatial reasoning tasks: angle, distance, and relative positions along the X, Y, and Z axes. This balanced composition supports fair evaluation across all pose-related subtasks.

Results

Key Findings

Fine-grained articulation remains challenging: Even strong vision-language models struggle with subtle finger bending and articulated pose understanding, indicating limited sensitivity to precise geometric cues.
Distance reasoning shows systematic bias: Base models frequently default to visually plausible “close” predictions, whereas geometry-grounded fine-tuning substantially reduces this bias.
Directional spatial reasoning improves most: Relations such as left/right, above/below, and front/behind benefit strongly from HandVQA training, suggesting improved spatial consistency through explicit 3D grounding.
Zero-shot transfer from 3D spatial grounding: Spatial knowledge learned from HandVQA transfers effectively without additional training, improving downstream tasks such as gesture recognition and hand–object interaction understanding.
Improved confidence calibration after fine-tuning: Base models often make incorrect predictions with high confidence, whereas fine-tuned models achieve higher accuracy while exhibiting more reliable confidence estimates.

Qualitative Examples

Gesture Recognition Example — **Zero-shot transfer (Gesture Recognition):** Qualitative Results on Zero-shot Gesture Recognition on HaGRID dataset.

Hand-Object Interaction Example — **Zero-shot transfer (Hand–Object Interaction):** Qualitative Results on Zero-shot Hand-Object Interaction Recognition on H2O dataset.

FreiHAND Examples on Our Benchmark — **FreiHAND:** HandVQA fine-tuning improves LLaVA’s joint-level spatial reasoning: the base model frequently chooses inconsistent options, while the fine-tuned model selects the correct answers across diverse poses.

FPHA Examples on Our Benchmark — **FPHA (egocentric):** HandVQA fine-tuning improves LLaVA’s joint-level spatial reasoning: the base model frequently chooses inconsistent options, while the fine-tuned model selects the correct answers across diverse poses.

InterHand Examples on Our Benchmark — **InterHand2.6M:** HandVQA fine-tuning improves LLaVA’s joint-level spatial reasoning: the base model frequently chooses inconsistent options, while the fine-tuned model selects the correct answers across diverse poses.

BibTeX

Coming Soon!