News

  • ๐ŸŽ‰ CVPR 2026: Our paper "HandVQA: Diagnosing and Improving Fine-Grained Spatial Reasoning about Hands in Vision-Language Models" has been accepted to CVPR 2026!
  • ๐ŸŽ‰ September 2025: Started my new role as a Researcher at UNIST's Vision and Learning Lab!
  • ๐ŸŽ‰ August 2025: I have successfully defended my Master's thesis and graduated from UNIST!
  • ๐ŸŽ‰ AAAI 2025: Our paper "QORT-Former: Query-Optimized Real-Time Transformer for Understanding Two Hands Manipulating Objects" has been accepted! Check out our poster!

Publications

  • HandVQA: Diagnosing and Improving Fine-Grained Spatial Reasoning about Hands in Vision-Language Models. Accepted at CVPR 2026.
    Authors: MD Khalequzzaman Chowdhury Sayem, Mubarrat Tajoar Chowdhury, Yihalem Yimolal Tiruneh, Muneeb A. Khan, Muhammad Salman Ali, Binod Bhattarai, Seungryul Baek.
    In this paper, we introduce HandVQA, a large-scale benchmark grounded in 3D hand geometry to systematically evaluate fine-grained spatial reasoning in Vision-Language Models (VLMs). The dataset contains 1.6M+ geometry-derived VQA pairs spanning joint angles, distances, and relative spatial relations (X/Y/Z). We demonstrate that explicit 3D supervision significantly improves spatial reasoning reliability and generalizes to gesture recognition and handโ€“object interaction tasks.
    Project Page | Paper (coming soon) | Code (coming soon)
  • QORT-Former: Query-Optimized Real-Time Transformer for Understanding Two Hands Manipulating Objects. Accepted at AAAI 2025.
    Authors: Elkhan Ismayilzada*, MD Khalequzzaman Chowdhury Sayem* (Co-First Author), Yihalem Yimolal Tiruneh, Mubarrat Tajoar Chowdhury, Muhammadjon Boboev, Seungryul Baek.
    In this paper, we present a query-optimized real-time Transformer (QORT-Former), the first Transformer-based real-time framework for 3D pose estimation of two hands and an object. Our approach optimizes queries to balance efficiency and accuracy, leveraging hand-object contact information and a three-step feature update mechanism. Our method achieves real-time pose estimation at 53.5 FPS on an RTX 3090TI GPU while outperforming state-of-the-art models on H2O and FPHA datasets.
    Project Page | Paper | Code