News

May 2026

Joined LIOPS

Started as a 3D AI Engineer at LIOPS, working on stereo depth estimation while continuing broader interests in 3D vision and multimodal AI.

March 2026

HandVQA on arXiv

Our HandVQA paper is now available on arXiv.

CVPR 2026

HandVQA Accepted

Our paper on diagnosing and improving fine-grained spatial reasoning about hands in vision-language models has been accepted to CVPR 2026.

September 2025

Joined UNIST

Started my role as a Researcher at UNIST Vision & Learning Lab, continuing work on grounded multimodal reasoning.

August 2025

Master's Graduation

Successfully defended my master's thesis and graduated from UNIST.

AAAI 2025

QORT-Former Accepted

Our paper on real-time two-hand and object understanding was accepted to AAAI 2025.

Publications

CVPR 2026

Denver, CO, USA

HandVQA: Diagnosing and Improving Fine-Grained Spatial Reasoning about Hands in Vision-Language Models

Authors: MD Khalequzzaman Chowdhury Sayem, Mubarrat Tajoar Chowdhury, Yihalem Yimolal Tiruneh, Muneeb A. Khan, Muhammad Salman Ali, Binod Bhattarai, Seungryul Baek.

HandVQA teaser image

In this paper, we introduce HandVQA, a large-scale benchmark grounded in 3D hand geometry to systematically evaluate fine-grained spatial reasoning in Vision-Language Models (VLMs). The dataset contains 1.6M+ geometry-derived VQA pairs spanning joint angles, distances, and relative spatial relations (X/Y/Z). We demonstrate that explicit 3D supervision significantly improves spatial reasoning reliability and generalizes to gesture recognition and hand-object interaction tasks.

AAAI 2025

Philadelphia, PA, USA

QORT-Former: Query-Optimized Real-Time Transformer for Understanding Two Hands Manipulating Objects

Authors: Elkhan Ismayilzada*, MD Khalequzzaman Chowdhury Sayem* (Co-First Author), Yihalem Yimolal Tiruneh, Mubarrat Tajoar Chowdhury, Muhammadjon Boboev, Seungryul Baek.

QORT-Former teaser image

In this paper, we present a query-optimized real-time Transformer (QORT-Former), the first Transformer-based real-time framework for 3D pose estimation of two hands and an object. Our approach optimizes queries to balance efficiency and accuracy, leveraging hand-object contact information and a three-step feature update mechanism. Our method achieves real-time pose estimation at 53.5 FPS on an RTX 3090TI GPU while outperforming state-of-the-art models on H2O and FPHA datasets.