Sayem's Personal Homepage

MD Khalequzzaman Chowdhury Sayem

Researcher, Vision & Learning Lab (UNIST), South Korea

I am a researcher at the Vision & Learning Lab at UNIST, working under the supervision of Prof. Seungryul Baek and Prof. Binod Bhattarai. My research focuses on multimodal learning and vision-language models, with an emphasis on structured and grounded reasoning in complex visual environments, particularly involving articulated hands and human-object interactions.

My recent work develops large-scale benchmarks and real-time Transformer architectures that integrate explicit 3D geometric supervision into multimodal models, improving fine-grained spatial reasoning reliability and cross-task generalization.

I am currently interested in advancing multimodal foundation models toward more reliable and interpretable reasoning, with long-term directions in grounded world models and embodied multimodal systems that integrate perception, language, and action.

I am always open to collaborations and discussions. If you are interested in my research or have any inquiries, feel free to reach out to me at khalequzzamansayem@unist.ac.kr.

News

CVPR 2026

HandVQA Accepted

Our paper on diagnosing and improving fine-grained spatial reasoning about hands in vision-language models has been accepted to CVPR 2026.

September 2025

Joined UNIST

Started my role as a Researcher at UNIST Vision & Learning Lab, continuing work on grounded multimodal reasoning.

August 2025

Master's Graduation

Successfully defended my master's thesis and graduated from UNIST.

AAAI 2025

QORT-Former Accepted

Our paper on real-time two-hand and object understanding was accepted to AAAI 2025.

Poster

Publications

CVPR 2026

Denver, CO, USA

HandVQA: Diagnosing and Improving Fine-Grained Spatial Reasoning about Hands in Vision-Language Models

Authors: MD Khalequzzaman Chowdhury Sayem, Mubarrat Tajoar Chowdhury, Yihalem Yimolal Tiruneh, Muneeb A. Khan, Muhammad Salman Ali, Binod Bhattarai, Seungryul Baek.

In this paper, we introduce HandVQA, a large-scale benchmark grounded in 3D hand geometry to systematically evaluate fine-grained spatial reasoning in Vision-Language Models (VLMs). The dataset contains 1.6M+ geometry-derived VQA pairs spanning joint angles, distances, and relative spatial relations (X/Y/Z). We demonstrate that explicit 3D supervision significantly improves spatial reasoning reliability and generalizes to gesture recognition and hand-object interaction tasks.

Project Page Paper (coming soon)Code (coming soon)

AAAI 2025

Philadelphia, PA, USA

QORT-Former: Query-Optimized Real-Time Transformer for Understanding Two Hands Manipulating Objects

Authors: Elkhan Ismayilzada*, MD Khalequzzaman Chowdhury Sayem* (Co-First Author), Yihalem Yimolal Tiruneh, Mubarrat Tajoar Chowdhury, Muhammadjon Boboev, Seungryul Baek.

In this paper, we present a query-optimized real-time Transformer (QORT-Former), the first Transformer-based real-time framework for 3D pose estimation of two hands and an object. Our approach optimizes queries to balance efficiency and accuracy, leveraging hand-object contact information and a three-step feature update mechanism. Our method achieves real-time pose estimation at 53.5 FPS on an RTX 3090TI GPU while outperforming state-of-the-art models on H2O and FPHA datasets.

Project Page Paper Code