QORT-Former: Query-optimized Real-time Transformer for Understanding Two Hands Manipulating Objects

UNIST, Ulsan, South Korea¹
Michigan State University, MI, USA²
AAAI 2025
^*Equal Contribution ^✉Corresponding author

Abstract

In this paper, we present a query-optimized real-time Transformer (QORT-Former), the first Transformer-based real-time framework for 3D pose estimation of two hands and an object. We first limit the number of queries and decoders to meet the efficiency requirement. Given limited number of queries and decoders, we propose to optimize queries which are taken as input to the Transformer decoder, to secure the good accuracy: (1) we propose to divide queries into three types (a left hand query, a right hand query and an object query) and enhance query features (2) by using the contact information between hands and an object and (3) by using three-step update of enhanced image and query features in decoder with respect to one another. With proposed methods, we achieved real-time pose estimation performance using just 108 queries and 1 decoder (53.5 FPS on an RTX 3090TI GPU). Surpassing state-of-the-art results on the H2O dataset by 17.6% (left hand), 22.8% (right hand), and 27.2% (object), as well as on the FPHA dataset by 5.3% (right hand) and 10.4% (object), our method excels in accuracy. Additionally, it sets the state-of-the-art in interaction recognition, maintaining real-time efficiency with an off-the-shelf action recognition module.

BibTeX

@inproceedings{ismayilzada2025qort, title={QORT-Former: Query-optimized Real-time Transformer for Understanding Two Hands Manipulating Objects}, author={Ismayilzada, Elkhan and Sayem, MD Khalequzzaman Chowdhury and Tiruneh, Yihalem Yimolal and Chowdhury, Mubarrat Tajoar and Boboev, Muhammadjon and Baek, Seungryul}, booktitle={Proceedings of the AAAI Conference on Artificial Intelligence}, volume={39}, number={4}, pages={3895--3903}, year={2025} }

QORT-Former: Query-optimized Real-time Transformer for Understanding Two Hands Manipulating Objects

Comparison between Ours and Previous SOTA method.

Abstract

Comparisons to competitive state-of-the-art algorithms on the two hands and an object pose estimation task on an RTX 3090TI GPU. Even with the Transformer architecture, we achieved the fastest speed (53.5 FPS) while obtaining the best accuracy among the methods.

BibTeX