Friday, May 2, 2025
HomeTechnologyRoboticsVote-based model developed for more accurate hand-held object pose estimation TechTricks365

Vote-based model developed for more accurate hand-held object pose estimation TechTricks365


Qualitative results. From left to right: input RGB and depth images from DexYCB Dataset [14]; rendered images using ground truth hand and object poses; rendered images using ground truth hand poses and object poses predicted by our method, RGB-D method [18], and RGB method [20], and hand-object pose estimation method [43]. Credit: Alexandria Engineering Journal (2025). DOI: 10.1016/j.aej.2025.02.017

Many robotic applications rely on robotic arms or hands to handle different types of objects. Estimating the pose of such hand-held objects is an important yet challenging task in robotics, computer vision and even in augmented reality (AR) applications. A promising direction is to utilize multi-modal data, such as color (RGB) and depth (D) images. With the increasing availability of 3D sensors, many machine learning approaches have emerged to leverage this technique.

However, existing approaches still face two main challenges. First, they face accuracy drops when hands occlude the objects held, obscuring critical features required for pose estimation. Additionally, hand-object interactions introduce non-rigid transformations, which further complicate the issue. This happens when hands change the shape or structure of the held object, such as when squeezing a soft ball, distorting the object’s perceived shape.

Second, most current techniques extract features from separate RGB and RGB-D backbones, which are then fused at the feature level. Since these two backbones handle inherently different modalities, this fusion can result in representation distribution shifts, meaning features learned from RGB images may misalign with those extracted from RGB-D inputs, affecting pose estimation.

Furthermore, during fine-tuning, dense interactions between the two backbones cause performance disruptions and limit the benefits of incorporating RGB features.

To address these issues, a research team led by Associate Professor Phan Xuan Tan from the Innovative Global Program, College of Engineering at Shibaura Institute of Technology, Japan, along with Dr. Dinh-Cuong Hoang and other researchers from FPT University, Vietnam, developed an innovative deep-neural network specifically designed for pose estimation using RGB-D images.

“The key innovation of our deep learning framework lies in a vote-based fusion mechanism, which effectively integrates both 2D (RGB) and 3D (depth) keypoints, while addressing hand-induced occlusions and the difficulties of fusing multimodal data. Additionally, it decouples the learning process and incorporates a self-attention-based hand-object interaction model, resulting in substantial improvements,” explains Dr. Tan.

Their study was made available online on February 17, 2025 in the Alexandria Engineering Journal.

The proposed deep-learning framework comprises four components: backbones to extract high-dimensional features from 2D images and 3D point cloud data, voting modules, a novel vote-based fusion module, and a hand-aware object pose estimation module.

Researchers develop a novel vote-based model for more accurate hand-held object pose estimation
Example of generated votes projected on a 2D image. Green points indicate precise predictions closely aligned with ground-truth keypoints, while red points represent predictions deviating further from the ground-truth. Credit: Alexandria Engineering Journal (2025). DOI: 10.1016/j.aej.2025.02.017

Initially, the 2D and 3D backbones predict 2D and 3D keypoints of both hands and objects from the RGB-D images. Keypoints refer to the meaningful locations in the input images that help describe the pose of the hands and objects. Next, the voting modules within each backbone independently cast votes for their respective keypoints.

These votes are then integrated by the vote-based fusion model, which dynamically combines the 2D and 3D votes using radius-based neighborhood projection and channel attention mechanisms. The former preserves local information, while the latter adapts to varying input conditions, ensuring robustness and accuracy.

This vote-based fusion effectively leverages the strengths of RGB and depth information, mitigating the impact of hand-induced occlusions and misalignment, therefore, enabling accurate hand-object pose estimation.

The final component, the hand-aware object pose estimation module, further improves accuracy by using a self-attention mechanism to capture the complex relationships between hand and object keypoints. This allows the system to account for the non-rigid transformations caused by different hand poses and grips.

To test their framework, the researchers conducted experiments on three public datasets. The results showed significant improvements in accuracy (up to 15%) and robustness over state-of-the-art approaches.

Furthermore, on-site experiments demonstrated an average precision of 76.8%, with performance improvements of up to 13.9% compared to existing methods. The framework also achieves inference times of up to 40 milliseconds without refinement and 200 milliseconds with refinement, demonstrating real-world applicability.

“Our research directly addresses a long-standing bottleneck in the robotics and computer vision industries—accurate object pose estimation in occluded, dynamic, and complex hand-object interaction scenarios,” remarks Dr. Tan.

“Our approach is not only more accurate but also simpler than many existing techniques. It has the potential to accelerate the deployment of AI-powered systems, such as efficient automated robotic assembly lines, human-assistive robotics, and immersive AR/VR technologies.”

Overall, this innovative approach represents a significant step forward in robotics, enabling robots to more effectively handle complex objects and advancing AR technologies to model more lifelike hand-object interactions.

More information:
Dinh-Cuong Hoang et al, Vote-based multimodal fusion for hand-held object pose estimation, Alexandria Engineering Journal (2025). DOI: 10.1016/j.aej.2025.02.017

Provided by
Shibaura Institute of Technology

Citation:
Vote-based model developed for more accurate hand-held object pose estimation (2025, May 1)
retrieved 1 May 2025
from https://techxplore.com/news/2025-05-vote-based-accurate-held-pose.html

This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no
part may be reproduced without the written permission. The content is provided for information purposes only.




RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments