Tuesday, June 17, 2025
HomeTechnologyRoboticsAI generates data to help embodied agents ground language to 3D world...

AI generates data to help embodied agents ground language to 3D world TechTricks365


A new 3D-text dataset, 3D-GRAND, leverages generative AI to create synthetic rooms that are automatically annotated with 3D structures. The dataset’s 40,087 household scenes can help train embodied AI, like household robots, connect language to 3D spaces. Credit: Joyce Chai

A new, densely annotated 3D-text dataset called 3D-GRAND can help train embodied AI, like household robots, to connect language to 3D spaces. The study, led by University of Michigan researchers, was presented at the Computer Vision and Pattern Recognition (CVPR) Conference in Nashville, Tennessee on June 15, and published on the arXiv preprint server.

When put to the test against previous 3D datasets, the model trained on 3D-GRAND reached 38% grounding accuracy, surpassing the previous best model by 7.7%. 3D-GRAND also drastically reduced hallucinations to only 6.67% from the previous state-of-the-art rate of 48%.

The dataset contributes to the next generation of household robots that will far exceed the robotic vacuums that currently populate homes. Before we can command a robot to “pick up the book next to the lamp on the nightstand and bring it to me,” the robot must be trained to understand what language refers to in space.

“Large multimodal language models are mostly trained on text with 2D images, but we live in a 3D world. If we want a robot to interact with us, it must understand spatial terms and perspectives, interpret object orientations in space, and ground language in the rich 3D environment,” said Joyce Chai, a professor of computer science and engineering at U-M and senior author of the study.

While text or image-based AI models can pull an enormous amount of information from the internet, 3D data is scarce. It’s even harder to find 3D data with grounded text data—meaning specific words like “sofa” are linked to 3D coordinates bounding the actual sofa.

Like all LLMs, 3D-LLMs perform best when trained on large data sets. However, building a large dataset by imaging rooms with cameras would be time-intensive and expensive as annotators must manually specify objects and their spatial relationships and link words to their corresponding objects.

The research team took a new approach, leveraging generative AI to create synthetic rooms that are automatically annotated with 3D structures. The resulting 3D-GRAND dataset includes 40,087 household scenes paired with 6.2 million densely-grounded descriptions of the room.

“A big advantage of synthetic data is that labels come for free because you already know where the sofa is, which makes the curation process easier,” said Jianing Jed Yang, a doctoral student of computer science and engineering at U-M and lead author of the study.

After generating the synthetic 3D data, an AI pipeline first used vision models to describe each object’s color, shape and material. From here, a text-only model generated descriptions of entire scenes while using scene graphs—structured maps of how objects relate to each other—to ensure each noun phrase is grounded to specific 3D objects.

A final quality control step used a hallucination filter to ensure each object generated in the text actually has an associated object in the 3D scene.

Human evaluators spot-checked 10,200 room-annotation pairs to ensure reliability by assessing whether there were any inaccuracies in AI-generated sentences or objects. The synthetic annotations had a low error rate of about 5% to 8%, which is comparable to professional human annotations.

“Given the size of the dataset, the LLM-based annotation reduces both the cost and time by an order of magnitude compared to human annotation, creating 6.2 million annotations in just two days. It is widely recognized that collecting high-quality data at scale is essential for building effective AI models,” said Yang.

To put the new dataset to the test, the research team trained a model on 3D-GRAND and compared it with three baseline models (3D-LLM, LEO and 3D-VISTA). The benchmark ScanRefer evaluated grounding accuracy—how much overlap the predicted bounding box overlaps with the true object boundary—while a newly introduced benchmark called 3D-POPE evaluated object hallucinations.

The model trained on 3D-GRAND reached a 38% grounding accuracy with only a 6.67% hallucination rate, far exceeding the competing generative models. While 3D-GRAND contributes to the 3D-LLM modeling community, testing on robots will be the next step.

“It will be exciting to see how 3D-GRAND helps robots better understand space and take on different spatial perspectives, potentially improving how they communicate and collaborate with humans,” said Chai.

More information:
Jianing Yang et al, 3D-GRAND: A Million-Scale Dataset for 3D-LLMs with Better Grounding and Less Hallucination, arXiv (2024). DOI: 10.48550/arxiv.2406.05132

Journal information:
arXiv

Provided by
University of Michigan College of Engineering

Citation:
AI generates data to help embodied agents ground language to 3D world (2025, June 16)
retrieved 16 June 2025
from https://techxplore.com/news/2025-06-ai-generates-embodied-agents-ground.html

This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no
part may be reproduced without the written permission. The content is provided for information purposes only.




RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments