Ai2 Unveils MolmoAct, AI Model for 3D Spatial Reasoning

1hosting

Aug 15, 2025 - 03:34

0 0

Ai2 Unveils MolmoAct, AI Model for 3D Spatial Reasoning

First-of-its-kind model that combines spatial planning and visual reasoning to enable safer, more adaptable robot control

Ai2 (The Allen Institute for AI) today announced the release of MolmoAct 7B, a breakthrough embodied AI model that brings the intelligence of state of the art AI models into the physical world. Instead of reasoning through language and converting that into movement, MolmoAct actually sees its surroundings, understands the relationships between space, movement and time, and plans its movements accordingly. It does this by generating visual reasoning tokens that transform 2D image inputs into 3D spatial plans—enabling robots to navigate the physical world with greater intelligence and control.

While spatial reasoning isn’t new, most modern systems rely on closed, end-to-end architectures trained on massive proprietary datasets. These models are difficult to reproduce, expensive to scale, and often operate as opaque black boxes. MolmoAct offers a fundamentally different approach: it’s trained entirely on open data, designed for transparency, and built for real-world generalization. Its step-by-step visual reasoning traces make it easy to preview what a robot plans to do and intuitively steer its behavior in real time as conditions change.

“Embodied AI needs a new foundation that prioritizes reasoning, transparency, and openness,” said Ali Farhadi, CEO of Ai2. “With MolmoAct, we’re not just releasing a model; we’re laying the groundwork for a new era of AI, bringing the intelligence of powerful AI models into the physical world. It’s a step toward AI that can reason and navigate the world in ways that are more aligned with how humans do — and collaborate with us safely and effectively.”

A New Class of Model: Action Reasoning

MolmoAct is the first in a new category of AI model Ai2 is calling an Action Reasoning Model (ARM), a model that interprets high-level natural language instructions and reasons through a sequence of physical actions to carry them out in the real world. Unlike traditional end-to-end robotics models that treat tasks as a single, opaque step, ARMs interpret high-level instructions and break them down into a transparent chain of spatially grounded decisions:

3D-aware perception: grounding the robot’s understanding of its environment using depth and spatial context
Visual waypoint planning: outlining a step-by-step task trajectory in image space
Action decoding: converting the plan into precise, robot-specific control commands

This layered reasoning enables MolmoAct to interpret commands like “Sort this trash pile” not as a single step, but as a structured series of sub-tasks: recognize the scene, group objects by type, grasp them one by one, and repeat.

Built to Generalize and Trained to Scale

MolmoAct 7B, the first in its model family, was trained on a curated dataset of about 12,000 “robot episodes” from real-world environments, such as kitchens and bedrooms. These demonstrations were transformed into robot-reasoning sequences that expose how complex instructions map to grounded, goal-directed actions. Along with the model, we’re releasing the MolmoAct post-training dataset containing ~12,000 distinct “robot episodes.” Ai2 researchers spent months curating videos of robots performing actions in diverse household settings, from arranging pillows on a living room couch to putting away laundry in a bedroom.

Despite its strong performance, MolmoAct was trained with striking efficiency. It required just 18 million samples, pretraining on 256 NVIDIA H100 GPUs for about 24 hours, and fine-tuning on 64 GPUs for only two more. In contrast, many commercial models require hundreds of millions of samples and far more compute. Yet MolmoAct outperforms many of these systems on key benchmarks—including a 71.9% success rate on SimPLER—demonstrating that high-quality data and thoughtful design can outperform models trained with far more data and compute.

Understandable AI You Can Build On

Unlike most robotics models, which operate as opaque systems, MolmoAct was built for transparency. Users can preview the model’s planned movements before execution, with motion trajectories overlaid on camera images. These plans can be adjusted using natural language or quick sketching corrections on a touchscreen—providing fine-grained control and enhancing safety in real-world environments like homes, hospitals, and warehouses.

True to Ai2’s mission, MolmoAct is fully open-source and reproducible. Ai2 is releasing everything needed to build, run, and extend the model: training pipelines, pre- and post-training datasets, model checkpoints, and evaluation benchmarks.

MolmoAct sets a new standard for what embodied AI should look like—safe, interpretable, adaptable, and truly open. Ai2 will continue expanding its testing across both simulated and real-world environments, with the goal of enabling more capable and collaborative AI systems.

Download the model and model artifacts – including training checkpoints and evals – from Ai2’s Hugging Face repository.

Explore AITechPark for the latest advancements in AI, IOT, Cybersecurity, AITech News, and insightful updates from industry experts!

The post Ai2 Unveils MolmoAct, AI Model for 3D Spatial Reasoning first appeared on AI-Tech Park.