MolmoMotion: Language‑guided 3D motion forecasting plus 1.16M‑video dataset

In detail

Model: MolmoMotion — inputs: RGB observation, set of 3D query points, action description; output: predicted future 3D point trajectories
MolmoMotion‑1M: 1.16 million videos with paired 3D point trajectories and action descriptions
PointMotionBench: human‑validated benchmark containing 2.7k video clips to measure object‑centric 3D motion forecasting accuracy
Code, model weights, data and a technical report are published (Hugging Face/GitHub/project page)

Why it matters

Predictive 3D motion models matter for robotics planning and controllable video generation; public models plus a large labeled dataset lower the barrier for applied R&D and standardized evaluation.

For you Evaluate MolmoMotion on representative tasks (robot grasping, trajectory‑conditioned video) and use PointMotionBench to quantify improvements before changing live planners or generators.

Sources

Hugging Face