Meta has unveiled V-JEPA 2, a groundbreaking world model that achieves state-of-the-art performance in visual understanding and prediction within the physical world. This 1.2 billion-parameter model, built on the Joint Embedding Predictive Architecture (JEPA), enables zero-shot robot planning, allowing robots to interact with unfamiliar objects in new environments without prior training. It represents a significant step toward advanced machine intelligence (AMI) and the development of AI agents that can operate effectively in dynamic real-world settings.
World models are inspired by human physical intuition—the innate ability to predict how the world responds to our actions. For instance, we know a tossed ball will fall due to gravity, or we navigate crowded spaces by anticipating others’ movements. These models empower AI with three core capabilities: understanding observations (recognizing objects and actions in videos), predicting how the world evolves, and planning sequences of actions to achieve goals.
V-JEPA 2 is trained primarily on video through self-supervised learning, eliminating the need for extensive human annotation. Its architecture includes an encoder that processes raw video into semantic embeddings and a predictor that forecasts future states. Training occurs in two stages: first, actionless pre-training with over 1 million hours of diverse video and images, where the model learns about object interactions and motion; second, action-conditioned training using minimal robot data (just 62 hours) to incorporate control actions, making it suitable for planning and real-world tasks.
The model excels in zero-shot applications, such as picking and placing objects in unseen environments. For short-horizon tasks, robots use V-JEPA 2 to evaluate candidate actions and select the best one via model-predictive control, achieving success rates of 65%–80%. This approach mimics human visual imitation learning, where robots follow a series of subgoals to complete complex tasks.

To advance research, Meta has released three benchmarks for evaluating physical reasoning from video. IntPhys 2 tests the ability to distinguish plausible from implausible scenarios, MVPBench assesses understanding through minimal-change video pairs to avoid shortcuts, and CausalVQA focuses on cause-and-effect questions. While humans achieve 85%–95% accuracy on these, current models, including V-JEPA 2, show significant gaps, highlighting areas for improvement.
Looking ahead, Meta plans to develop hierarchical JEPA models for multi-scale planning and multimodal versions incorporating vision, audio, and touch. By open-sourcing V-JEPA 2 and its benchmarks, they aim to foster community collaboration and accelerate progress toward intelligent AI systems that enhance everyday life.