@Kseniase on Hugging Face: "12 Powerful World Models World models are one of the most challenging areas…"

Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Kseniase

posted an update 2 days ago

Post

2997

12 Powerful World Models

World models are one of the most challenging areas in AI, pushing the boundaries of reasoning, perception, and planning. They're gen AI systems that help models and agents learn internal representations of real-world environments.

Today, we invite you to take a look at 12 standout examples:

1. WorldVLA → WorldVLA: Towards Autoregressive Action World Model (2506.21539)
This autoregressive world model integrates action prediction and visual world modeling in a single framework, allowing each to enhance the other. It introduces an attention masking strategy to reduce action prediction errors

2. SimuRA → https://arxiv.org/abs/2507.23773
A generalized world model that uses a language-based world model to simulate and plan actions before execution, enabling more general and flexible reasoning

3. PAN (Physical, Agentic, and Nested) world models → Critiques of World Models (2507.05169)
Has a hybrid architecture that combines discrete concept-based reasoning (via LLMs) with continuous perceptual simulation (via diffusion models), enabling rich multi-level, multimodal understanding and prediction

4. MineWorld by Microsoft Research → MineWorld: a Real-Time and Open-Source Interactive World Model on Minecraft (2504.08388)
Enables real-time, interactive world modeling in Minecraft by combining visual and action tokenization within an autoregressive Transformer. It uses parallel decoding for fast scene generation (4–7 FPS)

5. WorldMem → WORLDMEM: Long-term Consistent World Simulation with Memory (2504.12369)
Uses a memory bank with attention over time-stamped frames and states to maintain long-term and 3D spatial consistency in scene generation. So it reconstruct past scenes and simulate dynamic world changes across large temporal gaps

Read further below ⬇️

If you like this, also subscribe to the Turing post: https://www.turingpost.com/subscribe

Plus explore this article for a comprehensive overview of the history and current evolution of world models: https://www.turingpost.com/p/topic-35-what-are-world-models

Kseniase

2 days ago

iVideoGPT → https://huggingface.co/papers/2405.15223
Unifies visual observations, actions, and rewards into a single token sequence, enabling scalable, interactive world modeling of high-dimensional environments
MaskGWM → https://huggingface.co/papers/2502.11663
It's used for autonomous driving. It improves long-horizon and multi-view prediction by combining video generation with MAE-style feature-level context learning. Its innovations include: scalable Diffusion Transformers, diffusion-aware mask tokens, and spatial-temporal masking.
World-model-augmented (WMA) web agent → https://huggingface.co/papers/2410.13232
This mix of a world model and LLM-based web agents enables agents to simulate future outcomes in natural language and avoid mistakes in long-horizon tasks. The world model's transition-focused abstraction allows for efficient policy improvement
Navigation World Models from Meta →
https://huggingface.co/papers/2412.03572
Allows agents to simulate and evaluate navigation trajectories before acting. Powered by a large Conditional Diffusion Transformer, NWM adapts to dynamic constraints and generalizes to unfamiliar environments with a single image
Сosmos World Foundation Models by NVIDIA →
https://huggingface.co/papers/2501.03575
Include 3 model families: 1) Cosmos-Predict1 simulates how the visual world evolves over time, learning physical world dynamics from video clips; 2) Cosmos-Transfer1 allows to guide world generation using multiple spatial control signals: segmentation, depth, edge maps, blurred visual inputs, etc.; 3) Cosmos-Reason1 reasons about what is happening, what will happen next, and what actions are feasible.
DreamerV3, Google DeepMind → https://arxiv.org/abs/2301.04104
A single, general-purpose world model-based RL algorithm. It demonstrates robust, farsighted planning in complex environments without human data or reward shaping, and excels in tasks like collecting diamonds in Minecraft from scratch.
Genie 2, Google DeepMind →
https://deepmind.google/discover/blog/genie-2-a-large-scale-foundation-world-model/
Generates diverse training environments for embodied agents. From a single image prompt, it creates playable virtual worlds controllable via keyboard and mouse usable by both humans and AI systems.

In this post

Kseniase Ksenia Se