World Action Verifier

World Action Verifier: Self-Improving World Models via Forward-Inverse Asymmetry

Yuejiang Liu^†,*, Fan Feng^‡,*, Lingjing Kong^§,*, Weifeng Lu, Jinzhou Tang^‡,
Kun Zhang^§, Kevin Murphy^¶, Chelsea Finn^†, Yilun Du^||

^†Stanford University ^‡University of California, San Diego ^§Carnegie Mellon University
^¶Google DeepMind ^||Harvard University ^*Equal contribution

arXiv Code (MiniGrid) Code (Robot)

Abstract

General-purpose world models promise scalable policy evaluation, optimization, and planning, yet achieving the required level of robustness remains challenging. Unlike policy learning which primarily focuses on optimal actions, a world model needs to be reliable over a much broader range of suboptimal actions, which are often insufficiently covered by action-labeled interaction data. To address this challenge, we propose World Action Verifier (WAV), a framework that enables world models to identify their own prediction errors and self-improve. The key idea is to decompose action-conditioned state prediction into two factors—state plausibility and action reachability—and verify each separately. We show that these verification problems can be substantially easier than predicting future states due to two underlying asymmetries: the broader availability of action-free data and the lower dimensionality of action-relevant features. Leveraging these asymmetries, we augment a world model with (i) a diverse subgoal generator obtained from video corpora and (ii) a sparse inverse model that infers actions from a subset of state features. By enforcing cycle consistency among generated subgoals, inferred actions, and forward rollouts, WAV provides an effective verification mechanism in under-explored regimes, where existing methods typically fail. Across nine tasks spanning MiniGrid, RoboMimic, and ManiSkill, our method achieves 2× higher sample efficiency while improving downstream policy performance by 18%.

Method: Self-Improving World Model with World Action Verifier

As illustrated above, the subgoal generator $p_\phi$ and the inverse model $h_\psi$ form two complementary components for verification: the former evaluates whether a candidate future is plausible, while the latter checks whether it is reachable under an inferred action. Instead of directly estimating model error, we reformulate verification through two complementary and asymmetric factors derived from Bayes’ rule:

$$ p(s_{t+1}\mid s_t,a_t) \propto \underbrace{p(s_{t+1}\mid s_t)}_{\text{state}} \cdot \underbrace{p(a_t\mid s_t,s_{t+1})}_{\text{action}} $$

Asymmetry 1: Easier State Verification via Broad Data Coverage. Directly modeling forward dynamics requires learning action-conditioned transitions, which are sparse and expensive. In contrast, verifying state plausibility only requires learning the marginal transition model and this model can be trained from abundant action-free video data, providing substantially broader coverage of the state manifold.

Asymmetry 2: Easier Action Verification via Low-Dimensional Structure. Verifying reachability in the full state space is challenging. However, in many tasks, actions depend only on a small subset of state features. We exploit this structure by learning a sparse inverse dynamics model, which reduces the problem from high-dimensional state matching to a simpler, low-dimensional action inference problem.

Method: Verification-Guided Self-Improving Cycle. Given the two verification criteria above, we connect them into a self-improvement loop for exploration:

$$ s_t \xrightarrow{p_\phi} \tilde{s}_{t+1} \xrightarrow{h_\psi} \hat{a}_t \xrightarrow{f_\theta} \hat{s}_{t+1} $$

This closed-loop process enables self-improving exploration: the agent actively seeks out failures of its own predictions and uses them to refine its world model.

Experiments

Robustness of Sparse Inverse Dynamics Models
We test forward world models (WM) and sparse inverse dynamics models (IDM) under controlled shifts:

Sample Efficiency: IDMs consistently outperform WMs, especially in low-data regimes.
Dimensionality: Increasing the number of objects degrades WM performance, while IDMs remain stable.
Stochasticity: Observation noise (e.g., noisy floor tiles) impacts WMs but has minimal effect on IDMs.

These results highlight the reliability of sparse inverse verification across key robustness factors.

MiniGrid OOD generalization and world model improvement

Sparse Inverse Verification for World Model Improvement
We examine the ability of models to generalize to unseen environments and improve world model learning:

Vanilla vs Sparse IDM: Sparse IDMs maintain strong performance on interaction-heavy actions, whereas vanilla IDMs fail in low-data regimes.
World Model Enhancement: WAV-guided exploration selects critical, interaction-rich transitions, achieving lower prediction error than baselines and closely matching the Oracle strategy.
Sample Prioritization: Verification scores effectively reflect transition difficulty, guiding efficient and informative data collection.

Together, these experiments demonstrate that sparse inverse verification improves robustness, guides exploration, and enhances world model learning in under-explored regimes.

World Model Prediction Error across Data Budgets

World Model Prediction Across Data Budgets
We evaluate world-model learning quality on RoboMimic and ManiSkill tasks by measuring next-observation prediction MSE:

Performance Gains: Our approach consistently reduces prediction error across all data budgets compared to baselines.
Low-Data Advantage: Gains are particularly pronounced in low-data regimes, demonstrating efficient utilization of limited trajectories.
Sparse IDM Benefit: Sparse inverse dynamics models outperform dense variants, highlighting the role of selective latent action representation.

These results confirm that WAV-guided exploration and sparse modeling improve the accuracy of learned dynamics under diverse conditions.

Impact on Policy Learning via Imagination
We evaluate whether improved world models translate to stronger downstream policy performance:

Higher Rewards: Policies trained with our world model consistently outperform baselines and approach Oracle-level performance.
Complex Task Advantage: Gains are most pronounced on high-ambiguity or contact-rich tasks (Can, Square, PokeCube), where accurate latent dynamics are critical.
Sample Efficiency: Improved world models accelerate policy tuning in imagination, reducing the need for extensive real-world interactions.

Overall, these results confirm that self-improved world models enhance both task-relevant dynamics representation and imagination-based planning.

Videos

Exploration Tasks in MiniGrid

We evaluate WAV on three long-horizon MiniGrid tasks, namely Key Delivery, Ball Delivery, and Object Matching, which are specifically designed for this work, as well as on random play in EmptyEnv with noisy tiles.

Key Delivery

Ball Delivery

Object Matching

Random play in EmptyEnv

Prediction Comparison in MiniGrid

Select an initial state and then choose an action below to compare one-step predictions from different methods and the ground truth. Green borders indicate correct predictions, red borders indicate prediction errors, and the gold border denotes the ground truth next state.

Progress

Uncertainty

Ours

Ground Truth

Qualitative comparison of world model predictions

The videos compare world model rollouts from different curation strategies. WAV produces predictions that closely match the ground truth, while world models learned with Uncertainty and Progress often yield inaccurate or overly blurred future frames.