Login
Sign Up
Woofun AI reports that the term "world model" has become the most chaotic concept in artificial intelligence since 2025, with OpenAI labeling Sora a simulator, Genie enabling walkable images, robotics firms claiming similar status, NVIDIA positioning Omniverse as infrastructure, and game engines joining the narrative. Li Feifei intervened to clarify this confusion by returning to the classic reinforcement learning diagram of the partially observable Markov decision process (POMDP) closed loop, where an agent performs an action that changes a state, resulting in an observation that guides the next action. She argues that what is currently called a "world model" is actually three distinct projections of this loop: the renderer outputs pixels (observation), the simulator outputs state, and the planner outputs action. This taxonomy cuts through the noise created by companies like Google, Genie 3, RTFM, and World Labs, all of whom use the same label for fundamentally different technical outputs.
The theoretical foundation for this classification lies in the historical context of how agents interact with their environment, a concept formalized in textbooks by Sutton and Barto. The POMDP framework describes a cycle where an agent, whether human, robot, or software, executes actions that alter the world's state, yet the agent can only perceive observations such as photons, sensor readings, or video pixels. The term "state" here refers not to chemical phases but to a physicist's complete description of the world at a given moment, encompassing every object, position, velocity, and attribute. This underlying reality is theoretically complete but forever unobservable directly, existing only as a local perspective through observations. The concept of a "world model" traces back to Kenneth Craik's 1943 proposal that the mind reasons by running a "small-scale model" of reality, a notion introduced into neural networks by the late 1980s and early 1990s. Today, the various systems labeled as world models are merely different projections of this same closed loop, each outputting a specific component of the cycle.
The first category, the renderer, focuses exclusively on outputting observations in the form of pixels designed for human visual consumption, where the primary metric is visual fidelity rather than physical correctness. A video model transforming text prompts into cinematic aerial shots functions as a renderer, as do interactive systems like Google's Genie 3 or World Labs' own RTFM, which generate images in real-time based on user input. These models lack an explicit understanding of three-dimensional structures; they generate what the viewer will see, not what things actually look like in physical space. Consequently, buildings in an aerial shot may appear flawless, but attempting to navigate through the city below would result in collapse because the model does not understand the geometry or physics required to sustain the structure. The renderer's contract is purely visual, prioritizing aesthetic appeal over the structural integrity necessary for interaction or manipulation.
The second category, the simulator, outputs the state of the world, providing a faithful representation of geometry, physics, and dynamics that both humans and computer programs can compute and interact with. Unlike the renderer, the simulator's contract is structural, requiring that geometry withstands scrutiny, physics follows Newton's laws, and dynamics behave according to expected physical laws. This category serves two distinct user groups simultaneously: professionals such as architects, designers, filmmakers, and game developers who need accuracy beyond visual credibility, and computer programs including reinforcement learning agents, robot controllers, and autonomous vehicles that use the simulator as a training ground. These programs interact with the world on a large scale to test scenarios that are dangerous, expensive, or impossible to execute in reality. The simulator operates at the level of the world itself, acting as the structural skeleton from which visual representations and action consequences can be derived.
The third category, the planner, outputs actions by answering the question of what an agent should do next given an observation and a goal. In many ways, the planner is the reverse process of the renderer; while the renderer takes actions as input to produce observations, the planner takes observations as input to produce actions, thereby closing the perception-action loop. Visual-language-action models (VLA), model-based systems, and the new wave of world action models represent different attempts at building planners capable of enabling systems to decide what a robot should do in an unstructured world. Despite their potential, these systems remain largely confined to highly constrained laboratory environments with limited object variety and short task durations. The gap between a stunning demonstration video and a robot that can reliably work in a kitchen, warehouse, or operating room remains vast, highlighting the significant distance between current capabilities and real-world deployment.
Commercial maturity varies significantly across these three categories, with the renderer currently being the most developed and the planner the least mature. Google's Nano Banana model has brought renderer-level image generation capabilities to potentially hundreds of millions of users, proving the technology and market are real, yet the optimization goal remains visual credibility rather than physical accuracy. This ceiling is significant because renderer outputs, while beautiful, cannot be used to design buildings or train robots. In contrast, the commercial space for simulation is vast, with NVIDIA's Omniverse alone targeting a market size estimated to exceed a trillion dollars, covering factories, warehouses, supply chains, and digital twins. Robot training, autonomous driving testing, architectural visualization, engineering design, and drug discovery all rely on some form of simulation.
However, the field faces severe technical challenges, including a scarcity of three-dimensional data with explicit geometry and physical annotations compared to internet videos, the persistent sim-to-real gap, and the high computational cost of large-scale multiphysics simulations involving rigid bodies, deformable objects, fluids, and fabrics.
Woofun AI data shows that generative simulators introduce new risks where AI-generated geometries may look correct but contain self-intersections or incorrect proportions, leading to absurd results in physical simulations.
The convergence trend suggests that the boundaries between rendering, simulation, and planning are beginning to dissolve, driven by the consensus that the knowledge required for all three is largely the same. A model that truly understands how a cup is placed on a table, including its geometric shape, material properties, and response to forces, should be able to render that cup from any angle, simulate what happens when it is pushed, and plan a hand to pick it up. World Labs' Marble represents a step in this direction, accepting multimodal inputs to generate explorable 3D environments while outputting Gaussian splats for visual exploration and collision meshes for physical engine operations. This approach attempts to unify the renderer and simulator into one model, moving from passive output to interactive systems where renderers become responsive to action conditions and worlds generated by simulators become more controllable. The logical endpoint is a unified world model, a foundational model capable of rendering photo-realistic views, generating physically accurate structures, and planning action sequences, switching between modalities based on downstream needs. This vision echoes Ludwig Wittgenstein's 1921 assertion in Tractatus Logico-Philosophicus that "the world is the sum of all that happens," suggesting that a rich enough world model contains everything an agent needs to see, build, and act within the world.
The ultimate trajectory of this evolution points toward spatial intelligence as the next frontier for AI, where the world model serves as the path for machines to understand, imagine, reason, and interact with the physical world. While language models give machines powerful control over concepts and reasoning, the physical world operates on a foundation of space and time, requiring a different statistical structure to learn how light falls on surfaces or how objects respond to forces. The merging of rendering, simulation, and planning, each of which has already supported multi-billion-dollar industries, will redefine the relationship between machine intelligence and the physical world it inhabits. This synthesis is not merely a technical adjustment but a fundamental shift in how machines perceive and manipulate reality, moving beyond the limitations of text-based abstraction to a comprehensive understanding of the physical universe. For further details on this functional taxonomy, readers can refer to the full analysis at drfeifei.substack.com.