On June 3, 2026, the World Labs team, in collaboration with Stanford University Professor Fei-Fei Li, published a conceptual analysis article with a straightforward, almost unadorned title: "A Functional Classification of World Models." The article's opening sentence immediately exposes an industry consensus: "World model is one of the most important and most misused terms in the field of artificial intelligence today."
The context of this statement is familiar to anyone who has followed the AI industry.
In February 2024, OpenAI released Sora, a video generation model, with the technical report titled "Video Generation Model as a World Simulator." Jim Fan, NVIDIA's director of robotics, left a comment on LinkedIn that has since been repeatedly cited: Sora is essentially a "world model that only allows inaction as the sole action." Meanwhile, according to public reports, Tesla's AI team has repeatedly referred to the prediction components within its fully autonomous driving system as a "world model" or "world simulator." Game engines, 3D generation tools, embodied intelligent models—various products and technologies are being lumped together in the same basket and labeled with the same tag.
A video generator, an autonomous driving prediction network, a robot control model, a physics engine—what do they have in common? Almost nothing. But they are all called "world models."
This conceptual confusion, which has lasted for more than two years, is finally being systematically clarified. This time, Fei-Fei Li's team didn't release a new model, didn't announce a new benchmark, and didn't demonstrate any product features. Instead, they did something more fundamental: returning to the theoretical source of partially observable Markov decision processes, they reduced all systems marketed as "world models" to three different functional projections of the same cognitive loop.
The three projection types are: renderer, simulator, and planner. Within World Labs' classification framework, Sora and similar video generation models fall under the renderer category.
Why can a single term contain so many contradictory meanings?
To understand the root of this chaos, we need to start by asking a more fundamental question: when a company says, "We are building a world model," what exactly is it talking about?
For OpenAI, Sora's goal is to "understand and represent the physical world in videos." According to technical reports, Sora learns from statistical patterns in massive amounts of video data to generate visually plausible images: a cup will break when dropped, a paper airplane will fly if released, and a person's legs will swing alternately when walking. These images appear to "understand physics."
For Tesla, the "world model" is a neural network in the FSD system that predicts the trajectories of road users over the next few seconds. It needs to output precise 3D position, velocity, and orientation for the path planning module to calculate safe driving decisions. This model doesn't need to output pixels; it outputs vectors and probability distributions.
For robotics companies, a "world model" is an internal simulation mechanism that allows a robotic arm to predict whether a cup will tip over if it is pushed 5 centimeters to the left. It needs to understand object properties, contact mechanics, and stability, and outputs a feasibility assessment of the action.
The three types of companies have completely different goals. Video generation companies are concerned with pixel fidelity, autonomous driving companies are concerned with the accuracy of physical state prediction, and robotics companies are concerned with the predictability of the consequences of actions. They are all building "world models," but they are fundamentally doing different things.
In its article, World Labs directly addresses the core issue: these systems are all given the same name because they do indeed represent a certain aspect of "understanding the world." However, each only completes one link in a complete cognitive cycle, yet they are packaged into a complete world model by marketing language, media reports, and capital narratives.
Another driving force behind conceptual confusion is the inherent tension within the terminology itself. The term "world model" carries a grand narrative quality, sounding more imaginative and better able to support high valuations and funding narratives than "video generation model" or "video prediction model." When technological capabilities fail to meet public expectations, it becomes inevitable that the concept will degenerate into a propaganda tool.
Going back to the 1960s, what should a complete "world model" have been?
World Labs’ classification framework is based on a seemingly ancient theory: partially observable Markov decision processes.
This framework describes a complete cycle of interaction between an agent and its environment. The agent is in a certain environmental state; it performs an action, which changes the environmental state. The agent obtains partial observations through sensors, which trigger an internal state update. The updated understanding drives the next action. This cycle repeats continuously.
Within this framework, the complete functionality of a "world model" should include three stages: generating observations from states (pixels seen by the human eye or collected by sensors, point clouds, etc.), deducing the next state from actions and the current state (predicting physical changes), and generating actions from observations and goals (decision planning).
Language models learn the statistical patterns of text sequences, while world models learn the statistical characteristics of space and time. How light reflects on different surfaces, how objects move under gravity, and how energy is transferred after a rigid body collision—these are the patterns that world models aim to capture.
In their article, the World Labs team points out that all systems currently on the market that are called "world models" are actually just projections of a single functional segment of the complete cycle described above. Some systems only render "from state to observation," some only perform state deduction "from action to the next state," and some only plan "from observation to action." They each capture a segment of the cycle, yet each labels it as representing the complete circle.
The value of this analytical framework lies in providing a comparative framework that transcends marketing rhetoric. No matter how a company packages its product, once it's placed back into the POMDP cycle to examine its inputs, outputs, and missing components, its capability limitations become readily apparent.
Renderer, simulator, planner: the capability boundaries of three projections
In World Labs' taxonomy, the first category is defined as "renderers." Its core goal is to generate high-fidelity pixel output oriented towards human visual perception. The input is a representation of some environmental state (which can be a text description, 3D scene parameters, or implicit encoding), and the output is a series of consecutive frames.
The renderer is optimized for visual realism rather than physical accuracy. A World Labs article explicitly points out that renderer-generated buildings may appear "shaky" because they don't actually solve the structural mechanics equations; liquid splashes may look realistic, but the liquid volume, flow rate, and impact force may be completely inconsistent with real-world physical quantities. Therefore, such models cannot be used for architectural design, robot training, or tasks requiring physically accurate simulation.
Google's Genie 3, various text-to-video models, and almost all AI video generation tools belong to this category. Sora is certainly among them.
The second category is the "simulator." Its core goal is not to generate visuals for humans, but to generate precise states that can be used for subsequent calculations. The input is the current environmental state and external forces (or actions), and the output is the next state that is physically and geometrically faithful to the laws of the real world. The state output by the simulator can be used for stress analysis, energy consumption calculation, collision detection, and can also be used as input to a renderer to generate visuals, but its core value lies in the computability of the state itself.
NVIDIA Omniverse is a typical example of this type of system. It's not a native AI model, but rather a digital twin platform that integrates a traditional physics engine with AI-accelerated computation. World Labs, in its article, notes that simulators serve as a bridge between rendering and planning, but the scarcity of high-quality 3D physical annotation data is a major bottleneck. World Labs estimates in its article that the data used to train such models is orders of magnitude less than the video data available on the internet.
The third category is the "planner". Its inputs are observation data (camera footage, LiDAR point clouds, tactile sensor readings, etc.) and target instructions, and its output is what action to perform next. VLA (Vision-Language-Action) models and World Action Models belong to this category.
The difference between the three categories is not a minor divergence in technical approaches, but a fundamental functional differentiation. Renderers output pixels for humans to see, simulators output states for machines to calculate, and planners output actions for executors to run. A system can possess multiple capabilities simultaneously, but when most systems called "world models" essentially only perform rendering, equating "rendering" with "understanding the world" is a serious cognitive mismatch.
A two-year debate: Is Sora a world model?
In February 2024, OpenAI released Sora, with the technical report titled "Video Generation Models as World Simulators." This terminology immediately sparked heated debate in academia and the developer community.
Supporters argue that Sora's generated videos demonstrate consistency in 3D space, object persistence, and a certain intuitive understanding of physical interactions. Details such as a bitten hamburger leaving teeth marks and a dog running in the snow kicking up snowflakes suggest the model has learned some physical laws.
The core argument of the opponents comes from the classic definition of a world model in reinforcement learning: a world model must be able to predict state transitions based on actions. That is, given the current state and an action input, the model should output the next state after the action. Sora cannot do this. Users cannot tell Sora "push that cup away from the left" and then observe whether the cup will fall, in which direction, and where the fragments will fly.
Jim Fan's comment pinpoints this contradiction precisely: "Sora is essentially a world model, except that it only allows no-op as the sole action." This means that Sora does predict changes in the environment over time, but these changes are not influenced by any external intervention and can only unfold along the inherent causal chains within the video data. It's not performing interactive deduction, but rather continuing a passively observed sequence.
On the r/MachineLearning subreddit on Reddit, many reinforcement learning researchers have expressed sharper criticisms: a system that cannot predict state transitions based on actions cannot be called a world model; it can only be called a video prediction model.
World Labs' classification framework provides a definitive answer to this debate. In the POMDP loop, actions are the key input driving state transitions; a system lacking this input is merely a projection of the "observation generation" stage in the complete cognitive loop. Sora is a renderer, not a complete world model, much less a world simulator.
But this doesn't mean Sora has no value. Renderers solve a different problem: how to generate visuals that meet human visual expectations. This problem is extremely difficult in itself and has enormous commercial value. The problem is that packaging rendering capabilities as the ability to "understand the world" misleads technology decision-makers and investors, giving the false impression that these models already possess the ability to perform physical deductions or embodied interactions.
The industrial value of concept clarification
Clarifying the boundaries of the "world model" definition is not merely an academic exercise in wordplay. It directly impacts technology selection, investment decisions, and the public's understanding of AI capabilities.
For a manufacturing company evaluating whether to use a particular "world model" for robot training, figuring out whether the model is a renderer, simulator, or planner is essential to avoid millions of dollars in trial and error. A model that can only generate video footage, no matter how realistic the footage, cannot replace accurate calculations of the forces acting on objects, their trajectories, and the consequences of collisions.
For investment institutions, distinguishing between the three types of projections means they can more accurately identify a project's position within the technology stack. A startup that calls itself a "world model," if its product is essentially a renderer, will compete with video generation companies, not digital twin platforms or robot control models. This directly determines how market size is estimated and the selection of benchmark companies.
For academia, clear classification is a prerequisite for establishing comparable benchmarks. If the term "world model" continues to be generalized, researchers will find it difficult to define what constitutes an improvement or a breakthrough, and peer review will be based on ambiguity.
World Labs also points out in its article that clarifying concepts is not about creating conflict. The future direction will be the fusion of the three types of projection. A model that truly understands the physical properties of a cup should be able to simultaneously render its visual appearance, simulate the physical process of it being knocked over, and plan how a robotic arm can stably grasp it. But until the technology develops to that point, recognizing the boundaries of each is more realistic than imagining fusion.
According to an article by World Labs, simulators and digital twin technologies, represented by NVIDIA Omniverse, are targeting a potential market exceeding one trillion dollars in areas such as factories, warehouses, and supply chains. This figure comes from the manufacturers' own assessment, and when the market will truly reach this scale depends on whether simulators can overcome the bottleneck of the scarcity of high-quality 3D physical data.
For the AI industry at its current stage, the most important understanding may be quite simple: generating realistic videos does not equate to understanding the physical world; being called a world model does not mean it is actually simulating the world. Looking beyond marketing rhetoric and examining what inputs a system actually receives, what results it outputs, and which steps are missing in the POMDP (Programming Object Modeling) loop is the most honest way to judge the boundaries of its technological capabilities.



