Releasing PSIv0.5

What PSIv0.5 Is

A few months ago, we introduced Probabilistic Structure Integration (PSI), an approach to richly controllable world modeling. Today, we're releasing PSIv0.5, an 8B PSI instance.

PSIv0.5 is an RGBCFD model, meaning it integrates the following modes: RGB, the standard pixel description of video frames; C, camera motion between frames; F, optical flow between adjacent frames; and D, depth of a given frame. PSIv0.5 was trained on a large dataset of real-world videos, similar to the one described in the PSI paper.

PSIv0.5 treats visual reasoning tasks as sequence-completion problems over visual tokens. A task is defined by a sequence of RGB, C, F, and D tokens, together with a choice of which tokens are observed and which ones the model should predict. Under this view, next-frame prediction, interpolation, motion estimation, and controlled generation are not separate capabilities, but different queries to the same autoregressive model. This lets us use intermediate structure, such as flow and depth, as a control surface for interacting with scenes: poking objects, opening doors, folding paper, or asking how the world might evolve under a different motion.

Initial paper frame — A sparse flow prompt is densified into f01 and used to predict rgb1.

Below are some of the token sequences that can be used to prompt the model:

"rgb0->rgb1" next-frame generation
"rgb0,c01->rgb1" camera-conditioned next-frame prediction
"rgb0->f01" motion prediction from visual context
"rgb0,f01->rgb1" flow-conditioned next-frame generation
"rgb0,f01->f01,rgb1" sparse flow conditioned generation
"rgb0,rgb2->rgb1" rgb interpolation
"rgb0,d0,f01->f01,d1,rgb1" depth- and partial-flow-conditioned next-frame prediction

What PSIv0.5 Is Good For

PSIv0.5 has some powerful capabilities. It is especially good at fine-grained control for complex physical scenarios. It can also be used to estimate statistical properties of motion in a scene; we'll share more on that in a later note.

What PSIv0.5 Is Not Intended For

Fast video rendering. PSIv0.5 is designed for structured prediction, counterfactual prompting, and precise control, rather than for high-throughput video generation.

Code

Check out the PSIv0.5 model page for access to the model and instructions on how to use it.