← All posts

PSI: Richly Controllable Physical World Modeling

In the PSI paper we introduced a method for building a richly controllable world model. Here, we show examples of using the newly released PSI-0.5 model to implement a variety of complex physical prompts such as might be used in robotic manipulation scenarios.

In PSI, control commands become prompt handles, so the model starts to feel less like an image sequence generator and more like a visual analogue of an LLM: the prompt determines the capability. In practice, precise controls are just different prompt strings and sparse token values. The simplest prompt asks PSI to continue a scene by predicting the next RGB frame; in the examples below, we show how different prompts can elicit complex and precise behaviors from the model.

PSI Prompt
from PIL import Image
from transformers import AutoModel

predictor = AutoModel.from_pretrained(
    "StanfordNeuroAILab/psi0_5",
    trust_remote_code=True,
    device="cuda",
)

rgb1 = predictor.generate(
    "rgb0->rgb1",
    rgb0=Image.open("scene.png"),
)

Intuitive Physics

Intuitive Physics

Unconditional rollouts and simple physical scenes where PSI predicts how objects continue moving through contact, gravity, and momentum.

RGB0 context
Ball roll RGB0 context frame
context stack
Ball roll RGB1 context frame
PSI PSI
predicted RGB
Ball roll generated RGB5
With no prompt patches, PSI rolls forward one RGB at a time; each new prediction reveals left-to-right, then joins the context stack.
Ball Roll
frames = predictor.generate(
    "rgb0->rgb1,rgb2,rgb3,rgb4,rgb5,rgb6,rgb7",
    rgb0=Image.open("ball_ramp.png"),
)
RGB0 context
Car driving RGB0 context frame
future stack
unconditional prediction
PSI PSI
predicted RGB
Car driving generated RGB3
With no prompt patches, PSI predicts a roundabout continuation one RGB at a time; each new frame reveals, then joins the context stack for the next prediction.

This unconditional prediction asks PSI to continue the scene from a single roundabout frame, then repeatedly feeds each generated RGB back into the context.

Car Driving
rgb1, rgb2, rgb3 = predictor.generate(
    "rgb0->rgb1,rgb2,rgb3",
    rgb0=Image.open("car_roundabout.png"),
)

1. Unconditional flow

RGB0 ground truth
Block-slide RGB0 input frame
F01 empty grid
PSI PSI
F01 predicted flow
Block-slide predicted optical flow F01

2. Apply flow to RGB

RGB0 context
Block-slide RGB0 context frame
F01 motion field
Block-slide optical flow applied as conditioning
PSI PSI
RGB1 dense prediction
Block-slide RGB1 generated from F01

3. Autoregressive future frame rollout

RGB0 context
Block-slide RGB0 context frame
context stack
Block-slide RGB1 context frame
PSI PSI
predicted RGB
Block-slide generated RGB5
PSI first predicts F01 from RGB0, applies that motion field to produce RGB1, then rolls RGB2 through RGB5 forward while each new frame joins the visible context stack.

Starting from GT RGB0, PSI first predicts the motion and next frame unconditionally with rgb0->f01,rgb1. We then ask for rgb2 through rgb5 in one future-frame generation call.

Block-Slide
f01, rgb1 = predictor.generate(
    "rgb0->f01,rgb1",
    rgb0=Image.open("block_slide_rgb0.png"),
)

rgb2, rgb3, rgb4, rgb5 = predictor.generate(
    "rgb0,rgb1->rgb2,rgb3,rgb4,rgb5 ",
    rgb0=Image.open("block_slide_rgb0.png"),
    rgb1=rgb1,
)

A sparse move patch on the hand pushes the support to the right, and PSI infers the coupled object motion from that local intervention.

Support-push rgb0 input
GT RGB0
Low opacity support-push frame with rightward hand prompt
move patch
Support-push rgb1 prediction
RGB1
Support Push
rgb1 = predictor.generate(
    "rgb0,f01->rgb1",
    rgb0=Image.open("support_push.png"),
    f01=flow_prompt(patches=[((70, 221), (168, 221))]),  # move hand right
)

The same apple scene branches into two PSI predictions from sparse flow prompts: one pushes the apple backward into the box, while the other rotates the apple in place.

Apple rgb0 input
GT RGB0
Low opacity apple input with an inward push flow prompt
sparse f01
Pushed apple PSI prediction
rgb1 prediction
Apple rgb0 input
GT RGB0
Low opacity apple input with counter-rotating flow prompts
sparse f01
Rotated apple PSI prediction
rgb1 prediction
Apple Push and Rotate
push_flow, push_rgb1 = predictor.generate(
    "rgb0,f01->f01,rgb1",
    rgb0=Image.open("apple_rgb0.png"),
    f01=flow_prompt(
        patches=[((400, 286), (-111, -34))],  # push apple inward
    ),
)

rotate_flow, rotate_rgb1 = predictor.generate(
    "rgb0,f01->f01,rgb1",
    rgb0=Image.open("apple_rgb0.png"),
    f01=flow_prompt(
        patches=[
            ((343, 234), (-74, -22)),  # top turns left
            ((268, 399), (78, 21)),    # bottom turns right
        ],
    ),
)

Both billiards paths use the same initial frame and a zero camera-motion token. The only difference is the RGB patch prompt on the cue ball: one moves right to hit the second ball, and the other moves up-left away from it.

RGB0 dense input
Dense billiards input frame
RGB1 + C01 sparse prompt
move c01 = 0
PSI PSI
RGB1 dense prediction
Billiards hit second ball prediction generated by PSI
The green copied patch moves the cue ball toward the second ball, while c01 = 0 keeps the camera fixed.
Billiards
zero_camera = camera_prompt(
    translation=(0.0, 0.0, 0.0),
    rotation=(0.0, 0.0, 0.0),
)

hit_second_ball = predictor.generate(
    "rgb0,rgb1,c01->rgb1",
    rgb0=Image.open("billiards_rgb0.png"),
    rgb1=rgb_prompt(poke=((390, 615), (710, 615))),
    c01=zero_camera,
)

move_away = predictor.generate(
    "rgb0,rgb1,c01->rgb1",
    rgb0=Image.open("billiards_rgb0.png"),
    rgb1=rgb_prompt(poke=((390, 615), (90, 315))),
    c01=zero_camera,
)

Cause And Effect

Cause And Effect

Controlled interventions where a small prompt changes what happens next in the scene.

The same RGB0 branches into two PSI predictions from a sparse prompt at the same card-deck location: one moves left into the glasses case, while the other moves right and away from it.

Card deck rgb0 input
GT RGB0
Sparse leftward optical flow prompt on the card deck
sparse f01
Card deck hit-case PSI prediction
rgb1 prediction
Card deck rgb0 input
GT RGB0
Sparse rightward optical flow prompt on the card deck
sparse f01
Card deck move-away PSI prediction
rgb1 prediction
Sliding Card Deck
hit_flow, hit_rgb1 = predictor.generate(
    "rgb0,f01->f01,rgb1",
    rgb0=Image.open("card_deck.png"),
    f01=flow_prompt(poke=((305, 259), (245, 259))),  # slide left into case
)

move_flow, move_rgb1 = predictor.generate(
    "rgb0,f01->f01,rgb1",
    rgb0=Image.open("card_deck.png"),
    f01=flow_prompt(poke=((305, 259), (355, 259))),  # slide right away
)

A clockwise bowl-rotation prompt changes the downstream state of the fruit and bowl. PSI densifies the motion, predicts the next depth map, and renders RGB1.

Mango bowl rgb0 input
GT RGB0
Low opacity mango bowl frame with a blue clockwise prompt
rotation prompt
Mango bowl intermediate keyframe
RGB1
Mango bowl target keyframe
RGB2
Mango Bowl
dense_flow, depth1, rgb1 = predictor.generate(
    "rgb0,d0,f01->f01,d1,rgb1",
    rgb0=Image.open("mango_bowl_rgb0.png"),
    d0=Image.open("mango_bowl_d0.png"),
    f01=flow_prompt(
        patches=[((246, 282), "clockwise")],
    ),
)

A hold patch at the base keeps the can grounded while a down-right push patch on the lid specifies the partial observation. PSI explains that observation as a deformation, crushing the can rather than translating it rigidly.

Upright coke can rgb0 input
GT RGB0
Hold patch at can base and down-right push patch on the lid
RGB1 prompt
Crushed coke can target frame
RGB1
Coke Can Crush
rgb1 = predictor.generate(
    "rgb0,rgb1->rgb1",
    rgb0=Image.open("can_rgb0.png"),
    rgb1=rgb_prompt(
        holds=[((144, 222), (0, 0))],      # hold base
        pushes=[((160, 76), (24, 36))],    # push lid down-right
    ),
)

A single down-right flow patch on the top of the first book initiates the motion. PSI propagates that contact through the row, rendering a depth-aware next frame.

Books falling rgb0 input
GT RGB0
Low opacity books frame with the sparse down-right push prompt
sparse f01
Books falling rgb1 intermediate frame
RGB1
Books falling rgb2 final frame
RGB2
Books Falling
dense_flow, depth1, rgb1 = predictor.generate(
    "rgb0,d0,f01->f01,d1,rgb1",
    rgb0=Image.open("books_rgb0.png"),
    d0=Image.open("books_d0.png"),
    f01=flow_prompt(
        pushes=[((140, 212), (236, 260))],  # top book, down-right
    ),
)

A single sparse optical-flow prompt tells PSI where the push should begin. PSI then densifies that local cue into a motion plan for the scene and renders the pushed-box future in RGB.

Box push rgb0 input
GT RGB0
Low opacity box push frame with a blue rightward sparse prompt
move arm
Box push rgb1 frame
rgb1
Box push rgb2 continuation
rgb2
Box Push
dense_flow, depth1, rgb1 = predictor.generate(
    "rgb0,d0,f01->f01,d1,rgb1",
    rgb0=Image.open("box.png"),
    d0=Image.open("box_d0.png"),
    f01=flow_prompt(poke=((72, 145), (202, 145))),
)

rgb2 = predictor.generate(
    "rgb0,rgb1->rgb2",
    rgb0=Image.open("box.png"),
    rgb1=rgb1,
)

The lifted lamp frame is used as an RGB conditioning target. PSI renders the coupled change in illumination across the wall and tabletop without a dense-flow stage.

Desk lamp rgb0 input
GT RGB0
Desk lamp lifted RGB1 conditioning frame
RGB1 prompt
Desk lamp lifted rgb1 frame
rgb1
Moving Lamp
rgb1_prompt = Image.open("desk_lamp_rgb1.png")

rgb1 = predictor.generate(
    "rgb0,rgb1->rgb1",
    rgb0=Image.open("desk_lamp.png"),
    rgb1=rgb1_prompt,
)

Articulated Objects & Mechanisms

Articulated Objects & Mechanisms

Objects with constrained parts, handles, caps, pages, or mechanisms that move through structured articulation.

The lid follows a hinge-constrained motion plan: PSI closes the screen from the open frame to the final closed state.

Laptop rgb0 open frame
GT RGB0
Sparse laptop closing flow prompt
sparse f01
Dense laptop closing optical flow generated by PSI
dense f01
Laptop half-closed rgb1 frame
rgb1
Closing a Laptop
dense_flow, depth1, rgb1 = predictor.generate(
    "rgb0,d0,f01->f01,d1,rgb1",
    rgb0=Image.open("laptop_rgb0.png"),
    d0=Image.open("laptop_d0.png"),
    f01=hinge_flow_prompt(
        screen="lid",
        patches=[((443, 80), (-93, 118))],  # close toward hinge
    ),
)

A hold patch keeps the pen body in place while a move patch pulls the cap down and to the right. PSI turns that local articulated prompt into a separated cap and body state.

Pen rgb0 input
GT RGB0
Low opacity pen frame with a cap move patch and body hold patch
sparse f01
Pen uncapping dense flow
dense f01
Pen uncapped rgb1 prediction
RGB1
Uncapping a Pen
dense_flow, rgb1 = predictor.generate(
    "rgb0,f01->f01,rgb1",
    rgb0=Image.open("pen.png"),
    f01=flow_prompt(
        hold_patches=[(261, 221)],  # hold pen body
        patches=[((299, 165), (363, 221))],  # pull cap right-down
    ),
)

A curved sparse-flow prompt around the cap asks PSI to infer the coupled twist-and-lift motion that removes the cap from the bottle. The preview shows the single unscrewed target frame rather than a back-and-forth loop.

Unscrewing bottle rgb0 input
GT RGB0
Low opacity bottle input with curved cap flow prompt
sparse f01
Unscrewed bottle cap target frame
rgb1
Unscrewing Cap
rgb0 = Image.open("bottle.png")

dense_flow, depth1, rgb1 = predictor.generate(
    "rgb0,d0,f01->f01,d1,rgb1",
    rgb0=rgb0,
    d0=Image.open("bottle_d0.png"),
    f01=flow_prompt(
        cap_twist={
            "center": (252, 90),
            "radius": 46,
            "rotation_degrees": -80,
            "lift": (58, -78),
        },
    ),
)

PSI supports precise optical-flow conditioning, so a small local prompt can specify the desired motion. The model densifies that sparse control into a scene-wide motion plan, then renders the resulting state back into RGB.

Weight lift rgb0 input
GT RGB0
Low opacity weight lift frame with a blue upward sparse prompt
sparse f01
Dense optical flow generated by PSI
dense f01
Weight lift final rgb1 frame
rgb1
Weight Pull
dense_flow, rgb1 = predictor.generate(
    "rgb0,f01->f01,rgb1",
    rgb0=Image.open("weight_lift.png"),
    f01=flow_prompt(poke=((342, 760), (342, 592))),
)

A single sparse flow poke on the left page starts the page moving, while four zero-flow corner patches anchor the scene. PSI densifies the flow and rolls out the page motion.

Open book input frame
GT RGB0
Low opacity open book frame with page-turn sparse flow prompt
sparse f01
Dense optical flow for the page-turn prompt
dense f01
Book page generated rgb1
rgb1
Book page generated rgb2
rgb2
Book page generated rgb3
rgb3
Book page generated rgb4
rgb4
Book page generated rgb5
rgb5
Turning a Page in a Book
frames = predictor.generate(
    "rgb0,f01->f01,rgb1...",
    rgb0=Image.open("open-book.jpg"),
    f01=flow_prompt(
        poke=((230, 318), angle=7.25, max_flow=85),
        holds=["top_left", "top_right", "bottom_left", "bottom_right"],
    ),
    steps=8,
    seed=303,
)

Human Articulations

Human Articulations

Examples where the controllable object is a hand or body with linked joints and precise pose changes.

This example illustrates precise articulated control of a hand. Flow patches on the pointer, middle, and ring fingers close the hand, while zero-motion patches on the pinky and thumb preserve the rest pose.

Open hand rgb0 input
GT RGB0
Low opacity hand input with precise finger flow controls
sparse f01
Closed hand target frame
rgb1
Closing a Hand
rgb0 = Image.open("hand.png")

dense_flow, rgb1 = predictor.generate(
    "rgb0,f01->f01,rgb1",
    rgb0=rgb0,
    f01=flow_prompt(
        hold_patches=[
            (232, 254),  # thumb
            (300, 274),  # pinky
        ],
        patches=[
            ((150, 132), (183, 173)),  # pointer finger closes
            ((245, 142), (266, 184)),  # middle finger closes
            ((271, 188), (271, 240)),  # ring finger closes
        ],
    ),
)

This example shows direct manipulation of a human body. Hold patches stabilize the torso, grounded leg, and rear arm, while move patches pull the raised front hand and back leg down-forward. PSI densifies those sparse controls into f01 before rendering the new body pose in RGB.

Ballet dancer rgb0 input
GT RGB0
Low opacity ballet input with hold and move flow prompts
sparse f01
Densified optical flow generated by PSI for the ballet dancer
dense f01
Ballet dancer generated rgb1 frame
rgb1
Ballet
dense_flow, rgb1 = predictor.generate(
    "rgb0,f01->f01,rgb1",
    rgb0=Image.open("ballet.png"),
    f01=flow_prompt(
        hold_patches=[(302, 265), (310, 470), (202, 250)],
        patches=[
            ((398, 150), (412, 241)),  # front hand down-forward
            ((137, 325), (192, 382)),  # back leg down-forward
        ],
    ),
)

Dexterous / Precision Manipulation

Dexterous / Precision Manipulation

Fine motor-control examples where small prompt differences determine delicate contact outcomes.

One right-up hand prompt asks PSI to carry the thread through the needle eye while preserving the thin-object contact geometry.

Threading needle rgb0 input
GT RGB0
Low opacity threading needle input with a right-up hand prompt
sparse f01
Dense threading needle optical flow generated by PSI
dense f01
Threading needle rgb1 target frame
rgb1
Threading a Needle
dense_flow, rgb1 = predictor.generate(
    "rgb0,f01->f01,rgb1",
    rgb0=Image.open("threading_needle.png"),
    f01=flow_prompt(
        patches=[((220, 206), (150, -52))],  # hand moves farther right and up
    ),
)

The same RGB0 can branch into two PSI predictions from different sparse flow prompts. Pick Fruit uses one downward patch on the hand, while Release Fruit uses thumb and pinky patches that pull the hand open around the apple.

Fruit pull rgb0 input
GT RGB0
Low opacity fruit pull input with a downward hand flow prompt
sparse f01
Picked fruit PSI prediction
rgb1 prediction
Fruit pull rgb0 input
GT RGB0
Low opacity fruit pull input with two hand release flow prompts
sparse f01
Released fruit PSI prediction
rgb1 prediction
Picking a Fruit
pick_flow, pick_rgb1 = predictor.generate(
    "rgb0,f01->f01,rgb1",
    rgb0=Image.open("fruit_pull_rgb0.png"),
    f01=flow_prompt(
        patches=[((165, 303), (0, 57))],  # wrist pulls down
    ),
)

release_flow, release_rgb1 = predictor.generate(
    "rgb0,f01->f01,rgb1",
    rgb0=Image.open("fruit_pull_rgb0.png"),
    f01=flow_prompt(
        patches=[
            ((165, 303), (0, 57)),    # wrist keeps pulling down
            ((198, 212), (-52, 48)),  # thumb opens left-down
            ((308, 276), (56, 48)),   # pinky opens down-right
        ],
    ),
)

Deformable Materials

Deformable Materials

Non-rigid materials where sparse controls produce folds, tears, kneading, or stretching.

Paper Tear shows irreversible deformation and non-rigid manipulation. The same hand patches either press inward to crumple the sheet or move outward in opposite directions, causing the densified flow to rip the paper apart.

Paper tear rgb0 input
GT RGB0
Low opacity paper input with inward hand flow prompts
sparse f01
Densified inward paper deformation flow
dense f01
Paper press-in PSI prediction
rgb1 prediction
Paper tear rgb0 input
GT RGB0
Low opacity paper input with outward hand flow prompts
sparse f01
Densified outward paper tearing flow
dense f01
Paper tear-apart PSI prediction
rgb1 prediction
Paper
press_flow, press_rgb1 = predictor.generate(
    "rgb0,f01->f01,rgb1",
    rgb0=Image.open("paper.png"),
    f01=flow_prompt(
        patches=[
            ((145, 217), (212, 217)),  # left hand inward
            ((333, 217), (265, 217)),  # right hand inward
        ],
    ),
)

tear_flow, tear_rgb1 = predictor.generate(
    "rgb0,f01->f01,rgb1",
    rgb0=Image.open("paper.png"),
    f01=flow_prompt(
        patches=[
            ((145, 217), (59, 217)),   # left hand outward
            ((333, 217), (420, 217)),  # right hand outward
        ],
    ),
)

A move patch pulls the red sleeve right and slightly downward, while hold patches keep the white shirt body fixed so the fold is localized to the sleeve.

Shirt rgb0 input
GT RGB0
Low opacity shirt input with a red sleeve move prompt and white body hold prompts
sparse f01
Shirt rgb1 folded prediction
rgb1 prediction
Shirt Folding
dense_flow, rgb1 = predictor.generate(
    "rgb0,f01->f01,rgb1",
    rgb0=Image.open("shirt_rgb0.png"),
    f01=flow_prompt(
        move_patches=[((86, 160), (190, 223))],  # red sleeve right/down
        hold_patches=[
            (252, 150),
            (297, 262),
            (237, 402),
            (362, 377),
        ],  # hold white shirt body
    ),
)

Two compact hand prompts describe the kneading stroke: first the left hand drags the dough left, then it returns down and right. PSI fills the intermediate physical motion between the selected RGB keyframes.

Kneading rgb0 input
rgb0
Leftward prompt on the kneading hand
prompt 1
Kneading rgb1 midpoint
rgb1
Down-right prompt on the same kneading hand
prompt 2
Kneading rgb2 continuation
rgb2
Kneading Dough with Hands
rgb1 = predictor.generate(
    "rgb0,prompt->rgb1",
    rgb0=Image.open("dough.png"),
    prompt=rgb_prompt(poke=((205, 306), (112, 306))),
)

rgb2 = predictor.generate(
    "rgb1,prompt->rgb2",
    rgb1=rgb1,
    prompt=rgb_prompt(poke=((74, 276), (188, 344))),
)

A single downward flow patch on the hand specifies the pull, and PSI densifies that sparse cue into motion over the paper.

Toilet paper rgb0 input
GT RGB0
Low opacity toilet paper input with a downward hand prompt
sparse f01
Dense toilet paper optical flow generated by PSI
dense f01
Toilet paper rgb1 target frame
rgb1
Pulling Toilet Paper off of Roll
dense_flow, rgb1 = predictor.generate(
    "rgb0,f01->f01,rgb1",
    rgb0=Image.open("toilet_paper.png"),
    f01=flow_prompt(
        patches=[((401, 419), (0, 81))],  # hand pulls down
    ),
)

Fluids

Fluids

Fluid and pouring examples where motion prompts affect liquids and containers.

This coffee pouring example uses a coffee-pot rotation prompt. PSI densifies the motion, renders depth, and shows the next RGB frame.

Coffee pouring rgb0 input
GT RGB0
Low opacity coffee pouring input with a blue coffee-pot rotation prompt
rotation prompt
Coffee pouring rgb1 frame
RGB1
Coffee Pouring
dense_flow, depth1, rgb1 = predictor.generate(
    "rgb0,d0,f01->f01,d1,rgb1",
    rgb0=Image.open("coffee_pour_rgb0.png"),
    d0=Image.open("coffee_pour_d0.png"),
    f01=flow_prompt(
        rotations=[((238, 202), "clockwise")],
    ),
)

This example uses the bundled PSI codes directly: a sparse optical-flow prompt nudges the bottle, PSI densifies the flow, and the RGB head renders the filled glass state.

Beer pouring rgb0 input
GT RGB0
Low opacity beer pouring frame with a blue sparse prompt
sparse f01
Dense optical flow generated by PSI for beer pouring
dense f01
Beer pouring generated rgb1 frame
rgb1
Stop Beer Pouring
dense_flow, rgb1 = predictor.generate(
    "rgb0,f01->f01,rgb1",
    rgb0=Image.open("beer_pour.png"),
    f01=flow_codes("beer_pour_f01.codes.npy"),
    frame_gap_seconds=0.2,
    seed=0,
    num_seq_patches=1024,
)

Compositional Reasoning

Compositional Reasoning

Composed tasks where object motion, contact, appearance change, or tool use must stay coordinated.

A single sparse optical-flow patch on the blue glove gives the down-left cleaning motion. PSI densifies that local prompt so the sponge drags a clean tile streak through the wall, then renders the next RGB frame.

Cleaning rgb0 input
GT RGB0
Sparse down-left optical flow prompt for cleaning
sparse f01
Dense optical flow generated by PSI for cleaning
dense f01
Cleaning generated rgb1 frame
rgb1
Cleaning
dense_flow, rgb1 = predictor.generate(
    "rgb0,f01->f01,rgb1",
    rgb0=Image.open("cleaning.png"),
    f01=flow_prompt(
        patch=((456, 124), (-144, 92)),  # blue glove
        anchors=["top_right", "bottom_left", "bottom_right"],
    ),
)

A two-phase prompt can drive the hammer down onto the nail and then pull it back up. A separate two-phase branch moves the hammer away from the nail and continues to its missed final state.

Hammer rgb0 input
GT RGB0
Hammer downward prompt
down prompt
Hammer nail contact frame
rgb1
Hammer upward prompt after impact
up prompt
Hammer nail final static target
rgb2 target
Hammer rgb0 input
GT RGB0
Hammer up-left miss prompt
miss prompt
Hammer miss static target
rgb1
Hammer continued miss prompt
miss prompt
Hammer miss final static target
rgb2 target
Hammer Nailing a Nail
hammer_down = predictor.generate(
    "rgb0,prompt->rgb1",
    rgb0=Image.open("hammer.png"),
    prompt=rgb_prompt(poke=((260, 100), (268, 152))),
)

hammer_nail = predictor.generate(
    "rgb1,prompt->rgb2",
    rgb1=hammer_down,
    prompt=rgb_prompt(poke=((253, 140), (230, 75))),
)

miss_nail_1 = predictor.generate(
    "rgb0,prompt->rgb1",
    rgb0=Image.open("hammer.png"),
    prompt=rgb_prompt(poke=((260, 100), (188, 54))),
)

miss_nail = predictor.generate(
    "rgb1,prompt->rgb2",
    rgb1=miss_nail_1,
    prompt=rgb_prompt(poke=((222, 82), (222, 150))),
)

Light Candle drives the match to the wick. Move Away starts from the same hand point and moves the hand right, then farther right again. Extinguish Match moves the hand down and right, then slowly back up.

RGB0 dense input
Dense candle lighting input frame
RGB1 sparse prompt
hold move
PSI PSI
RGB1 dense prediction
Dense candle lighting prediction generated by PSI
Red patches hold the candle and background fixed, while the green copied patch moves the hand and match toward the wick.
Lighting Candle
light_candle = predictor.generate(
    "rgb0,rgb1->rgb1...",
    rgb0=Image.open("candle.png"),
    rgb1=rgb_prompt(pokes=[
        ((261, 131), (212, 151)),
        ((212, 151), (365, 151)),
    ]),
)

move_away_rgb1 = predictor.generate(
    "rgb0,prompt->rgb1",
    rgb0=Image.open("candle.png"),
    prompt=rgb_prompt(poke=((365, 182), (445, 182))),
)
move_away_rgb2 = predictor.generate(
    "rgb1,prompt->rgb2",
    rgb1=move_away_rgb1,
    prompt=rgb_prompt(poke=((439, 182), (519, 182))),
)

extinguish_match_rgb1 = predictor.generate(
    "rgb0,prompt->rgb1",
    rgb0=Image.open("candle.png"),
    prompt=rgb_prompt(poke=((365, 182), (428, 251))),
)
extinguish_match_rgb2 = predictor.generate(
    "rgb1,prompt->rgb2",
    rgb1=extinguish_match_rgb1,
    prompt=rgb_prompt(poke=((422, 251), (399, 171))),
)

Novel View Synthesis

Novel View Synthesis

Camera-conditioned generations that move the viewpoint while preserving the scene.

This NVS example starts from the first frame of the video and applies a view-change flow prompt. PSI renders the next depth map and RGB frame for the new view.

Coffee mug source frame
GT RGB0
Coffee mug orbit middle frame
rgb1...rgb2
Coffee mug orbit final frame
autoregressive view
Spinning Around Coffee Mug
depth1, rgb1 = predictor.generate(
    "rgb0,d0,f01->d1,rgb1",
    rgb0=Image.open("coffee_mug_000.png"),
    d0=Image.open("coffee_mug_d0.png"),
    f01=view_flow_prompt(direction="right"),
)

Entering a House first densifies a sparse door-opening flow prompt. After the door opens, PSI uses a depth-conditioned camera prompt to move into the room without another sparse-to-dense flow stage.

Entering a house source frame
GT RGB0
Low opacity doorway input with a door push prompt
f01 + c01
Entering a house first forward view
rgb1
Entering a house straight-through final view
rgb2
Entering a House
door_push = flow_prompt(
    patches=[((138, 286), (93, 318))],  # push door inward-left
)
forward = camera_prompt(translation=(0.0, 0.0, 0.55))

dense_flow, depth1, rgb1 = predictor.generate(
    "rgb0,d0,f01->f01,d1,rgb1",
    rgb0=Image.open("doorway_open.png"),
    d0=Image.open("doorway_d0.png"),
    f01=door_push,
)

depth2, rgb2 = predictor.generate(
    "rgb0,d0,c01->d1,rgb1",
    rgb0=rgb1,
    d0=depth1,
    c01=forward,
)