Improving Robotic Generalist Policies via Flow Reversal Steering

Abstract

Generalist policies can learn a wide range of skills from diverse robot datasets. In order to solve or improve on challenging new tasks, we need a way to infer and invoke the appropriate actions from the policy's rich behavioral prior, especially when directly commanding the policy fails. We focus on flow matching generalists and propose Flow Reversal Steering (FRS): a method that takes suboptimal but "reasonable" actions, finds their latent noises by passing them through the flow policy in reverse, and maps them to nearby generalist action modes. We evaluate FRS across many simulated and real-world manipulation settings. First, FRS can turn coarse semantic guidance from humans or vision-language models (VLMs) into corresponding good robot actions, improving zero-shot control. These gains can be distilled with behavioral cloning by training an auxiliary policy to output noises that the generalist maps to good actions — showing up to 95% absolute task success rate boosts in under a minute of training. Finally, FRS enables policy improvement by bootstrapping reinforcement learning with semantic knowledge, improving on several tasks that standard RL fails to improve on.

Flow Reversal Steering (FRS)

Flow Reversal Steering: a coarse reference action from a reasoner is reversed through the flow field into latent noise, then denoised into a precise in-distribution action.

FRS passes coarse reference actions through a flow policy in reverse, finding latent noises which map to precise actions from a nearby behavioral mode.

t = 1.00

Noise via flow reversal Denoise via flow matching

Generalist flow-matching robot policies learn a rich prior over behaviors. These policies contain many skills needed for novel tasks — provided these appropriate behaviors are elicited. How can we use semantic knowledge to steer generalist policies towards sampling "reasonable" actions for new tasks?

We thus propose Flow Reversal Steering (FRS): a method for guiding generalist flow policies' action sampling by finding underlying noises that map to semantically-reasonable behaviors. By passing a coarse reference action — which captures roughly how the robot should move — through the flow policy in reverse, FRS finds the noise which approximately maps to the action. When subsequently denoised, FRS effectively finds a nearby good behavioral mode from the generalist's prior that is similar to the reference.

Standard Flow Denoising

# Sample action from flow policy (K integration steps)
x ~ N(0, I)                  # Sample noise
dt ← 1 / K
for t in 0 … 1:              # K forward steps, size dt
    x ← x + dt·vθ(x, t)
return x                     # action sample

Flow Reversal Steering

# Steer w/ reference action a_ref (K integration steps)
x ← a_ref
dt ← 1 / K
for t = 1 … 0:               # reverse the flow → noise
    x ← x − dt·vθ(x, t)
for t = 0 … 1:               # denoise → nearby mode
    x ← x + dt·vθ(x, t)
return x                     # refined, in-distribution

In turn, this allows semantic reasoners, like humans or VLMs, to guide the policy towards task-relevant good behaviors. The noises and actions produced by FRS can also be used for policy learning and improvement, especially via noise-space behavioral cloning and reinforcement learning.

Overview of the Flow Reversal Steering pipeline.

Overview of the FRS pipeline.

We demonstrate FRS using state-of-the-art π_0.5 vision-language-action models (VLAs) in both simulated LIBERO and real-world DROID manipulation tasks. We present three ways to use FRS:

Zero-shot FRS: directly execute the refined actions FRS elicits from coarse human or VLM guidance, with no additional training.
Diffusion Steering via Behavioral Cloning (DSBC): distill good noises from flow reversal into a noise policy using supervised learning.
Diffusion Steering via Reinforcement Learning (DSRL) + FRS: bootstrap noise-space reinforcement learning with FRS rollouts.

Noising with Flow Reversal vs. Forward Diffusion Process

Flow Reversal

Forward Diffusion

t = 1.00

Noising · t: 1 → 0 Denoising · t: 0 → 1

Flow reversal maps data to noise by integrating the flow field backward in time. As flow matching learns a deterministic velocity field, there is a fixed correspondence between noise and data, so the resulting noises denoise back to the original data points.

In contrast, the forward diffusion process linearly interpolates data with Gaussian noise. While this yields the same marginal distributions as flow matching at all times, it does not preserve the exact data-noise correspondence, so the noised points do not necessarily denoise back to the original data.

Interactive Examples of Flow Reversal Steering

Example LIBERO frames with an interactive 3D view of the cardinal flow-reversal steering directions (arrows) and the resulting steered action chunks (dots). Black dots represent a sample from the VLA without steering, showing that the steered actions do differ from what the VLA usually would output. Three frames are shown per task across the rollout. All actions and steering are done with a π_0.5 VLA.

Task 13 — put the black bowl at the front on the plate

put the black bowl at the front on the plate (frame 0015)

put the black bowl at the front on the plate (frame 0050)

put the black bowl at the front on the plate (frame 0080)

Task 40 — put the frying pan on the cabinet shelf

put the frying pan on the cabinet shelf (frame 0010)

put the frying pan on the cabinet shelf (frame 0100)

put the frying pan on the cabinet shelf (frame 0170)

Task 47 — pick up the cream cheese box and put it in the basket

pick up the cream cheese box and put it in the basket (frame 0010)

pick up the cream cheese box and put it in the basket (frame 0055)

pick up the cream cheese box and put it in the basket (frame 0090)

Task 66 — put the red mug on the right plate

put the red mug on the right plate (frame 0020)

put the red mug on the right plate (frame 0090)

put the red mug on the right plate (frame 0105)

Zero-Shot Flow Reversal Steering

FRS converts coarse actions into better fine-grained actions from the generalist policy's prior. The simplest way to use FRS is to directly execute these refined actions without any training. The zero-shot FRS inference loop involves: (1) querying a reasoner (e.g. human or VLM) for a coarse reference action, (2) passing the reference action through flow reversal to find the corresponding noise, and (3) denoising this noise to get the final action to execute at each step.

We make use of the Gemini-ER-1.6 VLM to scalably produce semantically-meaningful coarse directional reference actions for evaluation in the LIBERO simulator.

Zero-shot FRS allows a VLM to guide the generalist policy based on its semantic reasonings, raising performance across LIBERO.

Zero-shot FRS improves over the base VLA across LIBERO tasks. — Zero-shot FRS converts coarse VLM actions into effective robot actions, outperforming the base policy and prior steering methods.

Zero-shot FRS outperforms the base VLA across LIBERO. On hard tasks where the base policy achieves ≤ 2% success, 11 of 42 tasks gain at least 10% absolute success rate with zero-shot FRS. Other steering baselines only boost 3 or 4 such tasks in this way, suggesting FRS is better in low-success regimes which are especially hard for generalist improvement. FRS also outperforms directly executing VLM actions, showing that flow reversal refines the coarse reference actions, rather than simply reconstructing them.

Diffusion Steering via Behavioral Cloning (DSBC)

Flow policies can be steered via small auxiliary noise policies, which output noises which the generalist flow policy maps to good actions. However, finding good noises can be challenging — past works like Diffusion Steering via Reinforcement Learning (DSRL) require trial-and-error and Q-learning to find good noises. In contrast, flow reversal rapidly and scalably identifies good noises when simply given reference actions.

Thus, flow reversal enables training noise-steering policies via supervised learning, rather than expensive reinforcement learning. We naturally call this Diffusion Steering via Behavioral Cloning (DSBC).

Real-World (DROID)

Noise policies trained with DSBC can steer the generalist VLA on real-world DROID tasks.

Real-world DSBC results showing 80% success vs 20% base VLA on six DROID tasks.

DSBC improves real-world task performance by training on FRS data.

DSBC trained on 10 successful human-steered FRS rollouts per task significantly outperforms the base π_0.5 VLA on six DROID tasks. Standard BC with an equivalent flow policy completely fails in this data regime.

Offline DSBC (Ours)

Base VLA

Standard Flow BC

Offline DSBC (Ours) solves the precise tape-hanging task, where the base VLA and standard flow BC fail.

Offline DSBC results on DROID tasks. — DSBC can also be trained fully offline from a fixed dataset of teleoperated trajectories.

Alternatively, DSBC can also be used with standard offline robotic demonstration data, e.g., collected via teleoperation. Flow reversal can augment each frame with the noise that approximately maps to the corresponding action chunk, then the DSBC noise policy can be trained on this augmented dataset. We demonstrate this on a more precise tape hanging task, training on 20 teleoperated demonstrations.

Simulation (LIBERO)

DSBC distills FRS trajectories to match zero-shot FRS performance on LIBERO. — DSBC on latent noise actions matches zero-shot FRS and beats standard BC on robot actions.

We also evaluate DSBC on LIBERO, where it distills the performance gains of zero-shot VLM FRS on 15 tasks, outperforming standard BC. DSBC is also highly efficient: each noise policy trains in under 1 minute using ~1 GB of GPU memory, without loading the full VLA.

We hypothesize that when the noise policy enters out-of-distribution states, the VLA maps its outputs back to reasonable in-distribution actions, providing implicit robustness against compounding error.

Diffusion Steering via Reinforcement Learning (DSRL) + FRS

Bootstrapping DSRL with FRS trajectories enables faster learning and higher final success on hard tasks. We propose two simple augmentations for DSRL + FRS: (1) prefilling the replay buffer with zero-shot FRS rollouts and (2) adding a BC auxiliary loss on successful trajectories' noise actions.

DSRL plus FRS outperforms standard DSRL on LIBERO tasks.

Left: DSRL + FRS on 15 tasks with effective zero-shot steering.
Right: on 10 hard tasks where FRS and the base policy both perform poorly, using even one FRS success drastically improves RL.

On 15 LIBERO-90 tasks, it learns faster and reaches higher final success than standard DSRL and PLD-style residual RL. On 10 harder tasks where the base VLA gets ~0% and zero-shot FRS reaches only 8%, even a single successful steered trajectory enables DSRL to learn substantially faster and better. By directing the learner toward semantically-meaningful behaviors early in training, DSRL + FRS improves performance in the sparse-reward regime where standard generalist RL struggles.

Step 0

Step 50K

Step 100K

Step 150K

Step 200K

As an example, DSRL + FRS learns on Task 52: pick up the milk and put it in the basket, whereas standard DSRL completely fails, as the base VLA struggles to solve the task without semantic steering from VLMs.

BibTeX

@article{tang2026frs,
  author  = {Andy Tang and William Chen and Andrew Wagenmaker and Chelsea Finn and Sergey Levine},
  title   = {Improving Robotic Generalist Policies via Flow Reversal Steering},
  year    = {2026},
}