ChronoFlow-Policy: Unifying Past-Current-Future Interaction Flow in Visuomotor Policy Learning

European Conference on Computer Vision | ECCV 2026
Bokai Lin1,2,*, Yifu Xu1,*, Xinyu Zhan1, Hongjie Fang1, Jialin Tian1,
Fu-Cheng Zhang2, Yong-Lu Li1,2, Cewu Lu1,2,3, Lixin Yang1,†
1Shanghai Jiao Tong University 2Shanghai Innovation Institute 3Noematrix

*Equal contribution   Corresponding author

ChronoFlow motivation and overview

ChronoFlow-Policy unifies past, current, and future interaction flow. The policy reasons over sparse 3D object and gripper keypoints, then uses ChronoFlow as a co-training target for diffusion-based action generation.

Video

ChronoFlow-Policy demonstrations across simulated and real-world manipulation tasks.

Abstract

Visual signals play a crucial role in policy learning by enabling models to capture object motion and interaction dynamics. Just as humans reason about actions using both past experience and anticipated outcomes, effective policies should integrate past interactions with future predictions.

We introduce ChronoFlow, a temporally unified representation that captures past, current, and future interaction dynamics through sparse 3D keypoints of both objects and the gripper. Based on this representation, ChronoFlow-Policy jointly learns ChronoFlow and action sequences through a diffusion co-training objective.

Experiments on 14 simulated tasks and 5 real-world manipulation tasks show consistent gains over strong diffusion-policy baselines, especially in long-horizon, deformable, and non-Markovian manipulation scenarios.

Why ChronoFlow?

Past

Historical object-gripper keypoint flows provide memory of previous interactions, which is critical for non-Markovian tasks such as Swap-Easy and Swap-Hard.

Current

A single RGB-D observation is encoded as a 3D point-cloud feature, grounding the policy in the current manipulation state.

Future

Future ChronoFlow prediction provides interaction foresight, letting actions be decoded from predicted object and gripper motion rather than action labels alone.

Method

ChronoFlow-Policy pipeline

Overview of ChronoFlow-Policy. ChronoFlow represents past-current-future object-gripper keypoint flows. A ChronoFlow encoder, diffusion backbone, ChronoFlow decoder, and action decoder jointly learn interaction trajectories and robot actions.

ChronoFlow

ChronoFlow tracks sparse 3D keypoints on both the gripper and task-relevant objects, forming compact interaction trajectories instead of dense scene-level motion.

ChronoFlow Encoder

Learnable interaction queries use cross-attention to encode historical ChronoFlow and noisy future ChronoFlow into compact object and gripper tokens.

Joint Diffusion

The diffusion backbone denoises ChronoFlow tokens while conditioned on the current 3D observation, shaping the latent state into an interaction model.

Action Decoding

A lightweight transformer decoder predicts actions from partially denoised ChronoFlow tokens and approximate future interaction trajectories.

Inference

During deployment, historical ChronoFlow trajectories are obtained asynchronously with TAPIP3D. The policy starts from Gaussian noise, iteratively denoises future ChronoFlow, and decodes both future keypoint trajectories and a horizon of robot actions. The asynchronous tracker keeps the real-world policy running at 5.82 Hz, compared with 0.93 Hz for synchronous tracking.

Experiments

72%
MetaWorld Avg.
CFP (Unet), +42 over DP3
66%
RoboTwin Avg.
CFP (Unet), +23 over DP3
93%
Swap-Easy
All three real-world stages
87%
Fold Towel
Both deformable stages

Method Comparison

Supervision, history, future modeling, and gains reported in the paper.

Method Supervision History Future Gain (Sim. / Real)
DP3 / RISE Action-only × × - / -
HistDP3 / HistRISE Action-only Yes × +2.2 / +25.8
3D-FDP Dense scene flow × Yes +4.0 / +6.9
MBA Object pose trajectory × Yes +15.0 / +20.0
CFP w/o past ChronoFlow × Yes +32.5 / +11.8
CFP ChronoFlow Yes Yes +32.5 / +35.3

Simulation Benchmarks

Success rates (%) on MetaWorld and RoboTwin 2.0. Best results are bold and second-best results are underlined.

Method MetaWorld Avg. RoboTwin Avg.
DP3 30 43
3D-FDP 34 47
MBA 47 56
CFP (Unet) 72 66
CFP (DiT) 70 63

Real-World Tasks

Platform

  • Flexiv Rizon robotic arm
  • Robotiq 2F-85 gripper
  • Intel RealSense D415 RGB-D camera

Tasks

  • Prepare Breakfast
  • Fold Towel
  • Pour Ball
  • Swap-Easy and Swap-Hard

Deployment

  • 3D policy only
  • 50 demonstrations per task
  • Async TAPIP3D tracking: 5.82 Hz
Real-world ChronoFlow-Policy tasks

Real-world manipulation tasks. The benchmark covers long-horizon execution, soft-object folding, and non-Markovian object swaps where the correct action depends on past interactions.

Task Breakfast Towel Pour Swap-Easy Swap-Hard
Stage III III III IIIIII IIIIII
RISE 7353 8040 8047 872713 891711
3D-FDP 7367 8753 7347 934033 833317
HistRISE 8767 8747 8067 938080 948956
CFP w/o past 9367 8787 9360 933320 942211
CFP (ours) 9380 8787 8780 939393 949461

ChronoFlow Visualizations and Ablations

Predicted ChronoFlow on real-world tasks
Predicted object and gripper flows across real-world manipulation tasks.
Ablation study on Fold Towel
Fold Towel ablations show that ChronoFlow supervision, object points, sampling, and augmentation all matter for robustness.

Training Curves

MetaWorld learning curves RoboTwin learning curves

Citation

If you find this work useful, please consider citing:

@inproceedings{lin2026chronoflowpolicy,
  title={ChronoFlow-Policy: Unifying Past-Current-Future Interaction Flow in Visuomotor Policy Learning},
  author={Lin, Bokai and Xu, Yifu and Zhan, Xinyu and Fang, Hongjie and Tian, Jialin and Zhang, Fu-Cheng and Li, Yong-Lu and Lu, Cewu and Yang, Lixin},
  booktitle={European Conference on Computer Vision (ECCV)},
  year={2026}
}