VCR: Learning Predictive Visuomotor Coordination

Abstract

Understanding and predicting human visuomotor coordination is crucial for applications in robotics, human-computer interaction, and assistive technologies. Our work introduces a forecasting-based task for visuomotor modeling, where the goal is to predict head pose, gaze, and upper-body motion from egocentric visual and kinematic observations.

We propose a Visuomotor Coordination Representation (VCR) that learns structured temporal dependencies across these multimodal signals. We extend a diffusion-based motion modeling framework that integrates egocentric vision and kinematic sequences, enabling temporally coherent and accurate visuomotor predictions.

Our approach is evaluated on the large-scale EgoExo4D dataset, demonstrating strong generalization across diverse real-world activities. Our results highlight the importance of multimodal integration in understanding visuomotor coordination, contributing to research in visuomotor learning and human behavior modeling.

Method

Visuomotor Coordination Representation

At each timestep, the visuomotor state is defined as S_t = {H_t, G_t, U_t}, where H_t is head pose, G_t is the 3D gaze endpoint, and U_t denotes upper-body joint positions.

Visuomotor State Canonicalization

Visuomotor states are aligned to the reference frame of the last observed head pose, allowing the model to learn relative motion patterns independent of global movement.

Model Architecture

Kinematic features and egocentric visual embeddings are fused through structured head-gaze and head-gaze-arm pathways, then processed by a temporal encoder and diffusion denoiser to forecast future visuomotor states.

Qualitative Results

Predicted visuomotor coordination across diverse real-world activities from EgoExo4D. Each demo visualizes the coupled evolution of head pose, 3D gaze endpoints, and upper-body motion.

Bike Fixing

A precise turn with head and gaze fixed on the bike tire.

Both hands are raised at the end of forecasting, as in the ground-truth egocentric observation.

Despite drastic movement, gaze leads the direction.

Cooking

Head and gaze shift along the upper-body movement.

Naturally adjusting visuomotor coordination in response to an unusual pose.

Correctly turns left at the end of forecasting.

Heading to the left and raising the left arm.

Basketball

The player is prepared to catch the ball at the end of forecasting.

Head and gaze cue the mid-range jump shooting.

Catches the dropping ball and continues with a Mikan layup.

COVID Test

Attention shifts and returns during the health-related task.

BibTeX


      @inproceedings{jia2026learning,
      title={Learning predictive visuomotor coordination},
      author={Jia, Wenqi and Lai, Bolin and Cao, Xu and Liu, Miao and Xu, Danfei and Rehg, James M},
      booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
      pages={3609--3619},
      year={2026}
    }