Multimodal Transformers for Real-Time Surgical Activity Prediction

Published in IEEE International Conference on Robotics and Automation (ICRA), 2024

Architecture for multimodal surgical activity prediction

This paper presents a multimodal transformer architecture for real-time recognition and prediction of surgical gestures and trajectories from short windows of kinematic and video data. Through an ablation study across different sensing modalities and feature representations, the work shows that combining robot kinematics with spatial and contextual video cues improves predictive accuracy while preserving real-time performance.

Evaluated on the JIGSAWS dataset, the model outperforms prior gesture prediction approaches and processes one-second input windows in only a few milliseconds, making it suitable for applications such as safety monitoring, autonomy, and intelligent teleoperation assistance in robot-assisted surgery.

Download paper here