Stereo Any Video:
Temporally Consistent Stereo Matching

Imperial College London
Rolling Video Comparison



Overview

Overview of the proposed Stereo Any Video. Given a stereo video sequence as input, the model first extracts features and context information using trainable convolutional encoders and a frozen monocular video depth encoder (Video Depth Anything). At each iteration, an all-to-all-pair correlation computes feature correlations, followed by an MLP encoder for compact representation. Disparities are iteratively refined using a 3D Gated Recurrent Unit (GRU), integrated with cost volume features, and upsampled via a temporal convex upsampling layer. This process is repeated within a cascaded pipeline to progressively recover full-resolution disparities.

Comparison with representative methods on labeled datasets

Comparison on real world videos

3D Point Tracking

BibTeX

@misc{jing2025stereovideotemporallyconsistent,
                        title={Stereo Any Video: Temporally Consistent Stereo Matching}, 
                        author={Junpeng Jing and Weixun Luo and Ye Mao and Krystian Mikolajczyk},
                        year={2025},
                        eprint={2503.05549},
                        archivePrefix={arXiv},
                        primaryClass={cs.CV},
                        url={https://arxiv.org/abs/2503.05549}, 
                }