Stereo Any Video:
Temporally Consistent Stereo Matching

ICCV 2025 (Highlight)
Imperial College London
Rolling Video Comparison



Overview

Overview of the proposed Stereo Any Video. Given a stereo video sequence as input, the model first extracts features and context information using trainable convolutional encoders and a frozen monocular video depth encoder (Video Depth Anything). At each iteration, an all-to-all-pair correlation computes feature correlations, followed by an MLP encoder for compact representation. Disparities are iteratively refined using a 3D Gated Recurrent Unit (GRU), integrated with cost volume features, and upsampled via a temporal convex upsampling layer. This process is repeated within a cascaded pipeline to progressively recover full-resolution disparities.

Comparison with representative methods on labeled datasets

Comparison on real world videos

3D Point Tracking

BibTeX


                @article{jing2025stereo,
                title={Stereo Any Video: Temporally Consistent Stereo Matching},
                author={Jing, Junpeng and Luo, Weixun and Mao, Ye and Mikolajczyk, Krystian},
                journal={arXiv preprint arXiv:2503.05549},
                year={2025}
                }