Lite Any Stereo:
Efficient Zero-Shot Stereo Matching


Imperial College London

Abstract

Recent advances in stereo matching have focused on accuracy, often at the cost of significantly increased model size. Traditionally, the community has regarded efficient models as incapable of zero-shot ability due to their limited capacity. In this paper, we introduce Lite Any Stereo, a stereo depth estimation framework that achieves strong zero-shot generalization while remaining highly efficient. To this end, we design a compact yet expressive backbone to ensure scalability, along with a carefully crafted hybrid cost aggregation module. We further propose a three-stage training strategy on million-scale data to effectively bridge the sim-to-real gap. Together, these components demonstrate that an ultra-light model can deliver strong generalization, ranking 1st across four widely used real-world benchmarks. Remarkably, our model attains accuracy comparable to or exceeding state-of-the-art non-prior-based accurate methods while requiring less than 1% of their computational cost, setting a new standard for efficient stereo matching.

Overview

Given an input stereo image pair, features are first extracted using a shared-weight feature extraction module. A correlation module then constructs cost volume from extracted features, which is processed by a hybrid 3D-2D cost aggregation module to obtain aggregated cost volume along both disparity and spatial dimensions. Finally, low-resolution disparity map is estimated and a convex upsampling operation is applied to recover the full-resolution disparity map.

Stage 1: The lite model is trained using a standard supervised setup on a mixed of synthetic datasets including 1.8M labeled stereo image pairs. Stage 2: Self-distillation is employed, where both teacher and student models are initialized from the Stage 1 weights. The teacher receives clean data, while the student is fed perturbed inputs to encourage learning of domain-invariant representations via feature alignment. Stage3: The lite model is fine-tuned on unlabeled real-world data using pseudo labels generated by a frozen accurate model.

Qualitative Comparison

Scene:

Corridor Wall Gallery Office Staircase

Method:

LightStereo-M BANet-2D StereoAnything-L
Left-Right RGB
Disparity (Ours)

Metric Point Cloud (Ours)

Quantitative Comparison

BibTeX


      @article{jing2025,
        title={Lite Any Stereo: Efficient Zero-Shot Stereo Matching},
        author={Junpeng Jing and Weixun Luo and Ye Mao and Krystian Mikolajczyk},
        journal={arXiv},
        year={2025}
      }
    

Contact

For questions, please reach out to Junpeng Jing (j.jing23@imperial.ac.uk).