Abstract

Recent advances in stereo matching have focused on accuracy, often at the cost of significantly increased model size. Traditionally, the community has regarded efficient models as incapable of zero-shot ability due to their limited capacity. In this paper, we introduce Lite Any Stereo, a stereo depth estimation framework that achieves strong zero-shot generalization while remaining highly efficient. To this end, we design a compact yet expressive backbone to ensure scalability, along with a carefully crafted hybrid cost aggregation module. We further propose a three-stage training strategy on million-scale data to effectively bridge the sim-to-real gap. Together, these components demonstrate that an ultra-light model can deliver strong generalization, ranking 1st across four widely used real-world benchmarks. Remarkably, our model attains accuracy comparable to or exceeding state-of-the-art non-prior-based accurate methods while requiring less than 1% of their computational cost, setting a new standard for efficient stereo matching.

Overview

Given an input stereo image pair, features are first extracted using a shared-weight feature extraction module. A correlation module then constructs cost volume from extracted features, which is processed by a hybrid 3D-2D cost aggregation module to obtain aggregated cost volume along both disparity and spatial dimensions. Finally, low-resolution disparity map is estimated and a convex upsampling operation is applied to recover the full-resolution disparity map.

Stage 1: The lite model is trained using a standard supervised setup on a mixed of synthetic datasets including 1.8M labeled stereo image pairs. Stage 2: Self-distillation is employed, where both teacher and student models are initialized from the Stage 1 weights. The teacher receives clean data, while the student is fed perturbed inputs to encourage learning of domain-invariant representations via feature alignment. Stage3: The lite model is fine-tuned on unlabeled real-world data using pseudo labels generated by a frozen accurate model.

Qualitative Comparison

Scene:

Corridor Wall Gallery Office Staircase

Method:

LightStereo-M BANet-2D StereoAnything-L

Metric Point Cloud (Ours)

Quantitative Comparison

BibTeX


      @article{jing2025,
        title={Lite Any Stereo: Efficient Zero-Shot Stereo Matching},
        author={Junpeng Jing and Weixun Luo and Ye Mao and Krystian Mikolajczyk},
        journal={arXiv:2511.16555},
        year={2025}
      }

Contact

For questions, please reach out to Junpeng Jing (j.jing23@imperial.ac.uk).