# AnyDepth **Repository Path**: kangchi/AnyDepth ## Basic Information - **Project Name**: AnyDepth - **Description**: No description available - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2026-01-18 - **Last Updated**: 2026-01-18 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # logo AnyDepth: Depth Estimation Made Easy > *"Simplicity is prerequisite for reliability." --- Edsger W. Dijkstra* ![teaser](./assets/teaser.png) > **AnyDepth: Depth Estimation Made Easy** > > Zeyu Ren1*, [Zeyu Zhang](https://steve-zeyu-zhang.github.io/)2*†, Wukai Li2, Qingxiang Liu3, Hao Tang2‡ > > 1The University of Melbourne, 2Peking University, 3Shanghai University of Engineering Science > > *Equal contribution. Project lead. Corresponding author. > > ### [Paper](https://arxiv.org/abs/2601.02760) | [Website](https://aigeeksgroup.github.io/AnyDepth) | [Model](https://huggingface.co/AIGeeksGroup/AnyDepth) | [HF Paper](https://huggingface.co/papers/2601.02760) ## ✏️ Citation If you find our code or paper helpful, please consider starring ⭐ us and citing: ```bibtex @article{ren2026anydepth, title={AnyDepth: Depth Estimation Made Easy}, author={Ren, Zeyu and Zhang, Zeyu and Li, Wukai and Liu, Qingxiang and Tang, Hao}, journal={arXiv preprint arXiv:2601.02760}, year={2026} } ``` ## 🏃 Intro We present **AnyDepth**, a simple and efficient training framework for zero-shot monocular depth estimation. The core contribution is the **Simple Depth Transformer (SDT)**, a compact decoder that achieves comparable accuracy to DPT while reducing parameters by **85%--89%**. **Key Features:** - **Single-Path Fusion**: Fuse-then-reassemble strategy avoids multi-branch cross-scale alignment - **Weighted Fusion**: Learnable fusion of 4-layer ViT features with cls token readout - **Spatial Detail Enhancer (SDE)**: Depthwise convolution for local spatial modeling - **DySample Upsampling**: Two-stage learnable upsampling (H/16 -> H/4 -> H) - **Lightweight**: Only ~5-13M parameters for the decoder ![architecture](./assets/anydepth.png) ## ⚡ Quick Start SDT has no additional dependencies beyond PyTorch. Simply replace DPT or other decoders with SDT in your existing codebase. ### Usage ```python import torch from sdt_head import SDTHead # in_channels: ViT-S=384, ViT-B=768, ViT-L=1024 # extract layers: [2,5,8,11] for ViT-S/B, [4,11,17,23] for ViT-L head = SDTHead( in_channels=in_channels, fusion_channels=fusion_channels, n_output_channels=1, use_cls_token=True ) ``` ## 📦 Datasets We provide the training splits (369K samples) in the `datasets/` folder. To prepare the data: 1. **Hypersim & Virtual KITTI**: Follow the instructions from [Lotus](https://github.com/EnVision-Research/Lotus) to download and prepare these datasets. 2. **IRS**: Follow the official instructions at [IRS](https://github.com/blackjack2015/IRS). 3. **BlendedMVS**: Follow the official instructions at [BlendedMVS](https://github.com/YoYo000/BlendedMVS). 4. **TartanAir**: Follow the official instructions at [TartanAir](https://github.com/castacks/tartanair_tools). ## 📊 SDT vs DPT ![DPT_SDT](./assets/DPT_SDT.png) **Key Difference**: DPT uses a reassemble-fusion strategy (per-layer reassembly + cross-scale fusion), while SDT uses a fusion-reassemble strategy (fuse tokens first, then single-path upsampling). ### Decoder Parameters | Decoder | ViT Backbone | Params (M) | |---------|--------------|------------| | DPT | ViT-S | 50.83 | | DPT | ViT-B | 76.05 | | DPT | ViT-L | 99.58 | | **SDT** | ViT-S | **5.51** | | **SDT** | ViT-B | **9.45** | | **SDT** | ViT-L | **13.38** | ### Multi-Resolution Efficiency (ViT-L, H100 GPU) | Resolution | Decoder | FLOPs (G) | Latency (ms) | |------------|---------|-----------|--------------| | 256×256 | DPT | 444.14 | 6.66 ± 0.22 | | 256×256 | **SDT (Ours)** | **234.17** | **6.10 ± 0.33** | | 512×512 | DPT | 1776.56 | 24.65 ± 0.22 | | 512×512 | **SDT (Ours)** | **936.70** | **23.17 ± 0.54** | | 1024×1024 | DPT | 7106.22 | 99.79 ± 0.79 | | 1024×1024 | **SDT (Ours)** | **3746.79** | **93.09 ± 0.51** | ## 🧪 Zero-Shot Depth Estimation Results ### AnyDepth vs DPT (DINOv3 Encoder) | Method | Data | Encoder | Params | NYUv2 | KITTI | ETH3D | ScanNet | DIODE | |--------|------|---------|--------|-------|-------|-------|---------|-------| | DPT | 584K | ViT-S | 71.8M | 8.4 | 10.8 | 12.7 | 8.3 | 26.0 | | **AnyDepth** | 369K | ViT-S | 26.5M | 8.2 | 10.2 | 8.4 | 8.0 | 24.7 | | DPT | 584K | ViT-B | 162.1M | 7.5 | 10.8 | 10.0 | 7.1 | 24.5 | | **AnyDepth** | 369K | ViT-B | 95.5M | 7.2 | 9.7 | **8.0** | 6.8 | 23.6 | | DPT | 584K | ViT-L | 399.6M | 6.1 | 8.9 | 13.0 | 6.0 | 23.4 | | **AnyDepth** | 369K | ViT-L | 313.4M | **6.0** | **8.6** | 9.6 | **5.4** | **22.6** | *Metric: AbsRel % (lower is better)* ### Zero-Shot Affine-Invariant Depth Estimation with Different Encoders and Decoders We fine-tune on Hypersim and Virtual KITTI with depth foundation models (DAv2, DA3, VGGT). | Method | Encoder | Decoder | NYUv2 | KITTI | ETH3D | ScanNet | DIODE | |--------|---------|---------|-------|-------|-------|---------|-------| | DAv2 | ViT-B | DPT | 5.8 | **10.4** | 8.8 | 6.2 | **23.4** | | DAv2 | ViT-B | **SDT** | **5.6** | 10.7 | **7.5** | **6.1** | 23.9 | | DA3 | ViT-L | DPT | **4.9** | **8.8** | 6.9 | 5.0 | 22.5 | | DA3 | ViT-L | Dual-DPT | **4.9** | 8.9 | 7.0 | **4.9** | 22.3 | | DA3 | ViT-L | **SDT** | **4.9** | 8.9 | **5.8** | 5.0 | **21.9** | | VGGT | VGGT-1B | DPT | **4.8** | 15.6 | 7.2 | **4.6** | 30.7 | | VGGT | VGGT-1B | **SDT** | **4.8** | **15.5** | **7.0** | **4.6** | **30.6** | *Metric: AbsRel % (lower is better). The encoder used pre-trained weights, and the decoder was randomly initialized.* ## 🚀 Real-World Deployment SDT has been tested on **Jetson Orin Nano (4GB)**: | Resolution | Decoder | Latency (ms) | FPS | |------------|---------|--------------|-----| | 256×256 | DPT | 305.65 | 3.3 | | 256×256 | **SDT (Ours)** | **213.35** | **4.7** | | 512×512 | DPT | 1107.64 | 0.9 | | 512×512 | **SDT (Ours)** | **831.48** | **1.2** | ## 😘 Acknowledgement - [DINOv2](https://github.com/facebookresearch/dinov2) / [DINOv3](https://github.com/facebookresearch/dinov3) - [Depth Anything V2](https://github.com/DepthAnything/Depth-Anything-V2) - [Depth Anything 3](https://github.com/ByteDance-Seed/Depth-Anything-3) - [VGGT](https://github.com/facebookresearch/vggt) - [DySample](https://arxiv.org/abs/2308.15085) - [DPT](https://github.com/isl-org/DPT) ## 📜 License This work is licensed under [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/).