# HRViT

**Repository Path**: silencewq/HRViT

## Basic Information

- **Project Name**: HRViT
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2024-01-24
- **Last Updated**: 2024-10-12

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# HRViT

This repo is the official implementation of ["Multi-Scale High-Resolution Vision Transformer for Semantic Segmentation"](https://arxiv.org/abs/2111.01236).

## Introduction

**HRViT** is introduced in [arXiv](https://arxiv.org/abs/2111.01236), which is a new vision transformer backbone design for semantic segmentation. It has a multi-branch high-resolution (HR) architecture with enhanced multi-scale representability. We balance the model performance and efficiency of HRViT by various branch-block co-optimization techniques. Specifically, we explore heterogeneous branch designs, reduce the redundancy in linear layers, and augment the attention block with enhanced expressiveness.

HRViT achieves 50.20% mIoU on ADE20K and 83.16% mIoU on Cityscapes, surpassing state-of-the-art MiT and CSWin backbones with an average of +1.78 mIoU improvement, 28% parameter saving, and 21% FLOPs reduction.

![teaser](teaser.png)

## Main Results on ImageNet

| model | pretrain | resolution | acc@1 | #params | FLOPs |
|:---:  | :---:    |  :---:     | :---: |   :---: | :---: |
| HRViT-b1 | ImageNet-1K  | 224x224 | 80.5 | 19.7M  | 2.7G  |
| HRViT-b2 | ImageNet-1k  | 224x224 | 82.3 | 32.5M  | 5.1G  |
| HRViT-b3 | ImageNet-1k  | 224x224 | 82.8 | 37.9M  | 5.7G |


## Main Results on Semantic Segmentation

**ADE20K Semantic Segmentation (val)**

| Backbone | Method | pretrain | Crop Size | Lr Schd | mIoU | #Params | FLOPs |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| HRViT-b1 | Segformer | ImageNet-1K | 512x512 | 160K | 45.88 | 8.2M | 14.6G |
| HRViT-b2 | Segformer | ImageNet-1K | 512x512 | 160K | 48.76 | 20.8M | 28.0G |
| HRViT-b3 | Segformer | ImageNet-1K | 512x512 | 160K | 50.20 | 28.7M | 67.9G |
| HRViT-b1 | UperNet | ImageNet-1K | 512x512 | 160K | 47.19 | 35.9M | 219G |
| HRViT-b2 | UperNet | ImageNet-1K | 512x512 | 160K | 49.10 | 49.7M | 233G |
| HRViT-b3 | UperNet | ImageNet-1K | 512x512 | 160K | 50.04 | 55.4M | 236G |


**Cityscapes Semantic Segmentation (val)**

| Backbone | Method | pretrain | Crop Size | Lr Schd | mIoU | #Params | FLOPs |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| HRViT-b1 | Segformer | ImageNet-1K | 512x512 | 160K | 81.63 | 8.1M | 14.1G |
| HRViT-b2 | Segformer | ImageNet-1K | 512x512 | 160K | 82.81 | 20.8M | 27.4G |
| HRViT-b3 | Segformer | ImageNet-1K | 512x512 | 160K | 83.16 | 28.6M | 66.8G |


Training code could be found at [`segmentation`](segmentation)

## Requirements

timm==0.3.4, pytorch>=1.4, opencv, ... , run:

```
bash install_req.sh
```

Data preparation: ImageNet-1K with the following folder structure, you can extract imagenet by this [script](https://gist.github.com/BIGBALLON/8a71d225eff18d88e469e6ea9b39cef4).

```
│imagenet/
├──train/
│  ├── n01440764
│  │   ├── n01440764_10026.JPEG
│  │   ├── n01440764_10027.JPEG
│  │   ├── ......
│  ├── ......
├──val/
│  ├── n01440764
│  │   ├── ILSVRC2012_val_00000293.JPEG
│  │   ├── ILSVRC2012_val_00002138.JPEG
│  │   ├── ......
│  ├── ......
```

## Train

Train three variants: HRViT-b1, HRViT-b2, and HRViT-b3. We need 4 nodes/machines, 8 GPUs per node.
On machine `NODE_RANK`={0,1,2,3}, run the following command to train `MODEL`={HRViT_b1_224, HRViT_b2_224, HRViT_b3_224}
```
bash train.sh 4 8 <NODE_RANK> --data <data path> --model <MODEL> -b 32 --lr 1e-3 --weight-decay .05 --amp --img-size 224 --warmup-epochs 20 --drop-path 0.1 --head-drop 0.1 --clip-grad 1 --sync-bn
```

If the GPU memory is not enough, please use [gradient checkpoint](https://pytorch.org/docs/stable/checkpoint.html) '--with-cp'.


## Cite HRViT

```
@misc{gu2021hrvit,
      title={Multi-Scale High-Resolution Vision Transformer for Semantic Segmentation},
        author={Jiaqi Gu and Hyoukjun Kwon and Dilin Wang and Wei Ye and Meng Li and Yu-Hsin Chen and Liangzhen Lai and Vikas Chandra and David Z. Pan},
        year={2021},
        eprint={2111.01236},
        archivePrefix={arXiv},
        primaryClass={cs.CV}
}
```


## Acknowledgement

This repository is built using the [timm](https://github.com/rwightman/pytorch-image-models) library, the [DeiT](https://github.com/facebookresearch/deit) repository, the [Swin Transformer](https://github.com/microsoft/Swin-Transformer) repository, the [CSWin](https://github.com/microsoft/CSWin-Transformer) repository, the [MMSegmentation](https://github.com/open-mmlab/mmsegmentation) repository, and the [MMCV](https://github.com/open-mmlab/mmcv) repository.

## License
This project is licensed under the license found in the LICENSE file in the root directory of this source tree.

[Meta Open Source Code of Conduct](https://opensource.fb.com/code-of-conduct)

### Contact Information

For help or issues using HRViT, please submit a GitHub issue.

For other communications related to HRViT, please contact Hyoukjun Kwon (`hyoukjunkwon@fb.com`), Dilin Wang (`wdilin@fb.com`).

## License Information
The majority of HRViT is licensed under CC-BY-NC, however portions of the project are available under separate license terms:
 - timm is licensed under the Apache-2.0 license
 - DeiT is licensed under the Apache-2.0 license
 - Swin Transformer is licensed under the MIT license 
 - CSWin Transformer is licensed under the MIT license
 - MMSegmentation is licensed under the Apache-2.0 license
 - MMCV is licensed under the Apache-2.0 license