# V2Drop
**Repository Path**: lenghong/V2Drop
## Basic Information
- **Project Name**: V2Drop
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Apache-2.0
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2026-03-18
- **Last Updated**: 2026-03-18
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
๐งท Variation-aware Vision Token Dropping for Faster Large Vision-Language Models ๐
[Junjie Chen]()1*,
[Xuyang Liu](https://xuyang-liu16.github.io/)1*,โ ,
[Zichen Wen](https://scholar.google.com/citations?hl=en&user=N-aPFvEAAAAJ)2,
[Yiyu Wang]()2,
[Siteng Huang](https://kyonhuang.top/)3,
[Honggang Chen](https://sites.google.com/view/honggangchen/)1โ
1Sichuan University, 2EPIC Lab, Shanghai Jiao Tong University, 3Zhejiang University
## ๐ฅ News
* **`2026.03.15`** ๐ป๐ป Our [Code](https://github.com/xuyang-liu16/V2Drop/tree/main/Qwen2-VL) of Qwen2-VL is available! This work also references [DRAT](https://github.com/ZichenWen1/DART/tree/main/Qwen2-VL), thanks for their contributions.
* **`2026.02.21`** ๐๐ Our [V2Drop](https://arxiv.org/abs/2509.01552) has been accepted by **CVPR 2026**!
* **`2025.08.27`** ๐ค๐ค We release our latest work [V2Drop](https://arxiv.org/abs/2509.01552), a variation-aware vision token dropping method for plug-and-play inference LVLM acceleration. [Code](https://github.com/xuyang-liu16/V2Drop) is available!
> **TLDR:** Token-wise variation intuitively reflects vision token importance (green boxes) while maintaining compatibility with efficient operators. Thus, we present V2Drop, a plug-and-play framework that measures token-wise variation across adjacent LLM layers and progressively drops vision tokens with minimal variation, thereby achieving plug-and-play inference acceleration.
## ๐ฅ Core Codes
The core implementation of our code is in [`llava/model/language_model/V2Drop.py`](https://github.com/xuyang-liu16/V2Drop/blob/main/llava/model/language_model/V2Drop.py).
## ๐ Preparation
### LLaVA
1. Clone this repository.
```bash
git clone https://github.com/xuyang-liu16/V2Drop
cd V2Drop
```
2. Environment Setup and Preparation
```Shell
conda create -n V2Drop python=3.10 -y
conda activate V2Drop
pip install -e .
pip install flash-attn --no-build-isolation
```
3. Download Multimodal Benchmark
Please follow the detailed instruction in [LLaVA-Evaluation](https://github.com/haotian-liu/LLaVA/blob/main/docs/Evaluation.md).
4. Download [LLaVA-1.5-7B](https://huggingface.co/liuhaotian/llava-v1.5-7b) and put them under `./liuhaotian/llava-v1.5-7b`.
> For users with limited access to Hugging Face (e.g., from mainland China), you can refer to this you can refer this [alternative guide](https://cloud.baidu.com/article/3251091) and use the following command, with LLaVA-1.5-7B as an example:
```
pip install -U huggingface_hub hf_transfer -i https://mirrors.aliyun.com/pypi/simple/
export HF_ENDPOINT=https://hf-mirror.com
huggingface-cli download --resume-download liuhaotian/llava-v1.5-7b --local-dir ./liuhaotian/llava-v1.5-7b
```
## ๐ Evaluation
Example for evaluating TextVQA results:
```
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/textvqa.sh
```
Example for evaluating MME results:
```
CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/mme.sh
```
## ๐ Citation
Please consider citing our paper in your publications, if our findings help your research.
```bibtex
@misc{chen2025variationawarevisiontokendropping,
title={Variation-aware Vision Token Dropping for Faster Large Vision-Language Models},
author={Junjie Chen and Xuyang Liu and Zichen Wen and Yiyu Wang and Siteng Huang and Honggang Chen},
year={2025},
eprint={2509.01552},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2509.01552},
}
```