# V2Drop **Repository Path**: lenghong/V2Drop ## Basic Information - **Project Name**: V2Drop - **Description**: No description available - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2026-03-18 - **Last Updated**: 2026-03-18 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README

๐Ÿงท Variation-aware Vision Token Dropping for Faster Large Vision-Language Models ๐Ÿš€

[Junjie Chen]()1*, [Xuyang Liu](https://xuyang-liu16.github.io/)1*,โ€ , [Zichen Wen](https://scholar.google.com/citations?hl=en&user=N-aPFvEAAAAJ)2, [Yiyu Wang]()2, [Siteng Huang](https://kyonhuang.top/)3, [Honggang Chen](https://sites.google.com/view/honggangchen/)1โœ‰ 1Sichuan University, 2EPIC Lab, Shanghai Jiao Tong University, 3Zhejiang University

## ๐Ÿ”ฅ News * **`2026.03.15`** ๐Ÿ’ป๐Ÿ’ป Our [Code](https://github.com/xuyang-liu16/V2Drop/tree/main/Qwen2-VL) of Qwen2-VL is available! This work also references [DRAT](https://github.com/ZichenWen1/DART/tree/main/Qwen2-VL), thanks for their contributions. * **`2026.02.21`** ๐ŸŽŠ๐ŸŽŠ Our [V2Drop](https://arxiv.org/abs/2509.01552) has been accepted by **CVPR 2026**! * **`2025.08.27`** ๐Ÿค—๐Ÿค— We release our latest work [V2Drop](https://arxiv.org/abs/2509.01552), a variation-aware vision token dropping method for plug-and-play inference LVLM acceleration. [Code](https://github.com/xuyang-liu16/V2Drop) is available!

> **TLDR:** Token-wise variation intuitively reflects vision token importance (green boxes) while maintaining compatibility with efficient operators. Thus, we present V2Drop, a plug-and-play framework that measures token-wise variation across adjacent LLM layers and progressively drops vision tokens with minimal variation, thereby achieving plug-and-play inference acceleration. ## ๐Ÿ’ฅ Core Codes The core implementation of our code is in [`llava/model/language_model/V2Drop.py`](https://github.com/xuyang-liu16/V2Drop/blob/main/llava/model/language_model/V2Drop.py). ## ๐Ÿ›  Preparation ### LLaVA 1. Clone this repository. ```bash git clone https://github.com/xuyang-liu16/V2Drop cd V2Drop ``` 2. Environment Setup and Preparation ```Shell conda create -n V2Drop python=3.10 -y conda activate V2Drop pip install -e . pip install flash-attn --no-build-isolation ``` 3. Download Multimodal Benchmark Please follow the detailed instruction in [LLaVA-Evaluation](https://github.com/haotian-liu/LLaVA/blob/main/docs/Evaluation.md). 4. Download [LLaVA-1.5-7B](https://huggingface.co/liuhaotian/llava-v1.5-7b) and put them under `./liuhaotian/llava-v1.5-7b`. > For users with limited access to Hugging Face (e.g., from mainland China), you can refer to this you can refer this [alternative guide](https://cloud.baidu.com/article/3251091) and use the following command, with LLaVA-1.5-7B as an example: ``` pip install -U huggingface_hub hf_transfer -i https://mirrors.aliyun.com/pypi/simple/ export HF_ENDPOINT=https://hf-mirror.com huggingface-cli download --resume-download liuhaotian/llava-v1.5-7b --local-dir ./liuhaotian/llava-v1.5-7b ``` ## ๐Ÿš€ Evaluation Example for evaluating TextVQA results: ``` CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/textvqa.sh ``` Example for evaluating MME results: ``` CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/mme.sh ``` ## ๐Ÿ“Œ Citation Please consider citing our paper in your publications, if our findings help your research. ```bibtex @misc{chen2025variationawarevisiontokendropping, title={Variation-aware Vision Token Dropping for Faster Large Vision-Language Models}, author={Junjie Chen and Xuyang Liu and Zichen Wen and Yiyu Wang and Siteng Huang and Honggang Chen}, year={2025}, eprint={2509.01552}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2509.01552}, } ```