# BitVLA

**Repository Path**: dong19960127/BitVLA

## Basic Information

- **Project Name**: BitVLA
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: MIT
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2026-03-06
- **Last Updated**: 2026-03-06

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# BitVLA: 1-bit Vision-Language-Action Models for Robotics Manipulation

- **March 2026:** 🚀 We have released the **pre-trained BitVLA model**! The evaluation results in the table below have been updated to reflect the performance after pre-training. You can try our new [pre-trained model](https://huggingface.co/lxsy/bitvla-bf16) out-of-the-box.
- June 2025: [BitVLA: 1-bit Vision-Language-Action Models for Robotics Manipulation](https://arxiv.org/abs/2506.07530)

## Open Source Plan

- ✅ Paper, Pre-trained VLM and evaluation code.
- ✅ Fine-tuned VLA code and models
- ✅ Pre-trained VLA.
- 🧭 Pre-training code

## Contents

- [BitVLA: 1-bit Vision-Language-Action Models for Robotics Manipulation](#bitvla-1-bit-vision-language-action-models-for-robotics-manipulation)
  - [Contents](#contents)
  - [Checkpoints](#checkpoints)
  - [Vision-Language](#vision-language)
    - [Evaluation on VQA](#evaluation-on-vqa)
  - [Vision-Language-Action](#vision-language-action)
    - [Robotics Pre-training](#robotics-pre-training)
    - [OFT Training](#oft-training)
      - [1. Preparing OFT](#1-preparing-oft)
      - [2. OFT fine-tuning](#2-oft-fine-tuning)
    - [Evaluation on LIBERO](#evaluation-on-libero)
  - [Acknowledgement](#acknowledgement)
  - [Citation](#citation)
  - [License](#license)
    - [Contact Information](#contact-information)

## Checkpoints

| **Models**        | **Size** | **Memory Usage↓** | **LIBERO-Spatial** | **LIBERO-Object** | **LIBERO-Goal** | **LIBERO-Long** | **Avg.** |
| :---------------------- | :------------- | :----------------------- | :----------------------- | :---------------------- | :-------------------- | :-------------------- | :------------- |
| *Large Models*        |                |                          |                          |                         |                       |                       |                |
| OpenVLA                 | 7.5B           | 15.1GB (10.79×)         | 84.7                     | 88.4                    | 79.2                  | 53.7                  | 76.5           |
| CoT-VLA                 | 8.0B           | 16.2GB (11.57×)         | 87.5                     | 91.6                    | 87.6                  | 69.0                  | 81.1           |
| UniVLA                  | 8.5B           | 17.0GB (12.14×)         | 96.5                     | 96.8                    | 95.6                  | 92.0                  | 95.2           |
| UnifiedVLA              | 8.5B           | 17.0GB (12.14×)         | 95.4                     | 98.8                    | 93.6                  | 94.0                  | 95.5           |
| OpenVLA-OFT             | 7.7B           | 15.4GB (11.00×)         | 97.6                     | 98.4                    | 97.9                  | 94.5                  | 97.1           |
| *Small Models*        |                |                          |                          |                         |                       |                       |                |
| SpatialVLA              | 4.2B           | 8.5GB (6.07×)           | 88.2                     | 89.9                    | 78.6                  | 55.5                  | 78.1           |
| NORA-Long               | 3.8B           | 7.5GB (5.36×)           | 92.2                     | 95.4                    | 89.4                  | 74.6                  | 87.9           |
| 4D-VLA                  | 4.1B           | 8.3GB (5.93×)           | 88.9                     | 95.2                    | 90.9                  | 79.1                  | 88.6           |
| SmolVLA                 | 2.3B           | 4.6GB (3.29×)           | 93.0                     | 94.0                    | 91.0                  | 77.0                  | 88.8           |
| GROOT-N1                | 2.2B           | 4.4GB (3.14×)           | 94.4                     | 97.6                    | 93.0                  | 90.6                  | 93.9           |
| π₀                    | 3.5B           | 7.0GB (5.00×)           | 96.8                     | 98.8                    | 95.8                  | 85.2                  | 94.2           |
| BitVLA w/o pre-training | 3.0B           | 1.4GB (1.00×)           | 97.4                     | 99.6                    | 94.4                  | 87.6                  | 94.8           |
| 🚀**BitVLA**      | 3.0B           | 1.4GB (1.00×)           | 96.6                     | 99.0                    | 95.4                  | 92.8                  | 96.0           |

| Model                                   | Path                                                                                                                                 |
| --------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------ |
| 🚀**BitVLA - VL&VLA pre-trained** | [lxsy/bitvla-bf16](https://huggingface.co/lxsy/bitvla-bf16)                                                                             |
| BitVLA - VL pre-trained                 | [hongyuw/bitvla-bitsiglipL-224px-bf16](https://huggingface.co/hongyuw/bitvla-bitsiglipL-224px-bf16)                                     |
| BitVLA finetuned on LIBERO-Spatial      | [hongyuw/ft-bitvla-bitsiglipL-224px-libero_spatial-bf16](https://huggingface.co/hongyuw/ft-bitvla-bitsiglipL-224px-libero_spatial-bf16) |
| BitVLA finetuned on LIBERO-Object       | [hongyuw/ft-bitvla-bitsiglipL-224px-libero_object-bf16](https://huggingface.co/hongyuw/ft-bitvla-bitsiglipL-224px-libero_object-bf16)   |
| BitVLA finetuned on LIBERO-Goal         | [hongyuw/ft-bitvla-bitsiglipL-224px-libero_long-bf16](https://huggingface.co/hongyuw/ft-bitvla-bitsiglipL-224px-libero_long-bf16)       |
| BitVLA finetuned on LIBERO-Long         | [hongyuw/ft-bitvla-bitsiglipL-224px-libero_long-bf16](https://huggingface.co/hongyuw/ft-bitvla-bitsiglipL-224px-libero_long-bf16)       |
| BitVLA w/ BF16 SigLIP                   | [hongyuw/bitvla-siglipL-224px-bf16](https://huggingface.co/hongyuw/bitvla-siglipL-224px-bf16)                                           |

*Note that we provide the master weights of BitVLA and perform online quantization. For actual memory savings, you may quantize the weights offline to 1.58-bit precision. We recommend using the [bitnet.cpp](https://github.com/microsoft/bitnet) inference framework to accurately measure the reduction in inference cost. *A dedicated inference framework and model are coming soon.*

## Vision-Language

### Evaluation on VQA

We use the [LMM-Eval](https://github.com/ustcwhy/BitVLA/tree/main/lmms-eval) toolkit to conduct evaluations on VQA tasks. We provide the [transformers repo](https://github.com/ustcwhy/BitVLA/tree/main/transformers) in which we modify the [modeling_llava.py](https://github.com/ustcwhy/BitVLA/blob/main/transformers/src/transformers/models/llava/modeling_llava.py) and [modeling_siglip.py](https://github.com/ustcwhy/BitVLA/blob/main/transformers/src/transformers/models/siglip/modeling_siglip.py) to support the W1.58-A8 quantization.

The evaluation should use nvidia_24_07 docker. Install the packages:

```bash
docker run --name nvidia_24_07  --privileged --net=host --ipc=host --gpus=all -v /mnt:/mnt -v /tmp:/tmp -d nvcr.io/nvidia/pytorch:24.07-py3 sleep infinity # only use for multimodal evaluation
docker exec -it nvidia_24_07 bash
git clone https://github.com/ustcwhy/BitVLA.git
cd BitVLA/
bash vl_eval_setup.sh # only use for multimodal evaluation
```

First, download the BitVLA model from HuggingFace:

```bash
git clone https://huggingface.co/hongyuw/bitvla-bitsiglipL-224px-bf16 # BitVLA w/ W1.58-A8 SigLIP-L
git clone https://huggingface.co/hongyuw/bitvla-siglipL-224px-bf16 # BitVLA w/ BF16 SigLIP-L
```

Then run the following scripts to conduct evaluations:

```bash
cd lmms-eval/
bash eval-dense-hf.sh /YOUR_PATH_TO_EXP/bitvla-bitsiglipL-224px-bf16
bash eval-dense-hf.sh /YOUR_PATH_TO_EXP/bitvla-siglipL-224px-bf16
```

Note that we provide the master weights of BitVLA and perform online quantization. For actual memory savings, you may quantize the weights offline to 1.58-bit precision. We recommend using the [bitnet.cpp](https://github.com/microsoft/bitnet) inference framework to accurately measure the reduction in inference cost.

## Vision-Language-Action

### Robotics Pre-training

To endow BitVLA with generalizable manipulation priors that transfer across embodiments and environments, we pre-train it with an autoregressive next-action prediction objective following OpenVLA.

**Pre-training Details:**

* **Base model:** We use [hongyuw/bitvla-bitsiglipL-224px-bf16](https://huggingface.co/hongyuw/bitvla-bitsiglipL-224px-bf16) as the base model.
* **Dataset:** Following OpenVLA, we use a curated large-scale corpus based on a subset of the [Open X-Embodiment dataset](https://huggingface.co/collections/IPEC-COMMUNITY/openx-lerobot), resulting in ~1M training samples.
* **Hyperparameters:** We train the model for 200K steps with a total batch size of 2048. The peak learning rates are set to 3×10⁻⁴ for the LLM and 1×10⁻⁴ for the ViT.
* **Compute:** The full pre-training takes approximately 14 days on 16 NVIDIA H800 (80GB) GPUs.

### OFT Training

#### 1. Preparing OFT

We fine-tune BitVLA using OFT training shown in [OpenVLA-OFT](https://github.com/moojink/openvla-oft/tree/main). First setup the environment as required by that project. You can refer to [SETUP.md](https://github.com/moojink/openvla-oft/blob/main/SETUP.md) and [LIBERO.md](https://github.com/moojink/openvla-oft/blob/main/LIBERO.md) for detailed instructions.

```
conda create -n bitvla python=3.10 -y
conda activate bitvla
pip install torch==2.5.0 torchvision==0.20.0 torchaudio==2.5.0 --index-url https://download.pytorch.org/whl/cu124

# or use the provided docker
# docker run --name nvidia_24_07  --privileged --net=host --ipc=host --gpus=all -v /mnt:/mnt -v /tmp:/tmp -d nvcr.io/nvidia/pytorch:24.07-py3 sleep infinity

cd BitVLA
pip install -e openvla-oft/
pip install -e transformers

cd openvla-oft/

# install LIBERO
git clone https://github.com/Lifelong-Robot-Learning/LIBERO.git
pip install -e LIBERO/
pip install -r experiments/robot/libero/libero_requirements.txt

# install bitvla
pip install -e bitvla/
```

We adopt the same dataset as OpenVLA-OFT for the fine-tuning on LIBERO. You can download the dataset from [HuggingFace](https://huggingface.co/datasets/openvla/modified_libero_rlds).

```
git clone git@hf.co:datasets/openvla/modified_libero_rlds
```

#### 2. OFT fine-tuning

##### Prepare the BitVLA

* 🚀 **New [pre-trained model](https://huggingface.co/lxsy/bitvla-bf16) (Recommended):** This model is ready to use out-of-the-box. No additional processing is required, and you can directly execute our provided scripts.
* 🕰️ **[Old model](https://huggingface.co/hongyuw/bitvla-bitsiglipL-224px-bf16):** This version was not pre-trained on the Open X-Embodiment dataset. To use this model, you must first convert the model into a format compatible with our codebase before using it.
  ```
  python convert_ckpt.py /path/to/bitvla-bitsiglipL-224px-bf16
  ```

##### Fine-tuing the BitVLA

After that, you can finetune BitVLA using the provided shell script.

```
sh ft_script/ft_bitvla_libero_spatial.sh
sh ft_script/ft_bitvla_libero_object.sh
sh ft_script/ft_bitvla_libero_goal.sh
sh ft_script/ft_bitvla_libero_long.sh
```

### Evaluation on LIBERO

You can download our fine-tuned BitVLA models from [HuggingFace](https://huggingface.co/collections/hongyuw/bitvla-68468fb1e3aae15dd8a4e36e). As an example for spatial set in LIBERO, run the following script for evaluation:

```
python experiments/robot/libero/run_libero_eval_bitnet.py \
    --pretrained_checkpoint  /path/to/ft-bitvla-bitsiglipL-224px-libero_spatial-bf16 \
    --task_suite_name libero_spatial \
    --info_in_path "information you want to show in path" \
    --model_family "bitnet" 
```

## Acknowledgement

This repository is built using [LMM-Eval](https://github.com/EvolvingLMMs-Lab/lmms-eval), [the HuggingFace&#39;s transformers](https://github.com/huggingface/transformers), [OpenVLA-OFT](https://github.com/moojink/openvla-oft) and [OpenVLA](https://github.com/openvla/openvla).

## Citation

If you find this repository useful, please consider citing our work:

```
@article{bitvla,
  title={BitVLA: 1-bit Vision-Language-Action Models for Robotics Manipulation}, 
  author={Hongyu Wang and Chuyan Xiong and Ruiping Wang and Xilin Chen},
  year={2025},
  eprint={2506.07530},
  archivePrefix={arXiv},
  primaryClass={cs.RO},
}
```

## License

This project is licensed under the MIT License.

### Contact Information

For help or issues using models, please submit a GitHub issue.