# IFVLA

**Repository Path**: AGIROS_Team/IFVLA

## Basic Information

- **Project Name**: IFVLA
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 1
- **Forks**: 1
- **Created**: 2025-12-22
- **Last Updated**: 2026-03-15

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README


> 

# IF-VLA: Instruction-Focused Vision-Language-Action Model 
> **IF-VLA: Instruction-Focused Vision-Language-Action Model with Perception and Decoupled Policy Guidance**
> 
**IF-VLA** is a novel Vision-Language-Action (VLA) framework designed to address the critical failure of existing models in **multi-object environments**: the inability to reliably ground language instructions to the correct visual target. By enforcing explicit grounding constraints at both the perception and policy levels, IF-VLA effectively mitigates "Attention Drift" and "Lazy Learning." 
![IF-VLA](assets/image.png)


## 🛠️ Installation & Deployment

To ensure **IF-VLA** runs correctly, please follow the steps below to set up the environment. We recommend using a Linux system equipped with a CUDA-enabled GPU.

### 1. Environment Setup
We recommend using Anaconda to create an isolated virtual environment to prevent dependency conflicts.

```bash
# Create virtual environment (Python 3.11+ is recommended)
conda create -n ifvla python=3.11 -y

# Activate the environment
conda activate ifvla
```

### 2. Install Dependencies
Install the required Python dependencies for the project.

> **Note**: Please ensure you are in the correct directory, or specify the absolute path to `requirements.txt`.

```bash
# Upgrade pip to ensure compatibility
pip install --upgrade pip
pip install -r requirements.txt
```

### 3. Transform on Your Own Data to LeRobot

Before starting the conversion, ensure your data is in **HDF5** format. The dataset structure should be as follows:
![IF-VLA](assets/data.png)


#### (1) Run `add_bbox_to_hdf5.py`
Add bounding box information for each episode into the dataset:
```bash
python add_bbox_to_hdf5.py data/click_bell/demo_clean 50
```
#### (2)Run process_data_ifvla.sh
Perform data cleaning, alignment, and feature reorganization specifically for IF-VLA: 
```bash
mkdir processed_data && mkdir training_data
bash process_data_ifvla.sh click_bell demo_clean 50
```

<span style="color:red">**Note:**</span>
The script must be modified based on your dataset storage path.

Once successful, the data will be saved under the `process_data` directory (e.g., `process_data/click_bell-demo_clean-50/`). 


Next, create a new task folder under `training_data` (e.g., `click_bell`), and copy the contents from `process_data/click_bell-demo_clean-50/` into `training_data/click_bell/`.

#### (3)generate.sh
Convert the processed data into the official LeRobot standard format:
```bash
bash generate.sh ${hdf5_path} ${repo_id}
```
e.g., bash generate.sh training_data/click_bell click_bell_repo

### 4. Fine-Tuning Base Models on Your Own Data
```bash
GIT_LFS_SKIP_SMUDGE=1 uv sync
source .venv/bin/activate
```

#### (1) Defining training configs and running training

*   **TrainConfig**: Defines fine-tuning hyperparameters, data configuration, and the weight loader.

You need to add specific configurations in ifvla/src/ifvla/training/config.py to support the two-stage training process.
```bash
TrainConfig(
        name="ifvla_base_aloha_lora",
        model=ifvla.ifvlaConfig(paligemma_variant="gemma_2b_lora", action_expert_variant="gemma_300m_lora"),
        data=LeRobotAlohaDataConfig(
            #repo_id="click_bell_clean_repo",  # your datasets repo_id,
            repo_id="....",
            adapt_to_pi=False,
            repack_transforms=_transforms.Group(inputs=[
                _transforms.RepackTransform({
                    "images": {
                        "cam_high": "observation.images.cam_high",
                        "cam_left_wrist": "observation.images.cam_left_wrist",
                        "cam_right_wrist": "observation.images.cam_right_wrist",
                    },
                    "state": "observation.state",
                    "actions": "action",
                    "prompt": "prompt",
                })
            ]),
            base_config=DataConfig(
                local_files_only=True,  # Set to True for local-only datasets.
                prompt_from_task=True,  # Set to True for prompt by task_name
            ),
        ),
        freeze_filter=ifvla.ifvlaConfig(paligemma_variant="gemma_2b_lora",
                                    action_expert_variant="gemma_300m_lora").get_freeze_filter(),
        batch_size=32,  # the total batch_size not pre_gpu batch_size
        weight_loader=weight_loaders.CheckpointWeightLoader("s3://openpi-assets/checkpoints/pi0_base/params"),
        num_train_steps=30000,
        fsdp_devices=2,  # refer line 359
    )
```

* Stage 1: Perception Pre-training
  This stage focuses on training the BoundingboxHead.

  Freeze Filter: Set the freeze_filter in your config to only allow updates for Stage 1 components:
    ```bash
    freeze_filter = ifvla.ifvlaConfig(
        paligemma_variant="gemma_2b_lora",
        action_expert_variant="gemma_300m_lora"
    ).get_freeze_filter_stage1()
    ```
  Weight Loading: s3://openpi-assets/checkpoints/pi0_base/params

* Stage 2: Joint Diffusion Training.
    This stage trains both the BoundingboxHead and Action Generation (Diffusion) components.

    Freeze Filter: Set the freeze_filter to the standard version (allowing action expert updates):
    ```bash
    freeze_filter = ifvla.ifvlaConfig(
        paligemma_variant="gemma_2b_lora",
        action_expert_variant="gemma_300m_lora"
    ).get_freeze_filter()
    ```
    Weight Loading: The weight_loader for this stage must point to the checkpoints saved during Stage 1 to ensure the model inherits the learned perception features.


#### (2)Compute the normalization statistics
Before we can run training, we need to compute the normalization statistics for the training data. Run the script below with the name of your two training configs   ( Compute for only one of them; the results can be copied directly to the other):

```bash
uv run scripts/compute_norm_stats.py --config-name yourconfig
```

#### Begin training

```bash
XLA_PYTHON_CLIENT_MEM_FRACTION=0.9 uv run scripts/train.py $train_config_name --exp-name=$model_name 
```
<span style="color:red">**Note:**</span>
In Stage 1, Perception Pre-training. Loss Modification: In ifvla/src/ifvla/models/ifvla.py, the compute_loss function must be set to:
```bash
total_loss = loss2  # loss2 corresponds to the Perception/BBox loss
```

In Stage 2, Co-training. Loss Modification: In ifvla/src/ifvla/models/ifvla.py, the compute_loss function must be set to:
```bash
total_loss = loss1 + loss2  
```
#### Spinning up a policy server and running inference

Once training is complete, we can run inference by spinning up a policy server.

### 5. Evaluation
#### （1）Install the package in your robot environment:openpi-client

```bash
cd packages/openpi-client
pip install -e .
```
#### （2）Then, you can use the client to query the remote policy server from your robot code. Here's an example of how to do this:

```bash
from openpi_client import image_tools
from openpi_client import websocket_client_policy

# Outside of episode loop, initialize the policy client.
# Point to the host and port of the policy server (localhost and 8000 are the defaults).
client = websocket_client_policy.WebsocketClientPolicy(host="localhost", port=8000)

for step in range(num_steps):
    # Inside the episode loop, construct the observation.
    # Resize images on the client side to minimize bandwidth / latency. Always return images in uint8 format.
    # We provide utilities for resizing images + uint8 conversion so you match the training routines.
    # The typical resize_size for pre-trained pi0 models is 224.
    # Note that the proprioceptive `state` can be passed unnormalized, normalization will be handled on the server side.
    observation = {
        "observation/image": image_tools.convert_to_uint8(
            image_tools.resize_with_pad(img, 224, 224)
        ),
        "observation/wrist_image": image_tools.convert_to_uint8(
            image_tools.resize_with_pad(wrist_img, 224, 224)
        ),
        "observation/state": state,
        "prompt": task_instruction,
    }

    # Call the policy server with the current observation.
    # This returns an action chunk of shape (action_horizon, action_dim).
    # Note that you typically only need to call the policy every N steps and execute steps
    # from the predicted action chunk open-loop in the remaining steps.
    action_chunk = client.infer(observation)["actions"]

    # Execute the actions in the environment.
    ...

```

#### (3) Starting a remote policy server：
```bash

uv run scripts/serve_policy.py policy:checkpoint --policy.config=yourconfig --policy.dir=checkpoints/yourconfig/my_experiment/20000

```
e.g., uv run scripts/serve_policy.py policy:checkpoint --policy.config=IFVLA_pi0_click_blue_bell_and_click_red_bell_lora_stage2 --policy.dir /data1/xuhuixin/project/IF-VLA/ifvla/checkpoints/IFVLA_pi0_click_blue_bell_and_click_red_bell_lora_stage2/ifvla_blue_bell_amd_red_bell_repo/20000

This will spin up a server that listens on port 8000 and waits for observations to be sent to it. We can then run an evaluation script (or robot runtime) that queries the server.

---

### 📝 Notes on Dependencies
如果在安装 `requirements.txt` 时遇到版本冲突，请检查以下关键库的版本兼容性：
*   `transformers`
*   `accelerate`