# Navformer

**Repository Path**: tj1652045/Navformer

## Basic Information

- **Project Name**: Navformer
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2026-04-28
- **Last Updated**: 2026-04-28

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# Navformer

Navformer is the end-to-end model training and evaluation component of
[WorldEngine](https://github.com/OpenDriveLab/WorldEngine), built on
MMDetection3D, the nuPlan / OpenScene dataset and NAVSIM.

It supports a full training loop:
**train → open-loop evaluation → rare case extraction → RL fine-tuning**,
with **VADv2** and **HydraMDP** as the supported model architectures.

---

## Table of Contents

- [System Requirements](#system-requirements)
- [Installation](#installation)
  - [Environment Variables](#environment-variables)
- [Data](#data)
- [Quick Reference](#quick-reference)
- [Training](#training)
- [Evaluation](#evaluation)
- [Rare Case Extraction](#rare-case-extraction)
- [Configuration](#configuration)
- [Model Architectures](#model-architectures)
- [Advanced Training](#advanced-training)
- [Troubleshooting](#troubleshooting)
- [Performance Optimization](#performance-optimization)

---

## System Requirements

**Minimum:**
- GPU: NVIDIA GPU with 8 GB VRAM (e.g., RTX 2080)
- RAM: 32 GB
- Storage: 500 GB SSD
- CPU: 8 cores

**Recommended:**
- GPU: NVIDIA GPU with 24 GB+ VRAM (e.g., RTX 3090, A100)
- RAM: 64 GB+
- Storage: 5 TB+ SSD
- CPU: 16+ cores

**Software:**
- OS: Linux (Ubuntu 20.04 / 22.04)
- CUDA: 11.8
- Conda / Miniconda

---

## Installation

### 1. Create Conda Environment

```bash
conda create --name navformer python=3.9 -y
conda activate navformer
```

### 2. Install PyTorch

```bash
pip install torch==2.0.1+cu118 torchvision==0.15.2+cu118 \
    --index-url https://download.pytorch.org/whl/cu118
```

Verify:

```bash
python -c "import torch; print(f'PyTorch: {torch.__version__}, CUDA: {torch.cuda.is_available()}')"
# Expected: PyTorch: 2.0.1+cu118, CUDA: True
```

### 3. Install MMCV (build from source)

MMCV must be built from source to include custom CUDA operators:

```bash
git clone https://github.com/open-mmlab/mmcv.git
cd mmcv
git checkout v1.6.2

# Build with custom ops (takes 10–15 minutes)
# Downgrade setuptools to ~75.1.0 if you encounter build errors
MMCV_WITH_OPS=1 pip install -v -e .

python .dev_scripts/check_installation.py
cd ..
```

Verify:

```bash
python -c "import mmcv; print(f'MMCV: {mmcv.__version__}')"
# Expected: MMCV: 1.6.2
```

### 4. Install OpenMMLab Ecosystem

```bash
pip install mmcls==0.25.0
pip install mmdet==2.25.3
pip install mmdet3d==1.0.0rc6
pip install mmsegmentation==0.29.1
```

### 5. Install Navformer Dependencies

```bash
pip install -r requirements.txt
pip install shapely==2.0.4
```

### 6. Verify Installation

```bash
python -c "
import torch, mmcv, mmdet, mmdet3d, numpy, hydra
print('All Navformer dependencies OK')
print(f'PyTorch {torch.__version__}')
print(f'MMCV {mmcv.__version__}')
print(f'MMDetection3D {mmdet3d.__version__}')
print(f'CUDA available: {torch.cuda.is_available()}')
"
```

---

## Environment Variables

Navformer relies on the [NAVSIM devkit v1.1](https://github.com/autonomousvision/navsim):

```bash
git clone -b v1.1 https://github.com/autonomousvision/navsim.git
```

Add the following to `~/.bashrc` or `~/.zshrc`:

```bash
export NAVSIM_DEVKIT_ROOT="/path/to/navsim"
export NAVFORMER_ROOT="/path/to/Navformer"
export NUPLAN_MAPS_ROOT="/path/to/nuplan/maps"

PYTHONPATH=$NAVFORMER_ROOT:$NAVSIM_DEVKIT_ROOT:$PYTHONPATH
```

Apply:

```bash
source ~/.bashrc   # or source ~/.zshrc
```

---

## Data

### Directory Layout

```
Navformer/
├── data/
│   ├── raw/                       # nuPlan and OpenScene datasets
│   └── alg_engine/                # Navformer-specific data
└── experiments/                   # Experiment outputs (auto-created)
```

### Download

Navformer reuses the **[OpenDriveLab/WorldEngine](https://huggingface.co/datasets/OpenDriveLab/WorldEngine)** dataset on Hugging Face, which contains merged annotation PKLs, PDM caches, model checkpoints, and K-means vocab files.

- **Hugging Face**:
  ```bash
  curl -LsSf https://hf.co/cli/install.sh | bash
  hf download OpenDriveLab/WorldEngine --repo-type dataset --local-dir /path/to/Navformer
  ```

- **ModelScope** (recommended for users in China):
  ```bash
  pip install modelscope
  modelscope download --dataset OpenDriveLab/WorldEngine
  ```

### Raw Data (`data/raw/`)

```
data/raw/
├── nuplan/
│   └── dataset/
│       ├── maps/                  # HD maps (required)
│       │   ├── nuplan-maps-v1.0.json
│       │   ├── us-nv-las-vegas-strip/
│       │   ├── us-ma-boston/
│       │   ├── us-pa-pittsburgh-hazelwood/
│       │   └── sg-one-north/
│       └── nuplan-v1.1/
│           ├── sensor_blobs/      # Camera images and LiDAR
│           └── splits/
│
└── openscene-v1.1/
    ├── sensor_blobs/
    │   ├── trainval/
    │   └── test/
    └── meta_datas/
        ├── trainval/
        └── test/
```

Use symlinks to point at your existing downloads:

```bash
cd data/raw
ln -s /path/to/nuplan nuplan
ln -s /path/to/openscene-v1.1 openscene-v1.1
```

### Navformer Data (`data/alg_engine/`)

```
data/alg_engine/
├── ckpts/                         # Pre-trained model checkpoints
├── merged_infos_navformer/
│   ├── nuplan_openscene_navtrain.pkl
│   └── nuplan_openscene_navtest.pkl
├── pdms_cache/                    # Pre-computed PDM metrics cache
│   ├── pdm_8192_gt_cache_navtrain.pkl
│   └── pdm_8192_gt_cache_navtest.pkl
└── test_8192_kmeans.npy           # K-means clustering for PDM vocab
```

---

## Quick Reference

```bash
conda activate navformer

# Training (8 GPUs)
./scripts/e2e_dist_train.sh <config> <num_gpus> [resume_checkpoint]

# Open-loop navtest evaluation
./scripts/e2e_dist_eval.sh <config> <checkpoint> <num_gpus>

# Full train set evaluation
bash scripts/e2e_dist_eval_navtrain.sh <config> <checkpoint> <num_gpus>

# Rare case extraction
python scripts/rare_case_sampling_by_pdms.py \
    --pdm-result <csv_file> \
    --base-split <yaml_file> \
    --output-dir <output_dir>
```

---

## Training

### Training from Scratch

```bash
conda activate navformer

# Train VADv2 (8 GPUs)
./scripts/e2e_dist_train.sh configs/navformer/e2e_vadv2.py 8
```

**Arguments:**
1. `<config>` — configuration file path
2. `<num_gpus>` — number of GPUs
3. `[resume_checkpoint]` (optional) — checkpoint to resume from

### Resume Training

```bash
./scripts/e2e_dist_train.sh \
    configs/navformer/e2e_vadv2.py \
    8 \
    experiments/navformer/e2e_vadv2/latest.pth
```

If `latest.pth` exists in `experiments/navformer/e2e_vadv2/`, training auto-resumes when you omit the third argument.

### Monitor Training

```bash
# Watch training log
tail -f experiments/navformer/e2e_vadv2/logs/train.*

# TensorBoard
tensorboard --logdir experiments/navformer/e2e_vadv2/tf_logs
```

**Key metrics:**
- `loss` — total training loss (should decrease)
- `loss_planning` — planning loss
- `loss_track` — tracking loss
- `ade_4s` — average displacement error at 4 s
- `fde_4s` — final displacement error at 4 s

### Training Output

```
experiments/navformer/e2e_vadv2/
├── e2e_vadv2.py          # config backup
├── logs/
│   └── train.*
├── epoch_1.pth
├── ...
├── epoch_20.pth
└── latest.pth            # symlink to latest checkpoint
```

---

## Evaluation

### Open-Loop Evaluation

#### Full Test Set

```bash
conda activate navformer

./scripts/e2e_dist_eval.sh \
    configs/navformer/e2e_vadv2.py \
    experiments/navformer/e2e_vadv2/epoch_20.pth \
    8
```

Output: `experiments/navformer/e2e_vadv2/navtest.csv`

#### Rare Navtest Cases Only

```bash
./scripts/e2e_dist_eval_navtest_failures.sh \
    configs/navformer/e2e_vadv2.py \
    experiments/navformer/e2e_vadv2/epoch_20.pth \
    8
```

Output: `experiments/navformer/e2e_vadv2/navtest_failures.csv`

#### Full Train Set

Required before [Rare Case Extraction](#rare-case-extraction). Evaluates on the full navtrain split:

```bash
bash scripts/e2e_dist_eval_navtrain.sh \
    configs/navformer/e2e_vadv2.py \
    experiments/navformer/e2e_vadv2/epoch_20.pth \
    8
```

Output: `experiments/navformer/e2e_vadv2/navtrain.csv`

#### Evaluation Metrics

```csv
token,ade_4s,fde_4s,no_at_fault_collisions,drivable_area_compliance,ego_progress,comfort,score
```

| Metric | Description | Direction |
|--------|-------------|-----------|
| `ade_4s` | Average trajectory error over 4 s (m) | lower |
| `fde_4s` | Final position error at 4 s (m) | lower |
| `no_at_fault_collisions` | Collision avoidance rate (0–1) | higher |
| `drivable_area_compliance` | Stay in drivable area (0–1) | higher |
| `ego_progress` | Route completion (0–1) | higher |
| `comfort` | Comfort metric (0–1) | higher |
| `score` | Overall PDM score (0–1) | higher |

---

## Rare Case Extraction

Extract failure scenarios from training-set evaluation for targeted fine-tuning.

**Prerequisite:** complete a [Full Train Set Evaluation](#full-train-set) first.

### Basic Extraction

```bash
conda activate navformer

python scripts/rare_case_sampling_by_pdms.py \
    --pdm-result experiments/navformer/e2e_vadv2/navtrain.csv \
    --base-split configs/navsim_splits/navtrain_split/navtrain_50pct.yaml \
    --output-dir configs/navsim_splits/navtrain_split/e2e_vadv2_rare
```

**Output:**

```
configs/navsim_splits/navtrain_split/e2e_vadv2_rare/
├── navtrain_50pct_collision.yaml    # collision scenarios
├── navtrain_50pct_off_road.yaml     # off-road scenarios
└── navtrain_50pct_ep_1pct.yaml      # low ego-progress (bottom 1%)
```

### Custom Thresholds

Edit `scripts/rare_case_sampling_by_pdms.py`:

```python
# Change collision threshold
collision_scenarios = df[df['no_at_fault_collisions'] < 0.95]  # default 1.0

# Change ego-progress percentile
ep_threshold = df['ego_progress'].quantile(0.05)  # default 0.01 (1% → 5%)
```

---

## Configuration

Configs follow the MMDetection3D hierarchical pattern:

```
configs/
├── _base_/
│   └── default_runtime.py
├── navformer/
│   ├── e2e_vadv2.py
│   ├── e2e_hydramdp.py
│   └── track_map_nuplan_r50_navtrain.py
└── navsim_splits/
    ├── navtrain_split/
    │   ├── navtrain.yaml
    │   ├── navtrain_50pct.yaml
    │   └── e2e_vadv2_rare/
    │       ├── navtrain_50pct_collision.yaml
    │       ├── navtrain_50pct_off_road.yaml
    │       └── navtrain_50pct_ep_1pct.yaml
    └── navtest_split/
        ├── navtest.yaml
        └── navtest_failures.yaml
```

### Key Config Parameters

```python
model = dict(
    type='VADv2',           # or 'HydraMDP'
    num_query=900,
    planning_steps=8,
)

bev_h_, bev_w_ = 200, 200
patch_size = [102.4, 102.4]  # physical range in meters

input_modality = dict(
    use_lidar=False,
    use_camera=True,         # 8 cameras
    use_radar=False,
    use_external=True,       # CAN bus
)

total_epochs = 20
optimizer = dict(type='AdamW', lr=2e-4, weight_decay=0.01)

data = dict(
    samples_per_gpu=1,
    workers_per_gpu=4,
    train=dict(
        ann_file='merged_infos_navformer/nuplan_openscene_navtrain.pkl',
        scenario_filter='configs/navsim_splits/navtrain_split/navtrain_50pct.yaml',
    ),
    val=dict(
        ann_file='merged_infos_navformer/nuplan_openscene_navtest.pkl',
        scenario_filter='configs/navsim_splits/navtest_split/navtest.yaml',
    ),
)
```

### Runtime Overrides

```bash
./scripts/e2e_dist_train.sh configs/navformer/e2e_vadv2.py 8 \
    --cfg-options optimizer.lr=1e-4 total_epochs=30 data.samples_per_gpu=2
```

---

## Model Architectures

| Architecture | Config | Strengths |
|---|---|---|
| **VADv2** (default) | `configs/navformer/e2e_vadv2.py` | Fast inference, general driving |
| **HydraMDP** | `configs/navformer/e2e_hydramdp.py` | Multi-modal planning, safety-critical |

---

## Advanced Training

### Multi-Node Training

```bash
# Node 0 (master)
export MASTER_ADDR=192.168.1.100
export MASTER_PORT=28567
export WORLD_SIZE=16
export RANK=0
./scripts/e2e_dist_train.sh configs/navformer/e2e_vadv2.py 8

# Node 1 (worker)
export MASTER_ADDR=192.168.1.100
export MASTER_PORT=28567
export WORLD_SIZE=16
export RANK=8
./scripts/e2e_dist_train.sh configs/navformer/e2e_vadv2.py 8
```

### Mixed Precision

```python
# in config
fp16 = dict(loss_scale='dynamic')
```

### Gradient Accumulation

```python
# effective batch = samples_per_gpu * num_gpus * gradient_accumulation_steps
runner = dict(max_epochs=20, gradient_accumulation_steps=4)
```

---

## Troubleshooting

**CUDA out of memory:**
```bash
# Reduce batch size: data.samples_per_gpu = 1
# Lower BEV resolution: bev_h_, bev_w_ = 150, 150
# Enable gradient checkpointing: model.img_backbone.with_cp = True
```

**Training loss not decreasing:**
```bash
grep "load checkpoint" experiments/navformer/*/logs/train.*
./scripts/e2e_dist_train.sh ... --cfg-options optimizer.lr=1e-4
```

**Evaluation hangs:**
```bash
ps aux | grep python
pkill -f "test.py"
./scripts/e2e_dist_eval.sh ... 4   # try fewer GPUs
```

**`ModuleNotFoundError: No module named mmdet3d`:**
```bash
conda activate navformer
python -c "import mmcv; print(mmcv.__version__)"
pip uninstall mmdet3d -y && pip install mmdet3d==1.0.0rc6
```

**Corrupted checkpoint:**
```bash
# Use a previous epoch
./scripts/e2e_dist_train.sh ... experiments/navformer/e2e_vadv2/epoch_18.pth
```

---

## Performance Optimization

**Training speed:**
- `data.workers_per_gpu = 8` (if CPU/RAM allows)
- Store data on NVMe SSD
- `fp16 = dict(loss_scale='dynamic')`
- `data.persistent_workers = True`

**Memory:**
- `data.samples_per_gpu = 1`
- `bev_h_, bev_w_ = 150, 150`
- `model.img_backbone.with_cp = True`

**Multi-node:**
- Use homogeneous GPU types across nodes
- InfiniBand for inter-node communication
- Shared NFS/Lustre for data loading