# robust-kbench

**Repository Path**: wangpengabc/robust-kbench

## Basic Information

- **Project Name**: robust-kbench
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Apache-2.0
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-10-29
- **Last Updated**: 2025-10-29

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# Towards Robust Agentic CUDA Kernel Benchmarking, Verification, and Optimization 

[![Python 3.11](https://img.shields.io/badge/python-3.11-blue.svg)](https://www.python.org/downloads/)
[![License](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](LICENSE)
[![Paper](http://img.shields.io/badge/paper-arxiv.2509.14279-B31B1B.svg)](http://arxiv.org/abs/2509.14279)

A comprehensive benchmark suite designed to evaluate and validate CUDA kernels generated by Large Language Models (LLMs). This benchmark addresses the limitations of existing kernel benchmarks by implementing robust evaluation criteria that prevent LLMs from exploiting benchmark settings.

## 🎯 Motivation

Traditional kernel benchmarks often fall short when evaluating LLM-generated CUDA code because:
- They can be easily exploited by LLMs through input shape manipulation
- They don't account for weight magnitude optimizations
- They lack comprehensive validation across different initialization settings
- They don't test for real-world performance characteristics

`robust-kbench` addresses these limitations through:
- Multiple initialization settings
- Varied input configurations
- Comprehensive correctness checks
- Performance profiling capabilities
- Real-world task scenarios

## 🚀 Quick Start

### Installation

```bash
# Clone the repository with submodules
git clone --recurse-submodules https://github.com/SakanaAI/robust-kbench.git

# Create and activate conda environment
conda create -n robust_kbench python=3.11
conda activate robust_kbench

# Install the package
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124

cd robust-kbench
pip install -e .
```

### Basic Usage

1. **Run Task Filtering**
```bash
python run_filter.py --task_dir tasks/mnist_cross_entropy
```

2. **Evaluate a Single Kernel**
```bash
python run_kernel.py --task_dir tasks/mnist_cross_entropy --cuda_code_path highlighted/mnist_cross_entropy/forward/kernel.cu
```

## 🔍 Task Filtering Procedure

The benchmark implements several filter checks to ensure robust evaluation:

1. **Output Range Check**: Ensures outputs are not artificially constrained to [-0.01, 0.01]
2. **Standard Deviation Check**: Verifies output variation > 0.01
3. **Axes Variation Check**: Confirms output variation across axes > 0.01
4. **Initialization Impact**: Tests kernel behavior across different initialization settings
5. **Input Impact**: Evaluates performance with varied input configurations
6. **LLM-judge Inefficiency**: Assesses potential inefficiencies in LLM-generated code

## 💻 Detailed Usage

### Parallel Kernel Evaluation

```python
from robust_kbench.parallel import ParallelKernelExecutor

executor = ParallelKernelExecutor(
    task_dir="tasks/mnist_cross_entropy",
    op_atol=1e-5,
    op_rtol=1e-5,
    warmup_time=25,
    repetition_time=10000,
    multi_init_settings=True,
    multi_input_settings=True,
    forward=True,
    timeout=300,
    torch_prof=True,
)

# Evaluate multiple kernels
cuda_files = [
    "highlighted/mnist_cross_entropy/forward/kernel.cu",
    "highlighted/mnist_cross_entropy/forward/kernel.cu",
]

# Run evaluations
torch_results = executor.torch_eval()
compile_results = executor.compile(cuda_files)
test_results = executor.test(cuda_files)
eval_results = executor.evaluate(cuda_files)
profile_results = executor.profile(cuda_files)
```

### Individual Evaluation Components

#### Torch Baseline Evaluation
```python
from robust_kbench.evaluate import eval_torch_runtime

torch_results, torch_compile_results = eval_torch_runtime(
    task_dir="tasks/mnist_linear",
    warmup_time=25,
    repetition_time=10000,
    eval_type="kernelbench",
    multi_init_settings=True,
    multi_input_settings=True,
    gpu_id=0,
    ext_dir=os.path.expanduser("~/.cache/torch_extensions/py311_cu124"),
    timeout=300,
    forward=True,
)
```

#### CUDA Kernel Compilation
```python
cuda_compile_results = compile_cuda_kernel(
    cuda_code_path="highlighted/mnist_linear/forward/kernel.cu",
    task_dir="tasks/mnist_linear",
    gpu_id=0,
    ext_dir=os.path.expanduser("~/.cache/torch_extensions/py311_cu124"),
)
```

#### Kernel Correctness Testing
```python
from robust_kbench.evaluate import test_cuda_kernel

correct_results = test_cuda_kernel(
    cuda_code_path="highlighted/mnist_linear/forward/forward.cu",
    task_dir="tasks/mnist_linear",
    op_atol=1e-5,
    op_rtol=1e-5,
    multi_init_settings=True,
    multi_input_settings=True,
    gpu_id=0,
    ext_dir=os.path.expanduser("~/.cache/torch_extensions/py311_cu124"),
    timeout=300,
    forward=True
)
```

#### Runtime Evaluation
```python
from robust_kbench.evaluate import eval_cuda_kernel

cuda_results = eval_cuda_kernel(
    cuda_code_path="highlighted/mnist_linear/forward/kernel.cu",
    task_dir="tasks/mnist_linear",
    warmup_time=25,
    repetition_time=10000,
    eval_type="kernelbench",
    multi_init_settings=True,
    multi_input_settings=True,
    gpu_id=0,
    ext_dir=os.path.expanduser("~/.cache/torch_extensions/py311_cu124"),
    timeout=300,
    forward=True
)
```

#### Performance Profiling
```python
from robust_kbench.evaluate import prof_cuda_kernel

prof_results = prof_cuda_kernel(
    cuda_code_path="tasks/linear/forward.cu",
    task_dir="tasks/mnist_linear",
    torch_prof=True,
    ncu_prof=False,
    clang_prof=False,
    forward=True
)
```

## 📋 Supported Tasks

### Basic Neural Network Operations
| Task | Description | Forward | Backward | Use Case |
|------|-------------|---------|-----------|-----------|
| [Linear](tasks/mnist_linear) | Matrix multiplication with bias | ✓ | ✓ | Neural network layers |
| [Linear+ReLU](tasks/mnist_linear_relu) | Linear layer followed by ReLU activation | ✓ | ✓ | Deep neural networks |
| [LayerNorm](tasks/layernorm) | Layer normalization | ✓ | ✓ | Transformer architectures |
| [Cross Entropy](tasks/mnist_cross_entropy) | Cross entropy loss for multi-class classification | ✓ | ✓ | Classification tasks |

### Convolutional Neural Network Operations
| Task | Description | Forward | Backward | Use Case |
|------|-------------|---------|-----------|-----------|
| [Conv2D](tasks/unet_conv2d) | 2D Convolution operation | ✓ | ✓ | CNN architectures |
| [Conv+ReLU+Pool](tasks/mnist_conv_relu_pool) | Convolution followed by ReLU and pooling | ✓ | ✓ | CNN feature extraction |
| [MaxPool2D](tasks/mnist_pool) | 2D Max pooling operation | ✓ | ✓ | CNN downsampling |

### Transformer Architecture Operations
| Task | Description | Forward | Backward | Use Case |
|------|-------------|---------|-----------|-----------|
| [LLaMA-FFW](tasks/llama_ffw) | LLaMA feed-forward network | ✓ | ✗ | LLaMA model architecture |
| [LLaMA-RMSNorm](tasks/llama_rmsnorm) | Root mean square normalization | ✓ | ✓ | LLaMA model architecture |

### Complex Network Blocks
| Task | Description | Forward | Backward | Use Case |
|------|-------------|---------|-----------|-----------|
| [ResNet Block](tasks/resnet_block) | Residual block with convolutions | ✓ | ✗ | ResNet architectures |
| [UNet Linear](tasks/unet_linear) | Linear operations in UNet architecture | ✓ | ✗ | UNet model architecture |

### Original KernelBench Tasks
| Task | Description | Forward | Backward | Use Case |
|------|-------------|---------|-----------|-----------|
| [KernelBench](tasks/kernelbench) | Original KernelBench tasks | ✓ | ✗ | Baseline comparison |


## 🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

## 📝 License

This project is licensed under the Apache 2.0 License - see the [LICENSE](LICENSE) file for details.