# robust-kbench **Repository Path**: wangpengabc/robust-kbench ## Basic Information - **Project Name**: robust-kbench - **Description**: No description available - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-10-29 - **Last Updated**: 2025-10-29 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # Towards Robust Agentic CUDA Kernel Benchmarking, Verification, and Optimization [![Python 3.11](https://img.shields.io/badge/python-3.11-blue.svg)](https://www.python.org/downloads/) [![License](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](LICENSE) [![Paper](http://img.shields.io/badge/paper-arxiv.2509.14279-B31B1B.svg)](http://arxiv.org/abs/2509.14279) A comprehensive benchmark suite designed to evaluate and validate CUDA kernels generated by Large Language Models (LLMs). This benchmark addresses the limitations of existing kernel benchmarks by implementing robust evaluation criteria that prevent LLMs from exploiting benchmark settings. ## 🎯 Motivation Traditional kernel benchmarks often fall short when evaluating LLM-generated CUDA code because: - They can be easily exploited by LLMs through input shape manipulation - They don't account for weight magnitude optimizations - They lack comprehensive validation across different initialization settings - They don't test for real-world performance characteristics `robust-kbench` addresses these limitations through: - Multiple initialization settings - Varied input configurations - Comprehensive correctness checks - Performance profiling capabilities - Real-world task scenarios ## 🚀 Quick Start ### Installation ```bash # Clone the repository with submodules git clone --recurse-submodules https://github.com/SakanaAI/robust-kbench.git # Create and activate conda environment conda create -n robust_kbench python=3.11 conda activate robust_kbench # Install the package pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124 cd robust-kbench pip install -e . ``` ### Basic Usage 1. **Run Task Filtering** ```bash python run_filter.py --task_dir tasks/mnist_cross_entropy ``` 2. **Evaluate a Single Kernel** ```bash python run_kernel.py --task_dir tasks/mnist_cross_entropy --cuda_code_path highlighted/mnist_cross_entropy/forward/kernel.cu ``` ## 🔍 Task Filtering Procedure The benchmark implements several filter checks to ensure robust evaluation: 1. **Output Range Check**: Ensures outputs are not artificially constrained to [-0.01, 0.01] 2. **Standard Deviation Check**: Verifies output variation > 0.01 3. **Axes Variation Check**: Confirms output variation across axes > 0.01 4. **Initialization Impact**: Tests kernel behavior across different initialization settings 5. **Input Impact**: Evaluates performance with varied input configurations 6. **LLM-judge Inefficiency**: Assesses potential inefficiencies in LLM-generated code ## 💻 Detailed Usage ### Parallel Kernel Evaluation ```python from robust_kbench.parallel import ParallelKernelExecutor executor = ParallelKernelExecutor( task_dir="tasks/mnist_cross_entropy", op_atol=1e-5, op_rtol=1e-5, warmup_time=25, repetition_time=10000, multi_init_settings=True, multi_input_settings=True, forward=True, timeout=300, torch_prof=True, ) # Evaluate multiple kernels cuda_files = [ "highlighted/mnist_cross_entropy/forward/kernel.cu", "highlighted/mnist_cross_entropy/forward/kernel.cu", ] # Run evaluations torch_results = executor.torch_eval() compile_results = executor.compile(cuda_files) test_results = executor.test(cuda_files) eval_results = executor.evaluate(cuda_files) profile_results = executor.profile(cuda_files) ``` ### Individual Evaluation Components #### Torch Baseline Evaluation ```python from robust_kbench.evaluate import eval_torch_runtime torch_results, torch_compile_results = eval_torch_runtime( task_dir="tasks/mnist_linear", warmup_time=25, repetition_time=10000, eval_type="kernelbench", multi_init_settings=True, multi_input_settings=True, gpu_id=0, ext_dir=os.path.expanduser("~/.cache/torch_extensions/py311_cu124"), timeout=300, forward=True, ) ``` #### CUDA Kernel Compilation ```python cuda_compile_results = compile_cuda_kernel( cuda_code_path="highlighted/mnist_linear/forward/kernel.cu", task_dir="tasks/mnist_linear", gpu_id=0, ext_dir=os.path.expanduser("~/.cache/torch_extensions/py311_cu124"), ) ``` #### Kernel Correctness Testing ```python from robust_kbench.evaluate import test_cuda_kernel correct_results = test_cuda_kernel( cuda_code_path="highlighted/mnist_linear/forward/forward.cu", task_dir="tasks/mnist_linear", op_atol=1e-5, op_rtol=1e-5, multi_init_settings=True, multi_input_settings=True, gpu_id=0, ext_dir=os.path.expanduser("~/.cache/torch_extensions/py311_cu124"), timeout=300, forward=True ) ``` #### Runtime Evaluation ```python from robust_kbench.evaluate import eval_cuda_kernel cuda_results = eval_cuda_kernel( cuda_code_path="highlighted/mnist_linear/forward/kernel.cu", task_dir="tasks/mnist_linear", warmup_time=25, repetition_time=10000, eval_type="kernelbench", multi_init_settings=True, multi_input_settings=True, gpu_id=0, ext_dir=os.path.expanduser("~/.cache/torch_extensions/py311_cu124"), timeout=300, forward=True ) ``` #### Performance Profiling ```python from robust_kbench.evaluate import prof_cuda_kernel prof_results = prof_cuda_kernel( cuda_code_path="tasks/linear/forward.cu", task_dir="tasks/mnist_linear", torch_prof=True, ncu_prof=False, clang_prof=False, forward=True ) ``` ## 📋 Supported Tasks ### Basic Neural Network Operations | Task | Description | Forward | Backward | Use Case | |------|-------------|---------|-----------|-----------| | [Linear](tasks/mnist_linear) | Matrix multiplication with bias | ✓ | ✓ | Neural network layers | | [Linear+ReLU](tasks/mnist_linear_relu) | Linear layer followed by ReLU activation | ✓ | ✓ | Deep neural networks | | [LayerNorm](tasks/layernorm) | Layer normalization | ✓ | ✓ | Transformer architectures | | [Cross Entropy](tasks/mnist_cross_entropy) | Cross entropy loss for multi-class classification | ✓ | ✓ | Classification tasks | ### Convolutional Neural Network Operations | Task | Description | Forward | Backward | Use Case | |------|-------------|---------|-----------|-----------| | [Conv2D](tasks/unet_conv2d) | 2D Convolution operation | ✓ | ✓ | CNN architectures | | [Conv+ReLU+Pool](tasks/mnist_conv_relu_pool) | Convolution followed by ReLU and pooling | ✓ | ✓ | CNN feature extraction | | [MaxPool2D](tasks/mnist_pool) | 2D Max pooling operation | ✓ | ✓ | CNN downsampling | ### Transformer Architecture Operations | Task | Description | Forward | Backward | Use Case | |------|-------------|---------|-----------|-----------| | [LLaMA-FFW](tasks/llama_ffw) | LLaMA feed-forward network | ✓ | ✗ | LLaMA model architecture | | [LLaMA-RMSNorm](tasks/llama_rmsnorm) | Root mean square normalization | ✓ | ✓ | LLaMA model architecture | ### Complex Network Blocks | Task | Description | Forward | Backward | Use Case | |------|-------------|---------|-----------|-----------| | [ResNet Block](tasks/resnet_block) | Residual block with convolutions | ✓ | ✗ | ResNet architectures | | [UNet Linear](tasks/unet_linear) | Linear operations in UNet architecture | ✓ | ✗ | UNet model architecture | ### Original KernelBench Tasks | Task | Description | Forward | Backward | Use Case | |------|-------------|---------|-----------|-----------| | [KernelBench](tasks/kernelbench) | Original KernelBench tasks | ✓ | ✗ | Baseline comparison | ## 🤝 Contributing Contributions are welcome! Please feel free to submit a Pull Request. ## 📝 License This project is licensed under the Apache 2.0 License - see the [LICENSE](LICENSE) file for details.