# claw-eval
**Repository Path**: github_zoo/claw-eval
## Basic Information
- **Project Name**: claw-eval
- **Description**: Claw-Eval is an evaluation harness for evaluating LLM as agents. All tasks verified by humans.
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2026-03-17
- **Last Updated**: 2026-03-17
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README

# Claw-Eval
[](#tasks)
[](#leaderboard)
[](https://claw-eval.github.io)
[](LICENSE)
> End-to-end transparent benchmark for AI agents acts in the real world.
> 104 tasks, 15 services, Docker sandboxes, and robust grading.
---
## Leaderboard
Browse the full leaderboard and individual task cases at **[claw-eval.github.io](https://claw-eval.github.io)**.
**Evaluation Logic (Updated March 2026):**
* **Primary Metric: Pass^3.** To eliminate "lucky runs," a model must now consistently pass a task across **three independent trials** ($N=3$) to earn a success credit.
* **Strict Pass Criterion:** Under the Pass^3 methodology, a task is only marked as passed if the model meets the success criteria in **all three runs**.
* **Reproducibility:** We are committed to end-to-end reproducibility. Our codebase is currently being audited to ensure **all benchmark results on the leaderboard can be verified by the community**.
## Quick Start
We recommend using [uv](https://docs.astral.sh/uv/) for fast, reliable dependency management:
```bash
pip install uv
uv venv --python 3.11
source .venv/bin/activate
```
Prepare your keys and set up the environments with one command:
```bash
export OPENROUTER_API_KEY=sk-or-...
export SERP_DEV_KEY=... # add this for tasks need real web search
bash scripts/test_sandbox.sh
```
Go rock 🚀
```bash
claw-eval batch --config model_configs/claude_opus_46.yaml --sandbox --trials 3 --parallel 16
```
---
## Roadmap
- [ ] More real-world, multimodal tasks in complex productivity environments
- [ ] Comprehensive, fine-grained scoring logic with deep state verification
- [ ] Enhanced sandbox isolation and full-trace tracking for transparent, scalable evaluation
## Contribution
We welcome any kind of contribution. Let us know if you have any suggestions!
## Acknowledgements
Our test cases are built on the work of the community. We draw from and adapt tasks contributed by OpenClaw, PinBench, OfficeQA, OneMillion-Bench, Finance Agent, and Terminal-Bench 2.0.
## Contributors
[Bowen Ye*](https://github.com/pkuYmiracle)(PKU), [Rang Li*](https://github.com/lirang04) (PKU), [Qibin Yang](https://github.com/yangqibin-caibi) (PKU), [Zhihui Xie](https://zhxie.site/)(HKU), [Lei Li](lilei-nlp.github.io)$^\dagger$(HKU, Project Lead)