# claw-eval **Repository Path**: github_zoo/claw-eval ## Basic Information - **Project Name**: claw-eval - **Description**: Claw-Eval is an evaluation harness for evaluating LLM as agents. All tasks verified by humans. - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2026-03-17 - **Last Updated**: 2026-03-17 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README
Claw-Eval Logo # Claw-Eval [![Tasks](https://img.shields.io/badge/tasks-104-blue)](#tasks) [![Models](https://img.shields.io/badge/models-22-green)](#leaderboard) [![Leaderboard](https://img.shields.io/badge/leaderboard-live-purple)](https://claw-eval.github.io) [![License](https://img.shields.io/badge/license-MIT-orange)](LICENSE) > End-to-end transparent benchmark for AI agents acts in the real world.
> 104 tasks, 15 services, Docker sandboxes, and robust grading.
--- ## Leaderboard Browse the full leaderboard and individual task cases at **[claw-eval.github.io](https://claw-eval.github.io)**. **Evaluation Logic (Updated March 2026):** * **Primary Metric: Pass^3.** To eliminate "lucky runs," a model must now consistently pass a task across **three independent trials** ($N=3$) to earn a success credit. * **Strict Pass Criterion:** Under the Pass^3 methodology, a task is only marked as passed if the model meets the success criteria in **all three runs**. * **Reproducibility:** We are committed to end-to-end reproducibility. Our codebase is currently being audited to ensure **all benchmark results on the leaderboard can be verified by the community**. ## Quick Start We recommend using [uv](https://docs.astral.sh/uv/) for fast, reliable dependency management: ```bash pip install uv uv venv --python 3.11 source .venv/bin/activate ``` Prepare your keys and set up the environments with one command: ```bash export OPENROUTER_API_KEY=sk-or-... export SERP_DEV_KEY=... # add this for tasks need real web search bash scripts/test_sandbox.sh ``` Go rock 🚀 ```bash claw-eval batch --config model_configs/claude_opus_46.yaml --sandbox --trials 3 --parallel 16 ``` --- ## Roadmap - [ ] More real-world, multimodal tasks in complex productivity environments - [ ] Comprehensive, fine-grained scoring logic with deep state verification - [ ] Enhanced sandbox isolation and full-trace tracking for transparent, scalable evaluation ## Contribution We welcome any kind of contribution. Let us know if you have any suggestions! ## Acknowledgements Our test cases are built on the work of the community. We draw from and adapt tasks contributed by OpenClaw, PinBench, OfficeQA, OneMillion-Bench, Finance Agent, and Terminal-Bench 2.0. ## Contributors [Bowen Ye*](https://github.com/pkuYmiracle)(PKU), [Rang Li*](https://github.com/lirang04) (PKU), [Qibin Yang](https://github.com/yangqibin-caibi) (PKU), [Zhihui Xie](https://zhxie.site/)(HKU), [Lei Li](lilei-nlp.github.io)$^\dagger$(HKU, Project Lead)