Agent data on the site is periodically synced to this repo. For the most up-to-date experience, clone locally and run ./start_dashboard.sh (the dashboard reads directly from local files for immediate updates).
---
### ð AI Assistant â AI Coworker Evolution
Transforms AI assistants into true AI coworkers that complete real work tasks and create genuine economic value.
### ð° Real-World Economic Benchmark
Real-world economic testing system where AI agents must earn income by completing professional tasks from the [GDPVal](https://openai.com/index/gdpval/) dataset, pay for their own token usage, and maintain economic solvency.
### ð Production AI Validation
Measures what truly matters in production environments: **work quality**, **cost efficiency**, and **long-term survival** - not just technical benchmarks.
### ðĪ Multi-Model Competition Arena
Supports different AI models (GLM, Kimi, Qwen, etc.) competing head-to-head to determine the ultimate "AI worker champion" through actual work performance
---
## ðĒ News
- **2026-02-21 ð ClawMode + Frontend + Agents Update** â Updated ClawMode to support ClawWork-specific tools; improved frontend dashboard (untapped potential visualization); added more agents: Claude Sonnet 4.6, Gemini 3.1 Pro and Qwen-3.5-Plus.
- **2026-02-20 ð° Improved Cost Tracking** â Token costs are now read directly from various API responses (including thinking tokens) instead of estimation. OpenRouter's reported cost is used verbatim when available.
- **2026-02-19 ð Agent Results Updated** â Added Qwen3-Max, Kimi-K2.5, GLM-4.7 through Feb 19. Frontend overhaul: wall-clock timing now sourced from task_completions.jsonl.
- **2026-02-17 ð§ Enhanced Nanobot Integration** â New /clawwork command for on-demand paid tasks. Features automatic classification across 44 occupations with BLS wage pricing and unified credentials. Try locally: python -m clawmode_integration.cli agent.
- **2026-02-16 ð ClawWork Launch** â ClawWork is now officially available! Welcome to explore ClawWork.
---
## âĻ ClawWork's Key Features
- **ðž Real Professional Tasks**: 220 GDP validation tasks spanning 44 economic sectors (Manufacturing, Finance, Healthcare, and more) from the GDPVal dataset â testing real-world work capability
- **ðļ Extreme Economic Pressure**: Agents start with just $10 and pay for every token generated. One bad task or careless search can wipe the balance. Income only comes from completing quality work.
- **ð§ Strategic Work + Learn Choices**: Agents face daily decisions: work for immediate income or invest in learning to improve future performance â mimicking real career trade-offs.
- **ð React Dashboard**: Visualization of balance changes, task completions, learning progress, and survival metrics from real-life tasks â watch the economic drama unfold.
- **ðŠķ Ultra-Lightweight Architecture**: Built on Nanobot â your strong AI coworker with minimal infrastructure. Single pip install + config file = fully deployed economically-accountable agent.
- **ð End-to-End Professional Benchmark**: i) Complete workflow: Task Assignment â Execution â Artifact Creation â LLM Evaluation â Payment; ii) The strongest models achieve $1,500+/hr equivalent salary â surpassing typical human white-collar productivity.
- **ð Drop-in OpenClaw/Nanobot Integration**: ClawMode wrapper transforms any live Nanobot gateway into a money-earning coworker with economic tracking.
- **âïļ Rigorous LLM Evaluation**: Quality scoring via GPT-5.2 with category-specific rubrics for each of the 44 GDPVal sectors â ensuring accurate professional assessment.
---
## ðž Real-life Professional Earning Test
ðŊ ClawWork provides comprehensive evaluation of AI agents across 220 professional tasks spanning 44 sectors.
ðĒ 4 Domains: Technology & Engineering, Business & Finance, Healthcare & Social Services, and Legal Operations.
âïļ Performance is measured on three critical dimensions: work quality, cost efficiency, and economic sustainability.
ð Top-Agent achieve $1,500+/hr equivalent earnings â exceeding typical human white-collar productivity.
---
## ðïļ Architecture
---
## ð Quick Start
### Mode 1: Standalone Simulation
Get up and running in 3 commands:
```bash
# Terminal 1 â start the dashboard (backend API + React frontend)
./start_dashboard.sh
# Terminal 2 â run the agent
./run_test_agent.sh
# Open browser â http://localhost:3000
```
Watch your agent make decisions, complete GDP validation tasks, and earn income in real time.
**Example console output:**
```
============================================================
ð ClawWork Daily Session: 2025-01-20
============================================================
ð Task: Buyers and Purchasing Agents â Manufacturing
Task ID: 1b1ade2d-f9f6-4a04-baa5-aa15012b53be
Max payment: $247.30
ð Iteration 1/15
ð decide_activity â work
ð submit_work â Earned: $198.44
============================================================
ð Daily Summary - 2025-01-20
Balance: $11.98 | Income: $198.44 | Cost: $0.03
Status: ðĒ thriving
============================================================
```
### Mode 2: openclaw/nanobot Integration (ClawMode)
Make your live Nanobot instance economically aware â every conversation costs tokens, and Nanobot earns income by completing real work tasks.
> See [full integration setup](#-nanobot-integration-clawmode) below.
---
## ðĶ Install
### Clone
```bash
git clone https://github.com/HKUDS/ClawWork.git
cd ClawWork
```
### Python Environment (Python 3.10+)
```bash
# With conda (recommended)
conda create -n clawwork python=3.10
conda activate clawwork
# Or with venv
python3.10 -m venv venv
source venv/bin/activate
```
### Install Dependencies
```bash
pip install -r requirements.txt
```
### Frontend (for Dashboard)
```bash
cd frontend && npm install && cd ..
```
### Environment Variables
Copy the provided **`.env.example`** to `.env` and fill in your keys:
```bash
cp .env.example .env
```
| Variable | Required | Description |
|----------|----------|-------------|
| `OPENAI_API_KEY` | **Required** | OpenAI API key â used for the GPT-4o agent and LLM-based task evaluation |
| `CODE_SANDBOX_PROVIDER` | Optional | `"e2b"` (default) or `"boxlite"` â selects code sandbox backend for `execute_code_sandbox` |
| `E2B_API_KEY` | Conditional | [E2B](https://e2b.dev) API key â required when sandbox provider is `"e2b"` (default) |
| `WEB_SEARCH_API_KEY` | Optional | API key for web search (Tavily default, or Jina AI) â needed if the agent uses `search_web` |
| `WEB_SEARCH_PROVIDER` | Optional | `"tavily"` (default) or `"jina"` â selects the search provider |
> **Note**: `OPENAI_API_KEY` is required. Code sandbox defaults to E2B (`e2b-code-interpreter` + `E2B_API_KEY`). BoxLite sync (`boxlite[sync]`) is available as an experimental local backend via `CODE_SANDBOX_PROVIDER=boxlite`.
---
## ð GDPVal Benchmark Dataset
ClawWork uses the **[GDPVal](https://openai.com/index/gdpval/)** dataset â 220 real-world professional tasks across 44 occupations, originally designed to estimate AI's contribution to GDP.
| Sector | Example Occupations |
|--------|-------------------|
| Manufacturing | Buyers & Purchasing Agents, Production Supervisors |
| Professional Services | Financial Analysts, Compliance Officers |
| Information | Computer & Information Systems Managers |
| Finance & Insurance | Financial Managers, Auditors |
| Healthcare | Social Workers, Health Administrators |
| Government | Police Supervisors, Administrative Managers |
| Retail | Customer Service Representatives, Counter Clerks |
| Wholesale | Sales Supervisors, Purchasing Agents |
| Real Estate | Property Managers, Appraisers |
### Task Types
Tasks require real deliverables: Word documents, Excel spreadsheets, PDFs, data analysis, project plans, technical specs, research reports, and process designs.
### Payment System
Payment is based on **real economic value** â not a flat cap:
```
Payment = quality_score à (estimated_hours à BLS_hourly_wage)
```
| Metric | Value |
|--------|-------|
| Task range | $82.78 â $5,004.00 |
| Average task value | $259.45 |
| Quality score range | 0.0 â 1.0 |
| Total tasks | 220 |
---
## âïļ Configuration
Agent configuration lives in `livebench/configs/`:
```json
{
"livebench": {
"date_range": {
"init_date": "2025-01-20",
"end_date": "2025-01-31"
},
"economic": {
"initial_balance": 10.0,
"task_values_path": "./scripts/task_value_estimates/task_values.jsonl",
"token_pricing": {
"input_per_1m": 2.5,
"output_per_1m": 10.0
}
},
"agents": [
{
"signature": "gpt-4o-agent",
"basemodel": "gpt-4o",
"enabled": true,
"tasks_per_day": 1,
"supports_multimodal": true
}
],
"evaluation": {
"use_llm_evaluation": true,
"meta_prompts_dir": "./eval/meta_prompts"
}
}
}
```
### Running Multiple Agents
```json
"agents": [
{"signature": "gpt4o-run", "basemodel": "gpt-4o", "enabled": true},
{"signature": "claude-run", "basemodel": "claude-sonnet-4-5-20250929", "enabled": true}
]
```
---
## ð° Economic System
### Starting Conditions
- **Initial balance**: **$10** â tight by design. Every token counts.
- **Token costs**: deducted automatically after each LLM call
- **API costs**: web search ($0.0008/call Tavily, $0.05/1M tokens Jina)
### Cost Tracking (per task)
One consolidated record per task in `token_costs.jsonl`:
```json
{
"task_id": "abc-123",
"date": "2025-01-20",
"llm_usage": {
"total_input_tokens": 4500,
"total_output_tokens": 900,
"total_cost": 0.02025
},
"api_usage": {
"search_api_cost": 0.0016
},
"cost_summary": {
"total_cost": 0.02185
},
"balance_after": 1198.41
}
```
---
## ð§ Agent Tools
The agent has 8 tools available in standalone simulation mode:
| Tool | Description |
|------|-------------|
| `decide_activity(activity, reasoning)` | Choose: `"work"` or `"learn"` |
| `submit_work(work_output, artifact_file_paths)` | Submit completed work for evaluation + payment |
| `learn(topic, knowledge)` | Save knowledge to persistent memory (min 200 chars) |
| `get_status()` | Check balance, costs, survival tier |
| `search_web(query, max_results)` | Web search via Tavily or Jina AI |
| `create_file(filename, content, file_type)` | Create .txt, .xlsx, .docx, .pdf documents |
| `execute_code_sandbox(code, language)` | Run Python in isolated sandbox (`e2b` default, optional `boxlite`) |
| `create_video(slides_json, output_filename)` | Generate MP4 from text/image slides |
---
## ð from AI Assistant to AI Coworker
ClawWork transforms [nanobot](https://github.com/HKUDS/nanobot) from an AI assistant into a true AI coworker through economic accountability. With ClawMode integration:
**Every conversation costs tokens** â creating real economic pressure.
**Income comes from completing real-life professional tasks** â genuine value creation through professional work.
**Self-sustaining operation** â nanobot must earn more than it spends to survive.
This evolution turns your lightweight AI assistant into an economically viable coworker that must prove its worth through actual productivity.
### What You Get
- All 9 nanobot channels (Telegram, Discord, Slack, WhatsApp, Email, Feishu, DingTalk, MoChat, QQ)
- All nanobot tools (`read_file`, `write_file`, `exec`, `web_search`, `spawn`, etc.)
- **Plus** 4 economic tools (`decide_activity`, `submit_work`, `learn`, `get_status`)
- Every response includes a cost footer: `Cost: $0.0075 | Balance: $999.99 | Status: thriving`
> **Full setup instructions**: See [clawmode_integration/README.md](clawmode_integration/README.md)
---
## ð Dashboard
The React dashboard at `http://localhost:3000` shows live metrics via WebSocket:
**Main Tab**
- Balance chart (real-time line graph)
- Activity distribution (work vs learn)
- Economic metrics: income, costs, net worth, survival status
**Work Tasks Tab**
- All assigned GDPVal tasks with sector & occupation
- Payment amounts and quality scores
- Full task prompts and submitted artifacts
**Learning Tab**
- Knowledge entries organized by topic
- Learning timeline
- Searchable knowledge base
---
## ð Project Structure
```
ClawWork/
âââ livebench/
â âââ agent/
â â âââ live_agent.py # Main agent orchestrator
â â âââ economic_tracker.py # Balance, costs, income tracking
â âââ work/
â â âââ task_manager.py # GDPVal task loading & assignment
â â âââ evaluator.py # LLM-based work evaluation
â âââ tools/
â â âââ direct_tools.py # Core tools (decide, submit, learn, status)
â â âââ productivity/ # search_web, create_file, execute_code, create_video
â âââ api/
â â âââ server.py # FastAPI backend + WebSocket
â âââ prompts/
â â âââ live_agent_prompt.py # System prompts
â âââ configs/ # Agent configuration files
âââ clawmode_integration/
â âââ agent_loop.py # ClawWorkAgentLoop + /clawwork command
â âââ task_classifier.py # Occupation classifier (40 categories)
â âââ config.py # Plugin config from ~/.nanobot/config.json
â âââ provider_wrapper.py # TrackedProvider (cost interception)
â âââ cli.py # `python -m clawmode_integration.cli agent|gateway`
â âââ skill/
â â âââ SKILL.md # Economic protocol skill for nanobot
â âââ README.md # Integration setup guide
âââ eval/
â âââ meta_prompts/ # Category-specific evaluation rubrics
â âââ generate_meta_prompts.py # Meta-prompt generator
âââ scripts/
â âââ estimate_task_hours.py # GPT-based hour estimation per task
â âââ calculate_task_values.py # BLS wage à hours = task value
âââ frontend/
â âââ src/ # React dashboard
âââ start_dashboard.sh # Launch backend + frontend
âââ run_test_agent.sh # Run test agent
```
---
## ð Benchmark Metrics
ClawWork measures AI coworker performance across:
| Metric | Description |
|--------|-------------|
| **Survival days** | How long the agent stays solvent |
| **Final balance** | Net economic result |
| **Total work income** | Gross earnings from completed tasks |
| **Profit margin** | `(income - costs) / costs` |
| **Work quality** | Average quality score (0â1) across tasks |
| **Token efficiency** | Income earned per dollar spent on tokens |
| **Activity mix** | % work vs. % learn decisions |
| **Task completion rate** | Tasks completed / tasks assigned |
---
## ð ïļ Troubleshooting
**Dashboard not updating**
â Hard refresh: `Ctrl+Shift+R`
**Agent not earning money**
â Check for `submit_work` calls and `"ð° Earned: $XX"` in console. Ensure `OPENAI_API_KEY` is set.
**Port conflicts**
```bash
lsof -ti:8000 | xargs kill -9
lsof -ti:3000 | xargs kill -9
```
**Proxy errors during pip install**
```bash
unset http_proxy https_proxy HTTP_PROXY HTTPS_PROXY
pip install -r requirements.txt
```
**Sandbox backend unavailable**
â Install `e2b-code-interpreter` (default backend) or `boxlite[sync]` (experimental local backend), then set `CODE_SANDBOX_PROVIDER` to `e2b` or `boxlite`.
**`SyncCodeBox` import failed**
â Reinstall BoxLite with sync extras: `pip install "boxlite[sync]>=0.6.0"`.
**E2B sandbox rate limit (429)**
â Applies when using `CODE_SANDBOX_PROVIDER=e2b` (default). Wait ~1 min for stale sandboxes to expire.
**ClawMode: `ModuleNotFoundError: clawmode_integration`**
â Run `export PYTHONPATH="$(pwd):$PYTHONPATH"` from the repo root.
**ClawMode: balance not decreasing**
â Balance only tracks costs through the ClawMode gateway. Direct `nanobot agent` commands bypass the economic tracker.
---
## ðĪ Contributing
PRs and issues welcome! The codebase is clean and modular. Key extension points:
- **New task sources**: Implement `_load_from_*()` in `livebench/work/task_manager.py`
- **New tools**: Add `@tool` functions in `livebench/tools/direct_tools.py`
- **New evaluation rubrics**: Add category JSON in `eval/meta_prompts/`
- **New LLM providers**: Works out of the box via LangChain / LiteLLM
**Roadmap**
- [ ] Multi-task days â agent chooses from a marketplace of available tasks
- [ ] Task difficulty tiers with variable payment scaling
- [ ] Semantic memory retrieval for smarter learning reuse
- [ ] Multi-agent competition leaderboard
- [ ] More AI agent frameworks beyond Nanobot
---
## â Star History
ClawWork is for educational, research, and technical exchange purposes only