# SGLangServerToolkit

**Repository Path**: madstone_thu/SGLangServerToolkit

## Basic Information

- **Project Name**: SGLangServerToolkit
- **Description**: My toolkit for setup and access SGLang Server for hosting open source models
- **Primary Language**: Unknown
- **License**: MulanPSL-2.0
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2026-04-12
- **Last Updated**: 2026-04-13

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# Local SGLang Serve Toolkit

This repository serves local models through SGLang and exposes OpenAI-compatible APIs for Codex, Claude Code, and other OpenAI-style clients.

The local service endpoint itself is the OpenAI-compatible gateway. Use its `/v1` base URL directly from external tools instead of adding a second gateway layer unless you explicitly need multi-backend routing.

## Layout

- `src/common/`: shared implementation
- `src/wsl2_agent_stack/`: WSL2 bootstrap and agent config toolkit for OpenCode/OpenClaw + Obsidian
- `models/glm51_fp8/`: GLM-5.1-FP8 wrappers
- `models/minimax_m25/`: Minimax M2.5 wrappers
- `docs/`: design and plan documents
- `README.md`, `Agent.md`, `report.md`: top-level docs

## Shared Files

- `src/common/build_sglang_image.sh`: build the patched SGLang image
- `src/common/Dockerfile.sglang`: image definition for the compatibility patch
- `src/common/glm_moe_dsa_transformers_patch.py`: GLM compatibility shim
- `src/common/benchmark_sglang.py`: generic benchmark runner
- `src/common/check_openai_service.sh`: generic OpenAI-compatible smoke test
- `src/common/install_kv_cache_stats.sh`: install the KV helper into a container
- `src/common/run_kv_cache_stats.sh`: run KV-cache stats against a container
- `src/common/kv_cache_stats_sglang.py`: container-internal KV helper
- `src/common/expose_api.sh`: expose a local API over SSH reverse tunneling

## Models

### GLM-5.1-FP8

- wrapper directory: `models/glm51_fp8`
- model path: `/data/models/ZhipuAI/GLM-5.1-FP8`
- default host port: `8000`
- model name: `glm-5.1-fp8`

Start:

```bash
API_KEY='replace-with-a-strong-secret' ./models/glm51_fp8/serve.sh
```

Stop:

```bash
./models/glm51_fp8/stop.sh
```

Smoke test:

```bash
API_KEY='replace-with-a-strong-secret' ./models/glm51_fp8/check.sh
```

Benchmark:

```bash
API_KEY='replace-with-a-strong-secret' ./models/glm51_fp8/benchmark.sh latency --requests 1 --max-tokens 64
```

KV-cache:

```bash
API_KEY='replace-with-a-strong-secret' ./models/glm51_fp8/kv_cache.sh --json
```

### Minimax M2.5

- wrapper directory: `models/minimax_m25`
- model path: `/data/models/Minimax/Minimax-M2.5`
- default host port: `8191`
- model name: `minimax-m2.5`
- host GPU policy: use only GPUs `4,5,6,7`

Start:

```bash
API_KEY='replace-with-a-strong-secret' ./models/minimax_m25/serve.sh
```

Stop:

```bash
./models/minimax_m25/stop.sh
```

Smoke test:

```bash
API_KEY='replace-with-a-strong-secret' ./models/minimax_m25/check.sh
```

Benchmark:

```bash
API_KEY='replace-with-a-strong-secret' ./models/minimax_m25/benchmark.sh latency --requests 1 --max-tokens 64
```

KV-cache:

```bash
API_KEY='replace-with-a-strong-secret' ./models/minimax_m25/kv_cache.sh --json
```

## Image Build

Build the patched local SGLang image:

```bash
./src/common/build_sglang_image.sh
```

The current image remains `glm51-sglang:local`. It is used by both GLM and Minimax wrappers on this machine.

## Fast-Start Defaults

Current serve defaults:

- `SKIP_SERVER_WARMUP=1`
- `DISABLE_CUDA_GRAPH=1`
- `ENABLE_JIT_DEEPGEMM=0`
- `JIT_DEEPGEMM_PRECOMPILE=0`

These remain overridable through environment variables passed to the model `serve.sh` wrapper.

## Benchmark

`src/common/benchmark_sglang.py` is model-agnostic.

The model wrappers already bind the correct defaults, but you can also call the shared script directly:

```bash
BASE_URL='http://127.0.0.1:8191/v1' \
MODEL='minimax-m2.5' \
API_KEY='replace-with-a-strong-secret' \
python3 ./src/common/benchmark_sglang.py latency --requests 1 --max-tokens 64
```

The benchmark script prints proxy state to `stderr` and bypasses proxies automatically for local loopback URLs.

## KV-Cache Observation

Install the helper into a container:

```bash
./src/common/install_kv_cache_stats.sh
```

Run the shared wrapper against a specific container:

```bash
API_KEY='replace-with-a-strong-secret' \
CONTAINER_NAME='minimax-m25-sglang' \
./src/common/run_kv_cache_stats.sh --json
```

The model wrappers already bind the correct container name:

```bash
API_KEY='replace-with-a-strong-secret' ./models/minimax_m25/kv_cache.sh --json
```

The KV wrapper supports both SGLang log formats:

- `KV size: <n> GB`
- `K size: <n> GB, V size: <n> GB`

## Expose Local API Over SSH

Use the shared SSH reverse-tunnel helper:

```bash
./src/common/expose_api.sh \
  --ssh-target user@example.com \
  --remote-port 18191 \
  --public-key-file ~/.ssh/id_rsa.pub \
  --identity-file ~/.ssh/id_rsa \
  --dry-run
```

`--dry-run` validates arguments and key paths without requiring the local service to be up.

Background tunnel:

```bash
./src/common/expose_api.sh \
  --ssh-target user@example.com \
  --remote-port 18191 \
  --public-key-file ~/.ssh/id_rsa.pub \
  --identity-file ~/.ssh/id_rsa \
  --background
```

Result modes:

- `public-bind`: remote SSH server allowed a public bind
- `loopback-bind`: remote SSH server allowed only loopback bind

## Proxy Behavior

- Local smoke tests and benchmarks print proxy state to `stderr`
- Loopback access bypasses local proxies by default
- This behavior is compatible with `/home/zli/proxy/setup_env.sh`

## WSL2 Agent Stack

For a Windows + WSL2 workstation that runs OpenCode and OpenClaw against the public API exposed by this machine, use:

```bash
./src/wsl2_agent_stack/bootstrap_wsl2.sh
./src/wsl2_agent_stack/install_opencode.sh
./src/wsl2_agent_stack/install_openclaw.sh
./src/wsl2_agent_stack/setup_workspace.sh \
  --vault-path /mnt/c/Users/<you>/Documents/ObsidianVault \
  --api-base-url https://your-public-api.example.com/v1 \
  --api-key 'replace-with-a-real-key' \
  --model minimax-m2.5
```

Detailed notes are in `src/wsl2_agent_stack/README.md`.

Both OpenCode and OpenClaw should point at the same self-hosted OpenAI-compatible `/v1` endpoint.

For a one-command WSL2 deployment entry that preserves the underlying four scripts, use:

```bash
export VAULT_PATH=/mnt/c/Users/<you>/Documents/ObsidianVault
export API_BASE_URL=https://your-public-api.example.com/v1
export API_KEY='replace-with-a-real-key'
export MODEL=minimax-m2.5

./src/wsl2_agent_stack/deploy_wsl2_agent_stack.sh
```