# SGLangServerToolkit **Repository Path**: madstone_thu/SGLangServerToolkit ## Basic Information - **Project Name**: SGLangServerToolkit - **Description**: My toolkit for setup and access SGLang Server for hosting open source models - **Primary Language**: Unknown - **License**: MulanPSL-2.0 - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2026-04-12 - **Last Updated**: 2026-04-13 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # Local SGLang Serve Toolkit This repository serves local models through SGLang and exposes OpenAI-compatible APIs for Codex, Claude Code, and other OpenAI-style clients. The local service endpoint itself is the OpenAI-compatible gateway. Use its `/v1` base URL directly from external tools instead of adding a second gateway layer unless you explicitly need multi-backend routing. ## Layout - `src/common/`: shared implementation - `src/wsl2_agent_stack/`: WSL2 bootstrap and agent config toolkit for OpenCode/OpenClaw + Obsidian - `models/glm51_fp8/`: GLM-5.1-FP8 wrappers - `models/minimax_m25/`: Minimax M2.5 wrappers - `docs/`: design and plan documents - `README.md`, `Agent.md`, `report.md`: top-level docs ## Shared Files - `src/common/build_sglang_image.sh`: build the patched SGLang image - `src/common/Dockerfile.sglang`: image definition for the compatibility patch - `src/common/glm_moe_dsa_transformers_patch.py`: GLM compatibility shim - `src/common/benchmark_sglang.py`: generic benchmark runner - `src/common/check_openai_service.sh`: generic OpenAI-compatible smoke test - `src/common/install_kv_cache_stats.sh`: install the KV helper into a container - `src/common/run_kv_cache_stats.sh`: run KV-cache stats against a container - `src/common/kv_cache_stats_sglang.py`: container-internal KV helper - `src/common/expose_api.sh`: expose a local API over SSH reverse tunneling ## Models ### GLM-5.1-FP8 - wrapper directory: `models/glm51_fp8` - model path: `/data/models/ZhipuAI/GLM-5.1-FP8` - default host port: `8000` - model name: `glm-5.1-fp8` Start: ```bash API_KEY='replace-with-a-strong-secret' ./models/glm51_fp8/serve.sh ``` Stop: ```bash ./models/glm51_fp8/stop.sh ``` Smoke test: ```bash API_KEY='replace-with-a-strong-secret' ./models/glm51_fp8/check.sh ``` Benchmark: ```bash API_KEY='replace-with-a-strong-secret' ./models/glm51_fp8/benchmark.sh latency --requests 1 --max-tokens 64 ``` KV-cache: ```bash API_KEY='replace-with-a-strong-secret' ./models/glm51_fp8/kv_cache.sh --json ``` ### Minimax M2.5 - wrapper directory: `models/minimax_m25` - model path: `/data/models/Minimax/Minimax-M2.5` - default host port: `8191` - model name: `minimax-m2.5` - host GPU policy: use only GPUs `4,5,6,7` Start: ```bash API_KEY='replace-with-a-strong-secret' ./models/minimax_m25/serve.sh ``` Stop: ```bash ./models/minimax_m25/stop.sh ``` Smoke test: ```bash API_KEY='replace-with-a-strong-secret' ./models/minimax_m25/check.sh ``` Benchmark: ```bash API_KEY='replace-with-a-strong-secret' ./models/minimax_m25/benchmark.sh latency --requests 1 --max-tokens 64 ``` KV-cache: ```bash API_KEY='replace-with-a-strong-secret' ./models/minimax_m25/kv_cache.sh --json ``` ## Image Build Build the patched local SGLang image: ```bash ./src/common/build_sglang_image.sh ``` The current image remains `glm51-sglang:local`. It is used by both GLM and Minimax wrappers on this machine. ## Fast-Start Defaults Current serve defaults: - `SKIP_SERVER_WARMUP=1` - `DISABLE_CUDA_GRAPH=1` - `ENABLE_JIT_DEEPGEMM=0` - `JIT_DEEPGEMM_PRECOMPILE=0` These remain overridable through environment variables passed to the model `serve.sh` wrapper. ## Benchmark `src/common/benchmark_sglang.py` is model-agnostic. The model wrappers already bind the correct defaults, but you can also call the shared script directly: ```bash BASE_URL='http://127.0.0.1:8191/v1' \ MODEL='minimax-m2.5' \ API_KEY='replace-with-a-strong-secret' \ python3 ./src/common/benchmark_sglang.py latency --requests 1 --max-tokens 64 ``` The benchmark script prints proxy state to `stderr` and bypasses proxies automatically for local loopback URLs. ## KV-Cache Observation Install the helper into a container: ```bash ./src/common/install_kv_cache_stats.sh ``` Run the shared wrapper against a specific container: ```bash API_KEY='replace-with-a-strong-secret' \ CONTAINER_NAME='minimax-m25-sglang' \ ./src/common/run_kv_cache_stats.sh --json ``` The model wrappers already bind the correct container name: ```bash API_KEY='replace-with-a-strong-secret' ./models/minimax_m25/kv_cache.sh --json ``` The KV wrapper supports both SGLang log formats: - `KV size: GB` - `K size: GB, V size: GB` ## Expose Local API Over SSH Use the shared SSH reverse-tunnel helper: ```bash ./src/common/expose_api.sh \ --ssh-target user@example.com \ --remote-port 18191 \ --public-key-file ~/.ssh/id_rsa.pub \ --identity-file ~/.ssh/id_rsa \ --dry-run ``` `--dry-run` validates arguments and key paths without requiring the local service to be up. Background tunnel: ```bash ./src/common/expose_api.sh \ --ssh-target user@example.com \ --remote-port 18191 \ --public-key-file ~/.ssh/id_rsa.pub \ --identity-file ~/.ssh/id_rsa \ --background ``` Result modes: - `public-bind`: remote SSH server allowed a public bind - `loopback-bind`: remote SSH server allowed only loopback bind ## Proxy Behavior - Local smoke tests and benchmarks print proxy state to `stderr` - Loopback access bypasses local proxies by default - This behavior is compatible with `/home/zli/proxy/setup_env.sh` ## WSL2 Agent Stack For a Windows + WSL2 workstation that runs OpenCode and OpenClaw against the public API exposed by this machine, use: ```bash ./src/wsl2_agent_stack/bootstrap_wsl2.sh ./src/wsl2_agent_stack/install_opencode.sh ./src/wsl2_agent_stack/install_openclaw.sh ./src/wsl2_agent_stack/setup_workspace.sh \ --vault-path /mnt/c/Users//Documents/ObsidianVault \ --api-base-url https://your-public-api.example.com/v1 \ --api-key 'replace-with-a-real-key' \ --model minimax-m2.5 ``` Detailed notes are in `src/wsl2_agent_stack/README.md`. Both OpenCode and OpenClaw should point at the same self-hosted OpenAI-compatible `/v1` endpoint. For a one-command WSL2 deployment entry that preserves the underlying four scripts, use: ```bash export VAULT_PATH=/mnt/c/Users//Documents/ObsidianVault export API_BASE_URL=https://your-public-api.example.com/v1 export API_KEY='replace-with-a-real-key' export MODEL=minimax-m2.5 ./src/wsl2_agent_stack/deploy_wsl2_agent_stack.sh ```