# mcoplib **Repository Path**: metax-maca/mcoplib ## Basic Information - **Project Name**: mcoplib - **Description**: No description available - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2026-04-27 - **Last Updated**: 2026-04-27 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # mcOpLib ## Compilation note: Please prioritize compiling within the published vllm/sglang Docker images, for example: ```shell docker run -it --name=mcoplib-build --shm-size 16384m --device=/dev/dri --device=/dev/mxcd --group-add=video --network=host --ulimit memlock=-1 --privileged=true -v /sw_home/metax/:/home/metax -v /pde_ai/models:/models ai-master/maca/sglang:0.5.1-maca.ai20251013-45-torch2.6-py310-ubuntu22.04-amd64 /bin/bash ``` Install build dependencies: ```shell # Install cmake. Note: If compiling inside a container and the code is stored on a network shared drive, you must first switch to the root user and install cmake as root. pip3 install cmake==3.26.3 # Install pybind11 pip3 install pybind11 pip3 install build pip3 install setuptools-scm==8.0 ``` Environment variable setup: ```shell # Switch to the source code directory and execute the following command source env.sh ``` Project source code compilation: ```shell cd /path/source/code/dir/mcoplib # Source code compilation command. This command will not display compilation logs. If you need to view compilation logs, add parameters: "-v", "-vv", or "-vvv". # After compilation is complete, the generated dynamic libraries and artifacts are located under `mcoplib` in the source directory. Incremental compilation is not supported. pip install -e . --no-build-isolation pip install -e . --no-build-isolation -v pip install -e . --no-build-isolation -vv pip install -e . --no-build-isolation -vvv # mcoplib also supports compilation via python. The following two commands support incremental compilation: python setup.py develop # "build_ext --inplace" focuses only on the extension build strategy itself; "develop" performs "installation/registration/dependency handling" in addition to building. python setup.py build_ext --inplace # Print detailed WCUDA information during compilation export WCUDA_DEBUG=1 ``` note: When compiling using the pip install -e . --no-build-isolation -v (or -vv, -vvv) command, print messages within setup.py will not be printed immediately. This is because pip uses a pipe to capture stdout/stderr from the subprocess in order to echo it upon failure or merge the display in verbose mode. Therefore, print messages from setup.py will only be displayed after compilation fails or completes successfully. CUTLASS OP API接口编译控制 ```shell #The compilation of CUTLASS OP API is enabled by default #The compilation of CUTLASS OP can also be controlled through environment variables #Enable export ENABLE_BUILD_CUTLASS_OP=1 #Disable export ENABLE_BUILD_CUTLASS_OP=0 ``` Project Packaging Command: ```shell # First set environment variables cd /path/source/code/dir python -m build --no-isolation # After the packaging command finishes, the whl package will be in the source code's dist directory, for example: mcoplib-0.1.0+maca3.0.0.8.torch2.6-cp310-cp310-linux_x86_64.whl ``` ## Installation ```shell pip3 install mcoplib-0.1.0+maca3.0.0.8.torch2.6-cp310-cp310-linux_x86_64.whl ``` ## mcoplib CV Op Kernel Compilation and Packaging ```shell # Switch to the source directory (~/mcOpLib/gerrit_mcoplib/mcoplib_dev/mcoplib) and execute the following command source env.sh cd /path/source/code/dir/mcoplib/op/cv/ # Execute commands: Configure + Build cmake_maca -S . -B build -DCMAKE_BUILD_TYPE=Release cmake_maca --build build -j$(nproc) # Generate deb cd build cpack -G DEB ``` ### CV Op Deb Package Installation ```shell # cd pkg directory dpkg -i mcoplib_cv-0.2.0-Linux.deb # sudo sudo dpkg -i mcoplib_cv-0.2.0-Linux.deb # After installation is complete, the /opt/maca-ai/mcoplib directory structure is as follows: root@lt-srv-10-2-182-63:~/mcoplib# tree . |-- include | |-- arithm.h | |-- calsum.h | |-- count_nozero.h | |-- meanstdev.h | |-- process_interface.h | |-- split.h | `-- utils.h `-- lib `-- libmcoplib_cv.so ``` ### Mcoplib cv Op kernel Test ```shell # Testing the mcoplib cv op kernel requires installing the mcoplib_cv-0.2.0-Linux.deb package first. # After installing the deb package, the mcoplib cv library and header files will be located in the /opt/maca-ai/mcoplib directory. dpkg -i mcoplib_cv-0.2.0-Linux.deb # Switch to the source code directory (~/mcOplib/gerrit_mcoplib/mcoplib_dev/mcoplib) and execute the following commands: source env.sh cd /path/source/code/dir/mcoplib/unit_test/cpp mkdir build cmake_maca .. && make_maca ``` ## Installation for Enabling VLLM Custom Operators ```shell pip3 install mcoplib-0.1.0+maca3.0.0.8.torch2.6-cp310-cp310-linux_x86_64.whl ``` ## Get Version Information ```shell # After installing the mcoplib package, execute the following command in the shell terminal to retrieve version information: mcoplib_version ```` ## Control Compilation via Environment Variables ```shell # The BUILD_VLLM_SUBMODULE environment variable controls whether vllm op operators are compiled; enabled by default. export BUILD_VLLM_SUBMODULE=OFF # The BUILD_SGLANG_SUBMODULE environment variable controls whether sglang op operators are compiled; enabled by default. # The sglang inference framework generally depends on the vllm op kernel. export BUILD_SGLANG_SUBMODULE=OFF # The BUILD_LMDEPLOY_SUBMODULE environment variable controls whether lmdeploy op operators are compiled; enabled by default. export BUILD_LMDEPLOY_SUBMODULE=OFF # The BUILD_DEFAULT_OP_SUBMODULE environment variable controls whether default op operators are compiled; enabled by default. # Under normal circumstances, default operators must be enabled as they are reused. # Additionally, when importing mcoplib, it defaults to importing mcoplib.op; if disabled, it will cause an import error. export BUILD_DEFAULT_OP_SUBMODULE=OFF # Control over multiple operator compilation modules export BUILD_VLLM_SUBMODULE=OFF BUILD_SGLANG_SUBMODULE=OFF BUILD_LMDEPLOY_SUBMODULE=OFF ``` ## Dynamic control operator input parameter information terminal output or parameter dump to the local disk ```shell # Enable operator input parameter information to be output to the terminal (including data type, shape and other information) export MCOP_DEBUG_TRACE=1 # enable dump export MCOP_DEBUG_PARAMS_DUMP=1 # (Optional) Configure the number of samples export MCOP_TENSOR_DUMP_SAMPLE_SIZE=20 # or dump all tensor data (optional) export MCOP_TENSOR_DUMP_FULL=1 ``` ### Operator parameter dump local example ```json { "function": "fused_moe_gate_deepseek", "parameters": [ { "name": "gating_outputs", "type": "at::Tensor", "dtype": "Half", "shape": "[16, 448]", "value": "[0.480469, 0.894531, 0.0356445, 0.0322266, 0.498047, 0.899902, 0.887207, 0.763672, 0.192871, 0.271484, ..., 0.730469, 0.484375, 0.0517578, 0.4375, 0.507812, 0.979492, 0.42041, 0.184082, 0.825195, 0.395508] (showing 20 of 7168 elements, set MCOP_TENSOR_DUMP_FULL=1 for all) [data_ptr=0x7f43fbc00000]", "bytes": 14336 }, { "name": "correction_bias", "type": "at::Tensor", "dtype": "Half", "shape": "[448]", "value": "[0.0732422, 0.508789, 0.522461, 0.0961914, 0.373535, 0.535645, 0.0454102, 0.862305, 0.300781, 0.5625, ..., 0.757324, 0.407227, 0.803711, 0.134766, 0.777344, 0.895996, 0.731445, 0.388184, 0.0571289, 0.395996] (showing 20 of 448 elements, set MCOP_TENSOR_DUMP_FULL=1 for all) [data_ptr=0x7f43fbc03800]", "bytes": 896 }, { "name": "out_routing_weights", "type": "at::Tensor", "dtype": "Float", "shape": "[16, 8]", "value": "[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] (showing 20 of 128 elements, set MCOP_TENSOR_DUMP_FULL=1 for all) [data_ptr=0x7f43fbc03c00]", "bytes": 512 }, { "name": "out_selected_experts", "type": "at::Tensor", "dtype": "Int", "shape": "[16, 8]", "value": "[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] (showing 20 of 128 elements, set MCOP_TENSOR_DUMP_FULL=1 for all) [data_ptr=0x7f43fbc03e00]", "bytes": 512 }, { "name": "topk", "type": "int", "dtype": "int", "shape": "[]", "value": "8", "bytes": 4 }, { "name": "renormalize", "type": "bool", "dtype": "bool", "shape": "[]", "value": "1", "bytes": 1 }, { "name": "num_expert_group", "type": "int", "dtype": "int", "shape": "[]", "value": "1", "bytes": 4 }, { "name": "topk_group", "type": "int", "dtype": "int", "shape": "[]", "value": "1", "bytes": 4 }, { "name": "num_fused_shared_experts", "type": "std::optional", "dtype": "int", "shape": "[]", "value": "nullopt", "bytes": 0 }, { "name": "routed_scaling_factor", "type": "std::optional", "dtype": "float", "shape": "[]", "value": "5.5", "bytes": 4 }, { "name": "moegate_type", "type": "std::optional", "dtype": "int", "shape": "[]", "value": "0", "bytes": 4 } ] } ``` ## Getting started ### samples ```python # Example of calling operators in mcoplib op and vllm _C import contextlib from typing import TYPE_CHECKING, Optional, Union import torch import vllm.envs as envs from vllm.logger import init_logger from vllm.platforms import current_platform from vllm.scalar_type import ScalarType from mcoplib import op as ops logger = init_logger(__name__) if not current_platform.is_tpu() and not current_platform.is_xpu(): try: import mcoplib._C except ImportError as e: logger.warning("Failed to import from vllm._C with %r", e) supports_moe_ops = False with contextlib.suppress(ImportError): import mcoplib._moe_C # noqa: F401 supports_moe_ops = True if TYPE_CHECKING: def register_fake(fn): return lambda name: fn else: try: from torch.library import register_fake except ImportError: from torch.library import impl_abstract as register_fake def rms_norm( hidden_states: Tensor, weight: Tensor, epsilon: float, ) -> Tensor: input_dtype = hidden_states.dtype hidden_states = hidden_states.to(torch.float32) weight = weight.to(torch.float32) output = torch.empty_like(hidden_states) ops.rms_norm(output, hidden_states, weight, epsilon, None, None,False)# Operators in the mcoplib op module # page attention ops def paged_attention_v1( out: torch.Tensor, query: torch.Tensor, key_cache: torch.Tensor, value_cache: torch.Tensor, num_kv_heads: int, scale: float, block_tables: torch.Tensor, seq_lens: torch.Tensor, block_size: int, max_seq_len: int, alibi_slopes: Optional[torch.Tensor], kv_cache_dtype: str, k_scale: torch.Tensor, v_scale: torch.Tensor, tp_rank: int = 0, blocksparse_local_blocks: int = 0, blocksparse_vert_stride: int = 0, blocksparse_block_size: int = 64, blocksparse_head_sliding_step: int = 0, ) -> None: # Invocation of the paged_attention_v1 operator in the mcoplib vllm op kernel _C module torch.ops._C.paged_attention_v1( out, query, key_cache, value_cache, num_kv_heads, scale, block_tables, seq_lens, block_size, max_seq_len, alibi_slopes, kv_cache_dtype, k_scale, v_scale, tp_rank, blocksparse_local_blocks, blocksparse_vert_stride, blocksparse_block_size, blocksparse_head_sliding_step) # sglang sgl_kernel invocation example import torch try: import mcoplib.sgl_kernel as sgl except ImportError as e: print("Failed to import from sgl_kernel with %r", e) try: import mcoplib.sgl_grouped_gemm_cuda except ImportError as e: print("Failed to import from sgl_grouped_gemm_cuda with %r", e) try: import mcoplib.sgl_moe_fused_w4a16 except ImportError as e: print("Failed to import from sgl_moe_fused_w4a16 with %r", e) # Function: In MLA, apply rotary_emb to q, apply rms_normal to latent_cache, update latent_cache and kv_a, then apply rotary_emb to latent_cache. # Call torch's kv_b_proj to calculate kv, copy data from kv to k and v, and copy data from latent_cache to k. # Input: # Output: # Limitations: def fused_mla_normal_rotary_emb( kv_a:torch.tensor, kv_b_proj, q:torch.tensor, # [bs, 128, 192], dtype=bf16 latent_cache:torch.tensor, # [bs, 576], dtype=bf16 positions:torch.tensor, # [bs], dtype=int64 cos_sin_cache:torch.tensor, # [max_position_embeddings, 64], dtype=float norm_weight:torch.tensor, # [512], dtype=bf16 k:torch.tensor, # [bs, 128, 192], dtype=bf16 v:torch.tensor, # [bs, 128, 192], dtype=bf16 q_len:int, #bs qk_nope_head_dim:int, #128 qk_rope_head_dim:int, #64 kv_lora_rank:int, #512 v_head_dim:int, #128 num_local_heads:int , #128 ): out = torch.ops.sgl_kernel.fused_mla_RMS_rotary_emb(q, latent_cache, cos_sin_cache, positions, norm_weight, kv_a, q_len, num_local_heads, kv_lora_rank, qk_rope_head_dim, qk_nope_head_dim) if out != 0: print("Failed to call mcoplib ops.fused_mla_RMS_rotary_emb") kv = kv_b_proj(kv_a) kv = kv[0] if isinstance(kv, tuple) else kv out = torch.ops.sgl_kernel.fused_mla_normal_kv_element_wise(kv, latent_cache, k, v, q_len, num_local_heads, kv_lora_rank, qk_nope_head_dim, qk_rope_head_dim, v_head_dim) if out != 0: print("Failed to call mcoplib ops.fused_mla_normal_kv_element_wise") return q, k, v, latent_cache ``` ## QA - Executing python -m build --no-isolation fails with error: /opt/conda/bin/python: No module named build.__main__; 'build' is a package and cannot be directly executed Answer:When Python tries to execute python -m build, it cannot find the build/__main__.py file, so it cannot run build as an executable module (i.e., __main__ module). The build in your current environment is not the official PyPA build toolkit. You need to install the build package: pip install --force-reinstall build - After building and packaging mcoplib, version information cannot be displayed, and there is no version file in the package directory. Answer: This is caused by the lack of the git command in the build environment. Please install the git command in the build environment. - Error during compilation: FileNotFoundError: [Errno 2] No such file or directory: 'cmake_maca' Answer: Please execute the environment variable script env.sh before compiling: cd /code/dir/mcoplib/ && source env.sh ## Release ### Release 0.4.4 - add cv op kernel - support sglang 0.5.10 op - optimize mcoplib project build - support mxbench for auto test op kernel `s perfromance - support profiler tools check op kernel `s perfromance - support for vllm 0.19.0 op kernels - support Project-customized op kernels - support k-transformer op kernels - support verl op kernels - support all of mcopZoo op kernels - support auto print and dump op input params by setting env - support auto build mxbench running env by shell script - support auto test torch/py/c op api by mxbench cmd ## Acknowledgment Show your appreciation to those who have contributed to the project. ## License For open source projects, say how it is licensed. ## Project status If you have run out of energy or time for your project, put a note at the top of the README saying that development has slowed down or stopped completely. Someone may choose to fork your project or volunteer to step in as a maintainer or owner, allowing your project to keep going. You can also make an explicit request for maintainers.