Common methods collection used by harmonyos drivers.
TLLM_QMM strips the implementation of quantized kernels of Nvidia's TensorRT-LLM, removing NVInfer dependency and exposes ease of use Pytorch module. We modified the dequantation and weight preprocessing to align with popular quantization alogirthms such as AWQ and GPTQ, and combine them with new FP8 quantization.
Bind observables to the lifecycle of Activity or Fragment in a non-invasive way.
最近更新: 5天前Fast implementation of BERT inference directly on NVIDIA (CUDA, CUBLAS) and Intel MKL
最近更新: 5天前