# ai-infra-learning **Repository Path**: cnbubblefish/ai-infra-learning ## Basic Information - **Project Name**: ai-infra-learning - **Description**: No description available - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-11-16 - **Last Updated**: 2025-11-16 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README ## AI Infra 学习会议 | 主题 | 时间 | 预习资料 | 录频 | 文档 | 问题反馈 & 课后思考题 | --- | --- | --- | --- | --- | --- | | vLLM Quickstart | 2025-05-11 | [Doc: vLLM](https://docs.vllm.ai/en/latest/index.html) | [AI INFRA 学习 01 - LLM 全景图介绍/vLLM 快速入门](https://www.bilibili.com/video/BV1T2EGzLEHi)| [01-vllm-quickstart](https://github.com/cr7258/ai-infra-learning/blob/main/lesson/01-vllm-quickstart.md) | | | PagedAttention| 2025-05-25 | [Blog: vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention](https://blog.vllm.ai/2023/06/20/vllm.html)

[Paper: Efficient Memory Management for Large Language Model Serving with PagedAttention](https://arxiv.org/pdf/2309.06180)

[Video: Fast LLM Serving with vLLM and PagedAttention](https://www.bilibili.com/video/BV1WUYieQEyL)| [AI INFRA 学习 02 - vLLM PagedAttention 论文精读](https://www.bilibili.com/video/BV1GWjjzfE1b) | [02-pagedattention](https://github.com/cr7258/ai-infra-learning/tree/main/lesson/02-pagedattention)| [02-PagedAttention 问题反馈](https://github.com/cr7258/ai-infra-learning/issues/2) | | Prefix Caching | 2025-06-08 | [Doc: Automatic Prefix Caching](https://docs.vllm.ai/en/stable/features/automatic_prefix_caching.html)

[Design Doc: Automatic Prefix Caching](https://docs.vllm.ai/en/stable/design/v1/prefix_caching.html)

[Paper: SGLang: Efficient Execution of Structured Language Model Programs](https://arxiv.org/abs/2312.07104) | [AI INFRA 学习 03 - Prefix Caching 原理详解](https://www.bilibili.com/video/BV1jgTRzSEjS) | [03-prefix-caching](https://github.com/cr7258/ai-infra-learning/tree/main/lesson/03-prefix-caching)| | | Speculative Decoding | 2025-06-22 | [Doc: Speculative Decoding](https://docs.vllm.ai/en/stable/features/spec_decode.html)

[Blog: How Speculative Decoding Boosts vLLM Performance by up to 2.8x](https://blog.vllm.ai/2024/10/17/spec-decode.html)

[Video: Hacker's Guide to Speculative Decoding in VLLM](https://www.youtube.com/watch?v=9wNAgpX6z_4)

[Video: Speculative Decoding in vLLM](https://www.youtube.com/watch?v=eVJBFajJRIU)

[Paper: Accelerating Large Language Model Decoding with Speculative Sampling](https://arxiv.org/abs/2302.01318)

[Paper: Fast Inference from Transformers via Speculative Decoding](https://arxiv.org/abs/2211.17192) | [AI INFRA 学习 04 - Speculative Decoding 实现方案](https://www.bilibili.com/video/BV1Q5KWzQEhn) | [04-speculative-decoding](https://github.com/cr7258/ai-infra-learning/tree/main/lesson/04-speculative-decoding)| | | Chunked-Prefills | 2025-07-13 | [Doc: vLLM Chunked Prefill](https://docs.vllm.ai/en/stable/configuration/optimization.html#chunked-prefill_1)

[Paper: SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills](https://arxiv.org/abs/2308.16369)

[Paper: DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference](https://arxiv.org/abs/2401.08671)

[Paper: Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve](https://arxiv.org/abs/2403.02310) | [AI INFRA 学习 05 - Chunked-Prefills 分块预填充](https://www.bilibili.com/video/BV1f2uczGEqt) | [05-chunked-prefills](https://github.com/cr7258/ai-infra-learning/tree/main/lesson/05-chunked-prefills) | [05-Chunked-Prefills 问题反馈 & 课后思考题](https://github.com/cr7258/ai-infra-learning/issues/1) | | Disaggregating Prefill and Decoding | 2025-09-21 | [Doc: Disaggregated Prefilling](https://docs.vllm.ai/en/stable/features/disagg_prefill.html#disaggregated-prefilling-experimental)

[Doc: vLLM Production Stack Disaggregated Prefill](https://docs.vllm.ai/projects/production-stack/en/latest/tutorials/disagg.html)

[Paper: DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving](https://arxiv.org/abs/2401.09670)

[Paper: Splitwise: Efficient generative LLM inference using phase splitting](https://arxiv.org/abs/2311.18677)

[Video: vLLM Office Hours - Disaggregated Prefill and KV Cache Storage in vLLM](https://www.youtube.com/watch?v=FPr37jCOvrA) | [AI INFRA 学习 06 - PD 分离推理架构详解](https://www.bilibili.com/video/BV1ZTWAzmEEc) | [06-disaggregating-prefill-and-decoding](https://github.com/cr7258/ai-infra-learning/tree/main/lesson/06-disaggregating-prefill-and-decoding) | [06-PD 分离问题反馈](https://github.com/cr7258/ai-infra-learning/issues/3) | | LoRA Adapters | | [Doc: LoRA Adapters](https://docs.vllm.ai/en/stable/features/lora.html)
[Paper: LoRA: Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685) | || | Quantization | | | | | | | Distributed Inference and Serving | | [Doc: Distributed Inference and Serving](https://docs.vllm.ai/en/stable/serving/distributed_serving.html)| || | ## 交流群(加群请备注来意) ## 微信公众号 ![搜索框传播样式-白色版](https://github.com/user-attachments/assets/bf4c1c47-4e85-407b-8143-68a59b474186)