# InfiniteBench **Repository Path**: gu-wei666/InfiniteBench ## Basic Information - **Project Name**: InfiniteBench - **Description**: No description available - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2026-01-21 - **Last Updated**: 2026-01-21 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README
## 简介 理解、处理长文本,是大模型迈向更深层次理解与交互阶段必备的能力。现已有大模型声称可以处理100k+的长序列,但是对应的标准评测集却是空缺的。为此,我们构建了一个面向 100k+ 的评测集,InfiniteBench。该评测集针对大模型在长文本方面的五项能力而设计:检索、数学、代码、问答、和摘要。 ## 特点 - **长上下文:** InfiniteBench 测试数据的平均上下文长度为195k,远超现有评测数据。 - **多领域多语言:** InfiniteBench 评测集包含12个任务,包括中英双语,涵盖了检索、数学、代码、问答、和摘要等5个领域。 - **前瞻性挑战性:** InfiniteBench 测试任务,对标当前最强的模型如 GPT-4, Claude 2 等。 - **真实场景与合成场景:** InfiniteBench 既包含真实场景数据,探测大模型在处理实际问题的能力;也包含合成数据,为测试数据拓展上下文窗口提供了便捷。 ## 任务构成 | Task Name | Context | # Examples | Avg Input Tokens | Avg Output Tokens | Description | | -------------------- | ------------- | ---------- | ---------------- | ----------------- | ------------------------------------------------------------------------------------------- | | En.Sum | Fake Book | 103 | 171.5k | 1.1k | Summarization of a fake book created with core entity substitution. | | En.QA | Fake Book | 351 | 192.6k | 4.8 | Free-form question answering based on the fake book. | | En.MC | Fake Book | 229 | 184.4k | 5.3 | Multiple choice questions derived from the fake book. | | En.Dia | Script | 200 | 103.6k | 3.4 | Identification of talkers in partially anonymized scripts. | | Zh.QA | New Book | 175 | 2068.6k | 6.3 | Question answering on a set of newly collected books. | | Code.Debug | Code Document | 394 | 114.7k | 4.8 | Finding which function in a code repo contains an crashing error (in multiple choice form). | | Code.Run | Synthetic | 400 | 75.2k | 1.3 | Simulating execution of multiple simple, synthetic functions. | | Math.Calc | Synthetic | 50 | 43.9k | 43.9k | Calculations involving super-long arithmetic equations. | | Math.Find | Synthetic | 350 | 87.9k | 1.3 | Finding special integers in a lengthy list. | | Retrieve.PassKey[^1] | Synthetic | 590 | 122.4k | 2.0 | Retrieving hidden keys in a noisy long context. | | Retrieve.Number | Synthetic | 590 | 122.4k | 4.0 | Locating repeated hidden numbers in a noisy long context. | | Retrieve.KV[^2] | Synthetic | 500 | 89.9k | 22.7 | Finding the corresponding value from a dictionary and a key. | ## 评测结果 我们在 SOTA 模型上评测了 InfiniteBench 结果如下: | Task Name | GPT-4 | YaRN-Mistral-7B | Kimi-Chat | Claude 2 | Yi-6B-200K | Yi-34B-200K | Chatglm3-6B-128K | | ---------------- | ------ | --------------- | --------- | -------- | -----------| -----------| -----------| | Retrieve.PassKey | 100% | 92.71% | 98.14% | 97.80% | 100.00% | 100.00% | 92.20% | | Retrieve.Number | 100% | 56.61% | 95.42% | 98.14% | 94.92% | 100.00% | 80.68% | | Retrieve.KV | 89.00% | < 5% | 53.60% | 65.40% | < 5% | < 5% | < 5% | | En.Sum | 14.73% | 9.09% | 17.96% | 14.50% | < 5% | < 5% |< 5% | | En.QA | 22.44% | 9.55% | 16.52% | 11.97% | 9.20% | 12.17% |< 5% | | En.MC | 67.25% | 27.95% | 72.49% | 62.88% | 36.68% |38.43% |10.48% | | En.Dia | 8.50% | 7.50% | 11.50% | 46.50% | < 5% |< 5% |< 5% | | Zh.QA | 25.96% | 16.98% | 17.93% | 9.64% | 15.07% |13.61% |< 5% | | Code.Debug | 37.06% | < 5% | 17.77% | < 5% | 9.14% |13.96% |7.36% | | Code.Run | 23.25% | < 5% | < 5% | < 5% | < 5% |< 5% |< 5% | | Math.Calc | < 5% | < 5% | < 5% | < 5% | < 5% |< 5% |< 5% | | Math.Find | 60.00% | 17.14% | 12.57% | 32.29% | < 5% |25.71% |7.71% | 注: 1. YaRN-Mistral-7B 实现代码已开源在仓库,请大家批评指正;Kimi-Chat 和 Claude 2 使用用户界面评测,GPT-4 使用 API 评测,均使用官方默认配置。 ## 评测 ## 获取数据集 从