# AudioStory **Repository Path**: citiao/AudioStory ## Basic Information - **Project Name**: AudioStory - **Description**: No description available - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-09-02 - **Last Updated**: 2025-09-02 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # AudioStory: Generating Long-Form Narrative Audio with Large Language Models **[Yuxin Guo^1,2](https://scholar.google.com/citations?user=x_0spxgAAAAJ&hl=en), [Teng Wang^2,✉](http://ttengwang.com/), [Yuying Ge²](https://geyuying.github.io/), [Shijie Ma^1,2](https://mashijie1028.github.io/), [Yixiao Ge²](https://geyixiao.com/), [Wei Zou¹](https://people.ucas.ac.cn/~zouwei), [Ying Shan²](https://scholar.google.com/citations?user=4oXBp9UAAAAJ&hl=en)**
¹Institute of Automation, CAS ²ARC Lab, Tencent PCG
## 📖 Release [2025/8/28] 🔥🔥 We release the inference code! [2025/8/28] 🔥🔥 We release our demo videos! ## 🔎 Introduction ![audiostory](audiostory.png) ✨ **TL; DR: We propose a model for long-form narrative audio generation built upon a unified understanding–generation framework, capable of handling video dubbing, audio continuation, and long-form narrative audio synthesis.** Recent advances in text-to-audio (TTA) generation excel at synthesizing short audio clips but struggle with long-form narrative audio, which requires temporal coherence and compositional reasoning. To address this gap, we propose AudioStory, a unified framework that integrates large language models (LLMs) with TTA systems to generate structured, long-form audio narratives. AudioStory possesses strong instruction-following reasoning generation capabilities. It employs LLMs to decompose complex narrative queries into temporally ordered sub-tasks with contextual cues, enabling coherent scene transitions and emotional tone consistency. AudioStory has two appealing features: 1) Decoupled bridging mechanism: AudioStory disentangles LLM-diffuser collaboration into two specialized components—a bridging query for intra-event semantic alignment and a consistency query for cross-event coherence preservation. 2) End-to-end training: By unifying instruction comprehension and audio generation within a single end-to-end framework, AudioStory eliminates the need for modular training pipelines while enhancing synergy between components. Furthermore, we establish a benchmark AudioStory-10K, encompassing diverse domains such as animated soundscapes and natural sound narratives. Extensive experiments show the superiority of AudioStory on both single-audio generation and narrative audio generation, surpassing prior TTA baselines in both instruction-following ability and audio fidelity. ## ⭐ Demos ### 1. Video Dubbing (Tom & Jerry style) > Dubbing is achieved using AudioStory (trained on Tom & Jerry) with visual captions extracted from videos.

### 2. Cross-domain Video Dubbing (Tom & Jerry style)

### 3. Text-to-Long Audio (Natural sound)

Instruction: "Develop a comprehensive audio that fully represents jake shimabukuro performs a complex ukulele piece in a studio, receives applause, and discusses his career in an interview. The total duration is 49.9 seconds."
Instruction: "Develop a comprehensive audio that fully represents a fire truck leaves the station with sirens blaring, signaling an emergency response, and drives away. The total duration is 35.1 seconds."
Instruction: "Understand the input audio, infer the subsequent events, and generate the continued audio of the coach giving basketball lessons to the players. The total duration is 36.6 seconds."

## 🔎 Methods ![audiostory_framework](audiostory_framework.png) To achieve effective instruction-following audio generation, the ability to understand the input instruction or audio stream and reason about relevant audio sub-events is essential. To this end, AudioStory adopts a unified understanding-generation framework (Fig.). Specifically, given textual instruction or audio input, the LLM analyzes and decomposes it into structured audio sub-events with context. Based on the inferred sub-events, the LLM performs **interleaved reasoning generation**, sequentially producing captions, semantic tokens, and residual tokens for each audio clip. These two types of tokens are fused and passed to the DiT, effectively bridging the LLM with the audio generator. Through progressive training, AudioStory ultimately achieves both strong instruction comprehension and high-quality audio generation. ## 🔩 Installation ### Dependencies * Python >= 3.10 (Recommend to use [Anaconda](https://www.anaconda.com/download/#linux)) * [PyTorch >=2.1.0](https://pytorch.org/) * NVIDIA GPU + [CUDA](https://developer.nvidia.com/cuda-downloads) ### Installation ``` git clone https://github.com/TencentARC/AudioStory.git cd AudioStory conda create -n audiostory python=3.10 -y conda activate audiostory bash install_audiostory.sh ``` ## 📊 Evaluation ### inference ``` python evaluate/inference.py --model_path /path/to/ckpt --guidance 4.0 --save_folder_name audiostory --total_duration 50 ``` ## 🔋 Acknowledgement When building the codebase of continuous denosiers, we refer to [SEED-X](https://github.com/AILab-CVC/SEED-X) and [TangoFlux](https://github.com/declare-lab/TangoFlux). Thanks for their wonderful projects. ## 📆 TO DO - [ ] Release our gradio demo. - [ ] Release checkpoints of AudioStory. - [ ] Release training codes of all three stages. ## 📜 License This repository is under the [Apache 2 License](https://github.com/mashijie1028/Gen4Rep/blob/main/LICENSE). ## 📚 BibTeX ``` @misc{guo2025audiostory, title={AudioStory: Generating Long-Form Narrative Audio with Large Language Models}, author={Yuxin Guo and Teng Wang and Yuying Ge and Shijie Ma and Yixiao Ge and Wei Zou and Ying Shan}, year={2025}, eprint={2508.20088}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2508.20088}, } ``` ## 📧 Contact If you have further questions, feel free to contact me: guoyuxin2021@ia.ac.cn Discussions and potential collaborations are also welcome.