# AudioStory
**Repository Path**: citiao/AudioStory
## Basic Information
- **Project Name**: AudioStory
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-09-02
- **Last Updated**: 2025-09-02
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
# AudioStory: Generating Long-Form Narrative Audio with Large Language Models
**[Yuxin Guo1,2](https://scholar.google.com/citations?user=x_0spxgAAAAJ&hl=en),
[Teng Wang2,✉](http://ttengwang.com/),
[Yuying Ge2](https://geyuying.github.io/),
[Shijie Ma1,2](https://mashijie1028.github.io/),
[Yixiao Ge2](https://geyixiao.com/),
[Wei Zou1](https://people.ucas.ac.cn/~zouwei),
[Ying Shan2](https://scholar.google.com/citations?user=4oXBp9UAAAAJ&hl=en)**
1Institute of Automation, CAS
2ARC Lab, Tencent PCG
## 📖 Release
[2025/8/28] 🔥🔥 We release the inference code!
[2025/8/28] 🔥🔥 We release our demo videos!
## 🔎 Introduction

✨ **TL; DR: We propose a model for long-form narrative audio generation built upon a unified understanding–generation framework, capable of handling video dubbing, audio continuation, and long-form narrative audio synthesis.**
Recent advances in text-to-audio (TTA) generation excel at synthesizing short audio clips but struggle with long-form narrative audio, which requires temporal coherence and compositional reasoning. To address this gap, we propose AudioStory, a unified framework that integrates large language models (LLMs) with TTA systems to generate structured, long-form audio narratives. AudioStory possesses strong instruction-following reasoning generation capabilities. It employs LLMs to decompose complex narrative queries into temporally ordered sub-tasks with contextual cues, enabling coherent scene transitions and emotional tone consistency. AudioStory has two appealing features:
1) Decoupled bridging mechanism: AudioStory disentangles LLM-diffuser collaboration into two specialized components—a bridging query for intra-event semantic alignment and a consistency query for cross-event coherence preservation.
2) End-to-end training: By unifying instruction comprehension and audio generation within a single end-to-end framework, AudioStory eliminates the need for modular training pipelines while enhancing synergy between components.
Furthermore, we establish a benchmark AudioStory-10K, encompassing diverse domains such as animated soundscapes and natural sound narratives.
Extensive experiments show the superiority of AudioStory on both single-audio generation and narrative audio generation, surpassing prior TTA baselines in both instruction-following ability and audio fidelity.
## ⭐ Demos
### 1. Video Dubbing (Tom & Jerry style)
> Dubbing is achieved using AudioStory (trained on Tom & Jerry) with visual captions extracted from videos.
| Instruction: "Develop a comprehensive audio that fully represents jake shimabukuro performs a complex ukulele piece in a studio, receives applause, and discusses his career in an interview. The total duration is 49.9 seconds." | |
| Instruction: "Develop a comprehensive audio that fully represents a fire truck leaves the station with sirens blaring, signaling an emergency response, and drives away. The total duration is 35.1 seconds." | |
| Instruction: "Understand the input audio, infer the subsequent events, and generate the continued audio of the coach giving basketball lessons to the players. The total duration is 36.6 seconds." | |