# VividHead **Repository Path**: hf-datasets/VividHead ## Basic Information - **Project Name**: VividHead - **Description**: Mirror of https://huggingface.co/datasets/Soul-AILab/VividHead - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2026-02-26 - **Last Updated**: 2026-02-26 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README --- license: apache-2.0 task_categories: - image-to-video pretty_name: VividHead size_categories: - 100K

SoulX-FlashHead: Oracle-guided Generation of Infinite Real-time Streaming Talking Heads

[Tan Yu*](https://jiayoujiayoujiayoua.github.io/), [Qian Qiao*](https://qianqiaoai.github.io/)^✉, [Le Shen*](https://openreview.net/profile?id=%7ELe_Shen3), [Ke Zhou](https://github.com/jokerz0624), [Jincheng Hu](#), [Dian Sheng](#), [Bo Hu](#), [Haoming Qin](#), [Jun Gao](#), [Changhai Zhou](#), [Shunshun Yin](#), [Siyuan Liu](#) ^✉ ^*Equal Contribution ^✉Corresponding Author

# VividHead Dataset ## Highlights - 🔥 **Large-scale, high-quality talking-head dataset** with **330K clips** and **782 hours** of head-cropped videos - 🔥 **Broad diversity** across **15+ languages** and a **wide age range (0–60+)** - 🔥 **Rich annotations** including age, gender, ethnicity, and language - 🔥 **Unified and standardized processing**, with a consistent **FPS = 25** and **resolution = 512 × 512** ## ShowCase ## 🌰 Examples

## Dataset Statistics This dataset exhibits strong diversity across multiple dimensions: - **Duration**: 3s–60s+, bimodal (peaks ~5s, ~10s), mean **8.37s**; most clips in 3–15s. - **Age**: 31–45 (432.5h), 19–30 (277.2h), 46–60 (61.3h), 60+ (10.4h), 0–19 (0.2h). - **Language (Top 10)**: English (651.4h), Chinese (67.5h), Russian (8.7h), Spanish (7.1h), Portuguese (6.4h), Welsh (5.4h), Hindi (5.3h), German (3.6h), French (3.0h), Korean (2.7h); 15+ languages in total. - **Gender & ethnicity**: Male (552.8h), Female (229.0h); White (506.7h), Asian (113.1h), Latino/Hispanic (56.5h), Middle Eastern (42.9h), Black (36.4h).

Duration	Age group
Language (Top 10)	Gender & ethnicity

## Comparison with Other Datasets | Dataset | Speakers | Face Crop | Clips | Hours | Resolution | Language | Age | Ethnicity | Source | |---------------|----------|-----------|--------|-------|------------------|----------|--------|-----------|--------| | MEAD | 60 | ✅ | 281.4K | 39 | 384p | English | 20–35 | – | Lab | | HDTF | 362 | ✅ | 10K | 15.8 | 512p | – | – | – | Wild | | AVSpeech | 150K | ❌ | 2.5M | 4700 | 720p, 1080p | – | – | – | Wild | | Hallo3 | – | ✅ | 101.5K | 70 | 720p | – | – | – | Wild | | OpenHumanVid | – | ❌ | 13.4M | 16.7K | 720p | – | – | – | Wild | | TalkVid | 7,729 | ❌ | 281.4K | 1244 | 1080p, 2160p | 15 lang. | 0–60+ | 3 | Wild | | SpeakerVid | 83K | ❌ | 5.2M | 8.7K | 1080p | – | – | – | Wild | | **Ours** | **60K** | ✅ | **330K** | **782** | **512p** | **15 lang.** | **0–60+** | **3** | **Wild** | # Data Processing Pipeline Our data processing pipeline is designed to construct a large-scale, high-quality talking-head dataset through systematic preprocessing, filtering, and annotation, ensuring sample uniqueness, temporal consistency, and reliable multi-modal supervision. ## Data Preprocessing Stage 1. **Data collection**: Aggregates initial content from Web videos and various Open-source videos to build a diverse raw data pool. 2. **Deduplication & Slicing**: Employs MD5 hash verification to eliminate redundant content and uses PySceneDetect to divide long videos into coherent clips ranging from 3 to 60+ seconds. 3. **Standardize to 25 FPS**: Normalizes all video clips to a uniform frame rate of 25 FPS using FFMPEG to ensure temporal consistency for model training. ## Data Filter & Annotation Stage 4. **Face detection & crop**: Detects face visibility and crops valid sequences into a centered $512 \times 512$ resolution. 5. **Jump cut detection**: Uses optical flow analysis to identify and exclude sequences containing scene discontinuities or abrupt transitions. 6. **Faceless filter**: Screens and excludes frames where a detectable face is missing or the head region is improperly framed. 7. **DWpose extraction & hand-filter**: Extracts body keypoints and strictly removes clips featuring hand-over-face occlusion to prevent generation artifacts. 8. **Lip-sync**: Utilizes the SyncNet model to calculate confidence scores (LSE-C and LSE-D), discarding any samples with poor audio-visual alignment. 9. **Audio feature & attribute labeling**: Extracts robust streaming features via Wav2Vec and annotates metadata including language, ethnicity, age, and gender. ## 📚 Citation If you find our work useful in your research, please consider citing: ``` @misc{yu2026soulxflashheadoracleguidedgenerationinfinite, title={SoulX-FlashHead: Oracle-guided Generation of Infinite Real-time Streaming Talking Heads}, author={Tan Yu and Qian Qiao and Le Shen and Ke Zhou and Jincheng Hu and Dian Sheng and Bo Hu and Haoming Qin and Jun Gao and Changhai Zhou and Shunshun Yin and Siyuan Liu}, year={2026}, eprint={2602.07449}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2602.07449}, } ``` # License Our VividHead dataset is released under the CC-BY-4.0 license and is intended for research and non-commercial purposes. The video samples are collected from publicly available datasets.