# SentiAvatar
**Repository Path**: suppermanljr/SentiAvatar
## Basic Information
- **Project Name**: SentiAvatar
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2026-04-17
- **Last Updated**: 2026-04-17
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
# SentiAvatar: Towards Expressive and Interactive Digital Humans
Chuhao Jin1,2,*
Rui Zhang2,*
Qingzhe Gao2
Haoyu Shi3
Dayu Wu2
Yichen Jiang2
Yihan Wu1
Ruihua Song1,โ
1 Gaoling School of Artificial Intelligence, Renmin University of China
2 SentiPulse
3 College of Computer Science, Inner Mongolia University
* Equal contribution. Chuhao Jin led this project. โ Corresponding author.
๐ Paper |
๐ Project Page |
๐ค Dataset |
๐ฌ Demo Video
---
## ๐ฅ Highlights
- ๐ **SuSuInterActs Dataset** โ 21K clips, 37 hours of synchronized speech + full-body motion + facial expressions captured via optical motion capture
- ๐ง **Plan-then-Infill Architecture** โ Decouples sentence-level semantic planning from frame-level prosody-driven interpolation
- ๐ **State-of-the-Art** โ R@1 43.64% (nearly 2ร the best baseline) on SuSuInterActs; FGD 4.941, BC 8.078 on BEATv2
- โก **Real-time** โ Generates 6 seconds of motion in 0.3 seconds with unlimited multi-turn streaming
Figure 1: SentiAvatar generates high-quality 3D human motion and expression, which are semantically aligned and frame-level synchronized. The same color indicates the same time step.
## Abstract
We present **SentiAvatar**, a framework for building expressive interactive 3D digital humans, and use it to create **SuSu**, a virtual character that speaks, gestures, and emotes in real time. Achieving such a system remains challenging, as it requires jointly addressing three key problems: the lack of large-scale high-quality multimodal data, robust semantic-to-motion mapping, and fine-grained frame-level motion-prosody synchronization.
To solve these problems, first, we build **SuSuInterActs** (21K clips, 37 hours), a dialogue corpus captured via optical motion capture around a single character with synchronized speech, full-body motion, and facial expressions. Second, we pre-train a **Motion Foundation Model** on 200K+ motion sequences, equipping it with rich action priors that go well beyond the conversation. We then propose an audio-aware **plan-then-infill** architecture that decouples sentence-level semantic planning from frame-level prosody-driven interpolation, so that generated motions are both semantically appropriate and rhythmically aligned with speech.
ไธญๆๆ่ฆ (Chinese Abstract)
ๆไปฌๆๅบไบ **SentiAvatar**๏ผไธไธช็จไบๆๅปบๅฏๆ่กจ็ฐๅ็ไบคไบๅผ3Dๆฐๅญไบบ็ๆกๆถใ่ฏฅ็ณป็ป้็จไธ้ถๆฎตๆตๆฐด็บฟ๏ผ(1) LLM Motion Planner ๆ นๆฎๅจไฝๆ ็ญพๅ้ณ้ข้ขๆต็จ็ๅ
ณ้ฎๅธง๏ผ(2) Mask Transformer ๅบไบ้ณ้ข็นๅพ่ฟ่กๆปๅจ็ชๅฃๆๅธง๏ผ(3) RVQVAE Decoder ๅฐ็ฆปๆฃ token ่งฃ็ ไธบ่ฟ็ปญๅจไฝๅบๅใๆญคๅค่ฟ้ๆไบ Face VQVAE + HuBERT ็้ข้จๅจ็ป็ๆๆจกๅใ
## ๐ Dataset: SuSuInterActs
We open-source the **SuSuInterActs** dataset with the following content:
| Type | Directory | Format | Description |
|------|-----------|--------|-------------|
| ๐ญ Face | `SuSuInterActs/arkit_data/` | `.npy` | ARKit facial BlendShape values (51 dims) |
| ๐ Audio | `SuSuInterActs/wav_data/` | `.wav` | 16kHz mono speech audio |
| ๐ Motion | `SuSuInterActs/motion_data/` | `.npy` | 63-joint 6D rotation + root displacement |
| ๐ Text | `SuSuInterActs/text_data/` | `.json` | Action/expression tags + dialogue text |
| ๐ Splits | `SuSuInterActs/split/` | `.txt` | Train (19K) / Val (635) / Test (1479) |
### Motion Data Format
Each `.npy` file is a dictionary:
```python
{
"body": np.ndarray, # (T, 153) = root_offset(3) + body_6d(25ร6)
"left": np.ndarray, # (T, 120) = left_hand_6d(20ร6)
"right": np.ndarray, # (T, 120) = right_hand_6d(20ร6)
}
```
- Frame rate: **20 FPS**
- Joints: **63** (25 body + 20 left hand + 20 right hand)
- Rotation: 6D rotation representation
- Root displacement: velocity form (differential encoding)
### Text Format
`text_data/motion2text.json`:
```json
{
"path/to/sample_name": "ใ่กจๆ
๏ผ่ฎค็่ๅฌใใๅจไฝ๏ผ็ผๆ
ข็นๅคดใๅฏๅฏ๏ผ่ฟๆ ทๅ...",
...
}
```
## ๐ง Method Overview
Overview of SentiAvatar. (a) Multi-modal inputs are quantized into tokens via encoders. (b) LLM planner predicts sparse keyframe tokens for high-level dialogue content. (c) Audio-aware Infill Transformer performs dense, prosody-driven interpolation for fine-grained temporal synchronization.
Our pipeline consists of three stages:
1. **Motion VQ-VAE (RVQVAE)** โ Encodes continuous motion into discrete tokens with a 4-layer residual codebook (512 codes each)
2. **LLM Motion Planner** โ A fine-tuned Qwen2-0.5B predicts sparse keyframe motion tokens (every 4th frame) conditioned on action tags + audio tokens
3. **Audio-Aware Infill Transformer** โ A masked transformer fills in the remaining 3 frames between each pair of keyframes using HuBERT audio features, achieving prosody-aligned dense motion
## โ๏ธ Installation
```bash
# Clone the repository
git clone https://github.com/SentiAvatar/SentiAvatar.git
cd SentiAvatar
# Create environment
conda create -n sentiavatar python=3.10 -y
conda activate sentiavatar
# Install dependencies
pip install -r requirements.txt
```
## ๐ฆ Model Checkpoints
Download all model weights from ๐ค HuggingFace:
๐ **[https://huggingface.co/Chuhaojin/SentiAvatar](https://huggingface.co/Chuhaojin/SentiAvatar)**
```bash
# Option 1: Using git lfs
git lfs install
git clone https://huggingface.co/Chuhaojin/SentiAvatar checkpoints/
# Option 2: Using huggingface-cli
pip install huggingface_hub
huggingface-cli download Chuhaojin/SentiAvatar --local-dir checkpoints/
```
Place the downloaded files into the `checkpoints/` directory. The expected structure:
| Model | Description | Size |
|-------|-------------|------|
| `checkpoints/llm/` | Qwen2-0.5B SFT (Motion Token Planner) | 1.1 GB |
| `checkpoints/mask_transformer/` | Audio-Motion Mask Transformer | 276 MB |
| `checkpoints/rvqvae/` | Residual VQ-VAE (body motion codec) | 754 MB |
| `checkpoints/face_vqvae/` | Face VQ-VAE + weight matrices | 50 MB |
| `checkpoints/chinese-hubert-base/` | Chinese HuBERT audio encoder | 361 MB |
| `checkpoints/hubert_kmeans/` | HuBERT K-means quantizer (layer9 โ tokens) | 1.5 MB |
| `checkpoints/eval_model/` | ChronAccRet evaluation model | 434 MB |
## ๐ Inference
### Data Preprocessing (Required for Batch Mode)
Before running batch inference, you need to preprocess the raw dataset to generate intermediate data:
```bash
# Preprocess all data (audio features + audio tokens + motion tokens)
python scripts/preprocess_data.py --all --device cuda:0
# Or separately:
python scripts/preprocess_data.py --audio # HuBERT features + K-means tokens
python scripts/preprocess_data.py --motion # RVQVAE motion tokens
```
This generates three directories under `data/`:
- `audio_features_hubert_layer9_fps10/` โ HuBERT layer9 features @10fps
- `audio_tokens_hubert_layer9_fps10/` โ K-means quantized audio tokens @10fps
- `motion_token_data/` โ RVQVAE encoded motion tokens (for GT comparison)
### Mode 1: Test Set Evaluation (Batch Mode)
Run inference on the entire test set and generate BVH/JSON outputs:
```bash
# Step 1: Preprocess data (if not done)
python scripts/preprocess_data.py --all
# Step 2: Start vLLM service (background)
bash scripts/start_vllm_server.sh checkpoints/llm 8095 0
# Step 3: Run batch inference
bash scripts/run_test.sh 8095 0
```
Output: `output/reconstructed/` (BVH + JSON + WAV per sample)
### Mode 2: Single Case Inference
Generate motion from your own audio + action tag:
```bash
# Make sure vLLM service is running
bash scripts/start_vllm_server.sh checkpoints/llm 8095 0
# ๐ Quick demo (uses built-in example audio, no extra data needed)
bash scripts/run_single_infer.sh
# Custom inference with your own audio
bash scripts/run_single_infer.sh \
--audio_path /path/to/your/audio.wav \
--action_text "ๅจไฝ๏ผ็นๅคดๅพฎ็ฌ" \
--output_dir ./output_single
```
Or use the Python script directly:
```bash
cd motion_generation
python single_case_infer.py \
--audio_path /path/to/audio.wav \
--action_text "ๅจไฝ๏ผๆฅๆๆๆๅผ" \
--output_dir ./output_single \
--vllm_port 8095
```
**Output files:**
- `.bvh` โ BVH motion file (viewable in Blender)
- `.json` โ Animation data (UE engine format)
- `.wav` โ Corresponding audio file
## ๐ Experimental Results
### Quantitative Comparison on SuSuInterActs
**Bold**: best; โ/โ: higher/lower is better. ESD in seconds. "โ " indicates token-by-token autoregressive generation.
| Method | Condition | R@1 โ | R@2 โ | R@3 โ | FID โ | ESD โ | Diversity โ |
|--------|-----------|-------|-------|-------|-------|-------|-------------|
| Real Motion | โ | 62.20 | 73.56 | 78.70 | 0.000 | 0.308 | 22.61 |
| *Audio-only methods* | | | | | | | |
| EMAGE | Audio | 5.00 | 9.40 | 13.32 | 441.6 | 0.606 | 12.92 |
| A2M-GPTโ | Audio | 8.72 | 15.96 | 20.08 | 13.66 | 0.477 | 22.23 |
| *Text-only methods* | | | | | | | |
| HunYuan-Motion | Text | 5.21 | 8.59 | 11.9 | 352.56 | 0.708 | 16.92 |
| T2M-GPT | Text | 23.12 | 30.49 | 35.43 | 67.78 | 0.721 | 20.65 |
| MoMask | Text | 34.55 | 46.58 | 54.29 | 36.25 | 0.471 | 22.03 |
| *Audio + Text methods* | | | | | | | |
| AT2M-GPTโ | Audio, Text | 27.52 | 36.11 | 41.38 | 18.491 | 0.503 | 22.36 |
| **SentiAvatar (Ours)** | **Audio, Text** | **43.64** | **54.94** | **61.84** | **8.912** | **0.456** | **22.41** |
| *Improvement (%)* | | *+26.3* | *+17.9* | *+13.9* | *+34.8* | *+3.2* | *+0.2* |
### Qualitative Comparison
Qualitative comparison of generated motions across methods. Texts and arrows of the same color indicate the same time step. Red arrows indicate incorrect actions.
## ๐ Evaluation
Evaluate generated motion quality using our ChronAccRet evaluation model:
```bash
bash scripts/run_eval.sh ./output/reconstructed 0
```
**Metrics:**
| Metric | Description | Better |
|--------|-------------|--------|
| **R@K** | Text-motion retrieval recall @K | Higher โ |
| **FID** | Frรฉchet Inception Distance | Lower โ |
| **Diversity** | Generation diversity in latent space | Higher โ |
| **ESD** | Event Sync Distance (seconds) | Lower โ |
## ๐ง Motion Visualization
Convert `.npy` motion data to BVH files for viewing in Blender or other 3D software:
```bash
# Single file conversion
python tools/visualize_motion.py \
--input data/motion_data/path/to/sample.npy \
--output output_vis/sample.bvh
# Batch conversion (max 10 files)
python tools/visualize_motion.py \
--input_dir data/motion_data \
--output_dir output_bvh \
--max_files 10
# Output both BVH and JSON
python tools/visualize_motion.py \
--input data/motion_data/sample.npy \
--output sample.bvh \
--save_json
```
## ๐๏ธ Project Structure
```
SentiAvatar/
โโโ motion_generation/ # ๐ฏ Motion generation module
โ โโโ pipeline_infer.py # LLM + Mask Transformer pipeline
โ โโโ single_case_infer.py # Single-case inference script
โ โโโ reconstruct_from_tokens.py # Token โ BVH/JSON decoder
โ โโโ vllm_server.py # vLLM server for LLM inference
โ โโโ models/ # Model definitions (RVQVAE, Mask Transformer)
โ โโโ actions/ # Post-processing (BVH/JSON conversion)
โ โโโ utils/ # Utilities and rotation tools
โ โโโ meta/ # Skeleton templates, normalization params
โโโ evaluation/ # ๐ Evaluation module (ChronAccRet)
โโโ tools/ # ๐ง Visualization tools
โโโ scripts/ # ๐ Shell scripts
โโโ data/ # ๐ Dataset (SuSuInterActs)
โโโ checkpoints/ # ๐พ Model weights
```
## ๐ Citation
If you find this work useful, please cite our paper:
```bibtex
@article{jin2026sentiavatar,
title={SentiAvatar: Towards Expressive and Interactive Digital Humans},
author={Jin, Chuhao and Zhang, Rui and Gao, Qingzhe and Shi, Haoyu and Wu, Dayu and Jiang, Yichen and Wu, Yihan and Song, Ruihua},
journal={arXiv preprint arXiv:2604.02908},
year={2026}
}
```
## โญ Star History
[](https://star-history.com/#SentiAvatar/SentiAvatar&Date)
##### If you like this project, please give it a star โญ! It would be a great encouragement for us and help more people discover this work.
## ๐ Acknowledgments
The authors would like to sincerely thank all collaborators for their valuable contributions to this work. In particular, special thanks to Shi Xueliang and Pan Xuanyue for leading the art design and data production efforts. The project also benefited greatly from the contributions of team members: Shi Xueliang, Yu Yongchang, Li Xing, and Liu Xueying in art design; Pan Xuanyue, Li Huixian, Yang Yijia, Zhang Wenxuan, and Wang Wei (UE) in data production. Their dedicated work and collaboration were essential to the successful completion of this research.
We also thank the following open-source projects:
- [vLLM](https://github.com/vllm-project/vllm) โ High-throughput LLM inference engine
- [HuggingFace Transformers](https://github.com/huggingface/transformers) โ Pre-trained model framework
- [Chinese-HuBERT](https://huggingface.co/TencentGameMate/chinese-hubert-base) โ Chinese speech encoder
- [Qwen2](https://github.com/QwenLM/Qwen2) โ Base language model
## License
This project is licensed under [SentiPulse Non-Commercial Source License v1.0](LICENSE).
**You are free to**: share, adapt, and build upon this work for non-commercial purposes.
**You may NOT**: use this project, its models, or data for any commercial purpose.
For commercial licensing, please contact the authors.
---
Made with โค๏ธ by the SentiPulse Team