# SATE **Repository Path**: ihyc/SATE ## Basic Information - **Project Name**: SATE - **Description**: No description available - **Primary Language**: Python - **License**: MIT - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2021-09-12 - **Last Updated**: 2021-09-12 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # Fairseq-S2T Adapt the fairseq toolkit for speech to text tasks. Implementation of the paper: [Stacked Acoustic-and-Textual Encoding: Integrating the Pre-trained Models into Speech Translation Encoders](https://arxiv.org/abs/2105.05752) ## Key Features ### Training - Support the Kaldi-style complete recipe - ASR, MT, and ST pipeline (bin) - Read training config in yaml file - CTC multi-task learning - MT training in the ST-like way (Online tokenizer) (There may be bugs) - speed perturb during pre-processing (need torchaudio ≥ 0.8.0) ### Model - Conformer Architecture - Load pre-trained model for ST - Relative position encoding - Stacked acoustic-and-textual encoding ## Installation * Note we only test the following environment. 1. Python == 3.6 2. torch == 1.8, torchaudio == 0.8.0, cuda == 10.2 3. apex ``` pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./ ``` 4. nccl ``` make -j src.build CUDA_HOME= ``` 5. gcc ≥ 4.9 (We use the version 5.4) 6. python library ``` pip install pandas sentencepiece configargparse gpustat tensorboard editdistance ``` ## Code Tree The shell scripts for each benchmark is in the egs folder, we create the ASR pipeline for LibriSpeech, all pipelines (ASR, MT, and ST) for MuST-C. Besides, we also provide the template for other benchmarks. Here is an example for MuST-C: ```markdown mustc ├── asr │   ├── binary.sh │   ├── conf │   ├── decode.sh │   ├── local │   ├── run.sh │   └── train.sh ├── mt │   ├── binary.sh │   ├── conf │   ├── decode.sh │   ├── local │   ├── run.sh │   └── train.sh └── st ├── binary.sh ├── conf ├── decode.sh ├── ensemble.sh ├── local ├── run.sh └── train.sh ``` * run.sh: the core script, which includes the whole processes * train.sh: call the run.sh for training * decode.sh: call the run.sh for decoding * binary.sh: generate the datasets alone * conf: the folder to save the configure files (.yaml). * local: the folder to save utils shell scripts * monitor.sh: check the GPUS for running the program automatically * parse_options.sh: parse the parameters for run.sh * path.sh: no use * utils.sh: the utils shell functions ## Citations ```angular2html @inproceedings{xu-etal-2021-stacked, title = "Stacked Acoustic-and-Textual Encoding: Integrating the Pre-trained Models into Speech Translation Encoders", author = "Xu, Chen and Hu, Bojie and Li, Yanyang and Zhang, Yuhao and Huang, Shen and Ju, Qi and Xiao, Tong and Zhu, Jingbo", booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)", month = aug, year = "2021", address = "Online", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.acl-long.204", doi = "10.18653/v1/2021.acl-long.204", pages = "2619--2630", } ```