MOS-Bench
MOS‑Bench is a collection of datasets for training and evaluating the generalization ability of subjective speech quality assessment (SSQA) models, developed by Nagoya University. The collection comprises seven training sets and twelve test sets, covering various sampling rates, languages, and speech types, including synthetic speech generated by text‑to‑speech (TTS), voice conversion (VC), and speech enhancement (SE) systems, as well as non‑synthetic speech such as transmission, noise, and reverberated speech. The dataset was created by integrating and processing multiple listening test datasets, aiming to address the generalization challenges of speech quality assessment models on unseen data. MOS‑Bench is widely used in speech processing, especially for research on subjective speech quality evaluation.
Description
🗣️ SHEET / MOS-Bench 🎧
Dataset Overview
- MOS-Bench is a benchmark for evaluating the generalization capability of subjective speech quality assessment (SSQA) models.
- SHEET is a toolkit for conducting research experiments with MOS-Bench.
Key Features
- MOS-Bench is the first large‑scale collection of SSQA training and testing datasets, covering a wide range of domains, including synthetic speech generated by text‑to‑speech (TTS), voice conversion (VC), and singing voice synthesis (SVS) systems, as well as distorted speech with artificial and real noise, clipping, transmission, reverberation, and other degradations.
- The repository aims to provide training recipes. Although many ready‑made speech quality assessment tools exist, such as DNSMOS, SpeechMOS, and speechmetrics, most do not provide training recipes and are therefore unsuitable for research.
MOS-Bench Overview
- MOS-Bench currently contains 7 training sets and 12 test sets.
Supported Models and Features
Models
- LDNet
- Original repository: https://github.com/unilight/LDNet
- Paper: arXiv
- Example config:
egs/bvcc/conf/ldnet-ml.yaml
- SSL-MOS
- Original repository: https://github.com/nii-yamagishilab/mos-finetune-ssl/tree/main
- Paper: arXiv
- Example config:
egs/bvcc/conf/ssl-mos-wav2vec2.yaml - Note: Some modifications were made to the original implementation.
- UTMOS (Strong learner)
- Original repository: https://github.com/sarulab-speech/UTMOS22/tree/master/strong
- Paper: arXiv
- Example config:
egs/bvcc/conf/utmos-strong.yaml - Note: Not all components of UTMOS strong are implemented.
- Modified AlignNet
- Original repository: https://github.com/NTIA/alignnet
- Paper: arXiv
- Example config:
egs/bvcc+nisqa+pstn+singmos+somos+tencent+tmhint-qi/conf/alignnet-wav2vec2.yaml
Features
- Modeling
- Audience modeling
- SSL‑based encoder supported by S3PRL
- Training
- Automatic saving of the best model and early stopping
- Visualization, including prediction score distribution and scatter plots of utterance‑ and system‑level scores
- Model averaging
- Model ensembling via stacking
Usage Guide
- New users: Provides complete experiment recipes, including scripts for downloading and processing datasets, and for training and evaluating models.
- Existing model users: Provides convenient scripts for collecting test sets.
- Using pre‑trained models: Offers functionality to load pre‑trained SSQA models and predict scores via
torch.hub.
Installation
- Perform an editable installation with
git cloneandmake, which automatically builds a virtual environment.
Information
- Citation: If you use the training scripts, benchmark scripts, or pre‑trained models from this project, please cite the associated papers.
- Acknowledgements: This project was inspired by repositories such as ESPNet and ParallelWaveGAN.
- Authors: Wen‑Chin Huang, Toda Laboratory, Nagoya University.
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: arXiv
Created: 11/6/2024
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.