Dataset assetOpen Source CommunityMachine LearningDeepfake Audio Detection

Codecfake Dataset

Codecfake dataset and countermeasures for universal detection of deep‑fake audio. Because of Zenodo repository size limits, the dataset is split into multiple subsets, including training, development, and test sets.

Source

github

Created

May 8, 2024

Updated

May 16, 2024

Signals

358 views

Availability

Linked source ready

Overview

Dataset description and usage context

Dataset Overview

Dataset Name

Codecfake Dataset

Dataset Description

Codecfake Dataset is a dataset for universal detection of deep‑fake audio, composed of multiple subsets including training, development, and test sets.

Dataset Subsets

Subset Name	Description	Link
Training set (part 1 of 3) & Labels	train_split.zip & train_split.z01 - train_split.z06	Link
Training set (part 2 of 3)	train_split.z07 - train_split.z14	Link
Training set (part 3 of 3)	train_split.z15 - train_split.z19	Link
Development set	dev_split.zip & dev_split.z01 - dev_split.z02	Link
Test set (part 1 of 2)	Codec test: C1.zip - C6.cip & ALM test: A1.zip - A3.zip	Link
Test set (part 2 of 2)	Codec unseen test: C7.zip	Link

Dataset License

CC BY‑NC‑ND 4.0

Dataset Structure

The dataset should be organized as follows:

├── Codecfake │ ├── label │ │ └── *.txt │ ├── train │ │ └── *.wav (740,747 samples) │ ├── dev │ │ └── *.wav (92,596 samples) │ ├── test │ │ └── C1 │ │ └── *.wav (26,456 samples) │ │ └── C2 │ │ └── *.wav (26,456 samples) │ │ └── C3 │ │ └── *.wav (26,456 samples) │ │ └── C4 │ │ └── *.wav (26,456 samples) │ │ └── C5 │ │ └── *.wav (26,456 samples) │ │ └── C6 │ │ └── *.wav (26,456 samples) │ │ └── C7 │ │ └── *.wav (145,505 samples) │ │ └── A1 │ │ └── *.wav (8,902 samples) │ │ └── A2 │ │ └── *.wav (8,902 samples) │ │ └── A3 │ │ └── *.wav (99,112 samples)

Usage Recommendations

If you wish to jointly train with the ASVspoof2019 dataset, first download the corresponding training, development, and evaluation sets from the ASVspoof2019 LA Database.

Pre‑trained Models

Several pre‑trained models are provided, including Vocoder‑trained ADD, Codec‑trained ADD, and Co‑trained ADD models, stored in the ./pretrained_model directory.

Citation

When using this dataset, please cite:

@article{xie2024codecfake, title={The Codecfake Dataset and Countermeasures for the Universally Detection of Deepfake Audio}, author={Xie, Yuankun and Lu, Yi and Fu, Ruibo and Wen, Zhengqi and Wang, Zhiyong and Tao, Jianhua and Qi, Xin and Wang, Xiaopeng and Liu, Yukun and Cheng, Haonan and others}, journal={arXiv preprint arXiv:2405.04880}, year={2024} }

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio