CoRal is a comprehensive Automatic Speech Recognition (ASR) dataset designed to capture the diversity of the Danish language across various dialects, accents, genders, and age groups. The primary goal of the CoRal dataset is to provide a robust resource for training and evaluating ASR models that can understand and transcribe spoken Danish in all its variations.
Key Features:
Dialect and Accent Diversity: The dataset includes speech samples from all major Danish dialects as well as multiple accents, ensuring broad geographical coverage and the inclusion of regional linguistic features.
Gender Representation: Both male and female speakers are well-represented, offering balanced gender diversity.
Age Range: The dataset includes speakers from a wide range of age groups, providing a comprehensive resource for age-agnostic ASR model development.
High-Quality Audio: All recordings are of high quality, ensuring that the dataset can be used for both training and evaluation of high-performance ASR models.
Forbidden Use Cases
Speech Synthesis and Biometric Identification are not allowed using the CoRal dataset. For more information, see addition 4 in our license (https://huggingface.co/datasets/alexandrainst/coral/blob/main/LICENSE).
A research paper will be submitted soon, but until then, if you use the CoRal dataset in your research or development, please cite it as follows:
@dataset{coral2024,
author = {Dan Saattrup Nielsen, Sif Bernstorff Lehmann, Simon Leminen Madsen, Anders Jess Pedersen, Anna Katrine van Zee and Torben Blach},
title = {CoRal: A Diverse Danish ASR Dataset Covering Dialects, Accents, Genders, and Age Groups},
year = {2024},
url = {https://hf.co/datasets/alexandrainst/coral},
}