Datasæt

Stortinget Speech Corpus version 1.0

The Stortinget Speech Corpus (SSC) is a 5000+ hours speech dataset for weak supervision ASR created from audio and aligned proceedings text from Stortinget, the Norwegian Parliament. It contains speech segments of up to 30 seconds with transcriptions in Norwegian Bokmål (nob) and Norwegian Nynorsk (nno) from the official proceedings.

The dataset is distributed as a JSONL file. Audio files, proceedings files and transcription files (with ASR output) are included in this repository, and there are relative file paths in the JSONL file. Note that only segmented audio files are part of the release.

Dataset statistics – Number of segments: 724 783 – Total duration in hours: 5 190 – Number of unique speakers: 729

You can also find the research article related to this dataset through this link https://aclanthology.org/2023.resourceful-1.7.pdf

Data og ressourcer

Stortinget Speech Corpus version 1.0http://publications.europa.eu/resource/authority/file-type/JSON_LD
Tilgå ressourcen her.
Udforsk
- Mere information
- Gå til ressource
GitHub repository: Stortinget Speech Corpus version 1.0
Tilgå ressourcen her.
Udforsk
- Mere information
- Gå til ressource

Nøgleord

Yderligere info

URI	https://data.gov.dk/dataset/lang/40959b78-712c-426f-9a02-f13a86ad26c2
Destinationsside	https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-91/
Høstes af Datavejviser
Udgivelsesdato	01-08-2019
Seneste ændringsdato	15-11-2023
Opdateringsfrekvens	ubekendt
Dækningsperiode	/
Emne(r)	16.05.07 Sprog og retskrivning Uddannelse, kultur og sport
Adgangsrettigheder	offentlig
Overholder
Proveniensudsagn
Dokumentation	https://www.nb.no/sbfil/talegjenkjenning/ssc/SSC_1.pdf