Skip to main content

WikiMatrix

135 mio parallelsætninger (1620 sprogpar - 85 sprog) fra Wikipedia.

License: The mined data is distributed under the Creative Commons Attribution-ShareAlike license.

Please cite reference [1] if you use this data.

References:

[1] Holger Schwenk, Vishrav Chaudhary, Shuo Sun, Hongyu Gong and Paco Guzman, WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia arXiv, July 11 2019.

[2] Mikel Artetxe and Holger Schwenk, Margin-based Parallel Corpus Mining with Multilingual Sentence Embeddings arXiv, Nov 3 2018.

[3] Mikel Artetxe and Holger Schwenk, Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond arXiv, Dec 26 2018.

[4] Ye Qi, Devendra Sachan, Matthieu Felix, Sarguna Padmanabhan and Graham Neubig, When and Why Are Pre-Trained Word Embeddings Useful for Neural Machine Translation? NAACL, pages 529-535, 2018.

Data og ressourcer

Nøgleord

Yderligere info

URI https://data.gov.dk/dataset/lang/752bf522-baca-4ff3-80af-6b7e3ea632bd
Destinationsside https://github.com/facebookresearch/LASER/tree/master/tasks/WikiMatrix
Høstes af Datavejviser
Udgivelsesdato
Seneste ændringsdato
Opdateringsfrekvens ubekendt
Dækningsperiode  / 
Emne(r) Uddannelse, kultur og sport
Adgangsrettigheder offentlig
Overholder
Proveniensudsagn
Dokumentation