Skip to content

Latest commit

 

History

History

eng-cat

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Description

English - Catalan corpus

Data files

Data file Description Segments Import date License Comments
globalvoices.en-ca https://ca.globalvoices.org/ 21342 Jan 2020 Creative Commons Attribution-Only
MemoriesProjectesLliures.en-ca Open source translations processed by https://www.softcatala.org/recursos/memories.html 771458 Jan 2020 Several open source licenses
OpenSubtitles2018.en-ca http://www.opensubtitles.org/ 482009 Jan 2020 No free (every setence belongs to their author)
WikiMatrix.en-ca.txt https://ai.facebook.com/blog/wikimatrix/ 977466 Sep 2020 Extraction of pairs with quality >= 1.04, and then clean up with language detection and comparing to a machine translation for target
europarl.en-ca https://www.statmt.org/europarl/ 1965734 Jan 2020 ? Original corpus was English -> Spanish and the Catalan has been translated using MT
tedtalks.en-ca https://www.ted.com/ 50979 Jan 2020 Creative Commons BY-NC-ND
tatoeba.en-ca https://tatoeba.org/eng/downloads 5500 Jan 2020 CC0 and CC-BY
covost2.ca-en https://github.com/facebookresearch/covost 79633 Aug 2020 CC0 Catalan original sentences from Common Voice corpus + English translations
covost2.en-ca https://github.com/facebookresearch/covost 263891 Aug 2020 CC0 English original sentences from Common Voice corpus + Catalan translations
macocu-ca-en* https://macocu.eu/ 3130519 Nov 2023 CC0 Crawled corpus from serveral web sites