English - Catalan corpus
Data file | Description | Segments | Import date | License | Comments |
---|---|---|---|---|---|
globalvoices.en-ca | https://ca.globalvoices.org/ | 21342 | Jan 2020 | Creative Commons Attribution-Only | |
MemoriesProjectesLliures.en-ca | Open source translations processed by https://www.softcatala.org/recursos/memories.html | 771458 | Jan 2020 | Several open source licenses | |
OpenSubtitles2018.en-ca | http://www.opensubtitles.org/ | 482009 | Jan 2020 | No free (every setence belongs to their author) | |
WikiMatrix.en-ca.txt | https://ai.facebook.com/blog/wikimatrix/ | 977466 | Sep 2020 | Extraction of pairs with quality >= 1.04, and then clean up with language detection and comparing to a machine translation for target | |
europarl.en-ca | https://www.statmt.org/europarl/ | 1965734 | Jan 2020 | ? | Original corpus was English -> Spanish and the Catalan has been translated using MT |
tedtalks.en-ca | https://www.ted.com/ | 50979 | Jan 2020 | Creative Commons BY-NC-ND | |
tatoeba.en-ca | https://tatoeba.org/eng/downloads | 5500 | Jan 2020 | CC0 and CC-BY | |
covost2.ca-en | https://github.com/facebookresearch/covost | 79633 | Aug 2020 | CC0 | Catalan original sentences from Common Voice corpus + English translations |
covost2.en-ca | https://github.com/facebookresearch/covost | 263891 | Aug 2020 | CC0 | English original sentences from Common Voice corpus + Catalan translations |
macocu-ca-en* | https://macocu.eu/ | 3130519 | Nov 2023 | CC0 | Crawled corpus from serveral web sites |