Skip to content

Commit

Permalink
Merge branch 'master' into exp8
Browse files Browse the repository at this point in the history
  • Loading branch information
jordimas committed Sep 7, 2024
2 parents 163ea82 + 7603ce4 commit 134b7aa
Show file tree
Hide file tree
Showing 42 changed files with 2,065 additions and 22 deletions.
16 changes: 8 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,14 +15,14 @@ Language pair | SC model BLEU | SC Flores200 BLEU | Google BLEU | Meta NLLB200 B
|Catalan-German | 28.5 |25.4 |32.9 |29.1|15.8| 3142257 | [cat-deu-2022-11-16.zip](https://www.softcatala.org/pub/softcatala/opennmt/models/2022-11-22/cat-deu-2022-11-16.zip)
|English-Catalan | 46.9 |43.8 |46.0 |41.7|29.8| 7856208 | [eng-cat-2023-10-30.zip](https://www.softcatala.org/pub/softcatala/opennmt/models/2022-11-22/eng-cat-2023-10-30.zip)
|Catalan-English | 47.4 |43.5 |47.0 |48.0|29.6| 7856208 | [cat-eng-2023-10-29.zip](https://www.softcatala.org/pub/softcatala/opennmt/models/2022-11-22/cat-eng-2023-10-29.zip)
|Basque-Catalan | 38.8 |24.9 |29.6 |N/A|N/A| 9546180 | [eus-cat-2024-08-09.zip](https://www.softcatala.org/pub/softcatala/opennmt/models/2022-11-22/eus-cat-2024-08-09.zip)
|Catalan-Basque | 27.3 |17.1 |18.0 |N/A|N/A| 9546180 | [cat-eus-2024-08-12.zip](https://www.softcatala.org/pub/softcatala/opennmt/models/2022-11-22/cat-eus-2024-08-12.zip)
|French-Catalan | 41.3 |31.6 |37.3 |33.3|27.2| 2566302 | [fra-cat-2022-11-09.zip](https://www.softcatala.org/pub/softcatala/opennmt/models/2022-11-22/fra-cat-2022-11-09.zip)
|Catalan-French | 41.4 |35.4 |41.7 |39.6|27.9| 2566302 | [cat-fra-2022-11-14.zip](https://www.softcatala.org/pub/softcatala/opennmt/models/2022-11-22/cat-fra-2022-11-14.zip)
|Galician-Catalan | 74.1 |31.4 |36.5 |33.2|N/A| 2710149 | [glg-cat-2022-11-17.zip](https://www.softcatala.org/pub/softcatala/opennmt/models/2022-11-22/glg-cat-2022-11-17.zip)
|Catalan-Galician | 80.7 |31.9 |33.1 |31.7|N/A| 2710149 | [cat-glg-2022-11-21.zip](https://www.softcatala.org/pub/softcatala/opennmt/models/2022-11-22/cat-glg-2022-11-21.zip)
|Italian-Catalan | 39.7 |26.5 |30.6 |27.8|22.0| 2584598 | [ita-cat-2022-11-11.zip](https://www.softcatala.org/pub/softcatala/opennmt/models/2022-11-22/ita-cat-2022-11-11.zip)
|Catalan-Italian | 36.2 |24.5 |27.5 |26.0|19.2| 2584598 | [cat-ita-2022-11-15.zip](https://www.softcatala.org/pub/softcatala/opennmt/models/2022-11-22/cat-ita-2022-11-15.zip)
|Basque-Catalan | 38.8 |24.9 |29.6 |25.7|N/A| 9546180 | [eus-cat-2024-08-09.zip](https://www.softcatala.org/pub/softcatala/opennmt/models/2022-11-22/eus-cat-2024-08-09.zip)
|Catalan-Basque | 27.3 |17.1 |18.0 |10.5|N/A| 9546180 | [cat-eus-2024-08-12.zip](https://www.softcatala.org/pub/softcatala/opennmt/models/2022-11-22/cat-eus-2024-08-12.zip)
|French-Catalan | 43.0 |33.8 |37.3 |33.3|27.2| 6392858 | [fra-cat-2024-08-29.zip](https://www.softcatala.org/pub/softcatala/opennmt/models/2022-11-22/fra-cat-2024-08-29.zip)
|Catalan-French | 43.1 |37.0 |41.7 |39.6|27.9| 6392858 | [cat-fra-2024-08-30.zip](https://www.softcatala.org/pub/softcatala/opennmt/models/2022-11-22/cat-fra-2024-08-30.zip)
|Galician-Catalan | 66.4 |32.5 |36.5 |33.2|N/A| 5644577 | [glg-cat-2024-09-03.zip](https://www.softcatala.org/pub/softcatala/opennmt/models/2022-11-22/glg-cat-2024-09-03.zip)
|Catalan-Galician | 69.4 |32.2 |33.1 |31.7|N/A| 5644577 | [cat-glg-2024-09-04.zip](https://www.softcatala.org/pub/softcatala/opennmt/models/2022-11-22/cat-glg-2024-09-04.zip)
|Italian-Catalan | 41.0 |27.0 |30.6 |27.8|22.0| 4146825 | [ita-cat-2024-08-29.zip](https://www.softcatala.org/pub/softcatala/opennmt/models/2022-11-22/ita-cat-2024-08-29.zip)
|Catalan-Italian | 38.3 |25.1 |27.5 |26.0|19.2| 4146825 | [cat-ita-2024-08-30.zip](https://www.softcatala.org/pub/softcatala/opennmt/models/2022-11-22/cat-ita-2024-08-30.zip)
|Japanese-Catalan | 24.9 |17.8 |23.4 |N/A|N/A| 1997740 | [jpn-cat-2023-02-17.zip](https://www.softcatala.org/pub/softcatala/opennmt/models/2022-11-22/jpn-cat-2023-02-17.zip)
|Catalan-Japanese | 21.3 |19.8 |32.5 |N/A|N/A| 1997740 | [cat-jpn-2023-02-18.zip](https://www.softcatala.org/pub/softcatala/opennmt/models/2022-11-22/cat-jpn-2023-02-18.zip)
|Dutch-Catalan | 30.4 |20.3 |27.1 |24.8|15.8| 2208538 | [nld-cat-2022-11-19.zip](https://www.softcatala.org/pub/softcatala/opennmt/models/2022-11-22/nld-cat-2022-11-19.zip)
Expand Down
6 changes: 4 additions & 2 deletions data-processing-tools/join-single-file.py
Original file line number Diff line number Diff line change
Expand Up @@ -150,6 +150,9 @@ def _is_sentence_len_good(src, trg):
return True

# How to split test-val sets based on https://en.wikipedia.org/wiki/Per_mille
# Two goals:
# - The split is predictable instead of random
# - The split has a good distribution over the different corpus
def _get_val_test_split_steps(lines, per_mille_val, per_mille_test):
lines_val = round(lines * per_mille_val / 1000)
steps_val = round(lines / lines_val)
Expand Down Expand Up @@ -179,7 +182,7 @@ def split_in_six_files(src_filename, tgt_filename, directory, source_lang, targe
SAMPLE_PER_MILLE_VAL = 1
SAMPLE_PER_MILLE_TEST = 1
steps_val, steps_test = _get_val_test_split_steps(total_lines, SAMPLE_PER_MILLE_VAL, SAMPLE_PER_MILLE_TEST)
clean_src = clean_trg = 0
cnt_steps_val = cnt_steps_test = clean_src = clean_trg = 0
equal = 0
bad_length = 0
dots = 0
Expand Down Expand Up @@ -209,7 +212,6 @@ def split_in_six_files(src_filename, tgt_filename, directory, source_lang, targe

print("total_lines {0}".format(total_lines))

cnt_steps_val = cnt_steps_test = clean_src = clean_trg = 0
while True:

src = read_source.readline()
Expand Down
4 changes: 3 additions & 1 deletion evaluate/meta-bleu.json
Original file line number Diff line number Diff line change
Expand Up @@ -16,5 +16,7 @@
"glg-cat": "33.2",
"cat-glg": "31.7",
"oci-cat": "36.2",
"cat-oci": "27.8"
"cat-oci": "27.8",
"eus-cat": "25.7",
"cat-eus": "10.5"
}
Loading

0 comments on commit 134b7aa

Please sign in to comment.