Datasets of language identification #596

imenelydiaker · 2024-04-29T07:32:27Z

imenelydiaker
Apr 29, 2024
Maintainer

Some datasets for language identification have been added to MTEB, although I'm not sure of the relevancy of these tasks for a language benchmark (see tasks: #564, #512, NordicLangCLassification).

They can only be used with multilingual models, since datasets are mixed and contain multiple languages. I'm nto sure we can use them for benchmarking classification on some language, imo the goal of these datasets is to show that a multilingual model is able to differentiate between languages.

Any thoughts about it?
@Muennighoff @KennethEnevoldsen @orionw @isaac-chung @asparius @dokato and any other person that is interested in discussing this.

isaac-chung · 2024-04-29T07:59:49Z

isaac-chung
Apr 29, 2024
Collaborator

That's a good point. How I understand "classification in xx language" is that the task is solely in that xx language, not a mix (though I understand that some languages are a mix, sometimes with English).

From how MTEB is structured, perhaps LangID can be a separate task from "classification". Another option is to completely exclude langID datasets.

9 replies

KennethEnevoldsen Apr 29, 2024
Maintainer

Hmm right, but if you only want languages which are "only" English. You would have to do more regardless. E.g. bitext mining tasks and cross-lingual datasets are also not differentiable (at least atm.). You could also imagine code-switching (language switch mid-sentence). All of these would have the same issues.

You can of course just filter only on datasets that have other languages:

tasks = mteb.get_tasks(["eng"])

only_eng = []
for task in tasks:
   if is_only_english(task):
      only_eng.append(task)

# where:
def is_only_english(task):
  if len(task.languages) == 1:
       return True

Here we assume that task.languages are updated based on passing the task(langs=["eng"]), which is not the case atm.

KennethEnevoldsen Apr 29, 2024
Maintainer

We might want to add a flag to get_tasks on languages to exclude these (any suggestions?)

Additionally, we do not pass the "langs" to the dataset atm., we should also do that (but it requires some handling as well).

imenelydiaker Apr 29, 2024
Maintainer Author

Actually, bitextmining tasks are not included in language benchmarks (see mteb_french_script & mteb_english_script), we only consider other tasks on the selected language.

imenelydiaker Apr 29, 2024
Maintainer Author

Yes maybe we should add a metadata field to these tasks to differentiate them from other MultilingualTasks

KennethEnevoldsen Apr 29, 2024
Maintainer

Right, but the benchmarks are just a list (so it could be anything).

Yes maybe we should add a metadata field to these tasks to differentiate them from other MultilingualTasks

I believe it is already there (or at least can be constructed from existing data). You are welcome to create a property for it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Datasets of language identification #596

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 9 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Datasets of language identification #596

imenelydiaker Apr 29, 2024 Maintainer

Replies: 1 comment · 9 replies

isaac-chung Apr 29, 2024 Collaborator

KennethEnevoldsen Apr 29, 2024 Maintainer

KennethEnevoldsen Apr 29, 2024 Maintainer

imenelydiaker Apr 29, 2024 Maintainer Author

imenelydiaker Apr 29, 2024 Maintainer Author

KennethEnevoldsen Apr 29, 2024 Maintainer

imenelydiaker
Apr 29, 2024
Maintainer

Replies: 1 comment 9 replies

isaac-chung
Apr 29, 2024
Collaborator

KennethEnevoldsen Apr 29, 2024
Maintainer

KennethEnevoldsen Apr 29, 2024
Maintainer

imenelydiaker Apr 29, 2024
Maintainer Author

imenelydiaker Apr 29, 2024
Maintainer Author

KennethEnevoldsen Apr 29, 2024
Maintainer