Datasets of language identification #596
imenelydiaker
started this conversation in
General
Replies: 1 comment 9 replies
-
That's a good point. How I understand "classification in xx language" is that the task is solely in that xx language, not a mix (though I understand that some languages are a mix, sometimes with English). From how MTEB is structured, perhaps LangID can be a separate task from "classification". Another option is to completely exclude langID datasets. |
Beta Was this translation helpful? Give feedback.
9 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Some datasets for language identification have been added to MTEB, although I'm not sure of the relevancy of these tasks for a language benchmark (see tasks: #564, #512, NordicLangCLassification).
They can only be used with multilingual models, since datasets are mixed and contain multiple languages. I'm nto sure we can use them for benchmarking classification on some language, imo the goal of these datasets is to show that a multilingual model is able to differentiate between languages.
Any thoughts about it?
@Muennighoff @KennethEnevoldsen @orionw @isaac-chung @asparius @dokato and any other person that is interested in discussing this.
Beta Was this translation helpful? Give feedback.
All reactions