Clarifications regarding datasets and task #459
-
I have a couple of questions regarding datasets. My understanding is that since the purpose of MTEB is to benchmark text embedding models, when we add a HF dataset we need to make sure that the dataset must have either test or validation set. Or can we add say "n" rows from the train split as well? Also is there a restriction on the license of the dataset or any dataset on HF would qualify? As for the task does a new sub task within the main task qualify as a new task contribution? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
@shreeya-dhakal, you can use whatever split you want. However, if there is a dev or test split we encourage that. no restrictions on license (as long as it permits us to refer to the dataset). We include datasets with no license attached. The user can use the metadata to filter out tasks without permissible licenses. |
Beta Was this translation helpful? Give feedback.
@shreeya-dhakal, you can use whatever split you want. However, if there is a dev or test split we encourage that.
no restrictions on license (as long as it permits us to refer to the dataset). We include datasets with no license attached. The user can use the metadata to filter out tasks without permissible licenses.