You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm recently trying to run lm-eval on Pythia models using the benchmarks listed in the paper. All the benchmarks show similar results to those reported in the paper, except WSC. In the paper the Pythia models report a WSC score of 0.3~0.5, while the models can easily get 0.6~0.8 accuracy on the WSC273 task from lm-eval. May I confirm what is the WSC task reported in the paper and how is it evaluated?
Thanks!
The text was updated successfully, but these errors were encountered:
Hi,
I'm recently trying to run lm-eval on Pythia models using the benchmarks listed in the paper. All the benchmarks show similar results to those reported in the paper, except WSC. In the paper the Pythia models report a WSC score of 0.3~0.5, while the models can easily get 0.6~0.8 accuracy on the WSC273 task from lm-eval. May I confirm what is the WSC task reported in the paper and how is it evaluated?
Thanks!
The text was updated successfully, but these errors were encountered: