Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Additionnal Datasets? #16

Open
PierreColombo opened this issue Jan 5, 2021 · 7 comments
Open

Additionnal Datasets? #16

PierreColombo opened this issue Jan 5, 2021 · 7 comments

Comments

@PierreColombo
Copy link

Hello,
Thanks for your work !
Do you have the code for the two other datasets : BAGEL and SFHOTEL ?
Kindest regards :0

@andyweizhao
Copy link
Collaborator

Hello Pierre,

Thanks a lot for your interest! I'd like to sort out the code on BAGEL and SFHOTEL in my free time. Meanwhile, you could check the datasets in https://github.com/jeknov/EMNLP_17_submission and run moverscore on them easily :-)

@PierreColombo
Copy link
Author

PierreColombo commented Jan 7, 2021

:) many thanks for your reply :0
for correlation on the datasets do you consider an average of the 3 annotors for correlation ? or the median value ? (i obtain very low correlations arround 0.04 for quality for instance)

@PierreColombo
Copy link
Author

PierreColombo commented Jan 26, 2021

Hello can you help me with coco ? where did you get the data? It is mentionned 5k images where the val set has 40k images :)

@PierreColombo
Copy link
Author

I seems that the official split is no longuer available however you report results on the official split for instance METEOR. How do you sample the 5k images?

@andyweizhao
Copy link
Collaborator

andyweizhao commented Jan 26, 2021

@PierreColombo

We follow the setup in the paper to report the results on MSCOCO. First, you need to download the system captions in the link, generated by participating teams. In the MSCOCO leaderboard, you could find human scores at system-level, e.g., M1-M5 scores. Next, you measure metric scores between system and reference captions and average these instance-level scores over all instances to obtain system-level scores. Lastly, you compute system-level correlation between system-level metric and human scores.

Note that the M1-M5 human scores are reported in the test set, however, we use the metric scores in the validation sets b/c human scores in the test set are not released), which causes the mismatch of metric and human scores (See the discussion in https://arxiv.org/abs/1806.06422).

Hope this could help.

@PierreColombo
Copy link
Author

Hello,
Thanks for your reply :) but the val set contains 40k images and in your paper 5k is mentionned. Did you use 5k or the full 40K?

@andyweizhao
Copy link
Collaborator

The results are reported on the full validation set (40k). Apologies for accidentally obscuring the setup in the paper.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants