Additionnal Datasets? #16

PierreColombo · 2021-01-05T15:34:04Z

Hello,
Thanks for your work !
Do you have the code for the two other datasets : BAGEL and SFHOTEL ?
Kindest regards :0

andyweizhao · 2021-01-06T19:58:16Z

Hello Pierre,

Thanks a lot for your interest! I'd like to sort out the code on BAGEL and SFHOTEL in my free time. Meanwhile, you could check the datasets in https://github.com/jeknov/EMNLP_17_submission and run moverscore on them easily :-)

PierreColombo · 2021-01-07T07:45:38Z

:) many thanks for your reply :0
for correlation on the datasets do you consider an average of the 3 annotors for correlation ? or the median value ? (i obtain very low correlations arround 0.04 for quality for instance)

PierreColombo · 2021-01-26T09:29:15Z

Hello can you help me with coco ? where did you get the data? It is mentionned 5k images where the val set has 40k images :)

PierreColombo · 2021-01-26T10:20:31Z

I seems that the official split is no longuer available however you report results on the official split for instance METEOR. How do you sample the 5k images?

andyweizhao · 2021-01-26T22:13:48Z

@PierreColombo

We follow the setup in the paper to report the results on MSCOCO. First, you need to download the system captions in the link, generated by participating teams. In the MSCOCO leaderboard, you could find human scores at system-level, e.g., M1-M5 scores. Next, you measure metric scores between system and reference captions and average these instance-level scores over all instances to obtain system-level scores. Lastly, you compute system-level correlation between system-level metric and human scores.

Note that the M1-M5 human scores are reported in the test set, however, we use the metric scores in the validation sets b/c human scores in the test set are not released), which causes the mismatch of metric and human scores (See the discussion in https://arxiv.org/abs/1806.06422).

Hope this could help.

PierreColombo · 2021-01-27T08:14:20Z

Hello,
Thanks for your reply :) but the val set contains 40k images and in your paper 5k is mentionned. Did you use 5k or the full 40K?

andyweizhao · 2021-01-27T22:17:02Z

The results are reported on the full validation set (40k). Apologies for accidentally obscuring the setup in the paper.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Additionnal Datasets? #16

Additionnal Datasets? #16

PierreColombo commented Jan 5, 2021

andyweizhao commented Jan 6, 2021

PierreColombo commented Jan 7, 2021 •

edited

Loading

PierreColombo commented Jan 26, 2021 •

edited

Loading

PierreColombo commented Jan 26, 2021

andyweizhao commented Jan 26, 2021 •

edited

Loading

PierreColombo commented Jan 27, 2021

andyweizhao commented Jan 27, 2021

Additionnal Datasets? #16

Additionnal Datasets? #16

Comments

PierreColombo commented Jan 5, 2021

andyweizhao commented Jan 6, 2021

PierreColombo commented Jan 7, 2021 • edited Loading

PierreColombo commented Jan 26, 2021 • edited Loading

PierreColombo commented Jan 26, 2021

andyweizhao commented Jan 26, 2021 • edited Loading

PierreColombo commented Jan 27, 2021

andyweizhao commented Jan 27, 2021

PierreColombo commented Jan 7, 2021 •

edited

Loading

PierreColombo commented Jan 26, 2021 •

edited

Loading

andyweizhao commented Jan 26, 2021 •

edited

Loading