Chatbot evaluation is really hard. There is no standard, and this is our attempt to at least address small parts of this issue.
Right now we using ParlAi as our framework for data as well as experiments. We used OpenMNT-py for training models. All of our checkpoints will be made publicly available including all configurations. See this link for checkpoints from the paper.
Submit your model! Please take a look our submission form.
Amazon Mechanical Turk is not free... we are actively looking for funding.
Please find our paper here.
What does ChatEval solve?
- Shared and publicly available model code and checkpoints.
- Standard evaluation datasets.
- Standard human annotator framework (currently using Amazon Mechanical Turk).
- Model comparisons of the performance of Model A vs Model B. Both a summary and all data are available.