Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added multi_pivot_paraphrases_generation transformation #252

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

AudayBerro
Copy link

This transformation generate a list of paraphrases of an English sentence following two steps:
1- Candidate Over-generation by leveraging Pivot-Translation techniques, bu translating the sentence to a curated list of languages using the Hugging-face Marian MT and UKPLab-EasyNMT Machine translator models.

2- After candidate Over-generation the list may contain some semantically unrelated or duplicated paraphrases. This step ensure to filter them from the final list by leveraging Universal Sentence Encoder embedding model. The idea is to compare the cosine similarity of the USE_embedding between the reference sentence and the candidate paraphrase. If the score is below o.5 the candidate is considered as semantically unrelated to the reference sentence; if score > 0.95 the candidate is a duplication of the reference; 0.5< score < 0.95 the candidate is accepted

@mille-s
Copy link
Contributor

mille-s commented Sep 21, 2021

Thanks for the submission! Can you please make sure this is not a duplicated tranformation? For instance #94 already creates paraphrases. Should we merge the two?

@kaustubhdhole
Copy link
Collaborator

@AudayBerro ping!

@kaustubhdhole
Copy link
Collaborator

Okay, this seems to be a great transformation and should be added to NL Augmenter. Here are a few comments. It would be great if you can address them and we are happy to merge.


## What type of a transformation is this?
This transformation is a paraphrase generation for Natural English Sentences by lveraging Pivot-Transaltion techniques. The Pivot-Trnasaltion technique allow to get lexically and syntaxically diverse paraphrases.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @mille-s mentioned, you might want to specify how this is different from earlier PRs.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this question.

In general to generate paraphrases we followed a data-flow principle, by splitting the process to 2 main step:
a) candidate over-generation: generate as many possible candidate using Pivot-translation techniques trough a predefined set of pivot languages
b) candidate selection: where semantically irrelevant paraphrases candidates are removed from the finale list through cosine similarity scores.

To resume here are the differences:

  1. We generate the paraphrase by pivot translation techniques using two pivot-levels(1-pivot meaning we have one pivot language or 2-pivot meaning we have 2 pivot languages) that are configurable, the user can choose from the beginning the level of paraphrase generation. e.g. 1-level => English-Italian-English || 2-level => English-Chinese-Russian-English.
  2. What makes our work different from others is that we use a manually defined list of pivot languages so that the sentences are more distinct and semantically related to the reference sentence. The languages were selected respecting two criteria:a) more the grammar of the pivot language is different from the source language(in our case is English) more we get syntactical diversity; b) more the grammar of the pivot language is close to the source language more we get lexical diversity and semantic relatedness. To resume we don't use one pivot-language as in the other works, instead we use the entire list of predefined languages respecting the selected pivot-level. e.g. if you choose 1-pivot level the paraphrases will be generated by translating respectively to each language in the list and translate back to English.
  3. Generate paraphrases is not enough we should ensure that the paraphrases candidates are semantically related to the reference sentences, since the translator Machine Engine may generate duplicate and semantically unrelated sentences. so the concept is to ensure that the result is of high quality(in our case semantically related to the reference sentences) so we need to perform a quality control step, it can be during generation or after generation of the paraphrase. In our transformation we apply quality control after paraphrase generation by computing the cosine similarity of the embedding vector off the reference sentence and the candidate paraphrase. We support Universal Sentence Encoder embedding, we can add other embedding model like BERT and ELMO but due to time constraint we used USE.
  4. Candidate selection as mentioned in 3 is configurable, the user can choose to apply or not after generation.
  5. The Semantic relatedness threshold are configurable and can be changed, in our work we used a minimal score of 0.5(if cosine score is lower than 0.5 the candidate is considered as semantically unrelated)

scikit-learn
tensorflow
tensorflow-hub
transformers
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think most of these libraries are present in the main requirements.txt file. You might want to skip adding them.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right

TaskType.QUESTION_GENERATION,
TaskType.TEXT_TO_TEXT_GENERATION
]
languages = ["en"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add the relevant keywords here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also in the end you can add a heavy=True parameter too.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right. I would add the lexical and syntactical key word

"inputs": {
"Reference sentence": "How does COVID-19 spread?"
},
"outputs": [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly these examples look great. :) I would suggest you to also the perform the robustness evaluation for your transformation (or at least in a separate PR).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the evaluation I wrote to you in an email that I can't get the evaluation script to work properly, I've tried several times and I always have the same problem.

The problem was a dependencies conflict, the runtime environment was not able to download the suggested version of the transformers packages

return response


if __name__ == '__main__':
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be commented or deleted.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@kaustubhdhole
Copy link
Collaborator

@AudayBerro would you like to address the above comments?

@AudayBerro
Copy link
Author

AudayBerro commented Oct 29, 2021 via email

@AudayBerro
Copy link
Author

AudayBerro commented Oct 31, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants