Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added multi_pivot_paraphrases_generation transformation #252

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
44 changes: 44 additions & 0 deletions transformations/multi_pivot_paraphrases_generation/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# From one English Sentence to a list of paraphrases 🦎 + ⌨️ → 🐍
This transformation generates a list of paraphrases for an English sentence by leveraging Pivot-Transaltion approach.
Pivot-Transaltion is an approach where a sentence in a source language is translated to a foreign language called the pivot language then translated back to the source language to get a paraprhase candidate, e.g. translate an English sentence to French, then translate back to English.

The paraphrases generation is divided into two step:
- Step 1: paraphrases Candidate Over-generation by leveraging Pivot-Transaltion. At this step, we generate a Pool of possible parparhases.
- Step 2: apply a candidate selection over the Pool of paraphrases, since the pool can contain semantically unrelated or duplicate paraphrases.
We leverage Embedding Model such as Universal Sentence Encoder~(USE) to disqualify candidate paraphrases from the pool, by computing the Cosine Similarity socres of the
USE Embeddings between the reference sentence and the candidate paraphrase. Let R = USE_Embeding(reference_english_sentence) and P = USE_Embeding(candidate):
- if Cosine(R,P) < alpha => the candidate is semantically unrelated and then removed from the final list of paraphrases
- if Cosine(R,P) > beta => the candidate is a duplication and then removed from the final list of paraphrases
- By default Alpha=0.5 and Beta=0.95, we set the value as suggested by [Parikh et al.](https://arxiv.org/pdf/2004.03484.pdf) works

Please refer to the test.json for all of the test cases catered.

This transformation translates an English sentence to a list of predefined languages using Huggingface MariamMT and EasyNMT as Machine Transaltion models.
- The transformation support Two Pivot-Transaltion Level.
- If Pivot-level = 1 => Transalte to only one foreign language. e.g. English -> French -> English || English -> Arabic -> English || English -> japanese -> English
- If Pivot-level = 2 => Transalte to only Two foreign language. e.g. English -> French -> Arabic -> English || English -> Russian -> Chinese -> English

Author name: Auday Berro ([email protected])

## What type of a transformation is this?
This transformation is a paraphrase generation for Natural English Sentences by lveraging Pivot-Transaltion techniques. The Pivot-Trnasaltion technique allow to get lexically and syntaxically diverse paraphrases.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @mille-s mentioned, you might want to specify how this is different from earlier PRs.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this question.

In general to generate paraphrases we followed a data-flow principle, by splitting the process to 2 main step:
a) candidate over-generation: generate as many possible candidate using Pivot-translation techniques trough a predefined set of pivot languages
b) candidate selection: where semantically irrelevant paraphrases candidates are removed from the finale list through cosine similarity scores.

To resume here are the differences:

  1. We generate the paraphrase by pivot translation techniques using two pivot-levels(1-pivot meaning we have one pivot language or 2-pivot meaning we have 2 pivot languages) that are configurable, the user can choose from the beginning the level of paraphrase generation. e.g. 1-level => English-Italian-English || 2-level => English-Chinese-Russian-English.
  2. What makes our work different from others is that we use a manually defined list of pivot languages so that the sentences are more distinct and semantically related to the reference sentence. The languages were selected respecting two criteria:a) more the grammar of the pivot language is different from the source language(in our case is English) more we get syntactical diversity; b) more the grammar of the pivot language is close to the source language more we get lexical diversity and semantic relatedness. To resume we don't use one pivot-language as in the other works, instead we use the entire list of predefined languages respecting the selected pivot-level. e.g. if you choose 1-pivot level the paraphrases will be generated by translating respectively to each language in the list and translate back to English.
  3. Generate paraphrases is not enough we should ensure that the paraphrases candidates are semantically related to the reference sentences, since the translator Machine Engine may generate duplicate and semantically unrelated sentences. so the concept is to ensure that the result is of high quality(in our case semantically related to the reference sentences) so we need to perform a quality control step, it can be during generation or after generation of the paraphrase. In our transformation we apply quality control after paraphrase generation by computing the cosine similarity of the embedding vector off the reference sentence and the candidate paraphrase. We support Universal Sentence Encoder embedding, we can add other embedding model like BERT and ELMO but due to time constraint we used USE.
  4. Candidate selection as mentioned in 3 is configurable, the user can choose to apply or not after generation.
  5. The Semantic relatedness threshold are configurable and can be changed, in our work we used a minimal score of 0.5(if cosine score is lower than 0.5 the candidate is considered as semantically unrelated)

## What tasks does it intend to benefit?
This transformation would benefit all tasks with a sentence as input like question generation, sentence generation, etc.

## What are the limitations of this transformation?

1. The transformation does not generate paraphrases for non-English sentences, e.g. Can't generate paraphrases for German or Chinese sentences

2. This transformation only generate paraphrases for Natural Language English sentences.

## Previous Work


2) This work is partly inspired by the following work on robustness for Machine Translation:
```bibtex
@article{berroextensible,
title={An Extensible and Reusable Pipeline for Automated Utterance Paraphrases},
author={Berro, Auday and Zade, Mohammad-Ali Yaghub and Baez, Marcos and Benatallah, Boualem and Benabdeslem, Khalid}
}
```
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
from .transformation import *
16 changes: 16 additions & 0 deletions transformations/multi_pivot_paraphrases_generation/constants.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# Huggign Face Marian Machine Translator Model to load. Set of Tuples in the form: tuple=(Source-2-target languages pairs, Huggingface MarianMT Helsinki-NLP model)
HUGGINGFACE_MARIANMT_MODELS_TO_LOAD = {
('en2romance','Helsinki-NLP/opus-mt-en-ROMANCE'),
('romance2en','Helsinki-NLP/opus-mt-ROMANCE-en'),
('de2en','Helsinki-NLP/opus-mt-de-en'),
('ru2en','Helsinki-NLP/opus-mt-ru-en'),
('en2ar','Helsinki-NLP/opus-mt-en-ar'),
('en2zh','Helsinki-NLP/opus-mt-en-zh'),
('en2jap','Helsinki-NLP/opus-mt-en-jap'),
('en2ru','Helsinki-NLP/opus-mt-en-ru'),
('en2de','Helsinki-NLP/opus-mt-en-de'),
('zh2en','Helsinki-NLP/opus-mt-zh-en')
}


EASYNMT_MODEL_NAME = 'm2m_100_418M'
22 changes: 22 additions & 0 deletions transformations/multi_pivot_paraphrases_generation/easy_nmt.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
""" EasyNMT - Easy to use, state-of-the-art Neural Machine Translation - https://github.com/UKPLab/EasyNMT """
from easynmt import EasyNMT

def load_easynmt_model(model_name='m2m_100_418M'):
"""
EasyNMT model to load
:param model_name: name of the model to load - List of supported model visit: https://github.com/UKPLab/EasyNMT#available-models
:return EasyNMT Machine translation model
"""

return EasyNMT(model_name)

def get_easynmt_translation(sentence,model,target_lang,source_lang=None):
"""
Translate a sentence
:param sentence: sentence to translate
:param model: EasyNMT model
:param trg: Target language for the translation
:param source_lang: Source language for the translation. If None, determines the source languages automatically.
:return Translated sentence
"""
return model.translate(sentence, source_lang=source_lang, target_lang=target_lang)
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
EasyNMT
numpy
206 changes: 206 additions & 0 deletions transformations/multi_pivot_paraphrases_generation/test.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,206 @@
{
"type": "multi_pivot_paraphrases_generation",
"test_cases": [
{
"class": "MultiPivotParaphrasesGeneration",
"inputs": {
"Reference sentence": "How does COVID-19 spread?"
},
"outputs": [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly these examples look great. :) I would suggest you to also the perform the robustness evaluation for your transformation (or at least in a separate PR).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the evaluation I wrote to you in an email that I can't get the evaluation script to work properly, I've tried several times and I always have the same problem.

The problem was a dependencies conflict, the runtime environment was not able to download the suggested version of the transformers packages

{
"Paraphrase": "How is COVID-19 disseminated?"
},
{
"Paraphrase": "How is COVID-19 spread?"
},
{
"Paraphrase": "How did COVID-19 spread?"
},
{
"Paraphrase": "How is COVID-19 spreading?"
},
{
"Paraphrase": "How does COVID-19 spread?"
}
]
},
{
"class": "MultiPivotParaphrasesGeneration",
"inputs": {
"Reference sentence": "Book a flight from Lyon to Sydney?"
},
"outputs": [
{
"Paraphrase": "To book a flight from Lyon to Sydney?"
},
{
"Paraphrase": "Have you booked a flight from Lyon to Sydney?"
},
{
"Paraphrase": "What is the journey from Lyon to Sydney?"
},
{
"Paraphrase": "Book a flight from Lyon to Sydney?"
},
{
"Paraphrase": "Are you booking a flight from Lyon to Sydney?"
}
]
},
{
"class": "MultiPivotParaphrasesGeneration",
"inputs": {
"Reference sentence": "Reserve an Italian Restaurant near Paris"
},
"outputs": [
{
"Paraphrase": "Reserve an Italian restaurant near Paris"
},
{
"Paraphrase": "Italian restaurants near Paris"
},
{
"Paraphrase": "Book an Italian restaurant near Paris"
},
{
"Paraphrase": "It's a reservation at the Italian restaurant near Paris."
},
{
"Paraphrase": "Save the Italian restaurant near Paris."
}
]
},
{
"class": "MultiPivotParaphrasesGeneration",
"inputs": {
"Reference sentence": "how many 10 euros are worth in dollars"
},
"outputs": [
{
"Paraphrase": "how many 10 euros are worth in dollars"
},
{
"Paraphrase": "how much 10 euros are worth in dollars"
},
{
"Paraphrase": "10 Euros in Dollars."
},
{
"Paraphrase": "How many Euros are worth in United States dollars?"
},
{
"Paraphrase": "How much is 10 euros in dollars?"
},
{
"Paraphrase": "how many 10 euros is worth in dollars"
},
{
"Paraphrase": "how many 10 euros in dollars are worth"
}
]
},
{
"class": "MultiPivotParaphrasesGeneration",
"inputs": {
"Reference sentence": "which company makes the ipod?"
},
"outputs": [
{
"Paraphrase": "Which company is making iPods?"
},
{
"Paraphrase": "What company does the iPod make?"
},
{
"Paraphrase": "Which company does the ipod?"
},
{
"Paraphrase": "What kind of company does an iPod?"
},
{
"Paraphrase": "Which company manufactures ipods?"
},
{
"Paraphrase": "What company does the iPod do?"
},
{
"Paraphrase": "Which company makes the iPod?"
},
{
"Paraphrase": "What company manufactures the ipod?"
}
]
},
{
"class": "MultiPivotParaphrasesGeneration",
"inputs": {
"Reference sentence": "what states does the connecticut river flow through?"
},
"outputs": [
{
"Paraphrase": "In what states does the connected river flow?"
},
{
"Paraphrase": "What state is the link to the river?"
},
{
"Paraphrase": "What states is the connecticut river going through?"
},
{
"Paraphrase": "Where does the river flow? What is the way the Nile flows?"
},
{
"Paraphrase": "What are you running through the Connecticut River?"
},
{
"Paraphrase": "What states does the river connecticut flow through?"
},
{
"Paraphrase": "In what state does the river connecticut flow?"
},
{
"Paraphrase": "What states pass through the river Kinkito?"
},
{
"Paraphrase": "What conditions does the Connecticut River flow through?"
},
{
"Paraphrase": "What states the river connecticut flows?"
}
]
},
{
"class": "MultiPivotParaphrasesGeneration",
"inputs": {
"Reference sentence": "in which tournaments did west indies cricket team win the championship?"
},
"outputs": [
{
"Paraphrase": "In which tournaments did Western Indians win the championship?"
},
{
"Paraphrase": "What tournaments did the West Indies cricket team win the championship?"
},
{
"Paraphrase": "Which team won the World Cup in West India?"
},
{
"Paraphrase": "in which tournaments has West India cricket team won the championship?"
},
{
"Paraphrase": "In which tournaments did the cricket team of the West Indies win the championship?"
},
{
"Paraphrase": "What game did the Cricket Team of the West Indies win?"
},
{
"Paraphrase": "In what tournaments did the cricket team of the West Indies win the championship?"
},
{
"Paraphrase": "What tournament did the West Indies cricket team win?"
}
]
}
]
}

Loading