-
Notifications
You must be signed in to change notification settings - Fork 196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added multi_pivot_paraphrases_generation transformation #252
base: main
Are you sure you want to change the base?
Changes from all commits
ca97cd9
0fc1aad
e3c0a3d
03ee9f7
0f47a8f
dc317f6
08a5dc5
e48070e
fbd10fd
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,44 @@ | ||
# From one English Sentence to a list of paraphrases 🦎 + ⌨️ → 🐍 | ||
This transformation generates a list of paraphrases for an English sentence by leveraging Pivot-Transaltion approach. | ||
Pivot-Transaltion is an approach where a sentence in a source language is translated to a foreign language called the pivot language then translated back to the source language to get a paraprhase candidate, e.g. translate an English sentence to French, then translate back to English. | ||
|
||
The paraphrases generation is divided into two step: | ||
- Step 1: paraphrases Candidate Over-generation by leveraging Pivot-Transaltion. At this step, we generate a Pool of possible parparhases. | ||
- Step 2: apply a candidate selection over the Pool of paraphrases, since the pool can contain semantically unrelated or duplicate paraphrases. | ||
We leverage Embedding Model such as Universal Sentence Encoder~(USE) to disqualify candidate paraphrases from the pool, by computing the Cosine Similarity socres of the | ||
USE Embeddings between the reference sentence and the candidate paraphrase. Let R = USE_Embeding(reference_english_sentence) and P = USE_Embeding(candidate): | ||
- if Cosine(R,P) < alpha => the candidate is semantically unrelated and then removed from the final list of paraphrases | ||
- if Cosine(R,P) > beta => the candidate is a duplication and then removed from the final list of paraphrases | ||
- By default Alpha=0.5 and Beta=0.95, we set the value as suggested by [Parikh et al.](https://arxiv.org/pdf/2004.03484.pdf) works | ||
|
||
Please refer to the test.json for all of the test cases catered. | ||
|
||
This transformation translates an English sentence to a list of predefined languages using Huggingface MariamMT and EasyNMT as Machine Transaltion models. | ||
- The transformation support Two Pivot-Transaltion Level. | ||
- If Pivot-level = 1 => Transalte to only one foreign language. e.g. English -> French -> English || English -> Arabic -> English || English -> japanese -> English | ||
- If Pivot-level = 2 => Transalte to only Two foreign language. e.g. English -> French -> Arabic -> English || English -> Russian -> Chinese -> English | ||
|
||
Author name: Auday Berro ([email protected]) | ||
|
||
## What type of a transformation is this? | ||
This transformation is a paraphrase generation for Natural English Sentences by lveraging Pivot-Transaltion techniques. The Pivot-Trnasaltion technique allow to get lexically and syntaxically diverse paraphrases. | ||
|
||
## What tasks does it intend to benefit? | ||
This transformation would benefit all tasks with a sentence as input like question generation, sentence generation, etc. | ||
|
||
## What are the limitations of this transformation? | ||
|
||
1. The transformation does not generate paraphrases for non-English sentences, e.g. Can't generate paraphrases for German or Chinese sentences | ||
|
||
2. This transformation only generate paraphrases for Natural Language English sentences. | ||
|
||
## Previous Work | ||
|
||
|
||
2) This work is partly inspired by the following work on robustness for Machine Translation: | ||
```bibtex | ||
@article{berroextensible, | ||
title={An Extensible and Reusable Pipeline for Automated Utterance Paraphrases}, | ||
author={Berro, Auday and Zade, Mohammad-Ali Yaghub and Baez, Marcos and Benatallah, Boualem and Benabdeslem, Khalid} | ||
} | ||
``` |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
from .transformation import * |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
# Huggign Face Marian Machine Translator Model to load. Set of Tuples in the form: tuple=(Source-2-target languages pairs, Huggingface MarianMT Helsinki-NLP model) | ||
HUGGINGFACE_MARIANMT_MODELS_TO_LOAD = { | ||
('en2romance','Helsinki-NLP/opus-mt-en-ROMANCE'), | ||
('romance2en','Helsinki-NLP/opus-mt-ROMANCE-en'), | ||
('de2en','Helsinki-NLP/opus-mt-de-en'), | ||
('ru2en','Helsinki-NLP/opus-mt-ru-en'), | ||
('en2ar','Helsinki-NLP/opus-mt-en-ar'), | ||
('en2zh','Helsinki-NLP/opus-mt-en-zh'), | ||
('en2jap','Helsinki-NLP/opus-mt-en-jap'), | ||
('en2ru','Helsinki-NLP/opus-mt-en-ru'), | ||
('en2de','Helsinki-NLP/opus-mt-en-de'), | ||
('zh2en','Helsinki-NLP/opus-mt-zh-en') | ||
} | ||
|
||
|
||
EASYNMT_MODEL_NAME = 'm2m_100_418M' |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
""" EasyNMT - Easy to use, state-of-the-art Neural Machine Translation - https://github.com/UKPLab/EasyNMT """ | ||
from easynmt import EasyNMT | ||
|
||
def load_easynmt_model(model_name='m2m_100_418M'): | ||
""" | ||
EasyNMT model to load | ||
:param model_name: name of the model to load - List of supported model visit: https://github.com/UKPLab/EasyNMT#available-models | ||
:return EasyNMT Machine translation model | ||
""" | ||
|
||
return EasyNMT(model_name) | ||
|
||
def get_easynmt_translation(sentence,model,target_lang,source_lang=None): | ||
""" | ||
Translate a sentence | ||
:param sentence: sentence to translate | ||
:param model: EasyNMT model | ||
:param trg: Target language for the translation | ||
:param source_lang: Source language for the translation. If None, determines the source languages automatically. | ||
:return Translated sentence | ||
""" | ||
return model.translate(sentence, source_lang=source_lang, target_lang=target_lang) |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
EasyNMT | ||
numpy |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,206 @@ | ||
{ | ||
"type": "multi_pivot_paraphrases_generation", | ||
"test_cases": [ | ||
{ | ||
"class": "MultiPivotParaphrasesGeneration", | ||
"inputs": { | ||
"Reference sentence": "How does COVID-19 spread?" | ||
}, | ||
"outputs": [ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Honestly these examples look great. :) I would suggest you to also the perform the robustness evaluation for your transformation (or at least in a separate PR). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. for the evaluation I wrote to you in an email that I can't get the evaluation script to work properly, I've tried several times and I always have the same problem. The problem was a dependencies conflict, the runtime environment was not able to download the suggested version of the transformers packages |
||
{ | ||
"Paraphrase": "How is COVID-19 disseminated?" | ||
}, | ||
{ | ||
"Paraphrase": "How is COVID-19 spread?" | ||
}, | ||
{ | ||
"Paraphrase": "How did COVID-19 spread?" | ||
}, | ||
{ | ||
"Paraphrase": "How is COVID-19 spreading?" | ||
}, | ||
{ | ||
"Paraphrase": "How does COVID-19 spread?" | ||
} | ||
] | ||
}, | ||
{ | ||
"class": "MultiPivotParaphrasesGeneration", | ||
"inputs": { | ||
"Reference sentence": "Book a flight from Lyon to Sydney?" | ||
}, | ||
"outputs": [ | ||
{ | ||
"Paraphrase": "To book a flight from Lyon to Sydney?" | ||
}, | ||
{ | ||
"Paraphrase": "Have you booked a flight from Lyon to Sydney?" | ||
}, | ||
{ | ||
"Paraphrase": "What is the journey from Lyon to Sydney?" | ||
}, | ||
{ | ||
"Paraphrase": "Book a flight from Lyon to Sydney?" | ||
}, | ||
{ | ||
"Paraphrase": "Are you booking a flight from Lyon to Sydney?" | ||
} | ||
] | ||
}, | ||
{ | ||
"class": "MultiPivotParaphrasesGeneration", | ||
"inputs": { | ||
"Reference sentence": "Reserve an Italian Restaurant near Paris" | ||
}, | ||
"outputs": [ | ||
{ | ||
"Paraphrase": "Reserve an Italian restaurant near Paris" | ||
}, | ||
{ | ||
"Paraphrase": "Italian restaurants near Paris" | ||
}, | ||
{ | ||
"Paraphrase": "Book an Italian restaurant near Paris" | ||
}, | ||
{ | ||
"Paraphrase": "It's a reservation at the Italian restaurant near Paris." | ||
}, | ||
{ | ||
"Paraphrase": "Save the Italian restaurant near Paris." | ||
} | ||
] | ||
}, | ||
{ | ||
"class": "MultiPivotParaphrasesGeneration", | ||
"inputs": { | ||
"Reference sentence": "how many 10 euros are worth in dollars" | ||
}, | ||
"outputs": [ | ||
{ | ||
"Paraphrase": "how many 10 euros are worth in dollars" | ||
}, | ||
{ | ||
"Paraphrase": "how much 10 euros are worth in dollars" | ||
}, | ||
{ | ||
"Paraphrase": "10 Euros in Dollars." | ||
}, | ||
{ | ||
"Paraphrase": "How many Euros are worth in United States dollars?" | ||
}, | ||
{ | ||
"Paraphrase": "How much is 10 euros in dollars?" | ||
}, | ||
{ | ||
"Paraphrase": "how many 10 euros is worth in dollars" | ||
}, | ||
{ | ||
"Paraphrase": "how many 10 euros in dollars are worth" | ||
} | ||
] | ||
}, | ||
{ | ||
"class": "MultiPivotParaphrasesGeneration", | ||
"inputs": { | ||
"Reference sentence": "which company makes the ipod?" | ||
}, | ||
"outputs": [ | ||
{ | ||
"Paraphrase": "Which company is making iPods?" | ||
}, | ||
{ | ||
"Paraphrase": "What company does the iPod make?" | ||
}, | ||
{ | ||
"Paraphrase": "Which company does the ipod?" | ||
}, | ||
{ | ||
"Paraphrase": "What kind of company does an iPod?" | ||
}, | ||
{ | ||
"Paraphrase": "Which company manufactures ipods?" | ||
}, | ||
{ | ||
"Paraphrase": "What company does the iPod do?" | ||
}, | ||
{ | ||
"Paraphrase": "Which company makes the iPod?" | ||
}, | ||
{ | ||
"Paraphrase": "What company manufactures the ipod?" | ||
} | ||
] | ||
}, | ||
{ | ||
"class": "MultiPivotParaphrasesGeneration", | ||
"inputs": { | ||
"Reference sentence": "what states does the connecticut river flow through?" | ||
}, | ||
"outputs": [ | ||
{ | ||
"Paraphrase": "In what states does the connected river flow?" | ||
}, | ||
{ | ||
"Paraphrase": "What state is the link to the river?" | ||
}, | ||
{ | ||
"Paraphrase": "What states is the connecticut river going through?" | ||
}, | ||
{ | ||
"Paraphrase": "Where does the river flow? What is the way the Nile flows?" | ||
}, | ||
{ | ||
"Paraphrase": "What are you running through the Connecticut River?" | ||
}, | ||
{ | ||
"Paraphrase": "What states does the river connecticut flow through?" | ||
}, | ||
{ | ||
"Paraphrase": "In what state does the river connecticut flow?" | ||
}, | ||
{ | ||
"Paraphrase": "What states pass through the river Kinkito?" | ||
}, | ||
{ | ||
"Paraphrase": "What conditions does the Connecticut River flow through?" | ||
}, | ||
{ | ||
"Paraphrase": "What states the river connecticut flows?" | ||
} | ||
] | ||
}, | ||
{ | ||
"class": "MultiPivotParaphrasesGeneration", | ||
"inputs": { | ||
"Reference sentence": "in which tournaments did west indies cricket team win the championship?" | ||
}, | ||
"outputs": [ | ||
{ | ||
"Paraphrase": "In which tournaments did Western Indians win the championship?" | ||
}, | ||
{ | ||
"Paraphrase": "What tournaments did the West Indies cricket team win the championship?" | ||
}, | ||
{ | ||
"Paraphrase": "Which team won the World Cup in West India?" | ||
}, | ||
{ | ||
"Paraphrase": "in which tournaments has West India cricket team won the championship?" | ||
}, | ||
{ | ||
"Paraphrase": "In which tournaments did the cricket team of the West Indies win the championship?" | ||
}, | ||
{ | ||
"Paraphrase": "What game did the Cricket Team of the West Indies win?" | ||
}, | ||
{ | ||
"Paraphrase": "In what tournaments did the cricket team of the West Indies win the championship?" | ||
}, | ||
{ | ||
"Paraphrase": "What tournament did the West Indies cricket team win?" | ||
} | ||
] | ||
} | ||
] | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As @mille-s mentioned, you might want to specify how this is different from earlier PRs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this question.
In general to generate paraphrases we followed a data-flow principle, by splitting the process to 2 main step:
a) candidate over-generation: generate as many possible candidate using Pivot-translation techniques trough a predefined set of pivot languages
b) candidate selection: where semantically irrelevant paraphrases candidates are removed from the finale list through cosine similarity scores.
To resume here are the differences: