add abbreviation replacement data augmentation op and test #732

abbeyyyy · 2022-04-09T20:58:23Z

This PR fixes #736.

Description of changes

Add Abbreviation Replacement Augmentation Method

Possible influences of this PR.

This PR provides a new replacement-based data augmentation method

Test Conducted

Test cases included in abbreviation_replacement_op_test.py

hunterhector · 2022-04-10T02:50:36Z

A couple of preparations for the PR:

We will need to create an issue and associate a PR with the issue: https://github.com/asyml/forte/blob/master/CONTRIBUTING.md#pull-requests
We've send you an invitation for this repo, then you can run the workflow CI without approval

codecov · 2022-04-11T14:48:39Z

Codecov Report

Merging #732 (82da6ef) into master (61c44ac) will increase coverage by 0.03%.
The diff coverage is 91.80%.

@@            Coverage Diff             @@
##           master     #732      +/-   ##
==========================================
+ Coverage   80.94%   80.98%   +0.03%     
==========================================
  Files         249      251       +2     
  Lines       18664    18725      +61     
==========================================
+ Hits        15108    15164      +56     
- Misses       3556     3561       +5

Impacted Files	Coverage Δ
..._augment/algorithms/abbreviation_replacement_op.py	`85.71% <85.71%> (ø)`
...ent/algorithms/abbreviation_replacement_op_test.py	`96.96% <96.96%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 61c44ac...82da6ef. Read the comment docs.

forte/processors/data_augment/algorithms/abbreviation_replacement_op.py

Pushkar-Bhuse · 2022-04-13T03:25:56Z

forte/processors/data_augment/algorithms/abbreviation_replacement_op.py

+
+class AbbreviationReplacementOp(SingleAnnotationAugmentOp):
+    r"""
+    This class is a replacement op utilizing a pre-defined


The docstring should be more comprehensive. This is what the user is going to see if they want to use this DA op.

Pushkar-Bhuse · 2022-04-13T03:26:22Z

forte/processors/data_augment/algorithms/abbreviation_replacement_op.py

@@ -0,0 +1,104 @@
+# Copyright 2020 The Forte Authors. All Rights Reserved.


Pushkar-Bhuse · 2022-04-13T03:29:02Z

forte/processors/data_augment/algorithms/abbreviation_replacement_op.py

+        super().__init__(configs)
+        if "dict_path" in configs.keys():
+            self.dict_path = configs["dict_path"]
+        else:


An if-else loop is not needed here as you are already setting a default value in the default_configs

Pushkar-Bhuse · 2022-04-13T03:29:44Z

forte/processors/data_augment/algorithms/abbreviation_replacement_op.py

+        self, input_anno: Annotation
+    ) -> Tuple[bool, str]:
+        r"""
+        This function replaces a word from an abbreviation dictionary.


Again, we should add a better description of what this function will do.

Pushkar-Bhuse · 2022-04-13T03:31:44Z

forte/processors/data_augment/algorithms/abbreviation_replacement_op.py

+        # If the replacement does not happen, return False.
+        if random.random() > self.configs.prob:
+            return False, input_anno.text
+        if input_anno.text in self.data.keys():


Since you are returning from the function if the program enters the earlier if statement, you dont need to add this if

Also, I am not sure is this check (input_anno.text in self.data.keys()) is necessary.

I was thinking if the input phrase does not have a corresponding abbreviation, an error will occur.

When checking dict existence, use text in self.data, don't need to call the keys().

Now we can see that the prob only applies to the annotation that has an abbreviation, which should probably be specified in the class docstring.

Pushkar-Bhuse · 2022-04-13T03:34:14Z

forte/processors/data_augment/algorithms/abbreviation_replacement_op.py

+            - dict_path: the `url` or the path to the pre-defined
+              abbreviation json file. The key is a word / phrase we want
+              to replace. The value is an abbreviated word of the
+              corresponding key.


I'd recommend adding the default value of dict_path in the docstring as well since this is what will be rendered in the documentation and it would be easier for users to see.

Pushkar-Bhuse

Your implementation of the Op seems fine but it looks like you might have not gotten the underlying intricacies of how to modify the SingleAnnotationAugmentOp for different type of annotations. Just take a closer look at that once.

Pushkar-Bhuse · 2022-04-13T03:34:56Z

forte/processors/data_augment/algorithms/abbreviation_replacement_op.py

+            A dictionary with the default config for this processor.
+        Following are the keys for this dictionary:
+            - prob: The probability of replacement,
+              should fall in [0, 1]. Default value is 0.1


The default value below is 0.5. Make sure you check the documentation thoroughly.

Pushkar-Bhuse · 2022-04-13T03:39:46Z

tests/forte/processors/data_augment/algorithms/abbreviation_replacement_op_test.py

+        data_pack = DataPack()
+        text = "see you later"
+        data_pack.set_text(text)
+        token = Token(data_pack, 0, len(text))


The Token class is generally used for a single word. When annotating the whole sequence, you should use the Sentence class. Also, if your SingleAnnotationOp augments an annotation other than a Token, you must specify that in the default_configs. For your reference, look at the implementation of the BackTranslationOp and its test cases. Even that Op augments sentences.

We also have Document https://github.com/asyml/forte/blob/master/ft/onto/base_ontology.py#L136 for the whole article.

I know it is just a test case so it doesn't matter too much, but still worth noting.

Pushkar-Bhuse · 2022-04-13T03:40:46Z

tests/forte/processors/data_augment/algorithms/abbreviation_replacement_op_test.py

+        augmented_data_pack = self.abre.perform_augmentation(data_pack)
+
+        augmented_token = list(
+            augmented_data_pack.get("ft.onto.base_ontology.Token")


I think you should take the comment above into consideration and rework your test cases accordingly.

Pushkar-Bhuse

I think the changes look good.

hunterhector · 2022-05-18T19:21:09Z

forte/processors/data_augment/algorithms/abbreviation_replacement_op.py

+    abbreviation dictionary to replace word or phrase
+    with an abbreviation. The abbreviation dictionary can
+    be user-defined, we also provide a default dictionary.
+    `prob` indicates the probability of replacement.


What does "probability of replacement" mean? For example, if prob is 0.4, is the replacement happen 40% of the case or the other way around. Does it mean that 40% of the words will be replaced, etc. Let's specify this clearly.

hunterhector · 2022-05-18T19:24:37Z

forte/processors/data_augment/algorithms/abbreviation_replacement_op.py

+        # If the replacement does not happen, return False.
+        if random.random() > self.configs.prob:
+            return False, input_anno.text
+        if input_anno.text in self.data.keys():


When checking dict existence, use text in self.data, don't need to call the keys().

Now we can see that the prob only applies to the annotation that has an abbreviation, which should probably be specified in the class docstring.

hunterhector · 2022-05-18T19:28:08Z

forte/processors/data_augment/algorithms/abbreviation_replacement_op.py

+        if random.random() > self.configs.prob:
+            return False, input_anno.text
+        if input_anno.text in self.data.keys():
+            result: str = self.data[input_anno.text]


Something about this replacement:

Do we need to consider the case? Maybe we should lower case your dictionary and user input.

How about substrings? For example, in "see you later": "syl8r", what if we have an input "i will see you later", it looks like we won't replace this?

Maybe you need to consider using an Aho-Corasick data sturcture here: https://pyahocorasick.readthedocs.io/en/latest/

hunterhector · 2022-05-18T19:31:30Z

forte/processors/data_augment/algorithms/abbreviation_replacement_op.py

+              to replace. The value is an abbreviated word of the
+              corresponding key. Default dictionary is from a web-scraped
+              slang dictionary
+              ("https://github.com/abbeyyyy/JsonFiles/blob/main/abbreviate.json").


Did we adopt the dictionary from another library?

hunterhector · 2022-05-18T19:32:22Z

tests/forte/processors/data_augment/algorithms/abbreviation_replacement_op_test.py

+        data_pack_1 = DataPack()
+        text_1 = "I will see you later!"
+        data_pack_1.set_text(text_1)
+        phrase_1 = Phrase(data_pack_1, 7, len(text_1) - 1)


I see that you have to first identify the phrase before doing the match, which is not a very typical use case.

add abbreviation replacement data augmentation op and test

5902d7b

abbeyyyy added 2 commits April 11, 2022 10:18

black reformatting

aa98d53

reformatting string line length

a3df17d

hepengfe reviewed Apr 11, 2022

View reviewed changes

forte/processors/data_augment/algorithms/abbreviation_replacement_op.py Outdated Show resolved Hide resolved

abbeyyyy and others added 2 commits April 11, 2022 19:17

Merge branch 'master' into abbreviation_replacement_op

f46e40d

fix docstring

f779549

abbeyyyy marked this pull request as ready for review April 12, 2022 00:17

abbeyyyy requested a review from Pushkar-Bhuse April 12, 2022 00:20

Pushkar-Bhuse reviewed Apr 13, 2022

View reviewed changes

abbeyyyy added 10 commits April 27, 2022 00:09

fix documentation / changed the replaced annotation to phrase

247ef59

Merge branch 'asyml:master' into abbreviation_replacement_op

e7f8d4c

fix argument

cc471ab

Add test and docs

d295df0

Merge branch 'asyml:master' into abbreviation_replacement_op

0e00ac5

Merge branch 'asyml:master' into abbreviation_replacement_op

0fa98b2

fix docs

e657d49

fix docs

8ae0112

fix docs error

512b1a0

Update abbreviation_replacement_op.py

a62c864

Pushkar-Bhuse approved these changes May 11, 2022

View reviewed changes

Merge branch 'master' into abbreviation_replacement_op

82da6ef

hunterhector reviewed May 18, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add abbreviation replacement data augmentation op and test #732

add abbreviation replacement data augmentation op and test #732

abbeyyyy commented Apr 9, 2022 •

edited

Loading

hunterhector commented Apr 10, 2022 •

edited

Loading

codecov bot commented Apr 11, 2022 •

edited

Loading

Pushkar-Bhuse Apr 13, 2022

Pushkar-Bhuse Apr 13, 2022

Pushkar-Bhuse Apr 13, 2022

Pushkar-Bhuse Apr 13, 2022

Pushkar-Bhuse Apr 13, 2022

Pushkar-Bhuse Apr 13, 2022

abbeyyyy Apr 27, 2022

hunterhector May 18, 2022

Pushkar-Bhuse Apr 13, 2022

Pushkar-Bhuse left a comment

Pushkar-Bhuse Apr 13, 2022

Pushkar-Bhuse Apr 13, 2022

hunterhector Apr 13, 2022

Pushkar-Bhuse Apr 13, 2022

Pushkar-Bhuse left a comment

hunterhector May 18, 2022

hunterhector May 18, 2022

hunterhector May 18, 2022

hunterhector May 18, 2022

hunterhector May 18, 2022

hunterhector May 18, 2022

		@@ -0,0 +1,104 @@
		# Copyright 2020 The Forte Authors. All Rights Reserved.

add abbreviation replacement data augmentation op and test #732

Are you sure you want to change the base?

add abbreviation replacement data augmentation op and test #732

Conversation

abbeyyyy commented Apr 9, 2022 • edited Loading

Description of changes

Possible influences of this PR.

Test Conducted

hunterhector commented Apr 10, 2022 • edited Loading

codecov bot commented Apr 11, 2022 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Pushkar-Bhuse left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Pushkar-Bhuse left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abbeyyyy commented Apr 9, 2022 •

edited

Loading

hunterhector commented Apr 10, 2022 •

edited

Loading

codecov bot commented Apr 11, 2022 •

edited

Loading