Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding two filters -- ambiguous characters and alphanumeric characters #274

Closed
wants to merge 10 commits into from

Conversation

motiwari
Copy link

@motiwari motiwari commented Sep 1, 2021

This PR contains two filters:

  1. One to filter text that contains ambiguous characters. For example, the sentence "Buy some piIIs here" actually has two capital Is in place of ls.

  2. Another filters text that is not alphanumeric or uses symbols other than very common punctuation.

This is of use in spam detection, profanity detection, etc. where adversaries may obfuscate messages in a way to avoid spam filters but still be legible by humans

@motiwari motiwari changed the title Adding ambiguous characters filter [WIP - Do not review] Adding ambiguous characters filter Sep 7, 2021
@motiwari
Copy link
Author

motiwari commented Sep 7, 2021

Hi, this PR is still a work-in-progress; please do not review (UPDATE: Now ready)

@motiwari motiwari changed the title [WIP - Do not review] Adding ambiguous characters filter Adding two filters -- ambiguous characters and alphanumeric characters Sep 17, 2021
@motiwari
Copy link
Author

motiwari commented Sep 17, 2021

Thanks all - I've updated the PR now and it is ready for review

filters/alphanumeric/filter.py Show resolved Hide resolved
filters/ambiguouscharacters/filter.py Show resolved Hide resolved
Comment on lines +22 to +30
def filter(self, sentence: str = None) -> bool:
for c in sentence:
if c in self.ambiguous_chars:
return False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question, this filter will return "False" for every sentence which contains the above-defined ambiguous chars? So for the sentence "Buy some piIIs here" it returns "False" which looks fine, but sentences like "This is the last 007 movie of Daniel Craig.", it will also return "False" (because of 0). So every sentence which contains 0 (or "G" or "Z") is an ambiguous sentence and gets filtered. Because this list contains two vowels (O, I), and numbers (0,1,2,5,6,8), most of the sentences will fall under the ambiguous category.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @motiwari , I like the idea of ambiguouscharacters filter. But I have the same question as @ashish3586 . After I check out your PR and run it locally, I cannot pass all the test cases with the following error.

E           AssertionError: Executing AmbiguousCharactersFilter The filter should return True for the inputs {'sentence': 'She has a cat.'}
E           assert False == True

If I understand it correctly, 'She has a cat.' returned False since it contains ambiguous_chars 'S' .

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Yiwen-Shi -- I have fixed that test case.

And @Yiwen-Shi and @ashish3586 your understandings are correct; for this filter, any sentence with a single ambiguous character will get filtered. This is a crude, first-pass filter that might be valuable to the community.

For future work, it would be interesting to filter sentences only if the ambiguous character actually leads to a difference in interpretation of the sentence, e.g. have some dictionary of English words and consider all possible substitutions of the ambiguous characters that could create multiple words. For example:

piIIs returns "ambiguous" because the capital is can be changed to l to create a new (valid) English word
She would return "not ambiguous" because the S cannot be changed to a 5 to create a new English word

Unfortunately such an approach is significantly more complex and I won't be able to implement it in this PR; however, I hope that someone will build off of this idea to implement it

Copy link
Collaborator

@aadesh11 aadesh11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add your filter name in /test/mapper.py so that the test job can run your test cases.

@motiwari
Copy link
Author

motiwari commented Nov 2, 2021

Thanks @aadesh11 @Yiwen-Shi and @ashish3586 , I have addressed your comments!

@@ -0,0 +1,21 @@
from tasks.TaskTypes import TaskType
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your build is failing:probably because of missing

from interfaces.SentenceOperation import SentenceOperation

@motiwari
Copy link
Author

@kaustubhdhole I added that but still failing :( any ideas?

@kaustubhdhole
Copy link
Collaborator

This is what the logs say :

      assert (
            output == test["outputs"]
        ), f"Executing {test['class']} The filter should return {test['outputs']} for the inputs {test['inputs']}"

E AssertionError: Executing AmbiguousCharactersFilter The filter should return True for the inputs {'sentence': 'Characters passes the filter.'}
E assert False == True

Copy link
Collaborator

@AbinayaM02 AbinayaM02 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @motiwari: Please pull the latest main branch once. Also, I see that you have force-pushed your commits once. If possible, could you do a clean commit of this PR?

@motiwari
Copy link
Author

Thanks @AbinayaM02 , I've created a new PR in #371 as per your request.

@kaustubhdhole I'm not sure what to do about that error message; it says The filter should return True for the inputs {'sentence': 'Characters passes the filter.'} but this filter should actually return False, no?

@motiwari motiwari closed this Nov 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants