-
Notifications
You must be signed in to change notification settings - Fork 196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding two filters -- ambiguous characters and alphanumeric characters #274
Conversation
Hi, this PR is still a work-in-progress; please do not review (UPDATE: Now ready) |
Thanks all - I've updated the PR now and it is ready for review |
def filter(self, sentence: str = None) -> bool: | ||
for c in sentence: | ||
if c in self.ambiguous_chars: | ||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One question, this filter will return "False" for every sentence which contains the above-defined ambiguous chars? So for the sentence "Buy some piIIs here" it returns "False" which looks fine, but sentences like "This is the last 007 movie of Daniel Craig.", it will also return "False" (because of 0). So every sentence which contains 0 (or "G" or "Z") is an ambiguous sentence and gets filtered. Because this list contains two vowels (O, I), and numbers (0,1,2,5,6,8), most of the sentences will fall under the ambiguous category.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @motiwari , I like the idea of ambiguouscharacters filter. But I have the same question as @ashish3586 . After I check out your PR and run it locally, I cannot pass all the test cases with the following error.
E AssertionError: Executing AmbiguousCharactersFilter The filter should return True for the inputs {'sentence': 'She has a cat.'}
E assert False == True
If I understand it correctly, 'She has a cat.' returned False since it contains ambiguous_chars 'S' .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Yiwen-Shi -- I have fixed that test case.
And @Yiwen-Shi and @ashish3586 your understandings are correct; for this filter, any sentence with a single ambiguous character will get filtered. This is a crude, first-pass filter that might be valuable to the community.
For future work, it would be interesting to filter sentences only if the ambiguous character actually leads to a difference in interpretation of the sentence, e.g. have some dictionary of English words and consider all possible substitutions of the ambiguous characters that could create multiple words. For example:
piIIs
returns "ambiguous" because the capital i
s can be changed to l
to create a new (valid) English word
She
would return "not ambiguous" because the S
cannot be changed to a 5
to create a new English word
Unfortunately such an approach is significantly more complex and I won't be able to implement it in this PR; however, I hope that someone will build off of this idea to implement it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add your filter name in /test/mapper.py so that the test job can run your test cases.
593775b
to
3646e93
Compare
Thanks @aadesh11 @Yiwen-Shi and @ashish3586 , I have addressed your comments! |
@@ -0,0 +1,21 @@ | |||
from tasks.TaskTypes import TaskType |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your build is failing:probably because of missing
from interfaces.SentenceOperation import SentenceOperation
@kaustubhdhole I added that but still failing :( any ideas? |
This is what the logs say :
E AssertionError: Executing AmbiguousCharactersFilter The filter should return True for the inputs {'sentence': 'Characters passes the filter.'} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @motiwari: Please pull the latest main branch once. Also, I see that you have force-pushed your commits once. If possible, could you do a clean commit of this PR?
Thanks @AbinayaM02 , I've created a new PR in #371 as per your request. @kaustubhdhole I'm not sure what to do about that error message; it says |
This PR contains two filters:
One to filter text that contains ambiguous characters. For example, the sentence "Buy some piIIs here" actually has two capital
I
s in place ofl
s.Another filters text that is not alphanumeric or uses symbols other than very common punctuation.
This is of use in spam detection, profanity detection, etc. where adversaries may obfuscate messages in a way to avoid spam filters but still be legible by humans