Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add fix_encoding to preprocessing #129

Closed
cedricconol opened this issue Jul 29, 2020 · 3 comments
Closed

add fix_encoding to preprocessing #129

cedricconol opened this issue Jul 29, 2020 · 3 comments

Comments

@cedricconol
Copy link
Contributor

I think it would be nice to have a fix_encoding function in preprocessing to fix bad encoding in input text. We can build this using ftfy.

Examples from ftfy's readme:

>>> print(fix_text('This text should be in “quotesâ€\x9d.'))
This text should be in "quotes".

>>> print(fix_text('ünicode'))
ünicode

>>> print(fix_text('Broken text… it’s flubberific!',
...                normalization='NFKC'))
Broken text... it's flubberific!
@henrifroese
Copy link
Collaborator

This is certainly useful, I'm just not sure how common these errors are? I assume we would not put it into the standard clean pipeline as the problem is probably not very common and running it introduces significant overhead.

Then the only case this would be used is if a user notices he has this encoding error in his Series. Would he then not just google the problem, land on StackOverflow, import ftfy and fix it himself? I guess I'm just not really seeing when a user would look for a texthero function to do this.

The only exception I can see is that maybe these errors are much more common than I think? I'm not sure.

@jbesomi
Copy link
Owner

jbesomi commented Jul 31, 2020

Agree with @henrifroese.

The way we would implement this is by simply calling s.apply(fix_text). This can be done directly by the user.

@cedricconol if you believe this function might be useful for many, you can write a blog article about that subject. The idea would be to load a dataset, explain the problem, and show the code to fix the issue.

I'm closing this now as the idea is to prioritize: #85

@jbesomi jbesomi closed this as completed Jul 31, 2020
@cedricconol
Copy link
Contributor Author

Thanks for your feedbacks @henrifroese and @jbesomi.

@jbesomi jbesomi mentioned this issue Aug 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants