add fix_encoding to preprocessing #129

cedricconol · 2020-07-29T09:30:58Z

I think it would be nice to have a fix_encoding function in preprocessing to fix bad encoding in input text. We can build this using ftfy.

Examples from ftfy's readme:

>>> print(fix_text('This text should be in â€œquotesâ€\x9d.'))
This text should be in "quotes".

>>> print(fix_text('uÌˆnicode'))
ünicode

>>> print(fix_text('Broken text&hellip; it&#x2019;s ﬂubberiﬁc!',
...                normalization='NFKC'))
Broken text... it's flubberific!

The text was updated successfully, but these errors were encountered:

henrifroese · 2020-07-30T15:17:26Z

This is certainly useful, I'm just not sure how common these errors are? I assume we would not put it into the standard clean pipeline as the problem is probably not very common and running it introduces significant overhead.

Then the only case this would be used is if a user notices he has this encoding error in his Series. Would he then not just google the problem, land on StackOverflow, import ftfy and fix it himself? I guess I'm just not really seeing when a user would look for a texthero function to do this.

The only exception I can see is that maybe these errors are much more common than I think? I'm not sure.

jbesomi · 2020-07-31T14:43:42Z

Agree with @henrifroese.

The way we would implement this is by simply calling s.apply(fix_text). This can be done directly by the user.

@cedricconol if you believe this function might be useful for many, you can write a blog article about that subject. The idea would be to load a dataset, explain the problem, and show the code to fix the issue.

I'm closing this now as the idea is to prioritize: #85

cedricconol · 2020-08-01T01:15:35Z

Thanks for your feedbacks @henrifroese and @jbesomi.

jbesomi closed this as completed Jul 31, 2020

jbesomi mentioned this issue Aug 12, 2020

Spell checker #27

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add fix_encoding to preprocessing #129

add fix_encoding to preprocessing #129

cedricconol commented Jul 29, 2020

henrifroese commented Jul 30, 2020

jbesomi commented Jul 31, 2020

cedricconol commented Aug 1, 2020

add fix_encoding to preprocessing #129

add fix_encoding to preprocessing #129

Comments

cedricconol commented Jul 29, 2020

henrifroese commented Jul 30, 2020

jbesomi commented Jul 31, 2020

cedricconol commented Aug 1, 2020