Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: clean() discards text which contains a single '<' character #1894

Closed
whchdorf opened this issue Feb 7, 2023 · 3 comments
Closed

Bug: clean() discards text which contains a single '<' character #1894

whchdorf opened this issue Feb 7, 2023 · 3 comments
Labels
duplicate This is a duplicate issue or root-cause of another issue

Comments

@whchdorf
Copy link

whchdorf commented Feb 7, 2023

Hello,

I discovered that Jsoup.clean() apparently discards all text after a lower than< is followed by another alpha (non-numeric) character.
Here is a short test to verify the behaviour:

@Test
public void test() {
    var result = Jsoup.clean("this is <some harmless input text", Safelist.basic());
    assertEquals("this is", result);
}

Expected is the assert to work like this:

assertEquals("this is <some harmless input text", result);

This is potentially severe, since a harmless input text can be discarded after cleaned with JSoup.
Might be caused by a logic that tries to unsuccessfully find the end of an opening tag. However, I would expect JSoup to not touch this text at all and the test should return the same input.

Is there a way to fix this?
Thanks

@akshaya57148
Copy link

I guess for reserved chars like < , > etc , you will need to use HTML encoding:
< for <
> for > and so on.
For details refer :https://www.w3schools.com/HTML/html_entities.asp

@whchdorf
Copy link
Author

I guess for reserved chars like < , > etc , you will need to use HTML encoding: < for < > for > and so on. For details refer :https://www.w3schools.com/HTML/html_entities.asp

When I escape the input string, JSoup just does not do anything because the input is already clean.

In case it was not stated clear enough: I want JSoups cleaning of tags still to work, but an opening tag should not stop it from working.
here is another test that should be green if JSoup works correctly:

@Test
public void test() {
    var result = Jsoup.clean("<a>RemoveThisTag</a><b>AndThisTag</b>but <not this harmless string", Safelist.none());
    assertEquals("RemoveThisTagAndThisTagbut <not this harmless string", result);

@jhy
Copy link
Owner

jhy commented Feb 17, 2023

This is effectively the same as the feature idea in #1230 - rewrite illegal tags as escaped entities.

If implemented, it should be as an option to Cleaner.

@jhy jhy closed this as completed Feb 17, 2023
@jhy jhy added the duplicate This is a duplicate issue or root-cause of another issue label Feb 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
duplicate This is a duplicate issue or root-cause of another issue
Projects
None yet
Development

No branches or pull requests

3 participants