-
-
Notifications
You must be signed in to change notification settings - Fork 266
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Duplicated lines when nested in <article> and <main>, with <br> in front #768
Comments
This example also has the same problem, even without <html>
<article>
<article>
Line that has to have at least 125 characters for the bug to appear so here is some filler text text text text text text text
</article>
</article>
</html> |
It's a bug indeed but as a side note it's possible to use Trafilatura's duplicate filter with a low threshold to prevent it. |
Hi @adbar can you please elaborate on how do deduplicate these examples? For my case I am doing
Which still returns the text twice. |
@sarahyurick see the following line in |
Thanks @adbar ! The following worked for me:
(For some reason I have to include |
Title might be misleading as it may have nothing to do with these tags, but I haven't been able to create a more minimal example.
This outputs the line twice. I am not sure which part is actually causing the issue, because whatever I remove from this structure, the problem disappears. It seems like this special configuration of main + article + br + long line is causing issues.
I'm on the newest trafilatura-2.0.0.
The text was updated successfully, but these errors were encountered: