Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicated lines when nested in <article> and <main>, with <br> in front #768

Open
ibestvina opened this issue Dec 14, 2024 · 5 comments
Open
Labels
bug Something isn't working

Comments

@ibestvina
Copy link

Title might be misleading as it may have nothing to do with these tags, but I haven't been able to create a more minimal example.

from trafilatura import extract

html = """
<!doctype html>

<body>
  <main>
    <article>
      <div>
        <br>Line that has to have at least 125 characters for the bug to appear so here is some filler text text text text text text text
      </div>
    </article>
  </main>
</body>

</html>
"""

page = extract(html, output_format="txt")
print(page)

This outputs the line twice. I am not sure which part is actually causing the issue, because whatever I remove from this structure, the problem disappears. It seems like this special configuration of main + article + br + long line is causing issues.

I'm on the newest trafilatura-2.0.0.

@ibestvina
Copy link
Author

ibestvina commented Dec 14, 2024

This example also has the same problem, even without <br> and the additional <div>

<html>
<article>
<article>
    Line that has to have at least 125 characters for the bug to appear so here is some filler text text text text text text text
</article>
</article>
</html>

@adbar adbar added the bug Something isn't working label Dec 18, 2024
@adbar
Copy link
Owner

adbar commented Dec 18, 2024

It's a bug indeed but as a side note it's possible to use Trafilatura's duplicate filter with a low threshold to prevent it.

@sarahyurick
Copy link

Hi @adbar can you please elaborate on how do deduplicate these examples? For my case I am doing

from trafilatura import extract


# Modified from https://github.com/chatnoir-eu/chatnoir-resiliparse/blob/abdf1966fb3cefe3e0790e510ab5cb1446f99a79/tests/resiliparse/extract/test_html2text.py
html_string = """<!doctype html>
    <head>
        <title>My Title</title>
        <meta charset="utf-8">
        <style>* { margin: 0; }</style>
    </head>
    <body>
        <section id="wrapper">
            <nav>
                <ul>
                    <li>Nav 1</li>
                    <li>
                        <p>Nav 2</p>
                        <ul>
                            <li><p>Nav 3</p></li>
                        </ul>
                    </li>
                </ul>
            </nav>
            <main>
                This is a sample paragraph. In it we write words.
                These are stopwords: because did than has near we almost while what still.
                <a href="#foo" hidden>bar</a>

                <p>
                This paragraph doesn't have many stopwords.
                <br>A new paragraph: either came does last new took taken making became from.
                </p>

                <button aria-hidden="true">Click here</button>
                <input type="hidden" value="foo">
                <input type="text" value="Some text" placeholder="Insert text">
                <input type="text" placeholder="Insert text">
                <img src="" alt="Some image">
                <object data="" class="some-class hidden">Cannot display object</object>
            </main>
            <script language="vbscript" type="text/vbscript">MsgBox("Hello World!")</script>
            <noscript>Sorry, your browser doesn't support VB Script!</noscript>
            <div><div><div><footer id="global-footer">
                Copyright (C) 2021 Foo Bar
            </footer></div></div></div>
        </section>
    </body>
</html>"""

text = extract(html_string, deduplicate=True)
print(text)
# This paragraph doesn't have many stopwords.
# A new paragraph: either came does last new took taken making became from.
# This paragraph doesn't have many stopwords.
# A new paragraph: either came does last new took taken making became from.

Which still returns the text twice.

@adbar
Copy link
Owner

adbar commented Dec 23, 2024

@sarahyurick see the following line in settings.cfg and the documentation page on settings:
MAX_REPETITIONS = 2

@sarahyurick
Copy link

Thanks @adbar ! The following worked for me:

from copy import deepcopy
from trafilatura import extract
from trafilatura.settings import DEFAULT_CONFIG


# Modified from https://github.com/chatnoir-eu/chatnoir-resiliparse/blob/abdf1966fb3cefe3e0790e510ab5cb1446f99a79/tests/resiliparse/extract/test_html2text.py
html_string = """<!doctype html>
    <head>
        <title>My Title</title>
        <meta charset="utf-8">
        <style>* { margin: 0; }</style>
    </head>
    <body>
        <section id="wrapper">
            <nav>
                <ul>
                    <li>Nav 1</li>
                    <li>
                        <p>Nav 2</p>
                        <ul>
                            <li><p>Nav 3</p></li>
                        </ul>
                    </li>
                </ul>
            </nav>
            <main>
                This is a sample paragraph. In it we write words.
                These are stopwords: because did than has near we almost while what still.
                <a href="#foo" hidden>bar</a>

                <p>
                This paragraph doesn't have many stopwords.
                <br>A new paragraph: either came does last new took taken making became from.
                </p>

                <button aria-hidden="true">Click here</button>
                <input type="hidden" value="foo">
                <input type="text" value="Some text" placeholder="Insert text">
                <input type="text" placeholder="Insert text">
                <img src="" alt="Some image">
                <object data="" class="some-class hidden">Cannot display object</object>
            </main>
            <script language="vbscript" type="text/vbscript">MsgBox("Hello World!")</script>
            <noscript>Sorry, your browser doesn't support VB Script!</noscript>
            <div><div><div><footer id="global-footer">
                Copyright (C) 2021 Foo Bar
            </footer></div></div></div>
        </section>
    </body>
</html>"""

my_config = deepcopy(DEFAULT_CONFIG)
my_config['DEFAULT']['MIN_EXTRACTED_SIZE'] = '10'
my_config['DEFAULT']['MIN_DUPLCHECK_SIZE'] = '10'
my_config['DEFAULT']['MAX_REPETITIONS'] = '1'

text = extract(html_string, config=my_config)
print(text)
# This is a sample paragraph. In it we write words.
# These are stopwords: because did than has near we almost while what still.
# bar
# This paragraph doesn't have many stopwords.
# A new paragraph: either came does last new took taken making became from.

(For some reason I have to include MIN_EXTRACTED_SIZE for it to work.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants