Duplicated lines when nested in <article> and <main>, with <br> in front #768

ibestvina · 2024-12-14T11:42:50Z

Title might be misleading as it may have nothing to do with these tags, but I haven't been able to create a more minimal example.

from trafilatura import extract

html = """
<!doctype html>

<body>
  <main>
    <article>
      <div>
        <br>Line that has to have at least 125 characters for the bug to appear so here is some filler text text text text text text text
      </div>
    </article>
  </main>
</body>

</html>
"""

page = extract(html, output_format="txt")
print(page)

This outputs the line twice. I am not sure which part is actually causing the issue, because whatever I remove from this structure, the problem disappears. It seems like this special configuration of main + article + br + long line is causing issues.

I'm on the newest trafilatura-2.0.0.

ibestvina · 2024-12-14T11:50:23Z

This example also has the same problem, even without <br> and the additional <div>

<html>
<article>
<article>
    Line that has to have at least 125 characters for the bug to appear so here is some filler text text text text text text text
</article>
</article>
</html>

adbar · 2024-12-18T09:40:16Z

It's a bug indeed but as a side note it's possible to use Trafilatura's duplicate filter with a low threshold to prevent it.

sarahyurick · 2024-12-20T19:59:33Z

Hi @adbar can you please elaborate on how do deduplicate these examples? For my case I am doing

from trafilatura import extract


# Modified from https://github.com/chatnoir-eu/chatnoir-resiliparse/blob/abdf1966fb3cefe3e0790e510ab5cb1446f99a79/tests/resiliparse/extract/test_html2text.py
html_string = """<!doctype html>
    <head>
        <title>My Title</title>
        <meta charset="utf-8">
        <style>* { margin: 0; }</style>
    </head>
    <body>
        <section id="wrapper">
            <nav>
                <ul>
                    <li>Nav 1</li>
                    <li>
                        <p>Nav 2</p>
                        <ul>
                            <li><p>Nav 3</p></li>
                        </ul>
                    </li>
                </ul>
            </nav>
            <main>
                This is a sample paragraph. In it we write words.
                These are stopwords: because did than has near we almost while what still.
                <a href="#foo" hidden>bar</a>

                <p>
                This paragraph doesn't have many stopwords.
                <br>A new paragraph: either came does last new took taken making became from.
                </p>

                <button aria-hidden="true">Click here</button>
                <input type="hidden" value="foo">
                <input type="text" value="Some text" placeholder="Insert text">
                <input type="text" placeholder="Insert text">
                <img src="" alt="Some image">
                <object data="" class="some-class hidden">Cannot display object</object>
            </main>
            <script language="vbscript" type="text/vbscript">MsgBox("Hello World!")</script>
            <noscript>Sorry, your browser doesn't support VB Script!</noscript>
            <div><div><div><footer id="global-footer">
                Copyright (C) 2021 Foo Bar
            </footer></div></div></div>
        </section>
    </body>
</html>"""

text = extract(html_string, deduplicate=True)
print(text)
# This paragraph doesn't have many stopwords.
# A new paragraph: either came does last new took taken making became from.
# This paragraph doesn't have many stopwords.
# A new paragraph: either came does last new took taken making became from.

Which still returns the text twice.

adbar · 2024-12-23T11:52:16Z

@sarahyurick see the following line in settings.cfg and the documentation page on settings:
MAX_REPETITIONS = 2

sarahyurick · 2024-12-23T17:00:50Z

Thanks @adbar ! The following worked for me:

from copy import deepcopy
from trafilatura import extract
from trafilatura.settings import DEFAULT_CONFIG


# Modified from https://github.com/chatnoir-eu/chatnoir-resiliparse/blob/abdf1966fb3cefe3e0790e510ab5cb1446f99a79/tests/resiliparse/extract/test_html2text.py
html_string = """<!doctype html>
    <head>
        <title>My Title</title>
        <meta charset="utf-8">
        <style>* { margin: 0; }</style>
    </head>
    <body>
        <section id="wrapper">
            <nav>
                <ul>
                    <li>Nav 1</li>
                    <li>
                        <p>Nav 2</p>
                        <ul>
                            <li><p>Nav 3</p></li>
                        </ul>
                    </li>
                </ul>
            </nav>
            <main>
                This is a sample paragraph. In it we write words.
                These are stopwords: because did than has near we almost while what still.
                <a href="#foo" hidden>bar</a>

                <p>
                This paragraph doesn't have many stopwords.
                <br>A new paragraph: either came does last new took taken making became from.
                </p>

                <button aria-hidden="true">Click here</button>
                <input type="hidden" value="foo">
                <input type="text" value="Some text" placeholder="Insert text">
                <input type="text" placeholder="Insert text">
                <img src="" alt="Some image">
                <object data="" class="some-class hidden">Cannot display object</object>
            </main>
            <script language="vbscript" type="text/vbscript">MsgBox("Hello World!")</script>
            <noscript>Sorry, your browser doesn't support VB Script!</noscript>
            <div><div><div><footer id="global-footer">
                Copyright (C) 2021 Foo Bar
            </footer></div></div></div>
        </section>
    </body>
</html>"""

my_config = deepcopy(DEFAULT_CONFIG)
my_config['DEFAULT']['MIN_EXTRACTED_SIZE'] = '10'
my_config['DEFAULT']['MIN_DUPLCHECK_SIZE'] = '10'
my_config['DEFAULT']['MAX_REPETITIONS'] = '1'

text = extract(html_string, config=my_config)
print(text)
# This is a sample paragraph. In it we write words.
# These are stopwords: because did than has near we almost while what still.
# bar
# This paragraph doesn't have many stopwords.
# A new paragraph: either came does last new took taken making became from.

(For some reason I have to include MIN_EXTRACTED_SIZE for it to work.)

adbar added the bug Something isn't working label Dec 18, 2024

sarahyurick mentioned this issue Dec 20, 2024

Add TrafilaturaExtractor class NVIDIA/NeMo-Curator#431

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicated lines when nested in <article> and <main>, with <br> in front #768

Duplicated lines when nested in <article> and <main>, with <br> in front #768

ibestvina commented Dec 14, 2024

ibestvina commented Dec 14, 2024 •

edited

Loading

adbar commented Dec 18, 2024

sarahyurick commented Dec 20, 2024

adbar commented Dec 23, 2024

sarahyurick commented Dec 23, 2024

Duplicated lines when nested in <article> and <main>, with <br> in front #768

Duplicated lines when nested in <article> and <main>, with <br> in front #768

Comments

ibestvina commented Dec 14, 2024

ibestvina commented Dec 14, 2024 • edited Loading

adbar commented Dec 18, 2024

sarahyurick commented Dec 20, 2024

adbar commented Dec 23, 2024

sarahyurick commented Dec 23, 2024

ibestvina commented Dec 14, 2024 •

edited

Loading