Skip to content

Commit

Permalink
drop invalid XML element attributes
Browse files Browse the repository at this point in the history
Fixes issue #375

The bug happened when we had a `:` in an element attribute that didn't match any XML namespace (invalid XML). In the example it was `padding:1px=""; margin:15px=""`
We can workaround it by manually dropping those bad elements.
I hope it doesn't impact performance too much

To reproduce:
`trafilatura -u  https://web.archive.org/web/20230619162141/https://www.tristatetelecom.com/productdetailI2.aspx?dataid=IPGSM-4G --xml`

Minimal reproduction example:
```
echo 'Testing<ul style="" padding:1px; margin:15px""><b>Features:</b> <li>Saves the cost of two dedicated phone lines.</li> al station using Internet or cellular technology.</li> <li>Requires no change to the existing Fire Alarm Control Panel configuration. The IPGSM-4G connects directly to the primary and secondary telephone ports.</li>
' | trafilatura --xml
```
  • Loading branch information
vbarbaresi committed Dec 31, 2023
1 parent d31c8d7 commit 5cf9b2e
Show file tree
Hide file tree
Showing 2 changed files with 10 additions and 1 deletion.
5 changes: 4 additions & 1 deletion tests/unit_tests.py
Original file line number Diff line number Diff line change
Expand Up @@ -387,7 +387,10 @@ def test_external():
teststring = f.read()
assert extract(teststring, no_fallback=True, include_tables=False) == ''
assert extract(teststring, no_fallback=False, include_tables=False) == ''

# invalid XML attributes: namespace colon in attribute key (issue #375). Those attributes should be stripped
bad_xml = 'Testing<ul style="" padding:1px; margin:15px""><b>Features:</b> <li>Saves the cost of two dedicated phone lines.</li> al station using Internet or cellular technology.</li> <li>Requires no change to the existing Fire Alarm Control Panel configuration. The IPGSM-4G connects directly to the primary and secondary telephone ports.</li>'
res = extract(bad_xml, output_format='xml')
assert "Features" in res

def test_images():
'''Test image extraction function'''
Expand Down
6 changes: 6 additions & 0 deletions trafilatura/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -303,6 +303,12 @@ def sanitize_tree(tree):
preserve_space = elem.tag in SPACING_PROTECTED or parent_tag in SPACING_PROTECTED
trailing_space = elem.tag in FORMATTING_PROTECTED or parent_tag in FORMATTING_PROTECTED or preserve_space

for attrib_key in elem.attrib.keys():
# Remove invalid attributes
if ':' in attrib_key: # colon is reserved for namespaces in XML
if not elem.attrib[attrib_key] or attrib_key.split(':')[0] not in tree.nsmap:
elem.attrib.pop(attrib_key)

if elem.text:
elem.text = sanitize(elem.text, preserve_space, trailing_space)
if elem.tail:
Expand Down

0 comments on commit 5cf9b2e

Please sign in to comment.