Skip to content

Commit

Permalink
Drop invalid XML element attributes (#462)
Browse files Browse the repository at this point in the history
* drop invalid XML element attributes

Fixes issue #375

The bug happened when we had a `:` in an element attribute that didn't match any XML namespace (invalid XML). In the example it was `padding:1px=""; margin:15px=""`
We can workaround it by manually dropping those bad elements.
I hope it doesn't impact performance too much

To reproduce:
`trafilatura -u  https://web.archive.org/web/20230619162141/https://www.tristatetelecom.com/productdetailI2.aspx?dataid=IPGSM-4G --xml`

Minimal reproduction example:
```
echo 'Testing<ul style="" padding:1px; margin:15px""><b>Features:</b> <li>Saves the cost of two dedicated phone lines.</li> al station using Internet or cellular technology.</li> <li>Requires no change to the existing Fire Alarm Control Panel configuration. The IPGSM-4G connects directly to the primary and secondary telephone ports.</li>
' | trafilatura --xml
```

* pin lxml to < 5

* syntax

---------

Co-authored-by: Adrien Barbaresi <[email protected]>
Co-authored-by: Adrien Barbaresi <[email protected]>
  • Loading branch information
3 people authored Jan 2, 2024
1 parent 396b991 commit de57ac1
Show file tree
Hide file tree
Showing 2 changed files with 10 additions and 1 deletion.
5 changes: 4 additions & 1 deletion tests/unit_tests.py
Original file line number Diff line number Diff line change
Expand Up @@ -387,7 +387,10 @@ def test_external():
teststring = f.read()
assert extract(teststring, no_fallback=True, include_tables=False) == ''
assert extract(teststring, no_fallback=False, include_tables=False) == ''

# invalid XML attributes: namespace colon in attribute key (issue #375). Those attributes should be stripped
bad_xml = 'Testing<ul style="" padding:1px; margin:15px""><b>Features:</b> <li>Saves the cost of two dedicated phone lines.</li> al station using Internet or cellular technology.</li> <li>Requires no change to the existing Fire Alarm Control Panel configuration. The IPGSM-4G connects directly to the primary and secondary telephone ports.</li>'
res = extract(bad_xml, output_format='xml')
assert "Features" in res

def test_images():
'''Test image extraction function'''
Expand Down
6 changes: 6 additions & 0 deletions trafilatura/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -303,6 +303,12 @@ def sanitize_tree(tree):
preserve_space = elem.tag in SPACING_PROTECTED or parent_tag in SPACING_PROTECTED
trailing_space = elem.tag in FORMATTING_PROTECTED or parent_tag in FORMATTING_PROTECTED or preserve_space

# remove invalid attributes
for attribute in elem.attrib:
if ':' in attribute: # colon is reserved for namespaces in XML
if not elem.attrib[attribute] or attribute.split(':', 1)[0] not in tree.nsmap:
elem.attrib.pop(attribute)

if elem.text:
elem.text = sanitize(elem.text, preserve_space, trailing_space)
if elem.tail:
Expand Down

0 comments on commit de57ac1

Please sign in to comment.