Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Drop invalid XML element attributes (#462)
* drop invalid XML element attributes Fixes issue #375 The bug happened when we had a `:` in an element attribute that didn't match any XML namespace (invalid XML). In the example it was `padding:1px=""; margin:15px=""` We can workaround it by manually dropping those bad elements. I hope it doesn't impact performance too much To reproduce: `trafilatura -u https://web.archive.org/web/20230619162141/https://www.tristatetelecom.com/productdetailI2.aspx?dataid=IPGSM-4G --xml` Minimal reproduction example: ``` echo 'Testing<ul style="" padding:1px; margin:15px""><b>Features:</b> <li>Saves the cost of two dedicated phone lines.</li> al station using Internet or cellular technology.</li> <li>Requires no change to the existing Fire Alarm Control Panel configuration. The IPGSM-4G connects directly to the primary and secondary telephone ports.</li> ' | trafilatura --xml ``` * pin lxml to < 5 * syntax --------- Co-authored-by: Adrien Barbaresi <[email protected]> Co-authored-by: Adrien Barbaresi <[email protected]>
- Loading branch information