Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Fixes issue #375 The bug happened when we had a `:` in an element attribute that didn't match any XML namespace (invalid XML). In the example it was `padding:1px=""; margin:15px=""` We can workaround it by manually dropping those bad elements. I hope it doesn't impact performance too much To reproduce: `trafilatura -u https://web.archive.org/web/20230619162141/https://www.tristatetelecom.com/productdetailI2.aspx?dataid=IPGSM-4G --xml` Minimal reproduction example: ``` echo 'Testing<ul style="" padding:1px; margin:15px""><b>Features:</b> <li>Saves the cost of two dedicated phone lines.</li> al station using Internet or cellular technology.</li> <li>Requires no change to the existing Fire Alarm Control Panel configuration. The IPGSM-4G connects directly to the primary and secondary telephone ports.</li> ' | trafilatura --xml ```
- Loading branch information