-
-
Notifications
You must be signed in to change notification settings - Fork 267
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue with LXML on M1 / Apple arm64 platforms #166
Comments
Hi @naftalibeder, all tests pass except for the dev version Python 3.11 which is experimental: https://github.com/adbar/trafilatura/actions/runs/1730002541 I don't understand what it is happening here, you're using Python 3.9.9? On which OS? Can you try |
Very odd! I’m running this on an M1 Mac. I wouldn’t have thought this would cause a problem, since I haven’t had any transition issues with other Python projects. But here’s the latest after some testing:
I don’t know what the issue could be (especially since the actual software mostly works right), and I may look into it. I’ll also see if I can run this through the x86 emulator. |
This is really strange, it could have something to do with lxml and its underlying XML library but I'm not sure. Please keep be posted if you find an explanation. |
I’ve narrowed it down to the offending line, and it looks like you were right.
I’ve tried a variety of tweaks to the parser configuration, with no meaningful effect. Given that almost all of the tests pass, my vague hypothesis is that the HTML of this particular webpage is invalid, and for whatever reason the underlying library on my M1 is more sensitive to that. (But I don’t have a good understanding of this stack.) Basic googling turned up nothing. I would be happy to try any ideas you might have! Please let me know what you think. |
Could you try the underlying library LXML alone on the problem at hand? You open the file, load it, and try to perform an operation on the tree, here is the gist of a possible test:
This should print a number higher than 1 and a list of nodes in the document. Could you please try it out? |
Sure enough, it's reproducible in a minimal project: https://github.com/naftalibeder/example-lxml. I tried messing around with the offending html file, to see what part of it is leading to the corruption, but I wasn't able to easily figure that out. I submitted a bug report at https://bugs.launchpad.net/lxml/+bug/1959358. |
Thanks, let's follow the resolution of the issue there. |
@naftalibeder It's doesn't appear to be going forward. Did you try building LXML from source? |
Please note that brew can now be used to install Trafilatura on MacOS in a seamless way: |
On a clean install on the
master
branch,metadata_tests.py
andrealworld_tests.py
fail.Please see the full clone/install/test flow and output below.
Shell output
The text was updated successfully, but these errors were encountered: