-
-
Notifications
You must be signed in to change notification settings - Fork 267
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
'lxml.etree._Element' object has no attribute 'text_content' #319
Comments
Hi @asjsrep, I cannot reproduce the bug and on a conceptual level this error should not happen. So I do not know what is happening here, do you have more details to share? |
@adbar sure, I'm running it on a Mac M2, Python version Python 3.10.8 This is the pip freeze in a new virtual environment:
Is there any specific debugging data which would be useful? |
I've had a chance to look into this in more detail now. The error was being raised in the lxml Changing: to
i.e. forcing the UTF-8 encoding seems to fix the issue, but I haven't tested it widely. I'm sure there's a better place to correct the character encoding but I'm not very familiar with the codebase or Python. |
Thanks for the additional information, the document has already parsed once at this stage so I don't quite understand how the problem can arise. Unfortunately the underlying parser (LXML) is known to have trouble with Mac M1/M2 systems (see #166) and your issue could be related to that. It should be fixed soon (with version 5). In the meantime we'll keep track of the issue. |
A workaround for lxml==4.9.2 on M1/M2/arm64 Macs seems to be building a specific wheel for lxml before installing trafilatura
|
Yes, there already are temporary fixes, could you try one of them and did it solve the problem? |
Yes, building lxml with |
Nice, so it's more a documentation issue, unless LXML v5 gets released soon. |
Hi, I am getting the same error on a M1 macbook. I have tried the steps above but they don't seem to work. I have attached my pip freeze
I am using Python 3.11.2. Are there any other steps I can do to resolve this issue? Thanks! |
@SnowstormAI not sure if it'll work, but try deleting the pip cache and then building the wheel for lxml
|
Hi, Thanks for the suggestion @asjsrep, but this doesn't seem to work either. Are there other workarounds I can try out? |
@SnowstormAI did you try editing the trafilatura code directly? For me this also worked (but obviously isn't ideal) Changing: to return fromstring(doc.summary().encode('UTF-8'), parser=HTML_PARSER) Or, I haven't tried it myself, but running your code inside a docker container might be a workaround |
See also PR #331. The problem is that I cannot run automated tests on such devices and thus I don't know if/when the issues are solved. A new LXML version (v5) is pending, it will hopefully solve this or the other problem. |
@asjsrep Had the same issue on a M1 mac, updating the code as suggested above fixes the issue. Thank you!
@adbar is there some tests that you would like to be run on M1? I can run and share results if required Alternatively, how about having a specific check for m1 arm and making the above change? until the lxml v5 changes are done? (It might work like a bandaid but would be a workaround until the lxml v5 changes are done) |
Thank you everyone! This seems to fix my issue, but isn't an ideal longterm solution. Do we have an eta on when lxml v5 will roll out?
This issue can be closed as it isn't directly connected to Trafilatura code. Thanks again for everyones help! |
Dear all, thanks for your feedback! As far as I know it is not possible to test the software on Apple Silicon with Github Actions which is the CI/CD solution I use (feel free to suggest ideas if you know a way). But since there are concording reports I am filing it as a bug. I have no idea when LXML v5 will be released so I am planning to edit and accept PR #331 accordingly. |
I chose to apply two different fixes however I cannot reproduce the bug so I cannot be sure both of them are necessary. Please get in touch if problems persist. |
Extraction of the following URL fails
trafilatura -u "https://buffer.com/resources/ai-content-creation/"
trafilatura version : 1.5.0
The text was updated successfully, but these errors were encountered: