-
-
Notifications
You must be signed in to change notification settings - Fork 267
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update core.py #331
Update core.py #331
Conversation
When I want to use a lambda function in python like this one : 'extract_content = lambda url: trafilatura.extract(trafilatura.fetch_url(url), output_format='html', include_links=True)' I encounter the following error: 'AttributeError: 'lxml.etree._Element' object has no attribute 'text_content'' Indeed, in the trafilatura library, the text_content() method is used to extract the text from the _Element object of lxml.etree. We can get the same result by using the etree.tostring() method with the argument method='text'. So in the compare_extraction function of trafilatura/core.py, I replaced this line : 'algo_text = trim(temppost_algo.text_content())' with this line: 'algo_text = trim(etree.tostring(temppost_algo, method='text', encoding='utf-8').decode('utf-8'))' Don't forget to import etree in core.py 'from lxml import etree' Tested and working
Hi @Korben00, thanks, don't mind the tests above there is a problem with Your code looks good but I wonder if the issue isn't related to LXML's behavior on Apple M1/M2, a new version is pending, could you check if this solution fixes the bug for you: #319 ? PS: LE Monsieur Korben, enchanté ;) |
Codecov Report
❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more. @@ Coverage Diff @@
## master #331 +/- ##
=======================================
Coverage 96.47% 96.47%
=======================================
Files 22 22
Lines 3379 3380 +1
=======================================
+ Hits 3260 3261 +1
Misses 119 119
|
When I want to use a lambda function in python like this one :
extract_content = lambda url: trafilatura.extract(trafilatura.fetch_url(url), output_format='html', include_links=True)
I encounter the following error:
AttributeError: 'lxml.etree._Element' object has no attribute 'text_content'
Indeed, in the trafilatura library, the text_content() method is used to extract the text from the _Element object of lxml.etree. We can get the same result by using the etree.tostring() method with the argument method='text'.
So in the compare_extraction function of trafilatura/core.py, I replaced this line :
algo_text = trim(temppost_algo.text_content())
with this line:
algo_text = trim(etree.tostring(temppost_algo, method='text', encoding='utf-8').decode('utf-8'))
Don't forget to import etree in core.py
from lxml import etree
Tested and working