Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update core.py #331

Merged
merged 4 commits into from
May 11, 2023
Merged

Update core.py #331

merged 4 commits into from
May 11, 2023

Conversation

Korben00
Copy link
Contributor

@Korben00 Korben00 commented Apr 23, 2023

When I want to use a lambda function in python like this one :

extract_content = lambda url: trafilatura.extract(trafilatura.fetch_url(url), output_format='html', include_links=True)

I encounter the following error:

AttributeError: 'lxml.etree._Element' object has no attribute 'text_content'
Indeed, in the trafilatura library, the text_content() method is used to extract the text from the _Element object of lxml.etree. We can get the same result by using the etree.tostring() method with the argument method='text'.

So in the compare_extraction function of trafilatura/core.py, I replaced this line :

algo_text = trim(temppost_algo.text_content())
with this line:

algo_text = trim(etree.tostring(temppost_algo, method='text', encoding='utf-8').decode('utf-8'))

Don't forget to import etree in core.py

from lxml import etree
Tested and working

When I want to use a lambda function in python like this one :

'extract_content = lambda url: trafilatura.extract(trafilatura.fetch_url(url), output_format='html', include_links=True)'


I encounter the following error: 

'AttributeError: 'lxml.etree._Element' object has no attribute 'text_content''

Indeed, in the trafilatura library, the text_content() method is used to extract the text from the _Element object of lxml.etree. We can get the same result by using the etree.tostring() method with the argument method='text'.

So in the compare_extraction function of trafilatura/core.py, I replaced this line :

'algo_text = trim(temppost_algo.text_content())'

with this line: 

'algo_text = trim(etree.tostring(temppost_algo, method='text', encoding='utf-8').decode('utf-8'))'


Don't forget to import etree in core.py

'from lxml import etree'

Tested and working
@adbar
Copy link
Owner

adbar commented Apr 24, 2023

Hi @Korben00, thanks, don't mind the tests above there is a problem with httpbin.org at the moment.

Your code looks good but I wonder if the issue isn't related to LXML's behavior on Apple M1/M2, a new version is pending, could you check if this solution fixes the bug for you: #319 ?

PS: LE Monsieur Korben, enchanté ;)

@adbar adbar added the feedback Feedback from users requested label Apr 25, 2023
@codecov-commenter
Copy link

codecov-commenter commented May 11, 2023

Codecov Report

Merging #331 (5cfa9df) into master (d66cc7c) will increase coverage by 0.00%.
The diff coverage is 100.00%.

❗ Current head 5cfa9df differs from pull request most recent head 6785ee7. Consider uploading reports for the commit 6785ee7 to get more accurate results

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

@@           Coverage Diff           @@
##           master     #331   +/-   ##
=======================================
  Coverage   96.47%   96.47%           
=======================================
  Files          22       22           
  Lines        3379     3380    +1     
=======================================
+ Hits         3260     3261    +1     
  Misses        119      119           
Impacted Files Coverage Δ
trafilatura/core.py 98.10% <100.00%> (+<0.01%) ⬆️

@adbar adbar linked an issue May 11, 2023 that may be closed by this pull request
@adbar
Copy link
Owner

adbar commented May 11, 2023

@Korben00 I applied a variant of your fix and added a line suggested in #319, however I cannot reproduce the bug so I cannot be sure both fixes are necessary. Please get in touch if you still run into errors.

@adbar adbar merged commit 27d7b3f into adbar:master May 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feedback Feedback from users requested
Projects
None yet
Development

Successfully merging this pull request may close these issues.

'lxml.etree._Element' object has no attribute 'text_content'
3 participants