Update core.py #331

Korben00 · 2023-04-23T15:58:49Z

When I want to use a lambda function in python like this one :

extract_content = lambda url: trafilatura.extract(trafilatura.fetch_url(url), output_format='html', include_links=True)

I encounter the following error:

AttributeError: 'lxml.etree._Element' object has no attribute 'text_content'
Indeed, in the trafilatura library, the text_content() method is used to extract the text from the _Element object of lxml.etree. We can get the same result by using the etree.tostring() method with the argument method='text'.

So in the compare_extraction function of trafilatura/core.py, I replaced this line :

algo_text = trim(temppost_algo.text_content())
with this line:

algo_text = trim(etree.tostring(temppost_algo, method='text', encoding='utf-8').decode('utf-8'))

Don't forget to import etree in core.py

from lxml import etree
Tested and working

When I want to use a lambda function in python like this one : 'extract_content = lambda url: trafilatura.extract(trafilatura.fetch_url(url), output_format='html', include_links=True)' I encounter the following error: 'AttributeError: 'lxml.etree._Element' object has no attribute 'text_content'' Indeed, in the trafilatura library, the text_content() method is used to extract the text from the _Element object of lxml.etree. We can get the same result by using the etree.tostring() method with the argument method='text'. So in the compare_extraction function of trafilatura/core.py, I replaced this line : 'algo_text = trim(temppost_algo.text_content())' with this line: 'algo_text = trim(etree.tostring(temppost_algo, method='text', encoding='utf-8').decode('utf-8'))' Don't forget to import etree in core.py 'from lxml import etree' Tested and working

adbar · 2023-04-24T11:05:59Z

Hi @Korben00, thanks, don't mind the tests above there is a problem with httpbin.org at the moment.

Your code looks good but I wonder if the issue isn't related to LXML's behavior on Apple M1/M2, a new version is pending, could you check if this solution fixes the bug for you: #319 ?

PS: LE Monsieur Korben, enchanté ;)

codecov-commenter · 2023-05-11T09:17:35Z

Codecov Report

Merging #331 (5cfa9df) into master (d66cc7c) will increase coverage by 0.00%.
The diff coverage is 100.00%.

❗ Current head 5cfa9df differs from pull request most recent head 6785ee7. Consider uploading reports for the commit 6785ee7 to get more accurate results

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

@@           Coverage Diff           @@
##           master     #331   +/-   ##
=======================================
  Coverage   96.47%   96.47%           
=======================================
  Files          22       22           
  Lines        3379     3380    +1     
=======================================
+ Hits         3260     3261    +1     
  Misses        119      119

Impacted Files	Coverage Δ
trafilatura/core.py	`98.10% <100.00%> (+<0.01%)`	⬆️

adbar · 2023-05-11T09:42:24Z

@Korben00 I applied a variant of your fix and added a line suggested in #319, however I cannot reproduce the bug so I cannot be sure both fixes are necessary. Please get in touch if you still run into errors.

Merge branch 'master' into master

5cfa9df

adbar added the feedback Feedback from users requested label Apr 25, 2023

adbar mentioned this pull request Apr 27, 2023

'lxml.etree._Element' object has no attribute 'text_content' #319

Closed

simplify code

a0f1d3d

adbar linked an issue May 11, 2023 that may be closed by this pull request

'lxml.etree._Element' object has no attribute 'text_content' #319

Closed

apply both possible fixes for safety

6785ee7

adbar merged commit 27d7b3f into adbar:master May 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update core.py #331

Update core.py #331

Korben00 commented Apr 23, 2023 •

edited

Loading

adbar commented Apr 24, 2023

codecov-commenter commented May 11, 2023 •

edited

Loading

adbar commented May 11, 2023

Update core.py #331

Update core.py #331

Conversation

Korben00 commented Apr 23, 2023 • edited Loading

adbar commented Apr 24, 2023

codecov-commenter commented May 11, 2023 • edited Loading

Codecov Report

adbar commented May 11, 2023

Korben00 commented Apr 23, 2023 •

edited

Loading

codecov-commenter commented May 11, 2023 •

edited

Loading