Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pass on encoding knowledge in HTTP layer to Beautiful Soup #49

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

orangain
Copy link

Requests's response.encoding has a encoding extracted from the Content-Type HTTP header. Passing the encoding to the Beautiful Soup will improve encoding detection in the Beautiful Soup even when the chardet package is not installed. The Beautiful Soup can try the encoding at first, then try an encoding extracted from the HTML content.

Note that response.encoding is None when encoding is not specified in the Content-Type header. In that case, behavior of Beautiful Soup does not change.

Before

Beautiful Soup failed to detect an encoding; Mojibake is occurred.

$ python google-jp.py
WARNING:root:Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
���{Python���[�U��
/url?q=http://www.python.jp/&sa=U&ved=0CBQQFjAAahUKEwiehOWOlLfHAhXBK6YKHdSRDb8&usg=AFQjCNELqejY_005Ae42b32WasF4ZxfcwA
Python - Wikipedia
/url?q=https://ja.wikipedia.org/wiki/Python&sa=U&ved=0CB8QFjABahUKEwiehOWOlLfHAhXBK6YKHdSRDb8&usg=AFQjCNG0M-PAefbKlN35PwDKmLrobJrrEw
Welcome to Python.org
/url?q=https://www.python.org/&sa=U&ved=0CCkQFjACahUKEwiehOWOlLfHAhXBK6YKHdSRDb8&usg=AFQjCNGmnVbDknSqhbM0lNPMg1-OOCl-XQ
���S�҂ł��قږ�����Python��׋��ł���R���e���c10�I - paiza�J�� ...
/url?q=http://paiza.hatenablog.com/entry/2015/04/09/%25E5%2588%259D%25E5%25BF%2583%25E8%2580%2585%25E3%2581%25A7%25E3%2582%2582%25E3%2581%25BB%25E3%2581%25BC%25E7%2584%25A1%25E6%2596%2599%25E3%2581%25A7Python%25E3%2582%2592%25E5%258B%2589%25E5%25BC%25B7%25E3%2581%25A7%25E3%2581%258D%25E3%2582%258B%25E3%2582%25B3%25E3%2583%25B3%25E3%2583%2586%25E3%2583%25B3%25E3%2583%258410&sa=U&ved=0CC8QFjADahUKEwiehOWOlLfHAhXBK6YKHdSRDb8&usg=AFQjCNH3VelkfeyP09b_NqaHGWT03sYAUA
...

Note that www.google.co.jp put an encoding in the HTTP Content-Type header, but its HTML content does not contain an encoding such as <meta charset="...">.

After

The correct encoding is detected.

$ python google-jp.py
日本Pythonユーザ会
/url?q=http://www.python.jp/&sa=U&ved=0CBQQFjAAahUKEwj3mIjFk7fHAhVBIaYKHUkLBR4&usg=AFQjCNELqejY_005Ae42b32WasF4ZxfcwA
Python - Wikipedia
/url?q=https://ja.wikipedia.org/wiki/Python&sa=U&ved=0CB8QFjABahUKEwj3mIjFk7fHAhVBIaYKHUkLBR4&usg=AFQjCNG0M-PAefbKlN35PwDKmLrobJrrEw
Welcome to Python.org
/url?q=https://www.python.org/&sa=U&ved=0CCkQFjACahUKEwj3mIjFk7fHAhVBIaYKHUkLBR4&usg=AFQjCNGmnVbDknSqhbM0lNPMg1-OOCl-XQ
初心者でもほぼ無料でPythonを勉強できるコンテンツ10選 - paiza開発 ...
/url?q=http://paiza.hatenablog.com/entry/2015/04/09/%25E5%2588%259D%25E5%25BF%2583%25E8%2580%2585%25E3%2581%25A7%25E3%2582%2582%25E3%2581%25BB%25E3%2581%25BC%25E7%2584%25A1%25E6%2596%2599%25E3%2581%25A7Python%25E3%2582%2592%25E5%258B%2589%25E5%25BC%25B7%25E3%2581%25A7%25E3%2581%258D%25E3%2582%258B%25E3%2582%25B3%25E3%2583%25B3%25E3%2583%2586%25E3%2583%25B3%25E3%2583%258410&sa=U&ved=0CC8QFjADahUKEwj3mIjFk7fHAhVBIaYKHUkLBR4&usg=AFQjCNH3VelkfeyP09b_NqaHGWT03sYAUA
...

Environment

$ cat google-jp.py
from robobrowser import RoboBrowser

browser = RoboBrowser(parser='html.parser')
browser.open('https://www.google.co.jp/search?q=Python')

for a in browser.select('h3 > a'):
    print(a.text)
    print(a.get('href'))
$ pip freeze
Werkzeug==0.10.4
beautifulsoup4==4.4.0
requests==2.7.0
robobrowser==0.5.3
six==1.9.0

Requests's response.encoding has a encoding extracted from the
Content-Type HTTP header. Passing the encoding to the Beautiful
Soup will improve encoding detection in the Beautiful Soup.
The Beautiful Soup tries the encoding at first, then tries an
encoding extracted from the HTML content.

Note that response.encoding is None when encoding is not specified
in the Content-Type header. In that case, behavior of Beautiful Soup
does not change.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant