Cant get full text of articles for last few days #115

vincenzon · 2024-07-21T14:22:18Z

vincenzon
Jul 21, 2024

I have an automated process that searches for and downloads articles every few hours. As of July 19th 2024 it stopped getting the article text. I traced some examples and it looks like here:

https://github.com/ranahaani/GNews/blob/a322163a40a0db2294b68ab50b1a6243fb69d2d4/gnews/utils/utils.py#L25C15-L25C62

The google news url is supposed to be dereferenced to the original source url, but that is not happening. If I manually decode the google url to the original source url, things work as expected.

I'm unsure if there was a change on the Google side, or on my side that broke this. For now I am inserting a base64 decode of the Google link into my processing pipeline. If there is a cleaner or more permanent fix I'd like to hear it.

T3z3nis · 2024-07-22T16:22:59Z

T3z3nis
Jul 22, 2024

I have the same problem, how do you decode the link?

0 replies

vincenzon · 2024-07-23T11:02:35Z

vincenzon
Jul 23, 2024
Author

I found this, I think through Stack Overflow.

import base64
from urllib.parse import urlparse

def decode_google_news_url(source_url):
    url = urlparse(source_url)
    path = url.path.split('/')
    if (
        url.hostname == "news.google.com" and
        len(path) > 1 and
        path[len(path) - 2] == "articles"
    ):
        base64_str = path[len(path) - 1]
        decoded_bytes = base64.urlsafe_b64decode(base64_str + '==')
        decoded_str = decoded_bytes.decode('latin1')

        prefix = bytes([0x08, 0x13, 0x22]).decode('latin1')
        if decoded_str.startswith(prefix):
            decoded_str = decoded_str[len(prefix):]

        suffix = bytes([0xd2, 0x01, 0x00]).decode('latin1')
        if decoded_str.endswith(suffix):
            decoded_str = decoded_str[:-len(suffix)]

        bytes_array = bytearray(decoded_str, 'latin1')
        length = bytes_array[0]
        if length >= 0x80:
            decoded_str = decoded_str[2:length+1]
        else:
            decoded_str = decoded_str[1:length+1]

        return decoded_str
    else:
        return source_url

url = decode_google_news_url(n['url'])

0 replies

caiolivf · 2024-07-23T16:59:11Z

caiolivf
Jul 23, 2024

Same problem here! The article.title output is "Google News".

from gnews import GNews
google_news = GNews()
json_resp = google_news.get_news('Pakistan')
article = google_news.get_full_article(json_resp[0]['url'])  # newspaper3k instance, you can access newspaper3k all attributes in article
article.title

# Google News

0 replies

Isaaq-Khader · 2024-07-25T04:12:59Z

Isaaq-Khader
Jul 25, 2024

I also have the same problem with the articles. I tried running the base64 decoder mentioned here, but it gives me what looks like a random series of characters. I'm curious if Google changed the article link length and this is affecting any decoding but either way, this seems to be some sort of decoding issue that defaults to no output and the title of "Google News".

Example:

    source_url = 'https://news.google.com/rss/articles/CBMiWkFVX3lxTE80Y0I5WjZtTlBBcTJYM2hVTkN1R0oxd0JLQk9tUHFlV3pKRVFsZzk2RnRETnd5RmJuOVdQTlM5VG1tYlQyMmNvenpRN0FNcndZdm4xdnJ3Qk90UdIBX0FVX3lxTE05ekJYblBXZkJGQ0gwRGgwaXcyeDJkMERSRWxvTkpfN09YNDdiX295N3g2UlVBbFhUUkoxVzVKeU12Nk1yQ1o5UThMTnJhOTZ0T1FWV19ta1p6SnBZajlr?oc=5&hl=en-US&gl=US&ceid=US:en'
    print(decode_google_news_url(source_url))

Output:
AU_yqLO4cB9Z6mNPAq2X3hUNCuGJ1wBKBOmPqeWzJEQlg96FtDNwyFbn9WPNS9TmmbT22cozzQ7AMrwYvn1vrwBOtQ

When it should link to: https://www.bbc.com/news/articles/ce58p0048r0o

0 replies

phytal · 2024-07-25T05:46:54Z

phytal
Jul 25, 2024

Having same encoding issue :(

0 replies

ranahaani · 2024-07-25T06:19:57Z

ranahaani
Jul 25, 2024
Maintainer

@vincenzon Thanks for this, can you please create a PR for this patch

0 replies

sif-gondy · 2024-07-25T07:58:43Z

sif-gondy
Jul 25, 2024

Fix from @vincenzon initially worked for couple of days and now getting the same output as @Isaaq-Khader for links.

0 replies

bckenstler · 2024-07-25T08:06:07Z

bckenstler
Jul 25, 2024

Same!

0 replies

sif-gondy · 2024-07-25T11:20:06Z

sif-gondy
Jul 25, 2024

Possible workaround found here: https://gist.github.com/huksley/bc3cb046157a99cd9d1517b32f91a99e?permalink_comment_id=4500912

Source code: https://gist.github.com/huksley/bc3cb046157a99cd9d1517b32f91a99e?permalink_comment_id=5132769#gistcomment-5132769

0 replies

xiyuanHou · 2024-07-25T16:26:15Z

xiyuanHou
Jul 25, 2024

you can try the decoder function from https://gist.github.com/huksley/bc3cb046157a99cd9d1517b32f91a99e?permalink_comment_id=5132769#gistcomment-5132769

it works for me

0 replies

Isaaq-Khader · 2024-07-26T02:43:41Z

Isaaq-Khader
Jul 26, 2024

That worked for me! I have it in my code now and it allows me to fetch the articles as before. Hopefully, this is a nice, permanent fix. Thank you guys for sharing :)

0 replies

TomoyaKuroda · 2024-07-29T18:45:13Z

TomoyaKuroda
Jul 29, 2024

Great! Can someone make pull request for this issue?

0 replies

jun0-ds · 2024-08-02T02:47:11Z

jun0-ds
Aug 2, 2024

I thought google blocks the base64 encoding, so used another way to solve.

I get the original url by using selenium current_url

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.support.ui import WebDriverWait


def get_driver():
    options = webdriver.ChromeOptions()
    options.add_argument("--headless")
    driver = webdriver.Chrome(
        service=Service(ChromeDriverManager().install()), options=options
    )

    return driver

def url_changes(oldUrl):
    def predicate(driver):
        return driver.current_url != oldUrl

    return predicate

def original_url_selenium(url: str, driver) -> str:
    driver.get(url)

    _ = WebDriverWait(driver, 5).until(url_changes(url))

    return driver.current_url


driver = get_driver()
url = "https://news.google.com/rss/articles/CBMifEFVX3lxTE9OVFhuaUsxTTFkZ3J6RURXNHRfOVRfMUp4aUQyV19FbXFITTRhQlQyTG9yd2lwb2lTWUY5cGU3YnV2R0JfbEVUZGhRWDN3cVluYTR1eFNRNDhzeUVJUHJZc196Zkxmb0t5U05veURCaDNoc0dMYXlTcWRzWng?oc=5&hl=en-US&gl=US&ceid=US:en"

original_url = original_url_selenium(url, driver)

0 replies

edmundman · 2024-09-01T09:32:24Z

edmundman
Sep 1, 2024

Seems like this is happening again

0 replies

sif-gondy · 2024-09-02T08:55:02Z

sif-gondy
Sep 2, 2024

New solution available here for the decoding: I tested it and it seems to solve the issue.

0 replies

neeley-pate · 2024-09-03T17:21:31Z

neeley-pate
Sep 3, 2024

How do you resolve the 429 timeout issues when decoding the URLs?

0 replies

Isaaq-Khader · 2024-10-09T18:52:36Z

Isaaq-Khader
Oct 9, 2024

In case anyone else comes across this issue, I found @SSujitX solution to work like a charm. Although it is slower due to the rate limiting, it's a great way to kick back up any news retrievals. Perhaps the package could be integrated into GNews to allow others to get their articles?

https://gist.github.com/huksley/bc3cb046157a99cd9d1517b32f91a99e?permalink_comment_id=5224328#gistcomment-5224328

0 replies

ManiacUrgency · 2024-10-09T21:43:59Z

ManiacUrgency
Oct 9, 2024

@Isaaq-Khader I love you man. Thank you for sharing @SSujitX solution. I have been searching all over the internet.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cant get full text of articles for last few days #115

{{title}}

Replies: 18 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Cant get full text of articles for last few days #115

Replies: 18 comments

vincenzon Jul 23, 2024 Author

ranahaani Jul 25, 2024 Maintainer

vincenzon
Jul 23, 2024
Author

ranahaani
Jul 25, 2024
Maintainer