Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alameda news doesn't detect multiple items listed under one date header #149

Open
Mr0grog opened this issue Nov 1, 2020 · 1 comment
Open
Labels
bug Something isn't working news Related to scraping news (rather than data)

Comments

@Mr0grog
Copy link
Collaborator

Mr0grog commented Nov 1, 2020

In the past, Alameda news items have always followed (roughly) one of two formats:

  1. <strong>date</strong> <a href="...">Title</a> <br> Summary

  2. <strong>date</strong> Title <br> Summary: <a href="...">English</a> | <a href="...">Spanish</a> | etc...

(Note the <br> Summary portion is optional. We added support for it in #113, but I’m not sure any items currently on the page still use it. ¯\_(ツ)_/¯)

Even if there were multiple news items on the same day, they’d each get their own entry with a date prefix like above. Here’s a screenshot of an example with two items on September 29th:

Screen Shot 2020-11-01 at 10 55 27 AM

However, we now have some entries that are structured to list multiple news items under a single date heading (with a summary of all the items). The HTML looks like:

<strong>Date</strong>
<strong>Summary of two news items<br></strong>
<a href="url1">Title of Item 1</a><br>
<a href="url2">Title of Item 2</a>

Screenshot:

Screen Shot 2020-11-01 at 10 57 42 AM

We currently don’t succeed in parsing either of these two news items because we treat the actual news items because they look like part of the summary, rather than the title, and we only pick up URLs from the title or the language links at the end. It doesn’t cause an error; we just skip them because they look malformed (there are actually a few legitimately malformed/broken items on the page with no link at all, and we have to handle those gracefully).

So we need to figure out a way to distinguish this setup from either of the two earlier structures. Note that this isn’t the way it’s done in all cases. It only occurs once on the page so far, although I have no idea whether they’ll do this more in the future or not. (It’s becoming increasingly clear this page is probably hand-written in a WYSIWYG editor, rather than an HTML template being filled out from a list of news entries in a CMS. It’s messy.)

Some thoughts:

  1. This either involves another massive restructuring of the parser (so that instead of being passed each potential news item to be parsed, it just parses a whole containing element of items, e.g. all the October news items), or the parser needs to be able to return lists of news items that we then flatten. Neither change is small. The first is probably more sound, while the second is at least a little smaller and more self-contained.

  2. Even thought the <br> Summary portion is not currently used on the page, I worry it might come back at any time, so we should try to continue supporting it.

  3. The one example we have of this new pattern includes a summary in a <strong> tag, although I’m wary of depending on that as a hallmark of the pattern. We only have one example to go by so far.

  4. We should probably assume the list of items under a single date could be the current style, could be the style with language-specific links, or a mix.

@Mr0grog Mr0grog added bug Something isn't working news Related to scraping news (rather than data) labels Nov 1, 2020
@Mr0grog
Copy link
Collaborator Author

Mr0grog commented Nov 1, 2020

Also: this will need to build on #148, which fixes the Alameda scraper so it at least runs. A different messy change on the page is currently breaking the scraper entirely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working news Related to scraping news (rather than data)
Projects
None yet
Development

No branches or pull requests

1 participant