You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
(Note the <br> Summary portion is optional. We added support for it in #113, but I’m not sure any items currently on the page still use it. ¯\_(ツ)_/¯)
Even if there were multiple news items on the same day, they’d each get their own entry with a date prefix like above. Here’s a screenshot of an example with two items on September 29th:
However, we now have some entries that are structured to list multiple news items under a single date heading (with a summary of all the items). The HTML looks like:
<strong>Date</strong><strong>Summary of two news items<br></strong><ahref="url1">Title of Item 1</a><br><ahref="url2">Title of Item 2</a>
Screenshot:
We currently don’t succeed in parsing either of these two news items because we treat the actual news items because they look like part of the summary, rather than the title, and we only pick up URLs from the title or the language links at the end. It doesn’t cause an error; we just skip them because they look malformed (there are actually a few legitimately malformed/broken items on the page with no link at all, and we have to handle those gracefully).
So we need to figure out a way to distinguish this setup from either of the two earlier structures. Note that this isn’t the way it’s done in all cases. It only occurs once on the page so far, although I have no idea whether they’ll do this more in the future or not. (It’s becoming increasingly clear this page is probably hand-written in a WYSIWYG editor, rather than an HTML template being filled out from a list of news entries in a CMS. It’s messy.)
Some thoughts:
This either involves another massive restructuring of the parser (so that instead of being passed each potential news item to be parsed, it just parses a whole containing element of items, e.g. all the October news items), or the parser needs to be able to return lists of news items that we then flatten. Neither change is small. The first is probably more sound, while the second is at least a little smaller and more self-contained.
Even thought the <br> Summary portion is not currently used on the page, I worry it might come back at any time, so we should try to continue supporting it.
The one example we have of this new pattern includes a summary in a <strong> tag, although I’m wary of depending on that as a hallmark of the pattern. We only have one example to go by so far.
We should probably assume the list of items under a single date could be the current style, could be the style with language-specific links, or a mix.
The text was updated successfully, but these errors were encountered:
Also: this will need to build on #148, which fixes the Alameda scraper so it at least runs. A different messy change on the page is currently breaking the scraper entirely.
In the past, Alameda news items have always followed (roughly) one of two formats:
<strong>date</strong> <a href="...">Title</a> <br> Summary
<strong>date</strong> Title <br> Summary: <a href="...">English</a> | <a href="...">Spanish</a> | etc...
(Note the
<br> Summary
portion is optional. We added support for it in #113, but I’m not sure any items currently on the page still use it. ¯\_(ツ)_/¯)Even if there were multiple news items on the same day, they’d each get their own entry with a date prefix like above. Here’s a screenshot of an example with two items on September 29th:
However, we now have some entries that are structured to list multiple news items under a single date heading (with a summary of all the items). The HTML looks like:
Screenshot:
We currently don’t succeed in parsing either of these two news items because we treat the actual news items because they look like part of the summary, rather than the title, and we only pick up URLs from the title or the language links at the end. It doesn’t cause an error; we just skip them because they look malformed (there are actually a few legitimately malformed/broken items on the page with no link at all, and we have to handle those gracefully).
So we need to figure out a way to distinguish this setup from either of the two earlier structures. Note that this isn’t the way it’s done in all cases. It only occurs once on the page so far, although I have no idea whether they’ll do this more in the future or not. (It’s becoming increasingly clear this page is probably hand-written in a WYSIWYG editor, rather than an HTML template being filled out from a list of news entries in a CMS. It’s messy.)
Some thoughts:
This either involves another massive restructuring of the parser (so that instead of being passed each potential news item to be parsed, it just parses a whole containing element of items, e.g. all the October news items), or the parser needs to be able to return lists of news items that we then flatten. Neither change is small. The first is probably more sound, while the second is at least a little smaller and more self-contained.
Even thought the
<br> Summary
portion is not currently used on the page, I worry it might come back at any time, so we should try to continue supporting it.The one example we have of this new pattern includes a summary in a
<strong>
tag, although I’m wary of depending on that as a hallmark of the pattern. We only have one example to go by so far.We should probably assume the list of items under a single date could be the current style, could be the style with language-specific links, or a mix.
The text was updated successfully, but these errors were encountered: