KeyError [b'frontier'] on Request Creation from Spider #401

dkipping · 2020-07-21T09:12:35Z

Issue might be related to #337

Hi,

I have already read in discussions here, that the scheduling of requests should be done by frontera and apparently even the creation should be done by the frontier and not by the spider.
However, in the documentation of scrapy and frontera it is written that requests shall be yielded in the spider parse function.

How should the process look like, if requests are to be created by the crawling strategy and not yielded by the spider? How does the spider trigger that?

In my use case, I am using scrapy-selenium with scrapy and frontera (I use SeleniumRequests to be able to wait for JS loaded elements).

I have to generate the URLs I want to scrape in two phases: I am yielding them firstly in the
start_requests() method of the spider instead of a seeds file and yield requests for extracted links in the first of two parse functions.

Yielding SeleniumRequests from start_requests works, but yielding SeleniumRequests from the parse function afterwards results in the following error (only pasted an extract, as the iterable error prints the same errors over and over):

return (_set_referer(r) for r in result or ())
  File "/Users/user/opt/anaconda3/envs/frontera-update/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 64, in _evaluate_iterable
    for r in iterable:
  File "/Users/user/opt/anaconda3/envs/frontera-update/lib/python3.8/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/Users/user/opt/anaconda3/envs/frontera-update/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 64, in _evaluate_iterable
    for r in iterable:
  File "/Users/user/opt/anaconda3/envs/frontera-update/lib/python3.8/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/Users/user/opt/anaconda3/envs/frontera-update/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 64, in _evaluate_iterable
    for r in iterable:
  File "/Users/user/opt/anaconda3/envs/frontera-update/lib/python3.8/site-packages/frontera/contrib/scrapy/schedulers/frontier.py", line 112, in process_spider_output
    frontier_request = response.meta[b'frontier_request']
KeyError: b'frontier_request'

Very thankful for all hints and examples!

The text was updated successfully, but these errors were encountered:

dkipping · 2020-07-21T09:32:18Z

For reference, the yielding in the spider's start_requests:

def start_requests(self):
        ... manually getting some parameters from a webpage with selenium ...
        total = results_from_above
        pagination_start = 0
        while pagination_start < total:
            url = f'{self.my_start_url}&from={pagination_start}'
            pagination_start += self.pagination_size
            yield SeleniumRequest(
                url=url,
                callback=self.parse1,
                wait_time=5,
                wait_until=EC.visibility_of_element_located((By.XPATH, self.xpaths[1])),
            )

And yielding in the parse1 function:

def parse1(self, response):
        urls_to_follow = response.selector.xpath(self.xpaths[2]).extract()
        for url in urls_to_follow:
            yield SeleniumRequest(
                url=url,
                callback=self.parse2,
                wait_time=10,
                wait_until= EC.presence_of_element_located((By.XPATH, self.xpaths[3]))
            )

A custom crawling strategy was not really necessary until now, as with this approach the link filtering happens via xpaths already...

dkipping · 2020-07-21T09:40:04Z

Following the distributed quickstart in the documentation (except the seed injection step), I am monitoring the prints from the ZeroMQ broker:

2020-07-21 11:38:15 {'started': 1595324105.494039, 'spiders_out_recvd': 0, 'spiders_in_recvd': 0, 'db_in_recvd': 1, 'db_out_recvd': 0, 'sw_in_recvd': 1, 'sw_out_recvd': 0}

... wich means, that the spider is never registered by frontera? Could that be the point breaking it? (And what could cause that? The configuration also mostly follows the general-spider example

dkipping · 2020-07-24T15:41:11Z

After some debugging I can say that at least the start_requests seem to be working properly, the issue arises from the yielded requests from the parse function

dkipping mentioned this issue Jul 24, 2020

How to configure with Scrapy CrawlSpider #344

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KeyError [b'frontier'] on Request Creation from Spider #401

KeyError [b'frontier'] on Request Creation from Spider #401

dkipping commented Jul 21, 2020

dkipping commented Jul 21, 2020 •

edited

Loading

dkipping commented Jul 21, 2020

dkipping commented Jul 24, 2020 •

edited

Loading

KeyError [b'frontier'] on Request Creation from Spider #401

KeyError [b'frontier'] on Request Creation from Spider #401

Comments

dkipping commented Jul 21, 2020

dkipping commented Jul 21, 2020 • edited Loading

dkipping commented Jul 21, 2020

dkipping commented Jul 24, 2020 • edited Loading

dkipping commented Jul 21, 2020 •

edited

Loading

dkipping commented Jul 24, 2020 •

edited

Loading