Project Status? #409

psdon · 2020-12-07T20:33:03Z

It's been a year since the last commit in the master branch? Do you have any plan to maintain this? I noticed a lot of issues doesn't get resolve, and lots of PR are still pending.

leopucci · 2021-01-03T20:49:11Z

Same feeling here. Should I invest my time using it? Final version contains fixed bugs but not released version for them.

getorca · 2021-01-25T18:04:42Z

also wondering the same.

aryaniyaps · 2021-05-22T07:05:03Z

Any updates on this?

leopucci · 2021-05-22T13:09:46Z

I think that the lack of update make it clear the status of it. Em sáb, 22 de mai de 2021 04:05, Aryan Iyappan ***@***.***> escreveu:

…

Any updates on this? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#409 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AJFZV45S3DU6BNJZTLOZAXDTO5JSTANCNFSM4UQ7NHCA> .

aryaniyaps · 2021-05-23T01:48:12Z

thanks for the reply! I am considering moving onto some other library or implementing my own solution.

leopucci · 2021-05-23T01:52:30Z

Try scrapy-cluster... I moved away from Frontera to it. Em sáb, 22 de mai de 2021 22:48, Aryan Iyappan ***@***.***> escreveu:

…

thanks for the reply! I am considering moving onto some other library or implementing my own solution. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#409 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AJFZV42MSWRCGJ6SZI6WRKDTPBNGRANCNFSM4UQ7NHCA> .

aryaniyaps · 2021-05-31T05:31:15Z

I ended up implementing my own distributed crawler based on this paper.
https://nlp.stanford.edu/IR-book/pdf/20crawl.pdf

It talks about creating an URL frontier that enqueues and manages URLs.
I would just like to give some tips to anyone who would look at this in the future.

While adapting to scrapy, the whole concept of "back queues" mentioned in the paper can be discarded.
Scrapy implements this using the downloader (more precisely, "download slots"). That's already taken care of.
While scaling, you might need to make your own downloader which uses redis to take care of slots maybe (the default
scrapy downloader stores slots in memory and can be very inefficient).

That said, the other thing we need to do is the "front queues".
The best place to implement this is the scheduler.

Say you have N number of front queues, push each request that comes into one of the queues
according to it's priority. (If the request has a priority of 3, it will be pushed into queue number 3).

While getting the next request, use weighted randoms to pick one of the front queues, and pop the first
request in the queue. Each front queues must be FIFO queues. This should be such that important requests
flow more frequently.

The next part is the dupefilter.
I am storing dupefilter keys in redis and also set them to expire after a certain amount of time.
If the request is already in the filter, I reject it.

This gives a more scalable frontier.
I believe this is the concept which frontera is about, but they've implemented it differently.

davidsu-citylitics · 2022-11-10T05:21:38Z

@aryaniyaps great insight, thanks for sharing!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Project Status? #409

Project Status? #409

psdon commented Dec 7, 2020

leopucci commented Jan 3, 2021

getorca commented Jan 25, 2021

aryaniyaps commented May 22, 2021

leopucci commented May 22, 2021 via email

aryaniyaps commented May 23, 2021

leopucci commented May 23, 2021 via email

aryaniyaps commented May 31, 2021 •

edited

Loading

davidsu-citylitics commented Nov 10, 2022

Project Status? #409

Project Status? #409

Comments

psdon commented Dec 7, 2020

leopucci commented Jan 3, 2021

getorca commented Jan 25, 2021

aryaniyaps commented May 22, 2021

leopucci commented May 22, 2021 via email

aryaniyaps commented May 23, 2021

leopucci commented May 23, 2021 via email

aryaniyaps commented May 31, 2021 • edited Loading

davidsu-citylitics commented Nov 10, 2022

aryaniyaps commented May 31, 2021 •

edited

Loading