Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Project Status? #409

Open
psdon opened this issue Dec 7, 2020 · 8 comments
Open

Project Status? #409

psdon opened this issue Dec 7, 2020 · 8 comments

Comments

@psdon
Copy link

psdon commented Dec 7, 2020

It's been a year since the last commit in the master branch? Do you have any plan to maintain this? I noticed a lot of issues doesn't get resolve, and lots of PR are still pending.

@leopucci
Copy link

leopucci commented Jan 3, 2021

Same feeling here. Should I invest my time using it? Final version contains fixed bugs but not released version for them.

@getorca
Copy link

getorca commented Jan 25, 2021

also wondering the same.

@aryaniyaps
Copy link

Any updates on this?

@leopucci
Copy link

leopucci commented May 22, 2021 via email

@aryaniyaps
Copy link

thanks for the reply! I am considering moving onto some other library or implementing my own solution.

@leopucci
Copy link

leopucci commented May 23, 2021 via email

@aryaniyaps
Copy link

aryaniyaps commented May 31, 2021

I ended up implementing my own distributed crawler based on this paper.
https://nlp.stanford.edu/IR-book/pdf/20crawl.pdf

It talks about creating an URL frontier that enqueues and manages URLs.
I would just like to give some tips to anyone who would look at this in the future.

While adapting to scrapy, the whole concept of "back queues" mentioned in the paper can be discarded.
Scrapy implements this using the downloader (more precisely, "download slots"). That's already taken care of.
While scaling, you might need to make your own downloader which uses redis to take care of slots maybe (the default
scrapy downloader stores slots in memory and can be very inefficient).

That said, the other thing we need to do is the "front queues".
The best place to implement this is the scheduler.

Say you have N number of front queues, push each request that comes into one of the queues
according to it's priority. (If the request has a priority of 3, it will be pushed into queue number 3).

While getting the next request, use weighted randoms to pick one of the front queues, and pop the first
request in the queue. Each front queues must be FIFO queues. This should be such that important requests
flow more frequently.

The next part is the dupefilter.
I am storing dupefilter keys in redis and also set them to expire after a certain amount of time.
If the request is already in the filter, I reject it.

This gives a more scalable frontier.
I believe this is the concept which frontera is about, but they've implemented it differently.

@davidsu-citylitics
Copy link

@aryaniyaps great insight, thanks for sharing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants