Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keyword BACKEND Meaning Inconsistent Between Spider and Workers #318

Open
grammy-jiang opened this issue Feb 2, 2018 · 2 comments
Open

Comments

@grammy-jiang
Copy link

Hi, there,

I am working on Frontera these days, and Frontera is a great tool for cluster crawling!

But I still find there is something not that easy to understand/figure out, because of the lack of documentation. After reading and trying the settings mentioned in the Cluster setup guide — Frontera 0.7.1 documentation, I notice that the meanings of the keyword BACKEND are inconsistent between spider and worker:

  • in the spider, it means the message bus, which normally would be Kafka
  • in the workers (db worker and strategy worker), it means the distributed database, which normally would be HBase or SQLAlchemy in Distributed Mode

I do not understand the purpose of this design: the inconsistent meaning would mislead users to set this keyword in both spiders and workers.

Would anyone tell me the reason for this design? Or is it just a mistake?

@sibiryakov
Copy link
Member

Hi @grammy-jiang it's quite an interesting finding. The thing is Frontera tries to be both a distributed and non-distributed crawl frontier framework. And backend became a place in internal architecture allowing to do this, by effectively moving the storage backend to some other process by means of MessageBusBackend.

Here http://frontera.readthedocs.io/en/latest/topics/architecture.html#single-process you can find more information.

The second reason is this happened historically. Frontera started as non-distributed framework, and that left some architectural artefacts.

I agree this is misleading. You can propose your variant how to organise these components to make them easier to understand and use.

@grammy-jiang
Copy link
Author

@sibiryakov Thanks for your reply!

Emmm, I only use Frontera in cluster mode and did not read other parts carefully in the documentation. Frontera is a fantastic framework for cluster crawling, but the documentation is not clear enough like scrapy.

I am a scrapy heavy user and write some useful middlewares (both spider and downloader, also with unit test cases), and most of them have published on my GitHub page. I would like to contribute these codes back to the community, but I do not know how to do it. Would you please review my code and mentor me how to contribute?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants