Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ModuleNotFoundError: No module named 'frontera.contrib.scrapy.middlewares.seeds' #371

Open
ghost opened this issue Jun 7, 2019 · 12 comments
Labels

Comments

@ghost
Copy link

ghost commented Jun 7, 2019

@sibiryakov Hi, thanks your suggestion about the kafka. But i have installed it in my pc. I tend to build kafka+hbase crawler.

I have few questions, first when i run this command
python -m frontera.utils.add_seeds --config tutorial.config.dbw --seeds-file seeds.txt

scrapy crawl tutorial -L INFO -s SPIDER_PARTITION_ID=0
i got this error
ModuleNotFoundError: No module named 'frontera.contrib.scrapy.middlewares.seeds'
Screenshot from 2019-06-06 23-37-04

after i removed, i can run the scrapy, but 0 page crawled
SPIDER_MIDDLEWARES = { 'frontera.contrib.scrapy.middlewares.schedulers.SchedulerSpiderMiddleware': 999, ̶ ̶ ̶ ̶'̶f̶r̶o̶n̶t̶e̶r̶a̶.̶c̶o̶n̶t̶r̶i̶b̶.̶s̶c̶r̶a̶p̶y̶.̶m̶i̶d̶d̶l̶e̶w̶a̶r̶e̶s̶.̶s̶e̶e̶d̶s̶.̶f̶i̶l̶e̶.̶F̶i̶l̶e̶S̶e̶e̶d̶L̶o̶a̶d̶e̶r̶'̶:̶ ̶1̶,̶ }
Screenshot from 2019-06-06 23-52-23

besides, my kafka didnt consume any message
Screenshot from 2019-06-06 23-53-04

All my config is followed the document cluster setup guide.

For the kafka problems.
after i add this line MESSAGE_BUS = 'frontera.contrib.messagebus.kafkabus.MessageBus' and remove ̶ ̶ ̶ ̶'̶f̶r̶o̶n̶t̶e̶r̶a̶.̶c̶o̶n̶t̶r̶i̶b̶.̶s̶c̶r̶a̶p̶y̶.̶m̶i̶d̶d̶l̶e̶w̶a̶r̶e̶s̶.̶s̶e̶e̶d̶s̶.̶f̶i̶l̶e̶.̶F̶i̶l̶e̶S̶e̶e̶d̶L̶o̶a̶d̶e̶r̶'̶:̶ ̶1̶,̶
i got this problem when i start db worker, stategic work and crawler.
Screenshot from 2019-06-07 17-14-38

my config
common.py

from __future__ import absolute_import
from frontera.settings.default_settings import MIDDLEWARES

MAX_NEXT_REQUESTS = 512
SPIDER_FEED_PARTITIONS = 2 # number of spider processes
SPIDER_LOG_PARTITIONS = 2 # worker instances
MIDDLEWARES.extend([
    'frontera.contrib.middlewares.domain.DomainMiddleware',
    'frontera.contrib.middlewares.fingerprint.DomainFingerprintMiddleware'
])

QUEUE_HOSTNAME_PARTITIONING = True
KAFKA_LOCATION = 'localhost:9092' 
URL_FINGERPRINT_FUNCTION='frontera.utils.fingerprint.hostname_local_fingerprint'
MESSAGE_BUS = 'frontera.contrib.messagebus.kafkabus.MessageBus'
SPIDER_LOG_TOPIC = 'frontier-done'
SPIDER_FEED_TOPIC = 'frontier-todo'
SCORING_TOPIC = 'frontier-score'

dbw.py

from __future__ import absolute_import
from .worker import *
LOGGING_CONFIG='logging-db.conf' 

spider.py

from __future__ import absolute_import
from .common import *
BACKEND = 'frontera.contrib.backends.remote.messagebus.MessageBusBackend'
KAFKA_GET_TIMEOUT = 0.5
LOCAL_MODE = False  # by default Frontera is prepared for single process mode

sw.py

from __future__ import absolute_import
from .worker import *
CRAWLING_STRATEGY = 'frontera.strategy.basic.BasicCrawlingStrategy' # path to the crawling strategy class
LOGGING_CONFIG='logging-sw.conf' # if needed

worker.py

from __future__ import absolute_import
from .common import *
BACKEND = 'frontera.contrib.backends.hbase.HBaseBackend'
HBASE_DROP_ALL_TABLES = True
MAX_NEXT_REQUESTS = 2048
NEW_BATCH_DELAY = 3.0
HBASE_THRIFT_HOST = 'localhost' # HBase Thrift server host and port
HBASE_THRIFT_PORT = 9090

how i create kafka topic
kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 2 --topic frontier-done
kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 2 --topic frontier-todo
kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 2 --topic frontier-score

I set the partition to 2 in common.py,

SPIDER_FEED_PARTITIONS = 2 # number of spider processes
SPIDER_LOG_PARTITIONS = 2 # worker instances

how i start kafka
kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic frontier-done --from-beginning
kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic frontier-todo --from-beginning
kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic frontier-score --from-beginning

Version of tools
Name: frontera
Version: 0.8.1
Name: Scrapy
Version: 1.6.0
Name:Python
Version:3.7.3
Name:Kafka
Version:2.2.1

I think may be the doc didnt update to v0.8.1, it still stay at v0.8.0.1.
Should i downgrade the frontera to the table version v0.8?
But myself love to use the latest version instead.

Thanks in advance!

@Gallaecio
Copy link
Member

Please, use StackOverflow to ask this type of questions. See also https://stackoverflow.com/help/mcve and https://stackoverflow.com/help/how-to-ask

@ghost
Copy link
Author

ghost commented Jun 7, 2019

@Gallaecio
I have think before to ask this question in stackoverflow, but there is less responsive than here. I believe that there are bugs involved. Have you read my entire problem?

Also, I checked all the previous questions, @sibiryakov is very responsive to solve the problem, this is why I am asking here.

I will try to ask in stackoverflow...

Screenshot from 2019-06-07 18-55-22

I have uploaded the question to stackoverflow, https://stackoverflow.com/questions/56493245/modulenotfounderror-no-module-named-frontera-contrib-scrapy-middlewares-seeds

Sorry i have not enough reputation to post image in stackoverflow. but i use i used imgur.com instead.
Hope i can get the answer soon..,

@ghost
Copy link
Author

ghost commented Jun 9, 2019

@sibiryakov I found a solution for this error

  File "/home/liho/anaconda3/lib/python3.7/site-packages/frontera/contrib/messagebus/kafkabus.py", line 60, in __init__
    self._partitions = [TopicPartition(self._topic, pid) for pid in self._consumer.partitions_for_topic(self._topic)]
TypeError: 'NoneType' object is not iterable

You should add this line

            self._consumer.topics()

before

            self._partitions = [TopicPartition(self._topic, pid) for pid in self._consumer.partitions_for_topic(self._topic)]

Seems like partitions_for_topic does not request a metadata refresh, whereas topics does. No clue why this worked in kafka-python 1.4.4, as it seems the two functions have not changed. Maybe metadata was always refreshed asap when creating the consumer in 1.4.4?

Making partitions_for_topic call the same code as topics before returning the partitions seems to solve the problem obviously.

Have a look they are fixing this problem recently
dpkp/kafka-python#1789
dpkp/kafka-python#1781
dpkp/kafka-python#1774
Yelp/kafka-utils@607a577

@ghost
Copy link
Author

ghost commented Jun 9, 2019

@sibiryakov
After i successfully start the cluster

python -m frontera.worker.db --config tutorial.config.dbw --no-incoming --partitions 0 1
python -m frontera.worker.strategy --config tutorial.config.sw --partition-id 0

When i inject the seeds file by command below,

python -m frontera.utils.add_seeds --config tutorial.config.sw --seeds-file seeds.txt

i got this error in the meanwhile in db worker terminal
Screenshot from 2019-06-09 17-59-27
But after the seed injected, it gone...

scrapy crawl tutorial -L INFO -s SPIDER_PARTITION_ID=1

But i still get 0 page crawled...
Screenshot from 2019-06-09 17-51-09

Pls help me when you are free sir, thanks in advance!

@sibiryakov
Copy link
Member

Hi @liho00 your seeds weren't injected, because the strategy worker was unable to create the table crawler:queue. Check that it can connect to Hbase Thrift Server, and namespace crawler exists.

@ghost
Copy link
Author

ghost commented Jun 10, 2019

@sibiryakov Hi, I am sure i have created the namespace crawler before, and i am also sure the queue table was created..., i need to clarify that im using frontera v0.8.1 as the 'frontera.contrib.scrapy.middlewares.seeds' has been removed at this version.

Screenshot from 2019-06-10 13-52-54

after i tried again the error still show up after key in this command

python -m frontera.utils.add_seeds --config tutorial.config.sw --seeds-file seeds.txt

dbw terminal
Screenshot from 2019-06-10 16-53-07

But after few second it show the seeds injected?

seeds terminal
Screenshot from 2019-06-10 16-53-11

I am still getting 0 page crawled

Besides that, can you tell me how to inject the seeds? If this module is not needed,

ModuleNotFoundError: No module named 'frontera.contrib.scrapy.middlewares.seeds'

i should inject the seed into my strategic worker?

Lastly, i cannot force close my crawler, it trapped in an endless loop
Screenshot from 2019-06-10 17-01-51

My kafka, zookeeper, hbase, hadoop all started,
Screenshot from 2019-06-10 17-05-05
Screenshot from 2019-06-10 17-03-38

@ghost
Copy link
Author

ghost commented Jun 13, 2019

solved by downgrade kafka-python to v1.4.4

@ghost ghost closed this as completed Jun 13, 2019
@Gallaecio Gallaecio reopened this Jun 14, 2019
@Gallaecio
Copy link
Member

If that’s the only fix, then we need to either update setup.py accordingly or add support for later versions of kafka-python.

@Gallaecio Gallaecio added bug and removed question labels Jun 14, 2019
@sibiryakov
Copy link
Member

@Gallaecio it should be a tiny PR #371 (comment)

@ghost
Copy link
Author

ghost commented Jun 14, 2019

Besides that, I cannot force close the spiders, it trapped in an endless loop [kafka client] warning unable to send to wakeup socket when using kafka-python v1.4.5 and v1.4.6 (latest).

kafka/client_async.py

            except socket.error:
                log.warning('Unable to send to wakeup socket!')

dpkp/kafka-python#1837
dpkp/kafka-python#1842

Screenshot from 2019-06-10 17-01-51

@psdon
Copy link

psdon commented Dec 7, 2020

I also get the same problem. How can we solve this?

@yenicelik
Copy link

Getting the same issue here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants