Skip to content

Commit

Permalink
new broad crawling options
Browse files Browse the repository at this point in the history
  • Loading branch information
sibiryakov committed Jul 22, 2016
1 parent fabe19d commit 3aaaf91
Show file tree
Hide file tree
Showing 4 changed files with 48 additions and 2 deletions.
31 changes: 31 additions & 0 deletions docs/source/topics/frontera-settings.rst
Original file line number Diff line number Diff line change
Expand Up @@ -87,6 +87,37 @@ Default: ``'frontera.contrib.backends.memory.FIFO'``
The :class:`Backend <frontera.core.components.Backend>` to be used by the frontier. For more info see
:ref:`Activating a backend <frontier-activating-backend>`.


.. setting:: BC_MIN_REQUESTS

BC_MIN_REQUESTS
---------------

Default: ``64``

Broad crawling queue get operation will keep retrying until specified number of requests is collected. Maximum number
of retries is hard-coded to 3.

.. setting:: BC_MIN_HOSTS

BC_MIN_HOSTS
------------

Default: ``24``

Keep retyring when getting requests from queue, until there are requests for specified minimum number of hosts
collected. Maximum number of retries is hard-coded and equals 3.

.. setting:: BC_MAX_REQUESTS_PER_HOST

BC_MAX_REQUESTS_PER_HOST
------------------------

Default:: ``128``

Don't include (if possible) batches of requests containing requests for specific host if there are already more then
specified count of maximum requests per host. This is a suggestion for broad crawling queue get algorithm.

.. setting:: CANONICAL_SOLVER

CANONICAL_SOLVER
Expand Down
6 changes: 6 additions & 0 deletions docs/source/topics/frontier-backends.rst
Original file line number Diff line number Diff line change
Expand Up @@ -283,6 +283,12 @@ tunning a block cache to fit states within one block for average size website. T
to achieve documents closeness within the same host. This function can be selected with :setting:`URL_FINGERPRINT_FUNCTION`
setting.

.. TODO: document details of block cache tuning,
BC* settings and queue get operation concept,
hbase tables schema and data flow
Queue exploration
shuffling with MR jobs
.. _FIFO: http://en.wikipedia.org/wiki/FIFO
.. _LIFO: http://en.wikipedia.org/wiki/LIFO_(computing)
.. _DFS: http://en.wikipedia.org/wiki/Depth-first_search
Expand Down
10 changes: 8 additions & 2 deletions frontera/contrib/backends/hbase.py
Original file line number Diff line number Diff line change
Expand Up @@ -376,6 +376,10 @@ def __init__(self, manager):
port = settings.get('HBASE_THRIFT_PORT')
hosts = settings.get('HBASE_THRIFT_HOST')
namespace = settings.get('HBASE_NAMESPACE')
self._min_requests = settings.get('BC_MIN_REQUESTS')
self._min_hosts = settings.get('BC_MIN_HOSTS')
self._max_requests_per_host = settings.get('BC_MAX_REQUESTS_PER_HOST')

self.queue_partitions = settings.get('SPIDER_FEED_PARTITIONS')
host = choice(hosts) if type(hosts) in [list, tuple] else hosts
kwargs = {
Expand Down Expand Up @@ -456,8 +460,10 @@ def get_next_requests(self, max_next_requests, **kwargs):
for partition_id in range(0, self.queue_partitions):
if partition_id not in partitions:
continue
results = self.queue.get_next_requests(max_next_requests, partition_id, min_requests=64,
min_hosts=24, max_requests_per_host=128)
results = self.queue.get_next_requests(max_next_requests, partition_id,
min_requests=self._min_requests,
min_hosts=self._min_hosts,
max_requests_per_host=self._max_requests_per_host)
next_pages.extend(results)
self.logger.debug("Got %d requests for partition id %d", len(results), partition_id)
return next_pages
3 changes: 3 additions & 0 deletions frontera/settings/default_settings.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,9 @@

AUTO_START = True
BACKEND = 'frontera.contrib.backends.memory.FIFO'
BC_MIN_REQUESTS = 64
BC_MIN_HOSTS = 24
BC_MAX_REQUESTS_PER_HOST = 128
CANONICAL_SOLVER = 'frontera.contrib.canonicalsolvers.Basic'
DELAY_ON_EMPTY = 5.0
DOMAIN_FINGERPRINT_FUNCTION = 'frontera.utils.fingerprint.sha1'
Expand Down

0 comments on commit 3aaaf91

Please sign in to comment.