-
Notifications
You must be signed in to change notification settings - Fork 66
Maximum URLs to compute is different even with same configuration. #105
Comments
The algorithm which computes URL is unordered and asynchronous. Let me explain how it works.
Now, how the crawler works exactly? We said that for each URL, it is opened and parsed. And that URL present in the document are enqueued. Great. But how URL are enqueued? In which order? To provide more significant results as fast as possible, URL are not put in queue A directly. Why? Imagine a menu called Products. The first items in the menu could lead to the “same” page, i.e. productd of the same kind, so probably with the same HTML (modulo product information and details). Our goal is to check very different pages as fast as possible. So when a URL is opened and parsed, all URL are enqueued in different sub queues. To simplify, say we have several queues: queue B_i, where i is the name of the queue. We call this name a bucket. For instance, let be 3 URL: URL in queues queue B_i are dequeued all asynchronously. When a URL is dequeue, it is opened, parsed, and new URL are extracted and pushed in queue B_i. Per queue B_i, only one URL is computed at the same time. It means that When a URL is opened, it is automatically added to queue A. Consequently, queue A is likely to not receive So. Your results are not identical because the algorithm behavior is undeterministic. At the end, given time, all URL must be computed, but not in the same order and time. Especially if the limit of maximum URL to compute is low. In your example, this is not normal that the run stops at 123 or 105 if 128 URL have been found the first time. Maybe there is a special URL that makes the crawler to stop. This special URL is never met the first time, is met at position 123 the second time, and at position 105 the third time. That's my guess. Do you have an idea about which URL it could be? |
Hi Ivan, Thank you so much for detailed insights about crawling algo, it really helped to make an understanding. Do you have an idea about which URL it could be? Culprit URL: https://www.drupal.org/promet-source As per my understanding, if the disruption is caused by above mentioned URLs, then the URL should not be crawled again by the tool. Not sure, why the crawling stopped at this particular URL only. Please guide. Also, can we track this disruption in some log? Thanks & Regards, |
I am sorry but I cannot reproduce. I have been able to successfully generate a report for 128 URLs 3 consecutive times. What is your OS? What is your version of PhantomJS? What is your version of NodeJS? |
OS Version : MacOS Sierra 10.12.1 |
Hi Ivan, I agree that there’s no regular pattern for crawling issue, which in fact makes it difficult to trace down the actual root cause. I have attached a PDF file to show how the issue is occurring at my end. I picked another small website “http://zomig.com/” to test. In total, I performed four executions and observations with valid screenshots are provided in attached PDF. Please find the same and kindly provide your inputs. Regards, |
Please try with NodeJS 7.x. This is likely to be NodeJS crashing because it is NodeJS. |
Thankyou Ivan but upgrading NodeJs to 7.x didn't solve the issue. |
I'm going to take a guess here. I've found that the check below causes the tool to quit if the queue isn't populated fast enough or at a consistent rate. So anytime the queue drops to 0, no matter if it reached the maximum urls, it will quit. Lines 220 to 222 in fb6acb8
You can comment out that check, however if the crawler doesn't reach your maximum URLs the queue will remain open and you will have to manually quit. Hope this helps. UPDATE: drain - a callback that is called when the last item from the queue has returned from the worker. empty - a callback that is called when the last item from the queue is given to a worker. |
I randomly picked one website to understand how crawling works for a11ym. I observed every time the maximum URLs crawled was different.
$ a11ym https://www.drupal.org/
First Run: 128/128 URLs was computed.
Second Run : 123/128 URLs was computed
Third Run: 105/128 URLs was computed.
Note: Value for "maxURLs" and "maxDepth" was unaltered at every execution and by default taking value "128" and "3" resp.
The text was updated successfully, but these errors were encountered: