You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, crawly is designed to efficiently crawl and scrape static web pages while adhering to robots.txt rules.
However, many modern websites use JavaScript to dynamically generate content, which presents a limitation for the current crawling capabilities.
Enhancement proposal
This feature request aims to integrate support for a web driver (such as Selenium or headless browsers like Puppeteer) to enable the crawling and rendering of dynamic content created with JavaScript.
Goals
Dynamic content rendering: use a web driver to fully render pages before scraping to capture dynamic content;
Integration with existing architecture: Seamlessly connect web driver capabilities into the current Crawler and CrawlerBuilder setup;
Respect existing configurations: ensure that the rendering process adheres to existing configurations such as adherence to robots.txt, rate limits, and depth limits.
Implementation suggestions
Option 1: Selenium Web Driver
Utilize Selenium WebDriver to control a browser and render dynamic content;
Use headless browsers like Puppeteer or Playwright for improved performance in rendering and scraping dynamic content;
This might involve creating Rust bindings or using existing native integrations.
Proposed API changes
Introduce a new setting in CrawlerBuilder to enable dynamic content rendering:
let crawler = CrawlerBuilder::new().with_max_depth(10).with_max_pages(100).with_max_concurrent_requests(50).with_rate_limit_wait_seconds(2).with_robots(true).with_dynamic_rendering(true)// New configuration.build()?;
Example usage
Demonstrate how users would take advantage of the new feature in their projects:
Summary
Currently,
crawly
is designed to efficiently crawl and scrape static web pages while adhering torobots.txt
rules.However, many modern websites use JavaScript to dynamically generate content, which presents a limitation for the current crawling capabilities.
Enhancement proposal
This feature request aims to integrate support for a web driver (such as Selenium or headless browsers like Puppeteer) to enable the crawling and rendering of dynamic content created with JavaScript.
Goals
Crawler
andCrawlerBuilder
setup;robots.txt
, rate limits, and depth limits.Implementation suggestions
Option 1: Selenium Web Driver
fantoccini
orthirtyfour
.Option 2: Headless browsers
Proposed API changes
Introduce a new setting in
CrawlerBuilder
to enable dynamic content rendering:Example usage
Demonstrate how users would take advantage of the new feature in their projects:
Expected benefits
crawly
lightweight for static sites.Additional context:
Tracking activity:
The text was updated successfully, but these errors were encountered: