Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

robots.txt #194

Open
edsu opened this issue May 31, 2023 · 0 comments
Open

robots.txt #194

edsu opened this issue May 31, 2023 · 0 comments

Comments

@edsu
Copy link
Contributor

edsu commented May 31, 2023

Users have noticed that swap.stanford.edu can become unresponsive when under load. Some investigation of the logs showed that this can happen when we have seen sustained attention from bots (e.g. Yandex). In the most recent case Yandex was doing about 1.5 requests per second from about 8 IP addresses, which caused swap to be pretty much unusable because all the CPUs were at 100% utilization.

Even though Yandex do not respect the crawl-delay directive in robots.txt files we think it would be good to instruct the crawlers that do (Google, Bing, Facebook, etc) with a:

User-agent: *
Crawl-delay: 10

If we continue to run into performance problems we should consider:

  • provisioning more CPU to swap
  • load-balancing two or more was-pywb nodes
  • looking into how network traffic can be shaped upstream (throttling, load balancing, etc)

See https://github.com/sul-dlss/puppet/pull/9614 for the robots.txt change.

If Yandex is going to be blocked in perpetuity it would be preferable to do that in the robots.txt rather than at the IP level, which is what we are doing currently: https://github.com/sul-dlss/puppet/pull/9619

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant