Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve PubMed download #352

Open
gaurav opened this issue Oct 1, 2024 · 0 comments
Open

Improve PubMed download #352

gaurav opened this issue Oct 1, 2024 · 0 comments

Comments

@gaurav
Copy link
Collaborator

gaurav commented Oct 1, 2024

Our PubMed download is currently somewhat constrained:

  • Their website uses robots.txt to prevent us from downloading all of PubMed via HTTP, so we have to use FTP.
  • FTP doesn't have a (good? working?) method to check for file last changed dates, so our recursive download method currently starts at the beginning and re-downloads all PubMed files.

This means that we can sometimes end up with a situation where a couple of files have failed or become corrupted during transfer, and we either need to re-download all the files or come up with some hacky solution to redownload just the broken files. However, recursively downloading all the files downloads the MD5 checksums for every file as well, which we could use to come up with a built-in mechanism for detecting and working around this case:

  1. If files exist in the PubMed download directories, verify the file by checking its MD5 checksum against the expected value.
  2. Somehow signal to the recursive download system that we don't want to re-download verified files.
@gaurav gaurav added this to the Babel November 2024 milestone Oct 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant