Improve PubMed download #352

gaurav · 2024-10-01T13:57:15Z

Our PubMed download is currently somewhat constrained:

Their website uses robots.txt to prevent us from downloading all of PubMed via HTTP, so we have to use FTP.
FTP doesn't have a (good? working?) method to check for file last changed dates, so our recursive download method currently starts at the beginning and re-downloads all PubMed files.

This means that we can sometimes end up with a situation where a couple of files have failed or become corrupted during transfer, and we either need to re-download all the files or come up with some hacky solution to redownload just the broken files. However, recursively downloading all the files downloads the MD5 checksums for every file as well, which we could use to come up with a built-in mechanism for detecting and working around this case:

If files exist in the PubMed download directories, verify the file by checking its MD5 checksum against the expected value.
Somehow signal to the recursive download system that we don't want to re-download verified files.

The text was updated successfully, but these errors were encountered:

gaurav added this to the Babel November 2024 milestone Oct 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve PubMed download #352

Improve PubMed download #352

gaurav commented Oct 1, 2024

Improve PubMed download #352

Improve PubMed download #352

Comments

gaurav commented Oct 1, 2024