Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compress cache files #113

Open
katrinleinweber opened this issue Jul 22, 2019 · 4 comments
Open

Compress cache files #113

katrinleinweber opened this issue Jul 22, 2019 · 4 comments

Comments

@katrinleinweber
Copy link
Contributor

katrinleinweber commented Jul 22, 2019

As a team of scientometricians, my colleagues and me are considering to share our ~/.scopus/scopus_search/ directories to avoid redownloading data and to parallelise multiple downloads for a single project.

In order to speed up synchronisation and to avoid filling up our local drives too much, gz compression (or any other) of the md5-named cache files would be tremendously helpful.

@Michael-E-Rose
Copy link
Contributor

Yes, we also share our cache.

Compression seems like a good idea. Do you have an idea of decompression times? Because that's the cost of saving space on disk

@katrinleinweber
Copy link
Contributor Author

I haven't measured, but used rootusers.com/gzip-vs-bzip2-vs-xz-performance-comparison to select gz for my use-case.

In any case, compared to the download rate from Scopus of 30-40 MB per hour, any delay due to (de)compression will be negligible.

@Michael-E-Rose
Copy link
Contributor

Okay, this sounds good and certainly makes sense.

I am thinking about how to best implement the compression:

  • Should the filename change or should it not?
  • Should it always be there or should there be a switch?
  • Should all classes be affected or just ScopusSearch results?

Depending on the answers, all previously cached files will be useless which I'd like to avoid.

In any case, that's something for pybliometrics 3.0.

@Michael-E-Rose Michael-E-Rose added this to the pybliometrics 3.0 milestone Jul 24, 2019
@katrinleinweber
Copy link
Contributor Author

katrinleinweber commented Jul 24, 2019

… previously cached files will be useless which I'd like to avoid.

I presume that in this case, some kind of inference & if … else … will be needed, regardless of whether the files receive an extension, or not. Maybe Pandas' compression inference is a good example of that?

… just ScopusSearch results?

Having used only the latter, I still guess that significant benefits are possible for each search class.

… should there be a switch?

Yes, please :-) Different situations require different prioritisations of speed over storage or the other way round. (De)compression will most likely add some delay. The main question is probably: What should the default be? I vote for compression='gz'.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants