Compress cache files #113

katrinleinweber · 2019-07-22T10:35:28Z

As a team of scientometricians, my colleagues and me are considering to share our ~/.scopus/scopus_search/ directories to avoid redownloading data and to parallelise multiple downloads for a single project.

In order to speed up synchronisation and to avoid filling up our local drives too much, gz compression (or any other) of the md5-named cache files would be tremendously helpful.

The text was updated successfully, but these errors were encountered:

Michael-E-Rose · 2019-07-22T11:03:11Z

Yes, we also share our cache.

Compression seems like a good idea. Do you have an idea of decompression times? Because that's the cost of saving space on disk

katrinleinweber · 2019-07-22T18:27:51Z

I haven't measured, but used rootusers.com/gzip-vs-bzip2-vs-xz-performance-comparison to select gz for my use-case.

In any case, compared to the download rate from Scopus of 30-40 MB per hour, any delay due to (de)compression will be negligible.

Michael-E-Rose · 2019-07-24T16:19:41Z

Okay, this sounds good and certainly makes sense.

I am thinking about how to best implement the compression:

Should the filename change or should it not?
Should it always be there or should there be a switch?
Should all classes be affected or just ScopusSearch results?

Depending on the answers, all previously cached files will be useless which I'd like to avoid.

In any case, that's something for pybliometrics 3.0.

katrinleinweber · 2019-07-24T17:05:58Z

… previously cached files will be useless which I'd like to avoid.

I presume that in this case, some kind of inference & if … else … will be needed, regardless of whether the files receive an extension, or not. Maybe Pandas' compression inference is a good example of that?

… just ScopusSearch results?

Having used only the latter, I still guess that significant benefits are possible for each search class.

… should there be a switch?

Yes, please :-) Different situations require different prioritisations of speed over storage or the other way round. (De)compression will most likely add some delay. The main question is probably: What should the default be? I vote for compression='gz'.

Michael-E-Rose added this to the pybliometrics 3.0 milestone Jul 24, 2019

Michael-E-Rose added Effort: High Enhancement labels Sep 9, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compress cache files #113

Compress cache files #113

katrinleinweber commented Jul 22, 2019 •

edited

Loading

Michael-E-Rose commented Jul 22, 2019

katrinleinweber commented Jul 22, 2019

Michael-E-Rose commented Jul 24, 2019

katrinleinweber commented Jul 24, 2019 •

edited

Loading

Compress cache files #113

Compress cache files #113

Comments

katrinleinweber commented Jul 22, 2019 • edited Loading

Michael-E-Rose commented Jul 22, 2019

katrinleinweber commented Jul 22, 2019

Michael-E-Rose commented Jul 24, 2019

katrinleinweber commented Jul 24, 2019 • edited Loading

katrinleinweber commented Jul 22, 2019 •

edited

Loading

katrinleinweber commented Jul 24, 2019 •

edited

Loading