Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[cache] Validate cached file with hashlib instead of dask tokenize #395

Merged
merged 1 commit into from
Dec 18, 2024

Conversation

bdestombe
Copy link
Collaborator

No description provided.

@bdestombe bdestombe changed the title [cache] Use hashlib instead of dask tokenize to validate cached file on disk [cache] Validate cached file with hashlib instead of dask tokenize Dec 18, 2024
@dbrakenhoff
Copy link
Collaborator

Does this make #393 unnecessary?

@bdestombe
Copy link
Collaborator Author

Yes.

It works in my script and I tested it with the code of Onno.
hashlib_hash2 = hashlib.sha256(open(fp_cache, 'rb').read()).hexdigest() the hash returned by the first script is the same as the one returned by the second script

@bdestombe
Copy link
Collaborator Author

bdestombe commented Dec 18, 2024

The adjusted code of Onno

File 1

import os

import nlmod
import pickle
import xarray as xr
import dask
import hashlib

# instellingen voor de logging
nlmod.util.get_color_logger("INFO")

cachedir ='.'
extent = [204800, 205000, 438000, 438200]

#%%
# get regis dataset
fp_cache = "regis.nc"
if os.path.exists(fp_cache):
   os.remove(fp_cache)

regis_ds = nlmod.read.regis.get_regis(extent, cachedir=cachedir, cachename="regis")
hash_orig = dask.base.tokenize(regis_ds.to_dict())
hashlib_hash1 = hashlib.sha256(open(fp_cache, 'rb').read()).hexdigest()

# read cached netcdf
regis_from_cache = xr.open_dataset('regis.nc')
hash_cache_direct = dask.base.tokenize(regis_from_cache.to_dict())

# read pickle
with open('regis_ds.pickle', 'rb') as handle:
    regis_from_pickle = pickle.load(handle)
hash_pickle = dask.base.tokenize(regis_from_pickle.to_dict())

# call get_regis again (this time the cache will be used)
regis_ds_ind = nlmod.read.regis.get_regis(extent, cachedir=cachedir, cachename="regis")
hash_cache_indirect = dask.base.tokenize(regis_ds_ind.to_dict())

# save hashes
with open('hashes.txt', 'w') as handle:
    handle.write(f'regis orig           : {hash_orig}\n')
    handle.write(f'regis cache direct   : {hash_cache_direct}\n')
    handle.write(f'regis pickle         : {hash_pickle}\n')
    handle.write(f'regis cache indirect : {hash_cache_indirect}\n')
    handle.write(f"hashlib1             : {hashlib_hash1}")

File 2

import nlmod
import pickle
import xarray as xr
import dask
import hashlib

# instellingen voor de logging
nlmod.util.get_color_logger("INFO")

cachedir ='.'
extent = [204800, 205000, 438000, 438200]

# read cached netcdf
fp_cache = "regis.nc"
hashlib_hash2 = hashlib.sha256(open(fp_cache, 'rb').read()).hexdigest()
regis_from_cache = xr.open_dataset('regis.nc')
hash_cache_direct = dask.base.tokenize(regis_from_cache.to_dict())
regis_from_cache.close()

# read pickle
with open('regis_ds.pickle', 'rb') as handle:
    regis_from_pickle = pickle.load(handle)
hash_pickle = dask.base.tokenize(regis_from_pickle.to_dict())

# call get_regis again (this time the cache won't be used)
regis_ds_ind = nlmod.read.regis.get_regis(extent, cachedir=cachedir, cachename="regis")
hash_cache_indirect = dask.base.tokenize(regis_ds_ind.to_dict())

# save hashes
with open('hashesv2.txt', 'w') as handle:
    handle.write(f'regis cache direct   : {hash_cache_direct}\n')
    handle.write(f'regis pickle         : {hash_pickle}\n')
    handle.write(f'regis failed cache   : {hash_cache_indirect}\n')
    handle.write(f"hashlib2             : {hashlib_hash2}")

hashlib_hash1 is the same as hashlib_hash2

@bdestombe bdestombe merged commit d982eb2 into dev Dec 18, 2024
4 checks passed
@bdestombe bdestombe deleted the cache branch December 18, 2024 16:04
@bdestombe bdestombe linked an issue Dec 22, 2024 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cache is never used because hashes do not match
2 participants