Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache is never used because hashes do not match #389

Closed
dbrakenhoff opened this issue Nov 29, 2024 · 3 comments · Fixed by #395
Closed

Cache is never used because hashes do not match #389

dbrakenhoff opened this issue Nov 29, 2024 · 3 comments · Fixed by #395
Labels
caching All caching related issues

Comments

@dbrakenhoff
Copy link
Collaborator

At the moment I cannot use the caching functionality in one of my projects. Confirmed to happen for REGIS and AHN datasets.

For some reason the hashes never match, causing the cache to be invalidated. However, when I try reproducing this in a separate minimal example, the caching is working fine.

If anyone has any ideas, I'd love to hear them, otherwise as my investigation continues I will post updates here.

This is working fine...

import nlmod

nlmod.util.get_color_logger("DEBUG")

cachedir = "."
extent = [100_000, 101_000, 400_000, 401_000]

regis_ds = nlmod.read.regis.get_regis(extent, cachedir=cachedir, cachename="regis")
regis_ds = nlmod.read.regis.get_regis(extent, cachedir=cachedir, cachename="regis")

Versions:

Python version     : 3.11.10
NumPy version      : 1.26.4
Xarray version     : 2024.9.0
Matplotlib version : 3.9.2
Flopy version      : 3.9.0.dev1

nlmod version      : 0.9.0
@OnnoEbbens
Copy link
Collaborator

OnnoEbbens commented Dec 11, 2024

I tried the following in a GH codespace (Debian GNU/Linux 12 (bookworm):

  1. run script 1 (see below)
  2. restart the kernel
  3. run script 2 (see below)
  4. compare the hashes created in both scripts (see below)

What I get from this:

  • Reading the same cached netcdf file after restarting the kernel results in a different hash. I visually compared the two datasets (dimensions, coordinates, data variables and attributes) and I don't see any differences.
  • Same for reading the same pickled dataset.

I don't know why the hashes are different but I guess it is not something we can easily solve.

The intention of the hash is to check if the pickled function arguments where created together with the cached netcdf file. In other words if the .pklz and the .nc file belong together. This is nice to have check but the cache will work without. So I would propose to disable the hash check until we find a solution for this.

packageversions used:
python 3.11.10
nlmod 0.9.0
xarray 2023.6.0
dask 2024.12.0

script 1

import nlmod
import pickle
import xarray as xr
import dask

# instellingen voor de logging
nlmod.util.get_color_logger("INFO")

cachedir ='.'
extent = [204800, 205000, 438000, 438200]

#%%
# get regis dataset
regis_ds = nlmod.read.regis.get_regis(extent, cachedir=cachedir, cachename="regis")
hash_orig = dask.base.tokenize(regis_ds)

# write to pickle
with open('regis_ds.pickle', 'wb') as handle:
    pickle.dump(regis_ds, handle, protocol=-1)

# read cached netcdf
regis_from_cache = xr.open_dataset('regis.nc')
hash_cache_direct = dask.base.tokenize(regis_from_cache)

# read pickle
with open('regis_ds.pickle', 'rb') as handle:
    regis_from_pickle = pickle.load(handle)
hash_pickle = dask.base.tokenize(regis_from_pickle)

# call get_regis again (this time the cache will be used)
regis_ds_ind = nlmod.read.regis.get_regis(extent, cachedir=cachedir, cachename="regis")
hash_cache_indirect = dask.base.tokenize(regis_ds_ind)

# save hashes
with open('hashes.txt', 'w') as handle:
    handle.write(f'regis orig           : {hash_orig}\n')
    handle.write(f'regis cache direct   : {hash_cache_direct}\n')
    handle.write(f'regis pickle         : {hash_pickle}\n')
    handle.write(f'regis cache indirect : {hash_cache_indirect}\n')

yields:

>>> INFO:nlmod.cache.wrapper:caching data -> regis.nc
>>> INFO:nlmod.cache.wrapper:using cached data -> regis.nc

script 2

import nlmod
import pickle
import xarray as xr
import dask

# instellingen voor de logging
nlmod.util.get_color_logger("INFO")

cachedir ='.'
extent = [204800, 205000, 438000, 438200]

# read cached netcdf
regis_from_cache = xr.open_dataset('regis.nc')
hash_cache_direct = dask.base.tokenize(regis_from_cache)
regis_from_cache.close()

# read pickle
with open('regis_ds.pickle', 'rb') as handle:
    regis_from_pickle = pickle.load(handle)
hash_pickle = dask.base.tokenize(regis_from_pickle)

# call get_regis again (this time the cache won't be used)
regis_ds_ind = nlmod.read.regis.get_regis(extent, cachedir=cachedir, cachename="regis")
hash_cache_indirect = dask.base.tokenize(regis_ds_ind)

# save hashes
with open('hashesv2.txt', 'w') as handle:
    handle.write(f'regis cache direct   : {hash_cache_direct}\n')
    handle.write(f'regis pickle         : {hash_pickle}\n')
    handle.write(f'regis failed cache   : {hash_cache_indirect}\n')

yields:

INFO:nlmod.cache._same_function_arguments:cache was created using different function argument values, do not use cached data
INFO:nlmod.cache.wrapper:caching data -> regis.nc

hashes.txt:

regis orig           : ff4715f0946be500d54235e53b080e99
regis cache direct   : c408096ea4ae53ca7407605cf0ed6f33
regis pickle         : ff4715f0946be500d54235e53b080e99
regis cache indirect : c408096ea4ae53ca7407605cf0ed6f33

hashesv2.txt:

regis cache direct   : eea86620b9909cfd5b5cfbc0831b984d
regis pickle         : 05935639d8c6bf40843612b9082fb667
regis failed cache   : 05935639d8c6bf40843612b9082fb667

@OnnoEbbens OnnoEbbens added the caching All caching related issues label Dec 11, 2024
@OnnoEbbens OnnoEbbens linked a pull request Dec 16, 2024 that will close this issue
@bdestombe
Copy link
Collaborator

#395

@bdestombe
Copy link
Collaborator

Solved with #395

@github-project-automation github-project-automation bot moved this from Todo to Done in NHFLO Dec 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
caching All caching related issues
Projects
None yet
3 participants