-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Making coordinate variables searchable #201
Comments
That was specifically for a project where we were using the intake catalogue as a source for an "experiment explorer", to expose the variables saved in an experiment in timeline to assist users in understanding what variables are available at different times in an experiment. For this purpose we really only wanted diagnostic model variables that have a time-varying component.
I'm confused. Does this mean
BTW this is a somewhat related issue I think about encoding grid information: |
Also would be interested in knowing if this would cover your use cases @marc-white @anton-seaice |
@charles-turner-1 my main concern to this point hasn't been how to search for the coordinates, but how to access them. In particular:
|
@marc-white In response to question 1, currently searching for a coordinate will load the entire dataset - it works very much like searching other metadata, eg. >>> ocean_files = cat_with_coords.search(start_date='2086.*',frequency='1mon',file_id='ocean',).to_dask()
>>>print(ocean_files)
<xarray.Dataset> Size: 847GB
Dimensions: (time: 12, st_ocean: 75, yt_ocean: 2700,
xt_ocean: 3600, yu_ocean: 2700, xu_ocean: 3600,
sw_ocean: 75, potrho: 80, grid_yt_ocean: 2700,
grid_xu_ocean: 3600, grid_yu_ocean: 2700,
grid_xt_ocean: 3600, neutral: 80, nv: 2,
st_edges_ocean: 76, sw_edges_ocean: 76,
potrho_edges: 81, neutralrho_edges: 81)
Coordinates: (12/18)
* xt_ocean (xt_ocean) float64 29kB -279.9 -279.8 ... 79.85 79.95
* yt_ocean (yt_ocean) float64 22kB -81.11 -81.07 ... 89.94 89.98
* st_ocean (st_ocean) float64 600B 0.5413 1.681 ... 5.709e+03
* st_edges_ocean (st_edges_ocean) float64 608B 0.0 1.083 ... 5.809e+03
* time (time) object 96B 2086-01-16 12:00:00 ... 2086-12-...
* nv (nv) float64 16B 1.0 2.0
... ...
* potrho (potrho) float64 640B 1.028e+03 ... 1.038e+03
* potrho_edges (potrho_edges) float64 648B 1.028e+03 ... 1.038e+03
* grid_xt_ocean (grid_xt_ocean) float64 29kB -279.9 -279.8 ... 79.95
* grid_yu_ocean (grid_yu_ocean) float64 22kB -81.09 -81.05 ... 90.0
* neutral (neutral) float64 640B 1.028e+03 ... 1.038e+03
* neutralrho_edges (neutralrho_edges) float64 648B 1.028e+03 ... 1.03...
Data variables: (12/28)
temp (time, st_ocean, yt_ocean, xt_ocean) float32 35GB dask.array<chunksize=(1, 7, 300, 400), meta=np.ndarray>
pot_temp (time, st_ocean, yt_ocean, xt_ocean) float32 35GB dask.array<chunksize=(1, 7, 300, 400), meta=np.ndarray>
salt (time, st_ocean, yt_ocean, xt_ocean) float32 35GB dask.array<chunksize=(1, 7, 300, 400), meta=np.ndarray>
age_global (time, st_ocean, yt_ocean, xt_ocean) float32 35GB dask.array<chunksize=(1, 7, 300, 400), meta=np.ndarray>
u (time, st_ocean, yu_ocean, xu_ocean) float32 35GB dask.array<chunksize=(1, 7, 300, 400), meta=np.ndarray>
v (time, st_ocean, yu_ocean, xu_ocean) float32 35GB dask.array<chunksize=(1, 7, 300, 400), meta=np.ndarray>
... ...
bih_fric_v (time, st_ocean, yu_ocean, xu_ocean) float32 35GB dask.array<chunksize=(1, 7, 300, 400), meta=np.ndarray>
u_dot_grad_vert_pv (time, st_ocean, yt_ocean, xt_ocean) float32 35GB dask.array<chunksize=(1, 7, 300, 400), meta=np.ndarray>
average_T1 (time) datetime64[ns] 96B dask.array<chunksize=(3,), meta=np.ndarray>
average_T2 (time) datetime64[ns] 96B dask.array<chunksize=(3,), meta=np.ndarray>
average_DT (time) timedelta64[ns] 96B dask.array<chunksize=(3,), meta=np.ndarray>
time_bounds (time, nv) timedelta64[ns] 192B dask.array<chunksize=(1, 2), meta=np.ndarray>
Attributes: (12/22)
filename: ocean.nc
title: ACCESS-OM2-01
grid_type: mosaic
grid_tile: 1
intake_esm_vars: ['temp', 'pot_temp', 'salt', 'a...
intake_esm_attrs:filename: ocean.nc
... ...
intake_esm_attrs:coord_calendar_types: ['', '', '', '', 'NOLEAP', '', ...
intake_esm_attrs:coord_bounds: ['', '', '', '', 'time_bounds',...
intake_esm_attrs:coord_units: ['degrees_E', 'degrees_N', 'met...
intake_esm_attrs:realm: ocean
intake_esm_attrs:_data_format_: netcdf
intake_esm_dataset_key: ocean.1mon I like the suggestion that we only load
>>> st_edges_search = cat_with_coords.search(coords='st_edges_ocean',frequency='1mon',file_id='ocean',start_date = '2086.*')
>>> print(st_edges_search.df.head(3))
filename file_id path \
0 ocean.nc ocean /g/data/ik11/outputs/access-om2-01/01deg_jra55...
1 ocean.nc ocean /g/data/ik11/outputs/access-om2-01/01deg_jra55...
2 ocean.nc ocean /g/data/ik11/outputs/access-om2-01/01deg_jra55...
filename_timestamp frequency start_date end_date \
0 NaN 1mon 2086-01-01, 00:00:00 2086-04-01, 00:00:00
1 NaN 1mon 2086-04-01, 00:00:00 2086-07-01, 00:00:00
2 NaN 1mon 2086-07-01, 00:00:00 2086-10-01, 00:00:00
>>> st_edges = st_edges_search.to_dask()
>>> print(st_edges)
<xarray.Dataset> Size: 847GB
Dimensions: (time: 12, st_ocean: 75, yt_ocean: 2700,
xt_ocean: 3600, yu_ocean: 2700, xu_ocean: 3600,
sw_ocean: 75, potrho: 80, grid_yt_ocean: 2700,
grid_xu_ocean: 3600, grid_yu_ocean: 2700,
grid_xt_ocean: 3600, neutral: 80, nv: 2,
st_edges_ocean: 76, sw_edges_ocean: 76,
potrho_edges: 81, neutralrho_edges: 81)
Coordinates: (12/18)
* xt_ocean (xt_ocean) float64 29kB -279.9 -279.8 ... 79.85 79.95
* yt_ocean (yt_ocean) float64 22kB -81.11 -81.07 ... 89.94 89.98
* st_ocean (st_ocean) float64 600B 0.5413 1.681 ... 5.709e+03
* st_edges_ocean (st_edges_ocean) float64 608B 0.0 1.083 ... 5.809e+03
* time (time) object 96B 2086-01-16 12:00:00 ... 2086-12-...
* nv (nv) float64 16B 1.0 2.0
... ...
* potrho (potrho) float64 640B 1.028e+03 ... 1.038e+03
* potrho_edges (potrho_edges) float64 648B 1.028e+03 ... 1.038e+03
* grid_xt_ocean (grid_xt_ocean) float64 29kB -279.9 -279.8 ... 79.95
* grid_yu_ocean (grid_yu_ocean) float64 22kB -81.09 -81.05 ... 90.0
* neutral (neutral) float64 640B 1.028e+03 ... 1.038e+03
* neutralrho_edges (neutralrho_edges) float64 648B 1.028e+03 ... 1.03...
Data variables: (12/28)
temp (time, st_ocean, yt_ocean, xt_ocean) float32 35GB dask.array<chunksize=(1, 7, 300, 400), meta=np.ndarray>
pot_temp (time, st_ocean, yt_ocean, xt_ocean) float32 35GB dask.array<chunksize=(1, 7, 300, 400), meta=np.ndarray>
salt (time, st_ocean, yt_ocean, xt_ocean) float32 35GB dask.array<chunksize=(1, 7, 300, 400), meta=np.ndarray>
age_global (time, st_ocean, yt_ocean, xt_ocean) float32 35GB dask.array<chunksize=(1, 7, 300, 400), meta=np.ndarray>
u (time, st_ocean, yu_ocean, xu_ocean) float32 35GB dask.array<chunksize=(1, 7, 300, 400), meta=np.ndarray>
v (time, st_ocean, yu_ocean, xu_ocean) float32 35GB dask.array<chunksize=(1, 7, 300, 400), meta=np.ndarray>
... ...
bih_fric_v (time, st_ocean, yu_ocean, xu_ocean) float32 35GB dask.array<chunksize=(1, 7, 300, 400), meta=np.ndarray>
u_dot_grad_vert_pv (time, st_ocean, yt_ocean, xt_ocean) float32 35GB dask.array<chunksize=(1, 7, 300, 400), meta=np.ndarray>
average_T1 (time) datetime64[ns] 96B dask.array<chunksize=(3,), meta=np.ndarray>
average_T2 (time) datetime64[ns] 96B dask.array<chunksize=(3,), meta=np.ndarray>
average_DT (time) timedelta64[ns] 96B dask.array<chunksize=(3,), meta=np.ndarray>
time_bounds (time, nv) timedelta64[ns] 192B dask.array<chunksize=(1, 2), meta=np.ndarray>
Attributes: (12/22)
filename: ocean.nc
title: ACCESS-OM2-01
grid_type: mosaic
grid_tile: 1
intake_esm_vars: ['temp', 'pot_temp', 'salt', 'a...
intake_esm_attrs:filename: ocean.nc
... ...
intake_esm_attrs:coord_calendar_types: ['', '', '', '', 'NOLEAP', '', ...
intake_esm_attrs:coord_bounds: ['', '', '', '', 'time_bounds',...
intake_esm_attrs:coord_units: ['degrees_E', 'degrees_N', 'met...
intake_esm_attrs:realm: ocean
intake_esm_attrs:_data_format_: netcdf
intake_esm_dataset_key: ocean.1mon As before, searching for a variable will only load relevant data, eg: >>> pot_temp_search = cat_with_coords.search(variable='pot_temp',frequency='1mon',file_id='ocean',start_date = '2086.*')
>>> print(pot_temp_search.df.head(3))
filename file_id path \
0 ocean.nc ocean /g/data/ik11/outputs/access-om2-01/01deg_jra55...
1 ocean.nc ocean /g/data/ik11/outputs/access-om2-01/01deg_jra55...
2 ocean.nc ocean /g/data/ik11/outputs/access-om2-01/01deg_jra55...
filename_timestamp frequency start_date end_date \
0 NaN 1mon 2086-01-01, 00:00:00 2086-04-01, 00:00:00
1 NaN 1mon 2086-04-01, 00:00:00 2086-07-01, 00:00:00
2 NaN 1mon 2086-07-01, 00:00:00 2086-10-01, 00:00:00
...
>>> pot_temp = pot_temp_search.to_dask()
>>> print(pot_temp)
<xarray.Dataset> Size: 35GB
Dimensions: (time: 12, st_ocean: 75, yt_ocean: 2700, xt_ocean: 3600)
Coordinates:
* xt_ocean (xt_ocean) float64 29kB -279.9 -279.8 -279.7 ... 79.75 79.85 79.95
* yt_ocean (yt_ocean) float64 22kB -81.11 -81.07 -81.02 ... 89.89 89.94 89.98
* st_ocean (st_ocean) float64 600B 0.5413 1.681 2.94 ... 5.511e+03 5.709e+03
* time (time) object 96B 2086-01-16 12:00:00 ... 2086-12-16 12:00:00
Data variables:
pot_temp (time, st_ocean, yt_ocean, xt_ocean) float32 35GB dask.array<chunksize=(1, 7, 300, 400), meta=np.ndarray>
Attributes: (12/22)
filename: ocean.nc
title: ACCESS-OM2-01
grid_type: mosaic
grid_tile: 1
intake_esm_vars: ['pot_temp']
intake_esm_attrs:filename: ocean.nc
... ...
intake_esm_attrs:coord_calendar_types: ['', '', '', '', 'NOLEAP', '', ...
intake_esm_attrs:coord_bounds: ['', '', '', '', 'time_bounds',...
intake_esm_attrs:coord_units: ['degrees_E', 'degrees_N', 'met...
intake_esm_attrs:realm: ocean
intake_esm_attrs:_data_format_: netcdf
intake_esm_dataset_key: ocean.1mon
```
|
<xarray.Dataset> Size: 229kB
st_edges_search = cat_with_coords.search(coords='st_edges_ocean',frequency='1mon',file_id='ocean',start_date = '2086.*')
st_edges = st_edges_search.to_dask() is messy, I agree - in the search above a large ARE instance will crash with default chunking - seems to be the result of Dask trying to set one chunk per file. Subsetting down to just a few files ( With the current implementation, we can easily search & load the dataset as follows: >> st_edges_search = cat_with_coords.search(coords='st_edges_ocean',frequency='1mon',file_id='ocean')
>>> varnames = st_edges_search.df.loc[0,'variable']
>>> print(st_edges_search.to_dask(xarray_open_kwargs = {'drop_variables' : varnames}))<xarray.Dataset> Size: 229kB
Dimensions: (xt_ocean: 3600, yt_ocean: 2700, st_ocean: 75,
st_edges_ocean: 76, time: 2760, nv: 2, xu_ocean: 3600,
yu_ocean: 2700, sw_ocean: 75, sw_edges_ocean: 76,
grid_xu_ocean: 3600, grid_yt_ocean: 2700, potrho: 80,
potrho_edges: 81, grid_xt_ocean: 3600,
grid_yu_ocean: 2700, neutral: 80, neutralrho_edges: 81)
Coordinates: (12/18)
* xt_ocean (xt_ocean) float64 29kB -279.9 -279.8 ... 79.85 79.95
* yt_ocean (yt_ocean) float64 22kB -81.11 -81.07 ... 89.94 89.98
* st_ocean (st_ocean) float64 600B 0.5413 1.681 ... 5.709e+03
* st_edges_ocean (st_edges_ocean) float64 608B 0.0 1.083 ... 5.809e+03
* time (time) object 22kB 1950-01-16 12:00:00 ... 2179-12-16 1...
* nv (nv) float64 16B 1.0 2.0
... ...
* potrho (potrho) float64 640B 1.028e+03 1.028e+03 ... 1.038e+03
* potrho_edges (potrho_edges) float64 648B 1.028e+03 ... 1.038e+03
* grid_xt_ocean (grid_xt_ocean) float64 29kB -279.9 -279.8 ... 79.85 79.95
* grid_yu_ocean (grid_yu_ocean) float64 22kB -81.09 -81.05 ... 89.96 90.0
* neutral (neutral) float64 640B 1.028e+03 1.028e+03 ... 1.038e+03
* neutralrho_edges (neutralrho_edges) float64 648B 1.028e+03 ... 1.038e+03
Data variables:
*empty*
Attributes: (12/16)
filename: ocean.nc
title: ACCESS-OM2-01
grid_type: mosaic
grid_tile: 1
intake_esm_attrs:filename: ocean.nc
intake_esm_attrs:file_id: ocean
... ...
intake_esm_attrs:coord_calendar_types: ['', '', '', '', 'NOLEAP', '', ''...
intake_esm_attrs:coord_bounds: ['', '', '', '', 'time_bounds', '...
intake_esm_attrs:coord_units: ['degrees_E', 'degrees_N', 'meter...
intake_esm_attrs:realm: ocean
intake_esm_attrs:_data_format_: netcdf
intake_esm_dataset_key: ocean.1mon
I think the most straightforward solution here is going to be to get searches on coordinates to drop all data variables by default. |
This makes sense to me, generally you are searching for a coordinate when trying to load them from a different file than where the variable is anyway, so you'll need a different search & .to_dask operation. e.g., you might want SST and area_t and this is where I'm not sure they should have their own field: e.g. would return 0 datasets ? And we would have to do this
to return two datasets then then need to be combined ? but a different implementation would allow |
I spent a good chunk of time thinking about this yesterday afternoon. I've got a couple of thoughts:
The major difficulty (that I'm thinking about right now) is that adding coordinate variables to the variables we search leads to intake-esm telling xarray to request that variable when it loads the dataset. I'm looking into whether we can add a step to check whether variables are data or coordinate variables before intake-esm actually opens and begins to concatenate the datasets. If not, I'm not sure that there's a straightforward solution - bundling coordinate variables in with data variables effectively loses the information that the two are different and need to be treated differently when we open files until we actually open the files. TLDR; I'm gonna keep prodding - I agree that it would nice to be able to search for coordinates or data variables, rather than coordinates and. |
Ah yes - I think my point was just ... do we need an extra field, can we just put the coordinates in the variable field?
I think we need this for #117 anyway right ? |
I see your point - I think if we can put the coordinates into the variable field then that would be preferable - it doesn't add any user complexity. And yeah, look like #117 is the same issue. In fact, it looks like the cosima_cookbook |
Is your feature request related to a problem? Please describe.
Currently, the ACCESS-NRI Intake Catalog doesn't allow for searching of coordinate variables: for example, searching for
st_edges_ocean
will return 0 datasets. This can make searching for coordinate variables difficult, with 2 main pain points:Although this doesn't require the user to (semi) manually work out what file to open, it's still messy as it requires passing round file names.
ocean_grid.nc
files only contain coordinate variables, and so cannot be found using the catalogue. The only way to currently access these files is to search the catalogue to get a handle on the directory structure - and then construct a file path and load it: eg:This requires the user to start poking round in directory structures to try to work out where to load their data - which is the problem intake is trying to solve.
This has caused some pain points migrating COSIMA recipes from cosima_cookbook => intake.
I also think this might be the same issue as discussed in #63? @aidanheerdegen - seem to be some concerns about coordinates being listed as variables when they shouldn't be there?
Describe the feature you'd like
Searchable coordinates: in the same way that the catalog currently lets you perform searches over variables, it would be useful to be able to do the same on coordinates:
Doing this is subject to a couple of constraints:
xr.combine_by_coords
will fail if passed a coordinate variable.Proposed Solution
data_vars
=>variables
), as this would then confuse coordinates & variables in the ACCESS-NRI Intake Catalog as well as causing concatenation issues. This is implemented on branch 660-coordinate-variables.Additional Info
datastore.csv.gz
files writted bybuilder.save()
are typically approximately doubled.The text was updated successfully, but these errors were encountered: