Making coordinate variables searchable #201

charles-turner-1 · 2024-09-27T05:26:40Z

Is your feature request related to a problem? Please describe.

Currently, the ACCESS-NRI Intake Catalog doesn't allow for searching of coordinate variables: for example, searching for st_edges_ocean will return 0 datasets. This can make searching for coordinate variables difficult, with 2 main pain points:

If the coordinate variable is know to be stored in a netCDF file with specific data variables:
- The user needs to know that these data & coordinate variables are found in the same files, and then search for the data variable in order to access the coordinate variables.
- In some instances, the user can directly access the coordinate variables by searching the data variables. In others, they need to perform something like

        VARNAME = ...
        COORD = ...
        fname = cat.search(variable=VARNAME).df.loc[0,'file']
        ds = xr.open_dataset(fname)
        coord = ds[COORD]

Although this doesn't require the user to (semi) manually work out what file to open, it's still messy as it requires passing round file names.

In other instances, coordinate variables are stored completely separately. For example, ocean_grid.nc files only contain coordinate variables, and so cannot be found using the catalogue. The only way to currently access these files is to search the catalogue to get a handle on the directory structure - and then construct a file path and load it: eg:

        fname = cat.search(variable=VARNAME).df.loc[0,'file']
        dirname = Path(fname).parent
        grid_file, = [file for file in os.listdir(dirname) if file.endswith('_grid.nc')] 
        ds = xr.open_dataset(grid_file)
        coord = ds[COORD]

This requires the user to start poking round in directory structures to try to work out where to load their data - which is the problem intake is trying to solve.

This has caused some pain points migrating COSIMA recipes from cosima_cookbook => intake.

I also think this might be the same issue as discussed in #63? @aidanheerdegen - seem to be some concerns about coordinates being listed as variables when they shouldn't be there?

Describe the feature you'd like

Searchable coordinates: in the same way that the catalog currently lets you perform searches over variables, it would be useful to be able to do the same on coordinates:

var = cat.search(variable=VARN).to_dask()
coord = cat.search(coord=COORD).to_dask()

Doing this is subject to a couple of constraints:

The catalog needs to know that coordinates & data variables aren't the same & need to be treated differently - xr.combine_by_coords will fail if passed a coordinate variable.
This requires an upstream patch of intake-esm - see issue 660.
Cannot cause serious performance regressions - see above issue.

Proposed Solution

Update intake-esm with @dougiesquire's proposed solution from issue 660.
Add separate coordinate coordinate variable fields to the ACCESS-NRI Intake Catalog, rather than just making the same change as in Intake-ESM (data_vars => variables), as this would then confuse coordinates & variables in the ACCESS-NRI Intake Catalog as well as causing concatenation issues. This is implemented on branch 660-coordinate-variables.

Additional Info

Due to the release cycle of Intake-ESM, this solution will probably require us to maintain a fork - at least for some time.
I've performance tested the proposed solution & changes in catalogue build times are small (typically ~5%), catalogue read times similar (typically 5-10%, sometimes faster), and the size of datastore.csv.gz files writted by builder.save() are typically approximately doubled.

The text was updated successfully, but these errors were encountered:

aidanheerdegen · 2024-09-27T07:09:06Z

I also think this might be the same issue as discussed in #63? @aidanheerdegen - seem to be some concerns about coordinates being listed as variables when they shouldn't be there?

That was specifically for a project where we were using the intake catalogue as a source for an "experiment explorer", to expose the variables saved in an experiment in timeline to assist users in understanding what variables are available at different times in an experiment. For this purpose we really only wanted diagnostic model variables that have a time-varying component.

Add separate coordinate variable fields to the ACCESS-NRI Intake Catalog

I'm confused. Does this mean

Have a "coordinate" flag (field)?
Move coordinates into a separate catalogue?
or neither?

BTW this is a somewhat related issue I think about encoding grid information:

#112

charles-turner-1 · 2024-09-30T05:04:26Z

Yup, a coordinate flag, sorry for the confusion with terms.

As currently implemented, this would just add coordinates & coordinate relevant information to the catalogue - I've tried to keep changes to a minimum. We could think about having a separate catalogue specifically for coordinates but that feels potentially a bit harder for users to me?

I've attached a screencap with a demo of how the change affects catalogue results (N.B both are run using the same kernel, so cat_without_coords.search(coords='st_edges_ocean') returns an empty search - running it on main would raise an error).

charles-turner-1 · 2024-09-30T05:37:23Z

Also would be interested in knowing if this would cover your use cases @marc-white @anton-seaice

marc-white · 2024-09-30T05:43:02Z

@charles-turner-1 my main concern to this point hasn't been how to search for the coordinates, but how to access them. In particular:

Currently, when you search for a variable and then do to_dask, you get back a Dask array of the requested variables, plus the coordinates that the variable relates to. Would this work in reverse? So, if I search for a coordinate, and then converted to an array, would the Dask array include the coordinate, and then the related variables, and then the other coordinates that those variables rely on?
There's also the issue of intake-esm crashing when it tries to pull out the coordinate data by stacking it on itself (in effect). How could we modify things to know to treat variables and coordinates differently in this regard?

charles-turner-1 · 2024-09-30T06:26:17Z

@marc-white In response to question 1, currently searching for a coordinate will load the entire dataset - it works very much like searching other metadata, eg.

>>> ocean_files = cat_with_coords.search(start_date='2086.*',frequency='1mon',file_id='ocean',).to_dask()
>>>print(ocean_files)
<xarray.Dataset> Size: 847GB
Dimensions:                (time: 12, st_ocean: 75, yt_ocean: 2700,
                            xt_ocean: 3600, yu_ocean: 2700, xu_ocean: 3600,
                            sw_ocean: 75, potrho: 80, grid_yt_ocean: 2700,
                            grid_xu_ocean: 3600, grid_yu_ocean: 2700,
                            grid_xt_ocean: 3600, neutral: 80, nv: 2,
                            st_edges_ocean: 76, sw_edges_ocean: 76,
                            potrho_edges: 81, neutralrho_edges: 81)
Coordinates: (12/18)
  * xt_ocean               (xt_ocean) float64 29kB -279.9 -279.8 ... 79.85 79.95
  * yt_ocean               (yt_ocean) float64 22kB -81.11 -81.07 ... 89.94 89.98
  * st_ocean               (st_ocean) float64 600B 0.5413 1.681 ... 5.709e+03
  * st_edges_ocean         (st_edges_ocean) float64 608B 0.0 1.083 ... 5.809e+03
  * time                   (time) object 96B 2086-01-16 12:00:00 ... 2086-12-...
  * nv                     (nv) float64 16B 1.0 2.0
    ...                     ...
  * potrho                 (potrho) float64 640B 1.028e+03 ... 1.038e+03
  * potrho_edges           (potrho_edges) float64 648B 1.028e+03 ... 1.038e+03
  * grid_xt_ocean          (grid_xt_ocean) float64 29kB -279.9 -279.8 ... 79.95
  * grid_yu_ocean          (grid_yu_ocean) float64 22kB -81.09 -81.05 ... 90.0
  * neutral                (neutral) float64 640B 1.028e+03 ... 1.038e+03
  * neutralrho_edges       (neutralrho_edges) float64 648B 1.028e+03 ... 1.03...
Data variables: (12/28)
    temp                   (time, st_ocean, yt_ocean, xt_ocean) float32 35GB dask.array<chunksize=(1, 7, 300, 400), meta=np.ndarray>
    pot_temp               (time, st_ocean, yt_ocean, xt_ocean) float32 35GB dask.array<chunksize=(1, 7, 300, 400), meta=np.ndarray>
    salt                   (time, st_ocean, yt_ocean, xt_ocean) float32 35GB dask.array<chunksize=(1, 7, 300, 400), meta=np.ndarray>
    age_global             (time, st_ocean, yt_ocean, xt_ocean) float32 35GB dask.array<chunksize=(1, 7, 300, 400), meta=np.ndarray>
    u                      (time, st_ocean, yu_ocean, xu_ocean) float32 35GB dask.array<chunksize=(1, 7, 300, 400), meta=np.ndarray>
    v                      (time, st_ocean, yu_ocean, xu_ocean) float32 35GB dask.array<chunksize=(1, 7, 300, 400), meta=np.ndarray>
    ...                     ...
    bih_fric_v             (time, st_ocean, yu_ocean, xu_ocean) float32 35GB dask.array<chunksize=(1, 7, 300, 400), meta=np.ndarray>
    u_dot_grad_vert_pv     (time, st_ocean, yt_ocean, xt_ocean) float32 35GB dask.array<chunksize=(1, 7, 300, 400), meta=np.ndarray>
    average_T1             (time) datetime64[ns] 96B dask.array<chunksize=(3,), meta=np.ndarray>
    average_T2             (time) datetime64[ns] 96B dask.array<chunksize=(3,), meta=np.ndarray>
    average_DT             (time) timedelta64[ns] 96B dask.array<chunksize=(3,), meta=np.ndarray>
    time_bounds            (time, nv) timedelta64[ns] 192B dask.array<chunksize=(1, 2), meta=np.ndarray>
Attributes: (12/22)
    filename:                                 ocean.nc
    title:                                    ACCESS-OM2-01
    grid_type:                                mosaic
    grid_tile:                                1
    intake_esm_vars:                          ['temp', 'pot_temp', 'salt', 'a...
    intake_esm_attrs:filename:                ocean.nc
    ...                                       ...
    intake_esm_attrs:coord_calendar_types:    ['', '', '', '', 'NOLEAP', '', ...
    intake_esm_attrs:coord_bounds:            ['', '', '', '', 'time_bounds',...
    intake_esm_attrs:coord_units:             ['degrees_E', 'degrees_N', 'met...
    intake_esm_attrs:realm:                   ocean
    intake_esm_attrs:_data_format_:           netcdf
    intake_esm_dataset_key:                   ocean.1mon

I like the suggestion that we only load
1. data variables that depend on the coordinate variable searched for, and
2. other coordinate variables those data variables depend on,
though. I'll have a look into how straightforward that might be to implement.

With regards to the second question, searching for a coordinate & obtaining a multi-file dataset works fine right now, so I think this approach has sidestepped the crashing issue:

>>> st_edges_search = cat_with_coords.search(coords='st_edges_ocean',frequency='1mon',file_id='ocean',start_date = '2086.*')
>>> print(st_edges_search.df.head(3))
filename file_id                                               path  \
0  ocean.nc   ocean  /g/data/ik11/outputs/access-om2-01/01deg_jra55...   
1  ocean.nc   ocean  /g/data/ik11/outputs/access-om2-01/01deg_jra55...   
2  ocean.nc   ocean  /g/data/ik11/outputs/access-om2-01/01deg_jra55...   

  filename_timestamp frequency            start_date              end_date  \
0                NaN      1mon  2086-01-01, 00:00:00  2086-04-01, 00:00:00   
1                NaN      1mon  2086-04-01, 00:00:00  2086-07-01, 00:00:00   
2                NaN      1mon  2086-07-01, 00:00:00  2086-10-01, 00:00:00   

>>> st_edges = st_edges_search.to_dask()
>>> print(st_edges)
<xarray.Dataset> Size: 847GB
Dimensions:                (time: 12, st_ocean: 75, yt_ocean: 2700,
                            xt_ocean: 3600, yu_ocean: 2700, xu_ocean: 3600,
                            sw_ocean: 75, potrho: 80, grid_yt_ocean: 2700,
                            grid_xu_ocean: 3600, grid_yu_ocean: 2700,
                            grid_xt_ocean: 3600, neutral: 80, nv: 2,
                            st_edges_ocean: 76, sw_edges_ocean: 76,
                            potrho_edges: 81, neutralrho_edges: 81)
Coordinates: (12/18)
  * xt_ocean               (xt_ocean) float64 29kB -279.9 -279.8 ... 79.85 79.95
  * yt_ocean               (yt_ocean) float64 22kB -81.11 -81.07 ... 89.94 89.98
  * st_ocean               (st_ocean) float64 600B 0.5413 1.681 ... 5.709e+03
  * st_edges_ocean         (st_edges_ocean) float64 608B 0.0 1.083 ... 5.809e+03
  * time                   (time) object 96B 2086-01-16 12:00:00 ... 2086-12-...
  * nv                     (nv) float64 16B 1.0 2.0
    ...                     ...
  * potrho                 (potrho) float64 640B 1.028e+03 ... 1.038e+03
  * potrho_edges           (potrho_edges) float64 648B 1.028e+03 ... 1.038e+03
  * grid_xt_ocean          (grid_xt_ocean) float64 29kB -279.9 -279.8 ... 79.95
  * grid_yu_ocean          (grid_yu_ocean) float64 22kB -81.09 -81.05 ... 90.0
  * neutral                (neutral) float64 640B 1.028e+03 ... 1.038e+03
  * neutralrho_edges       (neutralrho_edges) float64 648B 1.028e+03 ... 1.03...
Data variables: (12/28)
    temp                   (time, st_ocean, yt_ocean, xt_ocean) float32 35GB dask.array<chunksize=(1, 7, 300, 400), meta=np.ndarray>
    pot_temp               (time, st_ocean, yt_ocean, xt_ocean) float32 35GB dask.array<chunksize=(1, 7, 300, 400), meta=np.ndarray>
    salt                   (time, st_ocean, yt_ocean, xt_ocean) float32 35GB dask.array<chunksize=(1, 7, 300, 400), meta=np.ndarray>
    age_global             (time, st_ocean, yt_ocean, xt_ocean) float32 35GB dask.array<chunksize=(1, 7, 300, 400), meta=np.ndarray>
    u                      (time, st_ocean, yu_ocean, xu_ocean) float32 35GB dask.array<chunksize=(1, 7, 300, 400), meta=np.ndarray>
    v                      (time, st_ocean, yu_ocean, xu_ocean) float32 35GB dask.array<chunksize=(1, 7, 300, 400), meta=np.ndarray>
    ...                     ...
    bih_fric_v             (time, st_ocean, yu_ocean, xu_ocean) float32 35GB dask.array<chunksize=(1, 7, 300, 400), meta=np.ndarray>
    u_dot_grad_vert_pv     (time, st_ocean, yt_ocean, xt_ocean) float32 35GB dask.array<chunksize=(1, 7, 300, 400), meta=np.ndarray>
    average_T1             (time) datetime64[ns] 96B dask.array<chunksize=(3,), meta=np.ndarray>
    average_T2             (time) datetime64[ns] 96B dask.array<chunksize=(3,), meta=np.ndarray>
    average_DT             (time) timedelta64[ns] 96B dask.array<chunksize=(3,), meta=np.ndarray>
    time_bounds            (time, nv) timedelta64[ns] 192B dask.array<chunksize=(1, 2), meta=np.ndarray>
Attributes: (12/22)
    filename:                                 ocean.nc
    title:                                    ACCESS-OM2-01
    grid_type:                                mosaic
    grid_tile:                                1
    intake_esm_vars:                          ['temp', 'pot_temp', 'salt', 'a...
    intake_esm_attrs:filename:                ocean.nc
    ...                                       ...
    intake_esm_attrs:coord_calendar_types:    ['', '', '', '', 'NOLEAP', '', ...
    intake_esm_attrs:coord_bounds:            ['', '', '', '', 'time_bounds',...
    intake_esm_attrs:coord_units:             ['degrees_E', 'degrees_N', 'met...
    intake_esm_attrs:realm:                   ocean
    intake_esm_attrs:_data_format_:           netcdf
    intake_esm_dataset_key:                   ocean.1mon

As before, searching for a variable will only load relevant data, eg:

>>> pot_temp_search = cat_with_coords.search(variable='pot_temp',frequency='1mon',file_id='ocean',start_date = '2086.*')
>>> print(pot_temp_search.df.head(3))
 filename file_id                                               path  \
0  ocean.nc   ocean  /g/data/ik11/outputs/access-om2-01/01deg_jra55...   
1  ocean.nc   ocean  /g/data/ik11/outputs/access-om2-01/01deg_jra55...   
2  ocean.nc   ocean  /g/data/ik11/outputs/access-om2-01/01deg_jra55...   

  filename_timestamp frequency            start_date              end_date  \
0                NaN      1mon  2086-01-01, 00:00:00  2086-04-01, 00:00:00   
1                NaN      1mon  2086-04-01, 00:00:00  2086-07-01, 00:00:00   
2                NaN      1mon  2086-07-01, 00:00:00  2086-10-01, 00:00:00   
...
>>>  pot_temp = pot_temp_search.to_dask()
>>> print(pot_temp)
<xarray.Dataset> Size: 35GB
Dimensions:   (time: 12, st_ocean: 75, yt_ocean: 2700, xt_ocean: 3600)
Coordinates:
  * xt_ocean  (xt_ocean) float64 29kB -279.9 -279.8 -279.7 ... 79.75 79.85 79.95
  * yt_ocean  (yt_ocean) float64 22kB -81.11 -81.07 -81.02 ... 89.89 89.94 89.98
  * st_ocean  (st_ocean) float64 600B 0.5413 1.681 2.94 ... 5.511e+03 5.709e+03
  * time      (time) object 96B 2086-01-16 12:00:00 ... 2086-12-16 12:00:00
Data variables:
    pot_temp  (time, st_ocean, yt_ocean, xt_ocean) float32 35GB dask.array<chunksize=(1, 7, 300, 400), meta=np.ndarray>
Attributes: (12/22)
    filename:                                 ocean.nc
    title:                                    ACCESS-OM2-01
    grid_type:                                mosaic
    grid_tile:                                1
    intake_esm_vars:                          ['pot_temp']
    intake_esm_attrs:filename:                ocean.nc
    ...                                       ...
    intake_esm_attrs:coord_calendar_types:    ['', '', '', '', 'NOLEAP', '', ...
    intake_esm_attrs:coord_bounds:            ['', '', '', '', 'time_bounds',...
    intake_esm_attrs:coord_units:             ['degrees_E', 'degrees_N', 'met...
    intake_esm_attrs:realm:                   ocean
    intake_esm_attrs:_data_format_:           netcdf
    intake_esm_dataset_key:                   ocean.1mon
    ```

anton-seaice · 2024-10-01T01:14:54Z

Instead of this:

st_edges_search = cat_with_coords.search(coords='st_edges_ocean',frequency='1mon',file_id='ocean',start_date = '2086.*')
st_edges = st_edges_search.to_dask()

We should just be able to do:

cat_with_coords.search(coords='st_edges_ocean').to_dask() (possibly the file_id part is still needed)

'st_edges_ocean' doesn't change over time, so we should check that all files have the same values, but having to supply the start_date = '2086.*' is messy. And then calling to_dask, ideally would only return the coordinates (without the data_vars). e.g.

or

charles-turner-1 · 2024-10-01T03:23:18Z

<xarray.Dataset> Size: 229kB
Dimensions: (xt_ocean: 3600, yt_ocean: 2700, st_ocean: 75,
st_edges_ocean: 76, time: 2760, nv: 2, xu_ocean: 3600,
yu_ocean: 2700, sw_ocean: 75, sw_edges_ocean: 76,
grid_xu_ocean: 3600, grid_yt_ocean: 2700, potrho: 80,
potrho_edges: 81, grid_xt_ocean: 3600,
grid_yu_ocean: 2700, neutral: 80, neutralrho_edges: 81)
Coordinates: (12/18)

xt_ocean (xt_ocean) float64 29kB -279.9 -279.8 ... 79.85 79.95
yt_ocean (yt_ocean) float64 22kB -81.11 -81.07 ... 89.94 89.98
st_ocean (st_ocean) float64 600B 0.5413 1.681 ... 5.709e+03
st_edges_ocean (st_edges_ocean) float64 608B 0.0 1.083 ... 5.809e+03
time (time) object 22kB 1950-01-16 12:00:00 ... 2179-12-16 1...
nv (nv) float64 16B 1.0 2.0
... ...
potrho (potrho) float64 640B 1.028e+03 1.028e+03 ... 1.038e+03
potrho_edges (potrho_edges) float64 648B 1.028e+03 ... 1.038e+03
grid_xt_ocean (grid_xt_ocean) float64 29kB -279.9 -279.8 ... 79.85 79.95
grid_yu_ocean (grid_yu_ocean) float64 22kB -81.09 -81.05 ... 89.96 90.0
neutral (neutral) float64 640B 1.028e+03 1.028e+03 ... 1.038e+03
neutralrho_edges (neutralrho_edges) float64 648B 1.028e+03 ... 1.038e+03
Data variables:
empty
Attributes: (12/16)
filename: ocean.nc
title: ACCESS-OM2-01
grid_type: mosaic
grid_tile: 1
intake_esm_attrs:filename: ocean.nc
intake_esm_attrs:file_id: ocean
... ...
intake_esm_attrs:coord_calendar_types: ['', '', '', '', 'NOLEAP', '', ''...
intake_esm_attrs:coord_bounds: ['', '', '', '', 'time_bounds', '...
intake_esm_attrs:coord_units: ['degrees_E', 'degrees_N', 'meter...
intake_esm_attrs:realm: ocean
intake_esm_attrs:data_format: netcdf
intake_esm_dataset_key: ocean.1monI've looked into it, & I think adding data_vars that depend on the coordinate vars we've searched is going to

Probably require modifications to ecgtools.
Potentially cause memory usage blowups. Subsetting down to just a few files by providing a year etc, eg.

st_edges_search = cat_with_coords.search(coords='st_edges_ocean',frequency='1mon',file_id='ocean',start_date = '2086.*')
st_edges = st_edges_search.to_dask()

is messy, I agree - in the search above a large ARE instance will crash with default chunking - seems to be the result of Dask trying to set one chunk per file. Subsetting down to just a few files (start_date = '2086.*') resolves this.
I'd rather not change default chunking when we perform coordinate searches - it seems messy and error prone - depending on the coordinate we might load most of the data variables in the file.

With the current implementation, we can easily search & load the dataset as follows:

>> st_edges_search = cat_with_coords.search(coords='st_edges_ocean',frequency='1mon',file_id='ocean')
>>> varnames = st_edges_search.df.loc[0,'variable']
>>> print(st_edges_search.to_dask(xarray_open_kwargs = {'drop_variables' : varnames}))<xarray.Dataset> Size: 229kB
Dimensions:           (xt_ocean: 3600, yt_ocean: 2700, st_ocean: 75,
                       st_edges_ocean: 76, time: 2760, nv: 2, xu_ocean: 3600,
                       yu_ocean: 2700, sw_ocean: 75, sw_edges_ocean: 76,
                       grid_xu_ocean: 3600, grid_yt_ocean: 2700, potrho: 80,
                       potrho_edges: 81, grid_xt_ocean: 3600,
                       grid_yu_ocean: 2700, neutral: 80, neutralrho_edges: 81)
Coordinates: (12/18)
  * xt_ocean          (xt_ocean) float64 29kB -279.9 -279.8 ... 79.85 79.95
  * yt_ocean          (yt_ocean) float64 22kB -81.11 -81.07 ... 89.94 89.98
  * st_ocean          (st_ocean) float64 600B 0.5413 1.681 ... 5.709e+03
  * st_edges_ocean    (st_edges_ocean) float64 608B 0.0 1.083 ... 5.809e+03
  * time              (time) object 22kB 1950-01-16 12:00:00 ... 2179-12-16 1...
  * nv                (nv) float64 16B 1.0 2.0
    ...                ...
  * potrho            (potrho) float64 640B 1.028e+03 1.028e+03 ... 1.038e+03
  * potrho_edges      (potrho_edges) float64 648B 1.028e+03 ... 1.038e+03
  * grid_xt_ocean     (grid_xt_ocean) float64 29kB -279.9 -279.8 ... 79.85 79.95
  * grid_yu_ocean     (grid_yu_ocean) float64 22kB -81.09 -81.05 ... 89.96 90.0
  * neutral           (neutral) float64 640B 1.028e+03 1.028e+03 ... 1.038e+03
  * neutralrho_edges  (neutralrho_edges) float64 648B 1.028e+03 ... 1.038e+03
Data variables:
    *empty*
Attributes: (12/16)
    filename:                               ocean.nc
    title:                                  ACCESS-OM2-01
    grid_type:                              mosaic
    grid_tile:                              1
    intake_esm_attrs:filename:              ocean.nc
    intake_esm_attrs:file_id:               ocean
    ...                                     ...
    intake_esm_attrs:coord_calendar_types:  ['', '', '', '', 'NOLEAP', '', ''...
    intake_esm_attrs:coord_bounds:          ['', '', '', '', 'time_bounds', '...
    intake_esm_attrs:coord_units:           ['degrees_E', 'degrees_N', 'meter...
    intake_esm_attrs:realm:                 ocean
    intake_esm_attrs:_data_format_:         netcdf
    intake_esm_dataset_key:                 ocean.1mon

I think the most straightforward solution here is going to be to get searches on coordinates to drop all data variables by default.

anton-seaice · 2024-10-01T04:02:17Z

I think the most straightforward solution here is going to be to get searches on coordinates to drop all data variables by default.

This makes sense to me, generally you are searching for a coordinate when trying to load them from a different file than where the variable is anyway, so you'll need a different search & .to_dask operation. e.g., you might want SST and area_t

and this is where I'm not sure they should have their own field:

e.g.
cat.search(variable='SST', coords='area_t')

would return 0 datasets ?

And we would have to do this

cat.search(variable='SST')
cat.search(coords='area_t')

to return two datasets then then need to be combined ?

but a different implementation would allow cat.search(variable=['area_t','SST]) to return the two datasets we want?

charles-turner-1 · 2024-10-02T00:09:33Z

I spent a good chunk of time thinking about this yesterday afternoon. I've got a couple of thoughts:

I agree it's counterintuitive that cat.search(variable='SST', coords='area_t') returns nothing.
In a search like this:

cat.search(variable='SST')
cat.search(coords='area_t')
to return two datasets then then need to be combined ?

but a different implementation would allow cat.search(variable=['area_t','SST]) to return the two datasets we want?

we're going to obtain two datasets from the search anyway, right? And so downstream of the search we're going necessarily have something like:
```
sst, area = datasets.get('SST'), datasets.get('area_t')
```
which the user will have to merge anyway. So perhaps the file structure enforces a bit of unavoidable clunkiness in these situations?

The major difficulty (that I'm thinking about right now) is that adding coordinate variables to the variables we search leads to intake-esm telling xarray to request that variable when it loads the dataset. I'm looking into whether we can add a step to check whether variables are data or coordinate variables before intake-esm actually opens and begins to concatenate the datasets.

If not, I'm not sure that there's a straightforward solution - bundling coordinate variables in with data variables effectively loses the information that the two are different and need to be treated differently when we open files until we actually open the files.

TLDR; I'm gonna keep prodding - I agree that it would nice to be able to search for coordinates or data variables, rather than coordinates and.

anton-seaice · 2024-10-02T04:07:37Z

we're going to obtain two datasets from the search anyway, right? And so downstream of the search we're going necessarily have something like:

Ah yes - I think my point was just ... do we need an extra field, can we just put the coordinates in the variable field?

I'm looking into whether we can add a step to check whether variables are data or coordinate variables before intake-esm actually opens and begins to concatenate the datasets.

I think we need this for #117 anyway right ?

charles-turner-1 · 2024-10-02T04:32:02Z

I see your point - I think if we can put the coordinates into the variable field then that would be preferable - it doesn't add any user complexity.

And yeah, look like #117 is the same issue. In fact, it looks like the cosima_cookbook n=1 argument @dougiesquire mentions there is the workaround that the cosima_cookbook used - which makes me suspect this problem is hiding in a few other places too.

charles-turner-1 added the enhancement New feature or request label Sep 27, 2024

charles-turner-1 self-assigned this Sep 27, 2024

charles-turner-1 mentioned this issue Oct 3, 2024

Making coordinate variables searchable #212

Merged

charles-turner-1 linked a pull request Oct 3, 2024 that will close this issue

Making coordinate variables searchable #212

Merged

charles-turner-1 closed this as completed in #212 Oct 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Making coordinate variables searchable #201

Making coordinate variables searchable #201

charles-turner-1 commented Sep 27, 2024 •

edited

Loading

aidanheerdegen commented Sep 27, 2024

charles-turner-1 commented Sep 30, 2024

charles-turner-1 commented Sep 30, 2024

marc-white commented Sep 30, 2024

charles-turner-1 commented Sep 30, 2024 •

edited

Loading

anton-seaice commented Oct 1, 2024

charles-turner-1 commented Oct 1, 2024 •

edited

Loading

anton-seaice commented Oct 1, 2024

charles-turner-1 commented Oct 2, 2024 •

edited

Loading

anton-seaice commented Oct 2, 2024

charles-turner-1 commented Oct 2, 2024

Making coordinate variables searchable #201

Making coordinate variables searchable #201

Comments

charles-turner-1 commented Sep 27, 2024 • edited Loading

Is your feature request related to a problem? Please describe.

Describe the feature you'd like

Proposed Solution

Additional Info

aidanheerdegen commented Sep 27, 2024

charles-turner-1 commented Sep 30, 2024

charles-turner-1 commented Sep 30, 2024

marc-white commented Sep 30, 2024

charles-turner-1 commented Sep 30, 2024 • edited Loading

anton-seaice commented Oct 1, 2024

charles-turner-1 commented Oct 1, 2024 • edited Loading

anton-seaice commented Oct 1, 2024

charles-turner-1 commented Oct 2, 2024 • edited Loading

anton-seaice commented Oct 2, 2024

charles-turner-1 commented Oct 2, 2024

charles-turner-1 commented Sep 27, 2024 •

edited

Loading

charles-turner-1 commented Sep 30, 2024 •

edited

Loading

charles-turner-1 commented Oct 1, 2024 •

edited

Loading

charles-turner-1 commented Oct 2, 2024 •

edited

Loading