Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Making coordinate variables searchable #201

Closed
charles-turner-1 opened this issue Sep 27, 2024 · 11 comments · Fixed by #212
Closed

Making coordinate variables searchable #201

charles-turner-1 opened this issue Sep 27, 2024 · 11 comments · Fixed by #212
Assignees
Labels
enhancement New feature or request

Comments

@charles-turner-1
Copy link
Collaborator

charles-turner-1 commented Sep 27, 2024

Is your feature request related to a problem? Please describe.

Currently, the ACCESS-NRI Intake Catalog doesn't allow for searching of coordinate variables: for example, searching for st_edges_ocean will return 0 datasets. This can make searching for coordinate variables difficult, with 2 main pain points:

  1. If the coordinate variable is know to be stored in a netCDF file with specific data variables:
    • The user needs to know that these data & coordinate variables are found in the same files, and then search for the data variable in order to access the coordinate variables.
    • In some instances, the user can directly access the coordinate variables by searching the data variables. In others, they need to perform something like
        VARNAME = ...
        COORD = ...
        fname = cat.search(variable=VARNAME).df.loc[0,'file']
        ds = xr.open_dataset(fname)
        coord = ds[COORD]

Although this doesn't require the user to (semi) manually work out what file to open, it's still messy as it requires passing round file names.

  1. In other instances, coordinate variables are stored completely separately. For example, ocean_grid.nc files only contain coordinate variables, and so cannot be found using the catalogue. The only way to currently access these files is to search the catalogue to get a handle on the directory structure - and then construct a file path and load it: eg:
        fname = cat.search(variable=VARNAME).df.loc[0,'file']
        dirname = Path(fname).parent
        grid_file, = [file for file in os.listdir(dirname) if file.endswith('_grid.nc')] 
        ds = xr.open_dataset(grid_file)
        coord = ds[COORD]

This requires the user to start poking round in directory structures to try to work out where to load their data - which is the problem intake is trying to solve.

This has caused some pain points migrating COSIMA recipes from cosima_cookbook => intake.

I also think this might be the same issue as discussed in #63? @aidanheerdegen - seem to be some concerns about coordinates being listed as variables when they shouldn't be there?

Describe the feature you'd like

Searchable coordinates: in the same way that the catalog currently lets you perform searches over variables, it would be useful to be able to do the same on coordinates:

var = cat.search(variable=VARN).to_dask()
coord = cat.search(coord=COORD).to_dask()

Doing this is subject to a couple of constraints:

  1. The catalog needs to know that coordinates & data variables aren't the same & need to be treated differently - xr.combine_by_coords will fail if passed a coordinate variable.
  2. This requires an upstream patch of intake-esm - see issue 660.
  3. Cannot cause serious performance regressions - see above issue.

Proposed Solution

  1. Update intake-esm with @dougiesquire's proposed solution from issue 660.
  2. Add separate coordinate coordinate variable fields to the ACCESS-NRI Intake Catalog, rather than just making the same change as in Intake-ESM (data_vars => variables), as this would then confuse coordinates & variables in the ACCESS-NRI Intake Catalog as well as causing concatenation issues. This is implemented on branch 660-coordinate-variables.

Additional Info

  • Due to the release cycle of Intake-ESM, this solution will probably require us to maintain a fork - at least for some time.
  • I've performance tested the proposed solution & changes in catalogue build times are small (typically ~5%), catalogue read times similar (typically 5-10%, sometimes faster), and the size of datastore.csv.gz files writted by builder.save() are typically approximately doubled.
@charles-turner-1 charles-turner-1 added the enhancement New feature or request label Sep 27, 2024
@charles-turner-1 charles-turner-1 self-assigned this Sep 27, 2024
@aidanheerdegen
Copy link
Member

I also think this might be the same issue as discussed in #63? @aidanheerdegen - seem to be some concerns about coordinates being listed as variables when they shouldn't be there?

That was specifically for a project where we were using the intake catalogue as a source for an "experiment explorer", to expose the variables saved in an experiment in timeline to assist users in understanding what variables are available at different times in an experiment. For this purpose we really only wanted diagnostic model variables that have a time-varying component.

Add separate coordinate variable fields to the ACCESS-NRI Intake Catalog

I'm confused. Does this mean

  1. Have a "coordinate" flag (field)?
  2. Move coordinates into a separate catalogue?
    or neither?

BTW this is a somewhat related issue I think about encoding grid information:

#112

@charles-turner-1
Copy link
Collaborator Author

Yup, a coordinate flag, sorry for the confusion with terms.

As currently implemented, this would just add coordinates & coordinate relevant information to the catalogue - I've tried to keep changes to a minimum. We could think about having a separate catalogue specifically for coordinates but that feels potentially a bit harder for users to me?

I've attached a screencap with a demo of how the change affects catalogue results (N.B both are run using the same kernel, so cat_without_coords.search(coords='st_edges_ocean') returns an empty search - running it on main would raise an error).

Screenshot 2024-09-30 at 12 54 51 PM

@charles-turner-1
Copy link
Collaborator Author

Also would be interested in knowing if this would cover your use cases @marc-white @anton-seaice

@marc-white
Copy link
Collaborator

@charles-turner-1 my main concern to this point hasn't been how to search for the coordinates, but how to access them. In particular:

  • Currently, when you search for a variable and then do to_dask, you get back a Dask array of the requested variables, plus the coordinates that the variable relates to. Would this work in reverse? So, if I search for a coordinate, and then converted to an array, would the Dask array include the coordinate, and then the related variables, and then the other coordinates that those variables rely on?
  • There's also the issue of intake-esm crashing when it tries to pull out the coordinate data by stacking it on itself (in effect). How could we modify things to know to treat variables and coordinates differently in this regard?

@charles-turner-1
Copy link
Collaborator Author

charles-turner-1 commented Sep 30, 2024

@marc-white In response to question 1, currently searching for a coordinate will load the entire dataset - it works very much like searching other metadata, eg.

>>> ocean_files = cat_with_coords.search(start_date='2086.*',frequency='1mon',file_id='ocean',).to_dask()
>>>print(ocean_files)
<xarray.Dataset> Size: 847GB
Dimensions:                (time: 12, st_ocean: 75, yt_ocean: 2700,
                            xt_ocean: 3600, yu_ocean: 2700, xu_ocean: 3600,
                            sw_ocean: 75, potrho: 80, grid_yt_ocean: 2700,
                            grid_xu_ocean: 3600, grid_yu_ocean: 2700,
                            grid_xt_ocean: 3600, neutral: 80, nv: 2,
                            st_edges_ocean: 76, sw_edges_ocean: 76,
                            potrho_edges: 81, neutralrho_edges: 81)
Coordinates: (12/18)
  * xt_ocean               (xt_ocean) float64 29kB -279.9 -279.8 ... 79.85 79.95
  * yt_ocean               (yt_ocean) float64 22kB -81.11 -81.07 ... 89.94 89.98
  * st_ocean               (st_ocean) float64 600B 0.5413 1.681 ... 5.709e+03
  * st_edges_ocean         (st_edges_ocean) float64 608B 0.0 1.083 ... 5.809e+03
  * time                   (time) object 96B 2086-01-16 12:00:00 ... 2086-12-...
  * nv                     (nv) float64 16B 1.0 2.0
    ...                     ...
  * potrho                 (potrho) float64 640B 1.028e+03 ... 1.038e+03
  * potrho_edges           (potrho_edges) float64 648B 1.028e+03 ... 1.038e+03
  * grid_xt_ocean          (grid_xt_ocean) float64 29kB -279.9 -279.8 ... 79.95
  * grid_yu_ocean          (grid_yu_ocean) float64 22kB -81.09 -81.05 ... 90.0
  * neutral                (neutral) float64 640B 1.028e+03 ... 1.038e+03
  * neutralrho_edges       (neutralrho_edges) float64 648B 1.028e+03 ... 1.03...
Data variables: (12/28)
    temp                   (time, st_ocean, yt_ocean, xt_ocean) float32 35GB dask.array<chunksize=(1, 7, 300, 400), meta=np.ndarray>
    pot_temp               (time, st_ocean, yt_ocean, xt_ocean) float32 35GB dask.array<chunksize=(1, 7, 300, 400), meta=np.ndarray>
    salt                   (time, st_ocean, yt_ocean, xt_ocean) float32 35GB dask.array<chunksize=(1, 7, 300, 400), meta=np.ndarray>
    age_global             (time, st_ocean, yt_ocean, xt_ocean) float32 35GB dask.array<chunksize=(1, 7, 300, 400), meta=np.ndarray>
    u                      (time, st_ocean, yu_ocean, xu_ocean) float32 35GB dask.array<chunksize=(1, 7, 300, 400), meta=np.ndarray>
    v                      (time, st_ocean, yu_ocean, xu_ocean) float32 35GB dask.array<chunksize=(1, 7, 300, 400), meta=np.ndarray>
    ...                     ...
    bih_fric_v             (time, st_ocean, yu_ocean, xu_ocean) float32 35GB dask.array<chunksize=(1, 7, 300, 400), meta=np.ndarray>
    u_dot_grad_vert_pv     (time, st_ocean, yt_ocean, xt_ocean) float32 35GB dask.array<chunksize=(1, 7, 300, 400), meta=np.ndarray>
    average_T1             (time) datetime64[ns] 96B dask.array<chunksize=(3,), meta=np.ndarray>
    average_T2             (time) datetime64[ns] 96B dask.array<chunksize=(3,), meta=np.ndarray>
    average_DT             (time) timedelta64[ns] 96B dask.array<chunksize=(3,), meta=np.ndarray>
    time_bounds            (time, nv) timedelta64[ns] 192B dask.array<chunksize=(1, 2), meta=np.ndarray>
Attributes: (12/22)
    filename:                                 ocean.nc
    title:                                    ACCESS-OM2-01
    grid_type:                                mosaic
    grid_tile:                                1
    intake_esm_vars:                          ['temp', 'pot_temp', 'salt', 'a...
    intake_esm_attrs:filename:                ocean.nc
    ...                                       ...
    intake_esm_attrs:coord_calendar_types:    ['', '', '', '', 'NOLEAP', '', ...
    intake_esm_attrs:coord_bounds:            ['', '', '', '', 'time_bounds',...
    intake_esm_attrs:coord_units:             ['degrees_E', 'degrees_N', 'met...
    intake_esm_attrs:realm:                   ocean
    intake_esm_attrs:_data_format_:           netcdf
    intake_esm_dataset_key:                   ocean.1mon

I like the suggestion that we only load
1. data variables that depend on the coordinate variable searched for, and
2. other coordinate variables those data variables depend on,
though. I'll have a look into how straightforward that might be to implement.

  • With regards to the second question, searching for a coordinate & obtaining a multi-file dataset works fine right now, so I think this approach has sidestepped the crashing issue:
>>> st_edges_search = cat_with_coords.search(coords='st_edges_ocean',frequency='1mon',file_id='ocean',start_date = '2086.*')
>>> print(st_edges_search.df.head(3))
filename file_id                                               path  \
0  ocean.nc   ocean  /g/data/ik11/outputs/access-om2-01/01deg_jra55...   
1  ocean.nc   ocean  /g/data/ik11/outputs/access-om2-01/01deg_jra55...   
2  ocean.nc   ocean  /g/data/ik11/outputs/access-om2-01/01deg_jra55...   

  filename_timestamp frequency            start_date              end_date  \
0                NaN      1mon  2086-01-01, 00:00:00  2086-04-01, 00:00:00   
1                NaN      1mon  2086-04-01, 00:00:00  2086-07-01, 00:00:00   
2                NaN      1mon  2086-07-01, 00:00:00  2086-10-01, 00:00:00   

>>> st_edges = st_edges_search.to_dask()
>>> print(st_edges)
<xarray.Dataset> Size: 847GB
Dimensions:                (time: 12, st_ocean: 75, yt_ocean: 2700,
                            xt_ocean: 3600, yu_ocean: 2700, xu_ocean: 3600,
                            sw_ocean: 75, potrho: 80, grid_yt_ocean: 2700,
                            grid_xu_ocean: 3600, grid_yu_ocean: 2700,
                            grid_xt_ocean: 3600, neutral: 80, nv: 2,
                            st_edges_ocean: 76, sw_edges_ocean: 76,
                            potrho_edges: 81, neutralrho_edges: 81)
Coordinates: (12/18)
  * xt_ocean               (xt_ocean) float64 29kB -279.9 -279.8 ... 79.85 79.95
  * yt_ocean               (yt_ocean) float64 22kB -81.11 -81.07 ... 89.94 89.98
  * st_ocean               (st_ocean) float64 600B 0.5413 1.681 ... 5.709e+03
  * st_edges_ocean         (st_edges_ocean) float64 608B 0.0 1.083 ... 5.809e+03
  * time                   (time) object 96B 2086-01-16 12:00:00 ... 2086-12-...
  * nv                     (nv) float64 16B 1.0 2.0
    ...                     ...
  * potrho                 (potrho) float64 640B 1.028e+03 ... 1.038e+03
  * potrho_edges           (potrho_edges) float64 648B 1.028e+03 ... 1.038e+03
  * grid_xt_ocean          (grid_xt_ocean) float64 29kB -279.9 -279.8 ... 79.95
  * grid_yu_ocean          (grid_yu_ocean) float64 22kB -81.09 -81.05 ... 90.0
  * neutral                (neutral) float64 640B 1.028e+03 ... 1.038e+03
  * neutralrho_edges       (neutralrho_edges) float64 648B 1.028e+03 ... 1.03...
Data variables: (12/28)
    temp                   (time, st_ocean, yt_ocean, xt_ocean) float32 35GB dask.array<chunksize=(1, 7, 300, 400), meta=np.ndarray>
    pot_temp               (time, st_ocean, yt_ocean, xt_ocean) float32 35GB dask.array<chunksize=(1, 7, 300, 400), meta=np.ndarray>
    salt                   (time, st_ocean, yt_ocean, xt_ocean) float32 35GB dask.array<chunksize=(1, 7, 300, 400), meta=np.ndarray>
    age_global             (time, st_ocean, yt_ocean, xt_ocean) float32 35GB dask.array<chunksize=(1, 7, 300, 400), meta=np.ndarray>
    u                      (time, st_ocean, yu_ocean, xu_ocean) float32 35GB dask.array<chunksize=(1, 7, 300, 400), meta=np.ndarray>
    v                      (time, st_ocean, yu_ocean, xu_ocean) float32 35GB dask.array<chunksize=(1, 7, 300, 400), meta=np.ndarray>
    ...                     ...
    bih_fric_v             (time, st_ocean, yu_ocean, xu_ocean) float32 35GB dask.array<chunksize=(1, 7, 300, 400), meta=np.ndarray>
    u_dot_grad_vert_pv     (time, st_ocean, yt_ocean, xt_ocean) float32 35GB dask.array<chunksize=(1, 7, 300, 400), meta=np.ndarray>
    average_T1             (time) datetime64[ns] 96B dask.array<chunksize=(3,), meta=np.ndarray>
    average_T2             (time) datetime64[ns] 96B dask.array<chunksize=(3,), meta=np.ndarray>
    average_DT             (time) timedelta64[ns] 96B dask.array<chunksize=(3,), meta=np.ndarray>
    time_bounds            (time, nv) timedelta64[ns] 192B dask.array<chunksize=(1, 2), meta=np.ndarray>
Attributes: (12/22)
    filename:                                 ocean.nc
    title:                                    ACCESS-OM2-01
    grid_type:                                mosaic
    grid_tile:                                1
    intake_esm_vars:                          ['temp', 'pot_temp', 'salt', 'a...
    intake_esm_attrs:filename:                ocean.nc
    ...                                       ...
    intake_esm_attrs:coord_calendar_types:    ['', '', '', '', 'NOLEAP', '', ...
    intake_esm_attrs:coord_bounds:            ['', '', '', '', 'time_bounds',...
    intake_esm_attrs:coord_units:             ['degrees_E', 'degrees_N', 'met...
    intake_esm_attrs:realm:                   ocean
    intake_esm_attrs:_data_format_:           netcdf
    intake_esm_dataset_key:                   ocean.1mon

As before, searching for a variable will only load relevant data, eg:

>>> pot_temp_search = cat_with_coords.search(variable='pot_temp',frequency='1mon',file_id='ocean',start_date = '2086.*')
>>> print(pot_temp_search.df.head(3))
 filename file_id                                               path  \
0  ocean.nc   ocean  /g/data/ik11/outputs/access-om2-01/01deg_jra55...   
1  ocean.nc   ocean  /g/data/ik11/outputs/access-om2-01/01deg_jra55...   
2  ocean.nc   ocean  /g/data/ik11/outputs/access-om2-01/01deg_jra55...   

  filename_timestamp frequency            start_date              end_date  \
0                NaN      1mon  2086-01-01, 00:00:00  2086-04-01, 00:00:00   
1                NaN      1mon  2086-04-01, 00:00:00  2086-07-01, 00:00:00   
2                NaN      1mon  2086-07-01, 00:00:00  2086-10-01, 00:00:00   
...
>>>  pot_temp = pot_temp_search.to_dask()
>>> print(pot_temp)
<xarray.Dataset> Size: 35GB
Dimensions:   (time: 12, st_ocean: 75, yt_ocean: 2700, xt_ocean: 3600)
Coordinates:
  * xt_ocean  (xt_ocean) float64 29kB -279.9 -279.8 -279.7 ... 79.75 79.85 79.95
  * yt_ocean  (yt_ocean) float64 22kB -81.11 -81.07 -81.02 ... 89.89 89.94 89.98
  * st_ocean  (st_ocean) float64 600B 0.5413 1.681 2.94 ... 5.511e+03 5.709e+03
  * time      (time) object 96B 2086-01-16 12:00:00 ... 2086-12-16 12:00:00
Data variables:
    pot_temp  (time, st_ocean, yt_ocean, xt_ocean) float32 35GB dask.array<chunksize=(1, 7, 300, 400), meta=np.ndarray>
Attributes: (12/22)
    filename:                                 ocean.nc
    title:                                    ACCESS-OM2-01
    grid_type:                                mosaic
    grid_tile:                                1
    intake_esm_vars:                          ['pot_temp']
    intake_esm_attrs:filename:                ocean.nc
    ...                                       ...
    intake_esm_attrs:coord_calendar_types:    ['', '', '', '', 'NOLEAP', '', ...
    intake_esm_attrs:coord_bounds:            ['', '', '', '', 'time_bounds',...
    intake_esm_attrs:coord_units:             ['degrees_E', 'degrees_N', 'met...
    intake_esm_attrs:realm:                   ocean
    intake_esm_attrs:_data_format_:           netcdf
    intake_esm_dataset_key:                   ocean.1mon
    ```
    
    

@anton-seaice
Copy link
Collaborator

Instead of this:

st_edges_search = cat_with_coords.search(coords='st_edges_ocean',frequency='1mon',file_id='ocean',start_date = '2086.*')
st_edges = st_edges_search.to_dask()

We should just be able to do:

cat_with_coords.search(coords='st_edges_ocean').to_dask() (possibly the file_id part is still needed)

'st_edges_ocean' doesn't change over time, so we should check that all files have the same values, but having to supply the start_date = '2086.*' is messy. And then calling to_dask, ideally would only return the coordinates (without the data_vars). e.g.

Screenshot 2024-10-01 at 11 11 36 AM or Screenshot 2024-10-01 at 11 12 30 AM

@charles-turner-1
Copy link
Collaborator Author

charles-turner-1 commented Oct 1, 2024

<xarray.Dataset> Size: 229kB
Dimensions: (xt_ocean: 3600, yt_ocean: 2700, st_ocean: 75,
st_edges_ocean: 76, time: 2760, nv: 2, xu_ocean: 3600,
yu_ocean: 2700, sw_ocean: 75, sw_edges_ocean: 76,
grid_xu_ocean: 3600, grid_yt_ocean: 2700, potrho: 80,
potrho_edges: 81, grid_xt_ocean: 3600,
grid_yu_ocean: 2700, neutral: 80, neutralrho_edges: 81)
Coordinates: (12/18)

  • xt_ocean (xt_ocean) float64 29kB -279.9 -279.8 ... 79.85 79.95
  • yt_ocean (yt_ocean) float64 22kB -81.11 -81.07 ... 89.94 89.98
  • st_ocean (st_ocean) float64 600B 0.5413 1.681 ... 5.709e+03
  • st_edges_ocean (st_edges_ocean) float64 608B 0.0 1.083 ... 5.809e+03
  • time (time) object 22kB 1950-01-16 12:00:00 ... 2179-12-16 1...
  • nv (nv) float64 16B 1.0 2.0
    ... ...
  • potrho (potrho) float64 640B 1.028e+03 1.028e+03 ... 1.038e+03
  • potrho_edges (potrho_edges) float64 648B 1.028e+03 ... 1.038e+03
  • grid_xt_ocean (grid_xt_ocean) float64 29kB -279.9 -279.8 ... 79.85 79.95
  • grid_yu_ocean (grid_yu_ocean) float64 22kB -81.09 -81.05 ... 89.96 90.0
  • neutral (neutral) float64 640B 1.028e+03 1.028e+03 ... 1.038e+03
  • neutralrho_edges (neutralrho_edges) float64 648B 1.028e+03 ... 1.038e+03
    Data variables:
    empty
    Attributes: (12/16)
    filename: ocean.nc
    title: ACCESS-OM2-01
    grid_type: mosaic
    grid_tile: 1
    intake_esm_attrs:filename: ocean.nc
    intake_esm_attrs:file_id: ocean
    ... ...
    intake_esm_attrs:coord_calendar_types: ['', '', '', '', 'NOLEAP', '', ''...
    intake_esm_attrs:coord_bounds: ['', '', '', '', 'time_bounds', '...
    intake_esm_attrs:coord_units: ['degrees_E', 'degrees_N', 'meter...
    intake_esm_attrs:realm: ocean
    intake_esm_attrs:data_format: netcdf
    intake_esm_dataset_key: ocean.1monI've looked into it, & I think adding data_vars that depend on the coordinate vars we've searched is going to
  1. Probably require modifications to ecgtools.
  2. Potentially cause memory usage blowups. Subsetting down to just a few files by providing a year etc, eg.
st_edges_search = cat_with_coords.search(coords='st_edges_ocean',frequency='1mon',file_id='ocean',start_date = '2086.*')
st_edges = st_edges_search.to_dask()

is messy, I agree - in the search above a large ARE instance will crash with default chunking - seems to be the result of Dask trying to set one chunk per file. Subsetting down to just a few files (start_date = '2086.*') resolves this.
I'd rather not change default chunking when we perform coordinate searches - it seems messy and error prone - depending on the coordinate we might load most of the data variables in the file.

With the current implementation, we can easily search & load the dataset as follows:

>> st_edges_search = cat_with_coords.search(coords='st_edges_ocean',frequency='1mon',file_id='ocean')
>>> varnames = st_edges_search.df.loc[0,'variable']
>>> print(st_edges_search.to_dask(xarray_open_kwargs = {'drop_variables' : varnames}))<xarray.Dataset> Size: 229kB
Dimensions:           (xt_ocean: 3600, yt_ocean: 2700, st_ocean: 75,
                       st_edges_ocean: 76, time: 2760, nv: 2, xu_ocean: 3600,
                       yu_ocean: 2700, sw_ocean: 75, sw_edges_ocean: 76,
                       grid_xu_ocean: 3600, grid_yt_ocean: 2700, potrho: 80,
                       potrho_edges: 81, grid_xt_ocean: 3600,
                       grid_yu_ocean: 2700, neutral: 80, neutralrho_edges: 81)
Coordinates: (12/18)
  * xt_ocean          (xt_ocean) float64 29kB -279.9 -279.8 ... 79.85 79.95
  * yt_ocean          (yt_ocean) float64 22kB -81.11 -81.07 ... 89.94 89.98
  * st_ocean          (st_ocean) float64 600B 0.5413 1.681 ... 5.709e+03
  * st_edges_ocean    (st_edges_ocean) float64 608B 0.0 1.083 ... 5.809e+03
  * time              (time) object 22kB 1950-01-16 12:00:00 ... 2179-12-16 1...
  * nv                (nv) float64 16B 1.0 2.0
    ...                ...
  * potrho            (potrho) float64 640B 1.028e+03 1.028e+03 ... 1.038e+03
  * potrho_edges      (potrho_edges) float64 648B 1.028e+03 ... 1.038e+03
  * grid_xt_ocean     (grid_xt_ocean) float64 29kB -279.9 -279.8 ... 79.85 79.95
  * grid_yu_ocean     (grid_yu_ocean) float64 22kB -81.09 -81.05 ... 89.96 90.0
  * neutral           (neutral) float64 640B 1.028e+03 1.028e+03 ... 1.038e+03
  * neutralrho_edges  (neutralrho_edges) float64 648B 1.028e+03 ... 1.038e+03
Data variables:
    *empty*
Attributes: (12/16)
    filename:                               ocean.nc
    title:                                  ACCESS-OM2-01
    grid_type:                              mosaic
    grid_tile:                              1
    intake_esm_attrs:filename:              ocean.nc
    intake_esm_attrs:file_id:               ocean
    ...                                     ...
    intake_esm_attrs:coord_calendar_types:  ['', '', '', '', 'NOLEAP', '', ''...
    intake_esm_attrs:coord_bounds:          ['', '', '', '', 'time_bounds', '...
    intake_esm_attrs:coord_units:           ['degrees_E', 'degrees_N', 'meter...
    intake_esm_attrs:realm:                 ocean
    intake_esm_attrs:_data_format_:         netcdf
    intake_esm_dataset_key:                 ocean.1mon
   

I think the most straightforward solution here is going to be to get searches on coordinates to drop all data variables by default.

@anton-seaice
Copy link
Collaborator

I think the most straightforward solution here is going to be to get searches on coordinates to drop all data variables by default.

This makes sense to me, generally you are searching for a coordinate when trying to load them from a different file than where the variable is anyway, so you'll need a different search & .to_dask operation. e.g., you might want SST and area_t

and this is where I'm not sure they should have their own field:

e.g.
cat.search(variable='SST', coords='area_t')

would return 0 datasets ?

And we would have to do this

cat.search(variable='SST')
cat.search(coords='area_t')

to return two datasets then then need to be combined ?

but a different implementation would allow cat.search(variable=['area_t','SST]) to return the two datasets we want?

@charles-turner-1
Copy link
Collaborator Author

charles-turner-1 commented Oct 2, 2024

I spent a good chunk of time thinking about this yesterday afternoon. I've got a couple of thoughts:

  • I agree it's counterintuitive that cat.search(variable='SST', coords='area_t') returns nothing.

  • In a search like this:

    cat.search(variable='SST')
    cat.search(coords='area_t')
    to return two datasets then then need to be combined ?

    but a different implementation would allow cat.search(variable=['area_t','SST]) to return the two datasets we want?

    we're going to obtain two datasets from the search anyway, right? And so downstream of the search we're going necessarily have something like:

    sst, area = datasets.get('SST'), datasets.get('area_t')

    which the user will have to merge anyway. So perhaps the file structure enforces a bit of unavoidable clunkiness in these situations?

The major difficulty (that I'm thinking about right now) is that adding coordinate variables to the variables we search leads to intake-esm telling xarray to request that variable when it loads the dataset. I'm looking into whether we can add a step to check whether variables are data or coordinate variables before intake-esm actually opens and begins to concatenate the datasets.

If not, I'm not sure that there's a straightforward solution - bundling coordinate variables in with data variables effectively loses the information that the two are different and need to be treated differently when we open files until we actually open the files.

TLDR; I'm gonna keep prodding - I agree that it would nice to be able to search for coordinates or data variables, rather than coordinates and.

@anton-seaice
Copy link
Collaborator

  • we're going to obtain two datasets from the search anyway, right? And so downstream of the search we're going necessarily have something like:

Ah yes - I think my point was just ... do we need an extra field, can we just put the coordinates in the variable field?

I'm looking into whether we can add a step to check whether variables are data or coordinate variables before intake-esm actually opens and begins to concatenate the datasets.

I think we need this for #117 anyway right ?

@charles-turner-1
Copy link
Collaborator Author

I see your point - I think if we can put the coordinates into the variable field then that would be preferable - it doesn't add any user complexity.

And yeah, look like #117 is the same issue. In fact, it looks like the cosima_cookbook n=1 argument @dougiesquire mentions there is the workaround that the cosima_cookbook used - which makes me suspect this problem is hiding in a few other places too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants