Group together decoding options into a single argument #4490

shoyer · 2020-10-06T06:15:18Z

Is your feature request related to a problem? Please describe.

open_dataset() currently has a very long function signature. This makes it hard to keep track of everything it can do, and is particularly problematic for the authors of new backends (e.g., see #4477), which might need to know how to handle all these arguments.

Describe the solution you'd like

To simple the interface, I propose to group together all the decoding options into a new DecodingOptions class. I'm thinking something like:

from dataclasses import dataclass, field, asdict
from typing import Optional, List

@dataclass(frozen=True)
class DecodingOptions:
    mask: Optional[bool] = None
    scale: Optional[bool] = None
    datetime: Optional[bool] = None
    timedelta: Optional[bool] = None
    use_cftime: Optional[bool] = None
    concat_characters: Optional[bool] = None
    coords: Optional[bool] = None
    drop_variables: Optional[List[str]] = None

    @classmethods
    def disabled(cls):
        return cls(mask=False, scale=False, datetime=False, timedelta=False,
                  concat_characters=False, coords=False)

    def non_defaults(self):
        return {k: v for k, v in asdict(self).items() if v is not None}

    # add another method for creating default Variable Coder() objects,
    # e.g., those listed in encode_cf_variable()

The signature of open_dataset would then become:

def open_dataset(
    filename_or_obj,
    group=None,
    *
    engine=None,
    chunks=None,
    lock=None,
    cache=None,
    backend_kwargs=None,
    decode: Union[DecodingOptions, bool] = None, 
    **deprecated_kwargs
):
    if decode is None:
        decode = DecodingOptions()
    if decode is False:
        decode = DecodingOptions.disabled()
    # handle deprecated_kwargs...
    ...

Question: are decode and DecodingOptions the right names? Maybe these should still include the name "CF", e.g., decode_cf and CFDecodingOptions, given that these are specific to CF conventions?

Note: the current signature is open_dataset(filename_or_obj, group=None, decode_cf=True, mask_and_scale=None, decode_times=True, autoclose=None, concat_characters=True, decode_coords=True, engine=None, chunks=None, lock=None, cache=None, drop_variables=None, backend_kwargs=None, use_cftime=None, decode_timedelta=None)

Usage with the new interface would look like xr.open_dataset(filename, decode=False) or xr.open_dataset(filename, decode=xr.DecodingOptions(mask=False, scale=False)).

This requires a little bit more typing than what we currently have, but it has a few advantages:

It's easier to understand the role of different arguments. Now there is a function with ~8 arguments and a class with ~8 arguments rather than a function with ~15 arguments.
It's easier to add new decoding arguments (e.g., for more advanced CF conventions), because they don't clutter the open_dataset interface. For example, I separated out mask and scale arguments, versus the current mask_and_scale argument.
If a new backend plugin for open_dataset() needs to handle every option supported by open_dataset(), this makes that task significantly easier. The only decoding options they need to worry about are non-default options that were explicitly set, i.e., those exposed by the non_defaults() method. If another decoding option wasn't explicitly set and isn't recognized by the backend, they can just ignore it.

Describe alternatives you've considered

For the overall approach:

We could keep the current design, with separate keyword arguments for decoding options, and just be very careful about passing around these arguments. This seems pretty painful for the backend refactor, though.
We could keep the current design only for the user facing open_dataset() interface, and then internally convert into the DecodingOptions() struct for passing to backend constructors. This would provide much needed flexibility for backend authors, but most users wouldn't benefit from the new interface. Perhaps this would make sense as an intermediate step?

The text was updated successfully, but these errors were encountered:

dcherian · 2020-10-06T15:41:18Z

Totally in favour of this. Option 2 does seem like a good intermediate step.

This proposal would make error handling easier (#3020)

I agree that we should add CF to the names. Can we use Decoders instead of DecodingOptions? So decode_cf and CFDecoders?

shoyer · 2020-10-06T16:25:54Z

I agree that we should add CF to the names. Can we use Decoders instead of DecodingOptions? So decode_cf and CFDecoders?

works for me!

aurghs · 2020-10-06T23:35:30Z

I agree, open_dataset() currently has a very long signature that should be changed.
The interface you proposed is obviously clearer, but a class could give a false idea that all backends support all the decoding options listed in the class. I see two other alternatives:

Instead of a class we could use a dictionary. Pros 1, 2 and 3 would still hold.
With the interface proposed by @alexamici in Flexible Backend - AbstractDataStore definition #4309 the pros 2 and 3 would still hold and partially 1 (since the open_dataset interface would be greatly simplified).

For both these proposals, we would lose the autocompletion with the tab but, on the other hand, the user would be relieved of managing a class.
Finally, I'm not sure that for the user it would be clear the separation between backend_kwargs and decode, since they both contain arguments that will be passed to the backend. Especially if the backend needs more specific decoding options that must be set in backend_kwargs. In this sense, #4309 seems less error-prone.

shoyer · 2020-10-07T01:02:41Z

I see this as complementary to @alexamici's proposal in #4309 (comment), which I also like. I guess the main difference is moving decode_cf into the explicit signature of open_dataset, rather than leaving it in **kwargs.

Auto-completion is one reason to prefer a class, but enforced error checking and consistency between backends in the data model are also good reasons. In particular, it is important that users get an error if they mis-spell an argument name, e.g., open_dataset(path, decode_times=False) vs open_dataset(path, decode_time=False). We can definitely achieve this with putting decoding options into a dict, too, but we would need to be carefully to always validate the set of dict keys.

I guess cfgrib is an example of a backend with its own CF decoding options? This is indeed a tricky design question. I don't know if it's possible to make xarray.open_dataset() directly extensible in this way -- this could still be a reason for a user to use a backend-specific cfgrib.open_dataset() function.

alexamici · 2020-10-08T14:06:35Z

@shoyer I favour option 2. as a stable solution, possibly with all arguments moved to keyword-only ones:

users don't need to import and additional class
users get the argument completion onopen_dataset
xarray does validation and mangling in the class and passes to the backends only the non default values

I'm for using the words decode/decoding but I'm against the use of CF.

What backend will do is map the on-disk representation of the data (typically optimised for space) to the in-memory representation preferred by xarray (typically optimised of easy of use). This mapping is especially tricky for time-like variables.

CF is one possible way to specify the map, but it is not the only one. Both the GRIB format and all the spatial formats supported by GDAL/rasterio can encode rich data and decoding has (typically) nothing to do with the CF conventions.

My preferred meaning for the decode_-options is:

True: the backend attempts to map the data to the xarray natural data types (np.datetime64, np.float with mask and scale)
False: the backend attempts to return a representation of the data as close as possible to the on-disk one

Typically when a user asks the backend not to decode they intend to insepct the content of the data file to understand why something is not mapping in the expected way.

As an example: in the case of GRIB time-like values are represented as integers like 20190101, but cfgrib at the moment is forced to convert them into a fake CF representation before passing them to xarray, and when using decode_times=False a GRIB user is presented with something that has nothing to do with the on-disk representation.

aurghs · 2020-10-29T04:07:46Z

Taking into account the comments in this issue and the calls, I would propose this solution: #4547

TomNicholas · 2024-08-19T16:07:20Z

I had a similar idea in #9379 but forgot to check this issue didn't already exist 🤦‍♂️ . A few points made there that weren't already made here:

I agree that we should add CF to the names.

I'm for using the words decode/decoding but I'm against the use of CF.

It is unclear to me now whether or not the role of the decoding step needs to be CF-specific.

Decoding according to some conventions is something that is enshrined for the Climate Forecast community, but definitely exists for other fields of science too. The ubiquity of netCDF means that it is not actually unusual to see e.g. plasma physics data with their own plasma-specific conventions stored in a netCDF file. Even the full definition of what variables are coordinate variables is mediated by the CF conventions, but other domains mights have different ideas about that. If you're opening Zarr then clearly you don't want to be constrained to follow the CF conventions.

As all these kwargs are passed on to the public xarray.decode_cf function, I suggested in #9379 that we instead make decode a Callable[AbstractDatastore | Dataset, Dataset], with the idea being that it would point to xr.decode_cf by default but be general beyond CF if a different callable is passed.

Though perhaps I'm missing the point here and we are conflating "decode data from format returned by backend to useful in-memory types" with "apply domain-specific metadata-dependent transformations". One could for example use a decode callable to construct flexible indexes based on a variable's metadata...

group together all the decoding options into a new DecodingOptions class.

I had suggested a dict, but a dedicated DecodingOptions class could also work, so long as it doesn't prevent the generality above. Note that since this issue was originally raised we now also have from_array_kwargs (a dict) in addition to backend_kwargs (a dict).

keewis · 2024-08-20T10:22:02Z

throwing in another idea: what if instead of boolean flags we change decode_cf to:

xr.decode(ds, decoders: Sequence[str | callable] | None = ["cf"], disable: Sequence[str] | None = None)

where the order in decoders would dictate the order of application of decoders.

You'd use it like this:

xr.decode(ds, decoders=["cf", custom_decoder_func], disable=["mask", "scale", "datetime"])
xr.decode(ds, decoders=["mask", "scale", custom_decoder_func])

where "cf" is an alias for a list containing all the decoders we currently apply by default (another option would be to publicly expose that list, which allows filtering manually and avoiding the need for the second kwarg)

Then all open_* would have to do is forward the kwargs, and to disable entirely you pass decoders=None.

dcherian · 2024-08-20T16:57:20Z

I'm not sure what my previous comment was referring to but the current coders are CF-specific and so should be exposed with CF in the name IMO. I agree with using xr.decode instead of xr.decode_cf.

Also some of these "coders" can have complicated options: Example: https://cf-xarray.readthedocs.io/en/latest/coding.html#compression-by-gathering could decode to multiindex or sparse. We could encode/decode multiindex/sparse to a number of Discrete Sampling Geometry specifications etc. So using dataclasses seems like a good idea here.

How about:

xr.decode(ds: Dataset, coders: Sequence[Coder] | None = CFCoders())

with a helper xr.coders.CFCoders(sane_defaults) -> Sequence[Coder] to construct a list of Coders and allow turning specific ones on/off.

TomNicholas · 2024-08-20T17:10:45Z

the current coders are CF-specific and so should be exposed with CF in the name IMO

I think the point is just that if we are going to refactor this we should take the opportunity to generalize to allow supporting non-CF decoders too.

xr.decode(ds: Dataset, coders: Sequence[Coder] | None = CFCoders())

I like this, this is very neat. Then is the idea is that xr.open_dataset has a kwarg decode: Sequence[Coder] = xr.coders.CFCoders(sane_defaults)?

keewis · 2024-08-20T17:19:36Z

and would sane_defaults refer to options where you can enable / disable certain coders?

dcherian · 2024-08-20T17:27:28Z

yes to be clear: sane_defaults expands to today's defaults

CFCoders(
     mask: Optional[bool] = None
    scale: Optional[bool] = None
    datetime: Optional[bool] = None
    timedelta: Optional[bool] = None
    use_cftime: Optional[bool] = None
    concat_characters: Optional[bool] = None
    coords: Optional[bool] = None
    drop_variables: Optional[List[str]] = None
    )

for example.

xr.open_dataset has a kwarg decode: Sequence[Coder] = xr.coders.CFCoders(sane_defaults)?

Yes.

The wrinkle is that we might want to set different options for different backends: example, zarr can roundtrip datetime64 just fine without any encoding. Perhaps this is another reason to keep the separate open_zarr that turns off datetime encoding.

kmuehlbauer · 2024-08-20T17:51:23Z

👍 for the suggestion

It seems a good idea to check the metadata if a certain coder is applicable or not.

Another thing, currently we do not differentiate mask and scale, it's mask_and_scale.

One possible reason is that we can't mask integer values without casting to float (at least atm, until there will be nullable integers).

keewis · 2024-08-20T18:24:51Z

zarr can roundtrip datetime64 just fine without any encoding.

Backends could, in theory, filter the coders they forward to store_entrypoint.open_dataset for standard coders that don't apply to them. Another option would be to just go ahead and decode, the decoders should skip variables that are already decoded.

kmuehlbauer · 2024-12-17T16:32:12Z

Breathing new life into this.

Over in #9618 we are working on time decoding. @dcherian had the idea to split out CFDatetimeCoder to be able for users to parameterize it (deprecating use_cftime kwarg, adding time_unit kwarg later on).

I've split out this implementation from #9618 into #9901. As this adds new API, @pydata/xarray please have a review over there. Thanks!

This was referenced Oct 6, 2020

API Design for Xarray Backends #1970

Open

WIP: Proposed refactor of read API for backends #4477

Merged

shoyer added API design topic-backends labels Oct 6, 2020

dcherian mentioned this issue Oct 6, 2020

Consider how to deal with the proliferation of decoder options on open_dataset #939

Closed

aurghs mentioned this issue Oct 28, 2020

Update signature open_dataset for API v2 #4547

Merged

2 tasks

dcherian mentioned this issue Oct 7, 2022

xarray.open_dataset has issues if the dataset returned by the backend contains a multiindex #7139

Closed

4 tasks

dcherian mentioned this issue Aug 19, 2024

Simplify signature of xr.open_dataset using new decoding_kwargs dict #9379

Closed

TomNicholas added the topic-CF conventions label Aug 19, 2024

TomNicholas mentioned this issue Aug 20, 2024

Use xarray's encode_cf / decode_cf functions to handle CF conventions zarr-developers/VirtualiZarr#157

Open

dcherian mentioned this issue Nov 22, 2024

Relax nanosecond datetime restriction in CF time decoding #9618

Open

9 tasks

kmuehlbauer mentioned this issue Dec 17, 2024

split out CFDatetimeCoder, deprecate use_cftime as kwarg #9901

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Group together decoding options into a single argument #4490

Group together decoding options into a single argument #4490

shoyer commented Oct 6, 2020 •

edited

Loading

dcherian commented Oct 6, 2020

shoyer commented Oct 6, 2020

aurghs commented Oct 6, 2020

shoyer commented Oct 7, 2020

alexamici commented Oct 8, 2020

aurghs commented Oct 29, 2020

TomNicholas commented Aug 19, 2024

keewis commented Aug 20, 2024 •

edited

Loading

dcherian commented Aug 20, 2024

TomNicholas commented Aug 20, 2024

keewis commented Aug 20, 2024

dcherian commented Aug 20, 2024 •

edited

Loading

kmuehlbauer commented Aug 20, 2024

keewis commented Aug 20, 2024 •

edited

Loading

kmuehlbauer commented Dec 17, 2024

Group together decoding options into a single argument #4490

Group together decoding options into a single argument #4490

Comments

shoyer commented Oct 6, 2020 • edited Loading

dcherian commented Oct 6, 2020

shoyer commented Oct 6, 2020

aurghs commented Oct 6, 2020

shoyer commented Oct 7, 2020

alexamici commented Oct 8, 2020

aurghs commented Oct 29, 2020

TomNicholas commented Aug 19, 2024

keewis commented Aug 20, 2024 • edited Loading

dcherian commented Aug 20, 2024

TomNicholas commented Aug 20, 2024

keewis commented Aug 20, 2024

dcherian commented Aug 20, 2024 • edited Loading

kmuehlbauer commented Aug 20, 2024

keewis commented Aug 20, 2024 • edited Loading

kmuehlbauer commented Dec 17, 2024

shoyer commented Oct 6, 2020 •

edited

Loading

keewis commented Aug 20, 2024 •

edited

Loading

dcherian commented Aug 20, 2024 •

edited

Loading

keewis commented Aug 20, 2024 •

edited

Loading