Best practices for operational forecast data #1169

TomAugspurger · 2022-02-22T21:57:03Z

TomAugspurger
Feb 22, 2022

Hi all,

I'm working through how to model some operational forecast data. I put together a notebook at https://notebooksharing.space/view/e3f3eceaf2cd8da3d14d5cb0f7873e1909c836b64267421f36b10a260a720d99#displayOptions= that has some initial thoughts for a specific dataset. I wanted to generalize a few things, and get the STAC communities input, and eventually update https://github.com/radiantearth/stac-spec/blob/master/best-practices.md with guidance for these types of datasets. And we could possibly codify some of the recommendations as an extension.

Handling datetimes

Every forecast involves two datetimes:

The reference datetime (roughly, when the forecast model ran)
The forecast datetime (when the forecast actually applies / is valid for)

Most of the data assets I've seen are GRIB2 files with a single timestamp. So when we're grouping assets into items (if at all), we must choose which datetime to use.

Initially, I thought to use the forecast datetime. After all, that's the time the data are valid for, so why not use it? But @justinfisk made a compelling argument for using the reference datetime (when the model ran).

Using the reference datetime avoids the (possibly awkward) situation of searching for items whose datetime is in the future.
If we use the forecast datetime and group all assets with the same forecast time in the same item, we would be constantly mutating existing items as new forecasts are made available (the item for 2022-03-04T00:00:00's would have assets from 128 hours ago, 125 hours ago, 123 hours ago, ...)
If we use the forecast datetime but don't group all assets with the same forecast time in the same item users searching for a specific datetime= would get back many items and have to filter them down somehow (most likely choosing the most recent)

I suspect, but would really appreciate feedback here, that the dominant usage pattern is people examining one or all of the forecast datetimes from the most recent reference datetime. I would guess that there's not much value in yesterday's forecast for 36 hours from now, when I have today's forecast for 12 hours from now.

In short, we would group together all (similar) assets with the same reference datetime into an item . This would encourage a usage pattern like

items = catalog.search(collections=[my-collection])
item = max(items, key=lambda item: item.datetime)
asset = item.assets["0h"]  # forecast for right now
asset = item.assets["128h"]  # forecast for 128 hours from now

How to handle multiple "streams"

This is probably too dataset-specific to write best practices for, but how should we determine what goes in a single collection? The ECMWF publishes multiple "streams", which IIUC roughly corresponds to different models (some are focused on the atmosphere, others on the oceans; some are ensembles, others are single members). Should these go into separate collections? How do we make that decision?

How to handle datacubes?

Again, this might be too dataset specific, but these GRIB2 contain data that could be cataloged with the datacube extension. Unfortunately, the GRIB2 file might contain multiple datacubes (one variable might be measured at the surface; other variables might be at different pressure levels; some variables might be part of an ensemble, others might not be).

In the case of the ECMWF data, contents of the GRIB2 files depend on stream and type (see the notebook for details). If we have a single collection per (stream, type), then we might be able to use something like a nested version of the datacube extension, where we have a list or mapping of datacubes.

https://github.com/stactools-packages/ecmwf-forecast is coding some of this up, if you want to play around with things for yourself. The example notebook has more details.

aaronspring · 2022-02-23T10:21:04Z

aaronspring
Feb 23, 2022

In short, we would group together all (similar) assets with the same reference datetime into an item

Sounds good to me.

Handling datetimes

Usually, the data is concatenated along the dimension reference datetime or forecast_reference_time in NetCDF convections used in the attrs. In climpred, we call it init for initialization, i.e. when the model is started.

On IRIDL, the S2S data is concatinated along S with attrs forecast_reference_time or forecast_start_time: https://iridl.ldeo.columbia.edu/SOURCES/.ECMWF/.S2S/.ECMF/.forecast/.control/.2m_above_ground/

For the s2s-ai-challenge and on climatlab-s2s-ai-challenge, we shortened the name to forecast_time and also concatenated along this dim.

Regarding 1.: Most often I think in forecast_reference_time+lead_time/[forecast_period](http://cfconventions.org/Data/cf-standard-names/current/build/cf-standard-name-table.html) and not in valid_time. asset = item.assets["128h"] does that. Would it be also possible to allow 1d for 24h? would be handy for longer S2S forecasts reaching up to 60 days. (In theory there also also seasonal 1M and annual/decadal forecasts with lead increment 1Y.)

Regarding 2.: Yes. Agreed. IRIDL must have the same problem I guess. @aaron-kaplan might know someone

How to handle multiple "streams" and How to handle datacubes?

Should these go into separate collections? How do we make that decision?

If there is no naming duplicates, I think they could go into the same collection if the total number of variables is not overwhelming.

@floriankrb might know about much of this. He put grb and nc derived from grb S2S output for the s2s-ai-challenge on S3, see climatlab-s2s-ai-challenge for the data. You might find some details in the attrs of the nc: https://storage.ecmwf.europeanweather.cloud/s2s-ai-challenge/data/training-input/0.3.0/netcdf/index.html

1 reply

TomAugspurger Feb 23, 2022
Author

Most often I think in forecast_reference_time+lead_time/forecast_period and not in valid_time

That's really helpful, thanks.

Would it be also possible to allow 1d for 24h

I don't think we can really do that in the STAC item itself, which is just JSON. We could in theory have both keys, but that feels wrong. We can achieve something like that in a library like pystac that implements python APIs for working with extensions. It could support lookups for timedelta's / timedelta-like strings.

m-mohr · 2022-02-23T10:35:07Z

m-mohr
Feb 23, 2022
Maintainer

Disclaimer: I have no clue about forecast data, but that might actually help to think a bit out of the box? ;-)

Initially, I thought to use the forecast datetime.

Yes, that's also my first thought. And I'm note sure I agree with all the arguments below.

Using the reference datetime avoids the (possibly awkward) situation of searching for items whose datetime is in the future.

Why is that an issue?

If we use the forecast datetime and group all assets with the same forecast time in the same item, we would be constantly mutating existing items as new forecasts are made available (the item for 2022-03-04T00:00:00's would have assets from 128 hours ago, 125 hours ago, 123 hours ago, ...)

Aren't older forecasts separate items in a catalog (or sub-collection)? I'd have expected that...

If we use the forecast datetime but don't group all assets with the same forecast time in the same item users searching for a specific datetime= would get back many items and have to filter them down somehow (most likely choosing the most recent)

I would suggest using the version-extension, use the deprecated flag and add links to the next and potentially also newest version (although the latter would require a lot of updates to old files). You can then exclude (by default?) the deprecated items in a search and enable/disable them specifically by setting the "deprecated" query/filter option. This would nicely allow to return only the newest items and probably make a nicer user experience, I assume.

Unfortunately, the GRIB2 file might contain multiple datacubes (one variable might be measured at the surface; other variables might be at different pressure levels; some variables might be part of an ensemble, others might not be).

I guess it depends on how you define a data cube and how exactly the data cubes look like. Either you work with more dimensions or you could also link to the same file (as asset) twice with different data cubes definitions in the asset.

2 replies

TomAugspurger Feb 23, 2022
Author

Using the reference datetime avoids the (possibly awkward) situation of searching for items whose datetime is in the future.

Why is that an issue?

I suppose it isn't an issue. I'd forgotten we already have a collection like that (https://planetarycomputer.microsoft.com/dataset/nasa-nex-gddp-cmip6) and I'm assuming others do too.

Aren't older forecasts separate items in a catalog (or sub-collection)? I'd have expected that...

This is something I struggle with: if two assets have the exact same properties ((forecast) datetime, geometry, etc.) and differ only by the reference datetime (when the model was run), should they be in the same item or not? Maybe having a different reference_datetime is significant enough to warrant separate items. I was concerned about forcing users to filter through many "stale" items with the same forecast_datetime, but your next suggestion might handle that gracefully.

I would suggest using the version-extension

That's a really interesting suggestion. It aligns with a comment / question from Rob about why we're even keeping old forecasts around when newer ones are available (FWIW, these assets are being deleted after 30 days, so they'll go away eventually.)

Either you work with more dimensions

Occasionally you have variables with the same dimensions, e.g. (heightAboveLand, lon, lat) but different coordinate values within those dimensions (one variable might be defined at 2m above land, the other might be defined at 4, 6, 8, and 10 m. You wouldn't (necessarily) want to merge those two variables into the same datacube, since you'd have to somehow align them, likely by filling in missing values where the variable isn't defined.

m-mohr Feb 23, 2022
Maintainer

With regards to the datetime, I just remembered this: https://github.com/stac-extensions/timestamps#lifecycle

While my comment above still seems more logical, the "Data capture" (datetime) sounds more like "reference datetime" and "forecast datetime" would either need something new (or "expires"?)

Hmm... hard to decide, indeed. I feel like this is also a question that would apply to the ml-model extension (there it is already somewhat defined, cc @duckontheweb) and other use cases, too. There are various use cases where you have a time where you predicted and a time that you are predicting for. It may actually make sense to define that in the extension mentioned above in general.

floriankrb · 2022-02-23T15:24:07Z

floriankrb
Feb 23, 2022

A note about naming these dimensions, for the time when the forecast actually applies / is valid for, you called it "forecast datetime", but this is ambiguous, while "valid_time" is not ambiguous (or valid_datetime if you like). BTW, following the CF conventions, this dimension has a standard_name="time".

Regarding making recommendations on how to group the data by "reference datetime" (when the model starts) or by "valid datetime" (when the data is valid for). Unfortunately, I will not be very helpful here: the access patterns to our archive does not favor one or another, as very different users have very different access patterns: we saw that we need to support both of them, and more. But accessing a specific subset of our data may be different, and I look forward on having feedback from you and your users on this topic.

0 replies

aaron-kaplan · 2022-02-24T21:06:26Z

aaron-kaplan
Feb 24, 2022

@aaronspring thanks for pulling me in. I've been meaning to get acquainted with STAC.

The IRIDL represents a forecast dataset as a variable with (at least) four dimensions: Y (latitude), X (longitude), S (reference date), and L (lead time, which is the difference between forecast date and reference date). As Aaron said, we extend the S dimension as new forecasts are issued.

I gather that with STAC, each "item" needs to be identified by a single datetime, so what the IRIDL considers one variable would have to be split up into thousands of "items." This is a really bad fit for some common workflows. Our users often need to retrieve a time series spanning multiple decades for a single geographic point or region. (This applies more often to observational data than to forecasts, but I have seen it with forecasts too, in an application where we evaluate a model's historical accuracy for a particular location.) If a product with daily resolution on the time dimension is split up into one "item" (file, URL, catalog entry, whatever) per day, then to get a 20 year time series for a single point you have to open more than 7,000 items, retrieving just a couple of bytes from each, which is extremely inefficient. Representing the dataset as a single zarr store with a time dimension allows us to use a different chunking strategy that better supports this kind of request. I don't know how we would represent that kind of entity in STAC.

3 replies

TomAugspurger Feb 24, 2022
Author

Thanks @aaron-kaplan.

I gather that with STAC, each "item" needs to be identified by a single datetime

STAC items can have the datetime be null (which is what we do for timeseries like https://planetarycomputer.microsoft.com/api/stac/v1/collections/nasa-nex-gddp-cmip6/items?limit=1). The downside of a null datetime is that users can't search on it with the STAC using the regular datetime= field in the STAC API. But given the ambiguity around exactly what datetime would mean, "forcing" users to specify one or both of the forecast datetime and reference datetime might be preferable.

Representing the dataset as a single zarr store with a time dimension allows us to use a different chunking strategy that better supports this kind of request. I don't know how we would represent that kind of entity in STAC.

I think the two will go well together. pangeo-forge/pangeo-forge-recipes#267 (comment) is a prototype building timeseries using kerchunk and the GRIB2 files (rather than converting to Zarr). We'll also make STAC items for these timeseries, probably in a separate collection.

aaron-kaplan Feb 25, 2022

The downside of a null datetime is that users can't search on it with the STAC using the regular datetime= field in the STAC API.

Is that a fundamental limitation of how the search API is defined, or is it just that the implementation hasn't caught up with the addition of start_datetime/end_datetime to the spec? Ideally, a single query could match both time points and ranges.

m-mohr Feb 25, 2022
Maintainer

That should be an implementation issue...

aaron-kaplan · 2022-02-24T21:30:33Z

aaron-kaplan
Feb 24, 2022

I found some previous discussion of the issue I mentioned here: #781. Apparently a STAC item can be identified by a time range rather than a single datetime. There's an open issue about allowing the end date to be open rather than fixed.

0 replies

TomAugspurger · 2022-02-25T22:17:30Z

TomAugspurger
Feb 25, 2022
Author

I put up some test items at https://pct-apis-staging.westeurope.cloudapp.azure.com/stac/collections/ecmwf-forecast/. These differ from my original proposal:

I have a STAC item per GRIB2 asset. I don't try to group together assets with the same reference datetime into the same item.
The item datetime is set to the "valid datetime" (what I was calling forecast time, which was ambiguous)

https://notebooksharing.space/view/cc994864817996cda7a6e13c09fb9018be8e1b41450ed1a30db3504f39bb7956 has a demo going through it with various queries.

For timeseries of items (I want all the forecasts leading up to a specific valid datetime, or I want all the forecasts from a specific reference datetime), you specify just that in your query.

To find a single forecast, you'll make a query specifying both the reference datetime and the valid datetime, or you'll make a query specifying either the reference or valid datetime, and then selecting the one you want (e.g. taking the min or max over the other datetime).

In general, I like the idea of not combining assets with the same "datetime" (reference datetime or valid datetime) into a single item. It wasn't obvious which datetime to use, and so perhaps not guessing is the right thing to do. This does increase the number of items substantially (by about 200x, roughly), but hopefully that won't become a burden on the database.

0 replies

cboettig · 2022-05-09T04:19:18Z

cboettig
May 9, 2022

FWIW, it looks like the Google Earth Engine STAC catalog includes GFS forecasts using the terms creation_time and forecast_time to distinguish between the two time scales (wasn't immediately obvious to me from the landing page which one, if any, was being mapped to the STAC datetime field.)

0 replies

m-mohr · 2022-08-17T13:01:10Z

m-mohr
Aug 17, 2022
Maintainer

It seems many people implement "proprietary"/custom solutions now, which should be avoided. As such, I just started and made a very slim proposal as "Forecast extension": https://github.com/stac-extensions/forecast#fields

I still don't have a lot of background knowledge about this domain so feel encouraged to comment and improve it through PRs.

1 reply

m-mohr Aug 17, 2022
Maintainer

@TomAugspurger @schwehr @cboettig @aaron-kaplan @floriankrb

cboettig · 2022-08-17T23:16:40Z

cboettig
Aug 17, 2022

@m-mohr fantastic, greatly appreciated! Looks like a really nice start to me. I took the liberty of opening a few issues over there regarding terminology that probably need more discussion, would love to hear input from anyone on those issues as well before I think I'd be ready to shape them into a PR.

1 reply

m-mohr Aug 18, 2022
Maintainer

Thanks. I just read this and I already updated the proposal a bit, but please still consider proposing PRs! :-)

zackarno · 2024-05-13T23:08:36Z

zackarno
May 13, 2024

Hi all, I just came across this really interesting discussion and the mentioned forecast STAC-extension GH repo. I looked through the STAC extension, but don't fully understand how one is supposed to use it.

We access various forecast data sets (often grib/grib2) and everyone has there own code to process them either in R or python. Our team uses alot of the same terms discussed above: "valid_time", "time", "leadtime", "reference_time", "publication_date"... not going to lie, it's incredibly confusing and very easy to mess up. Therefore, I would like to use STAC and the most standardized conventions to date to simplify and standardize this process for myself/team.

I'd like to process the grib/other format files into COGs with a structure that makes sense to write a STAC json file for. I guess this would mean one .tif per each reference_datetime + valid_time combination? The potential negative of this structure would be the number of files.
Previously I've collapsed the valid_time into bands and produced 1 tif per reference_datetime with the number of bands equivalent to the number of leadtimes/valid_datetimes... however I see a limitation of this with multiple parameters of interest (i.e precip, temp, etc.).

Assuming the COG structure I wrote above (1 tif per "reference_datetime" + "valid_time" combo) makes sense and I create these tifs and put them in a S3/Blob/Bucket container, I don't really understand what would be the next step?

Does anyone have any links they can share where people have gone through something similar? I can't really find too many examples of people creating and interacting with these systems.

1 reply

m-mohr May 14, 2024
Maintainer

The following stactools package implements the forecast extension:
https://github.com/stactools-packages/noaa-gefs

Documentation improvements / Clarifications as PRs to the forecast extension would be appreciated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best practices for operational forecast data #1169

{{title}}

Replies: 10 comments 9 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Best practices for operational forecast data #1169

Handling datetimes

How to handle multiple "streams"

How to handle datacubes?

Replies: 10 comments · 9 replies

Handling datetimes

How to handle multiple "streams" and How to handle datacubes?

TomAugspurger Feb 23, 2022 Author

m-mohr Feb 23, 2022 Maintainer

TomAugspurger Feb 23, 2022 Author

m-mohr Feb 23, 2022 Maintainer

TomAugspurger Feb 24, 2022 Author

m-mohr Feb 25, 2022 Maintainer

TomAugspurger Feb 25, 2022 Author

m-mohr Aug 17, 2022 Maintainer

m-mohr Aug 17, 2022 Maintainer

m-mohr Aug 18, 2022 Maintainer

m-mohr May 14, 2024 Maintainer

Replies: 10 comments 9 replies

TomAugspurger Feb 23, 2022
Author

m-mohr
Feb 23, 2022
Maintainer

TomAugspurger Feb 23, 2022
Author

m-mohr Feb 23, 2022
Maintainer

TomAugspurger Feb 24, 2022
Author

m-mohr Feb 25, 2022
Maintainer

TomAugspurger
Feb 25, 2022
Author

m-mohr
Aug 17, 2022
Maintainer

m-mohr Aug 17, 2022
Maintainer

m-mohr Aug 18, 2022
Maintainer

m-mohr May 14, 2024
Maintainer