Best practices for operational forecast data #1169
Replies: 10 comments 9 replies
-
Sounds good to me. Handling datetimesUsually, the data is concatenated along the dimension On IRIDL, the S2S data is concatinated along For the Regarding 1.: Most often I think in Regarding 2.: Yes. Agreed. IRIDL must have the same problem I guess. @aaron-kaplan might know someone How to handle multiple "streams" and How to handle datacubes?
If there is no naming duplicates, I think they could go into the same collection if the total number of variables is not overwhelming. @floriankrb might know about much of this. He put |
Beta Was this translation helpful? Give feedback.
-
Disclaimer: I have no clue about forecast data, but that might actually help to think a bit out of the box? ;-)
Yes, that's also my first thought. And I'm note sure I agree with all the arguments below.
Why is that an issue?
Aren't older forecasts separate items in a catalog (or sub-collection)? I'd have expected that...
I would suggest using the version-extension, use the deprecated flag and add links to the next and potentially also newest version (although the latter would require a lot of updates to old files). You can then exclude (by default?) the deprecated items in a search and enable/disable them specifically by setting the "deprecated" query/filter option. This would nicely allow to return only the newest items and probably make a nicer user experience, I assume.
I guess it depends on how you define a data cube and how exactly the data cubes look like. Either you work with more dimensions or you could also link to the same file (as asset) twice with different data cubes definitions in the asset. |
Beta Was this translation helpful? Give feedback.
-
A note about naming these dimensions, for the time when the forecast actually applies / is valid for, you called it "forecast datetime", but this is ambiguous, while "valid_time" is not ambiguous (or valid_datetime if you like). BTW, following the CF conventions, this dimension has a standard_name="time". Regarding making recommendations on how to group the data by "reference datetime" (when the model starts) or by "valid datetime" (when the data is valid for). Unfortunately, I will not be very helpful here: the access patterns to our archive does not favor one or another, as very different users have very different access patterns: we saw that we need to support both of them, and more. But accessing a specific subset of our data may be different, and I look forward on having feedback from you and your users on this topic. |
Beta Was this translation helpful? Give feedback.
-
@aaronspring thanks for pulling me in. I've been meaning to get acquainted with STAC. The IRIDL represents a forecast dataset as a variable with (at least) four dimensions: Y (latitude), X (longitude), S (reference date), and L (lead time, which is the difference between forecast date and reference date). As Aaron said, we extend the S dimension as new forecasts are issued. I gather that with STAC, each "item" needs to be identified by a single datetime, so what the IRIDL considers one variable would have to be split up into thousands of "items." This is a really bad fit for some common workflows. Our users often need to retrieve a time series spanning multiple decades for a single geographic point or region. (This applies more often to observational data than to forecasts, but I have seen it with forecasts too, in an application where we evaluate a model's historical accuracy for a particular location.) If a product with daily resolution on the time dimension is split up into one "item" (file, URL, catalog entry, whatever) per day, then to get a 20 year time series for a single point you have to open more than 7,000 items, retrieving just a couple of bytes from each, which is extremely inefficient. Representing the dataset as a single zarr store with a time dimension allows us to use a different chunking strategy that better supports this kind of request. I don't know how we would represent that kind of entity in STAC. |
Beta Was this translation helpful? Give feedback.
-
I found some previous discussion of the issue I mentioned here: #781. Apparently a STAC item can be identified by a time range rather than a single datetime. There's an open issue about allowing the end date to be open rather than fixed. |
Beta Was this translation helpful? Give feedback.
-
I put up some test items at https://pct-apis-staging.westeurope.cloudapp.azure.com/stac/collections/ecmwf-forecast/. These differ from my original proposal:
https://notebooksharing.space/view/cc994864817996cda7a6e13c09fb9018be8e1b41450ed1a30db3504f39bb7956 has a demo going through it with various queries. For timeseries of items (I want all the forecasts leading up to a specific valid datetime, or I want all the forecasts from a specific reference datetime), you specify just that in your query. To find a single forecast, you'll make a query specifying both the reference datetime and the valid datetime, or you'll make a query specifying either the reference or valid datetime, and then selecting the one you want (e.g. taking the In general, I like the idea of not combining assets with the same "datetime" (reference datetime or valid datetime) into a single item. It wasn't obvious which datetime to use, and so perhaps not guessing is the right thing to do. This does increase the number of items substantially (by about 200x, roughly), but hopefully that won't become a burden on the database. |
Beta Was this translation helpful? Give feedback.
-
FWIW, it looks like the Google Earth Engine STAC catalog includes GFS forecasts using the terms |
Beta Was this translation helpful? Give feedback.
-
It seems many people implement "proprietary"/custom solutions now, which should be avoided. As such, I just started and made a very slim proposal as "Forecast extension": https://github.com/stac-extensions/forecast#fields I still don't have a lot of background knowledge about this domain so feel encouraged to comment and improve it through PRs. |
Beta Was this translation helpful? Give feedback.
-
@m-mohr fantastic, greatly appreciated! Looks like a really nice start to me. I took the liberty of opening a few issues over there regarding terminology that probably need more discussion, would love to hear input from anyone on those issues as well before I think I'd be ready to shape them into a PR. |
Beta Was this translation helpful? Give feedback.
-
Hi all, I just came across this really interesting discussion and the mentioned forecast STAC-extension GH repo. I looked through the STAC extension, but don't fully understand how one is supposed to use it. We access various forecast data sets (often grib/grib2) and everyone has there own code to process them either in R or python. Our team uses alot of the same terms discussed above: "valid_time", "time", "leadtime", "reference_time", "publication_date"... not going to lie, it's incredibly confusing and very easy to mess up. Therefore, I would like to use STAC and the most standardized conventions to date to simplify and standardize this process for myself/team.
Assuming the COG structure I wrote above (1 tif per "reference_datetime" + "valid_time" combo) makes sense and I create these tifs and put them in a S3/Blob/Bucket container, I don't really understand what would be the next step? Does anyone have any links they can share where people have gone through something similar? I can't really find too many examples of people creating and interacting with these systems. |
Beta Was this translation helpful? Give feedback.
-
Hi all,
I'm working through how to model some operational forecast data. I put together a notebook at https://notebooksharing.space/view/e3f3eceaf2cd8da3d14d5cb0f7873e1909c836b64267421f36b10a260a720d99#displayOptions= that has some initial thoughts for a specific dataset. I wanted to generalize a few things, and get the STAC communities input, and eventually update https://github.com/radiantearth/stac-spec/blob/master/best-practices.md with guidance for these types of datasets. And we could possibly codify some of the recommendations as an extension.
Handling datetimes
Every forecast involves two datetimes:
Most of the data assets I've seen are GRIB2 files with a single timestamp. So when we're grouping assets into items (if at all), we must choose which datetime to use.
Initially, I thought to use the forecast datetime. After all, that's the time the data are valid for, so why not use it? But @justinfisk made a compelling argument for using the reference datetime (when the model ran).
datetime=
would get back many items and have to filter them down somehow (most likely choosing the most recent)I suspect, but would really appreciate feedback here, that the dominant usage pattern is people examining one or all of the forecast datetimes from the most recent reference datetime. I would guess that there's not much value in yesterday's forecast for 36 hours from now, when I have today's forecast for 12 hours from now.
In short, we would group together all (similar) assets with the same reference datetime into an item . This would encourage a usage pattern like
How to handle multiple "streams"
This is probably too dataset-specific to write best practices for, but how should we determine what goes in a single collection? The ECMWF publishes multiple "streams", which IIUC roughly corresponds to different models (some are focused on the atmosphere, others on the oceans; some are ensembles, others are single members). Should these go into separate collections? How do we make that decision?
How to handle datacubes?
Again, this might be too dataset specific, but these GRIB2 contain data that could be cataloged with the datacube extension. Unfortunately, the GRIB2 file might contain multiple datacubes (one variable might be measured at the surface; other variables might be at different pressure levels; some variables might be part of an ensemble, others might not be).
In the case of the ECMWF data, contents of the GRIB2 files depend on
stream
andtype
(see the notebook for details). If we have a single collection per(stream, type)
, then we might be able to use something like a nested version of the datacube extension, where we have a list or mapping of datacubes.https://github.com/stactools-packages/ecmwf-forecast is coding some of this up, if you want to play around with things for yourself. The example notebook has more details.
Beta Was this translation helpful? Give feedback.
All reactions