Load ICON data from HF #66

peterdudfield · 2024-09-04T19:34:15Z

Detailed Description

It would be great to be able to load the ICON data from HF in our nwp open dataset

Context

https://huggingface.co/datasets/openclimatefix/dwd-icon-eu
Discussion on HF https://huggingface.co/datasets/openclimatefix/dwd-icon-eu/discussions/7
most other nwp are one big zarr, or a few big zarr files. Howevere on HF there are lots 2 per day, of zarr.zip
We need to lazily load the data, and then once we have selcted the time and location we can load the small amount of data in.

Possible Implementation

Try to find a method that can lazily load the ICON data from HF in load nwp

gabrielelibardi · 2024-10-13T13:42:08Z

@peterdudfield I'd like to have a look at this, could you give me permission to create a branch

peterdudfield · 2024-10-13T17:03:03Z

Hi @gabrielelibardi , we actually paused development on this in ocf_datapipes. But this could be done in ocf-data-sampler. Would you like to try it there?

gabrielelibardi · 2024-10-14T12:18:42Z

I managed to run the pvnet_datapipe on some of the icon-eu huggingface data that I downloaded locally (one day worth of data). This needs some change in the code but I think the problem that you refer to in this issue is different, this seems to be independent of the postprocessing done in ocf_datapipes and just about this line ds = xr.open_mfdataset(zarr_paths, engine="zarr", combine="nested", concat_dim="time"). If zarr_paths is all the paths to each .zarr.zip file on hugging-face that ds is never initialized as it just takes too long. I presume this is because of all the metadata that need to be downloaded from each .zarr.zip file. Once the ds is initialized then the data will be loaded lazily as you create the batches. Maybe caching the metadata locally could speed things up. Do I understand this correctly? @peterdudfield

peterdudfield · 2024-10-14T13:03:25Z

Thanks, great you managed it with one day.

hmmm, interesting. The meta is normally quite small.

Does it scale linear with the number of files you provide in xr.open_mfdataset? Like can you load from 2 data files from HF quickly?

I have seen before if they are differnet shapes, then xr.open_mfdataset can take a long time, as it tries sort these out

gabrielelibardi · 2024-10-14T17:07:32Z

It definitely gets slower with more files. Here is a flamegraph svg for the profiling. . I am getting a year worth of hugging face .zarr.zip files and trying to make a xarray from the first 20. Most of the time is spend in requests to the HF server. If I try to make an xarray with too many paths eventually it pisses the hf server huggingface_hub.errors.HfHubHTTPError: 429 Client Error: Too Many Requests for url: https://huggingface.co/api/datasets/openclimatefix/dwd-icon-eu.

peterdudfield · 2024-10-14T17:45:23Z

Thanks for doing this, and nice to see the profile. Whats the current timings for making an xarray with the first 20?
You totaly sure its only getting the metadata, not downloading all the data?

Do you know roughtly how many paths until you get a 429?

gabrielelibardi · 2024-10-14T19:53:34Z

It takes 58 secs to create an xarray with 20. Pretty sure it is not downloading the data (that would be more than 60 GB in 60 secs a bit too crazy for my laptop).
I think I was trying 50 or so and I was already getting a 429. Maybe it is because the .zmetadata are too small so maybe there are a lot of separate requests too quickly, if I download a whole .zip.zarr file it does not complain.

gabrielelibardi · 2024-10-14T19:56:21Z

For training with ECMWF data you also use multiple zarr file or just one? You stream it from S3 buckets or keep it on the training server?

peterdudfield · 2024-10-14T21:27:41Z

For training with ECMWF data you also use multiple zarr file or just one? You stream it from S3 buckets or keep it on the training server?

For training ECMWF, we tend to join our daily ECMWF files into year (or monthly) files and then open multi open zarrs. We tend to try to keep the data close to where the model is running, so either locally if trianing localy, or in cloud, if we are training in the cloud. It would be nice to stream from HF though, so we dont have to re organise the files, and the live data is available.

Thanks for these benchmark figures. Yea I agree it's just downloading metadata not the whole thing.

Feels like there should be some caching we can do, so for example, loading metadata for each month one, and then quickly loading this new file in the future. I really dont know the solution for this sorry.

@devsjc @Sukh-P @AUdaltsova might have some ideas?

jacobbieker · 2024-10-14T23:32:55Z

You should be able to do something like kerchunk or virtualizarr to save the metadata into one file and open that. That way you wouldn't need all the requests for the metadata. Getting the data is still then limited by HF request limits, but is at least then less of an issue.

gabrielelibardi · 2024-10-16T22:01:49Z

Thanks a lot @jacobbieker @peterdudfield! I haven't tried it yet but looks promising. For now I put 1 month of data on a cloud storage (something like S3). This solves the problem with the 429 error. I tried to run the script/save_batches.py though and it is impractically slow (1 sample every 3 secs or so). It is not a problem for me to self-host the icon dataset (or parts of it) but I would like to keep large amounts of data off the training instance as we would spin this one on and off. I would be curious to know in your case for the training of PVNet how long did you need to create the batches for the training data and when you were training on the cloud what solutions worked best for you.

peterdudfield · 2024-10-17T07:00:22Z

Thanks @gabrielelibardi

Is this slow running speed when you load from s3? Where are you running the code from, an ec2 instance?

We tend to get a speed of more like 1 batch every 1sec ish (I would have to look up the batch size). So it can still take a day or so to make batches. We get the fastest results by having the data very near i.e locally or on a disk attached to an vm. This is more set up, but faster. Using Multi-processing helps too, I think thats in the making batches script already.

gabrielelibardi · 2024-10-21T08:39:36Z

Thank you @peterdudfield, it is a colocated S3 bucket and I am running from a VPS but these are not aws services so there might be some differences in terms of bandwidth. In general though it is faster than streaming it from HF, which makes this quite useless in my opinion. Maybe you have tried this before with a more powerful instance and it was faster? Still I think implementing this is still beneficial as it should be the same code weather you stream form local or remote .zarr.zip files. I can implement this in the ocf-data-sampler repo, for me this one worked pretty well though, what is the major difference?

peterdudfield · 2024-10-21T10:26:07Z

Thanks.

Yea we first use HF to save the ICON data, as there is not rolling archive at the moment. Your totally right though, we should consider a further set that make its useful for other people to use.

We are migrating from ocf-datapipes to ocf-data-sample. Essentially to simply the code and remove torch.datapipes which doesnt fit our use case anymore.

That would be greatly appreciate, if you can put the code in for ICON in this repo

gabrielelibardi · 2024-10-23T13:47:36Z

@peterdudfield can you give me the permission to push a feature branch to ocf-data-sample please.

peterdudfield · 2024-10-23T15:09:31Z

Yea, I can defiantely open access. What we general do is let OS contributors

clone the repo
do their changes
do a PR back to OCF's repo
Does that work for you?

gabrielelibardi · 2024-10-23T17:48:18Z

Sure!

peterdudfield changed the title ~~Load ICon data from HF~~ Load ICON data from HF Sep 4, 2024

peterdudfield transferred this issue from openclimatefix/ocf_datapipes Oct 14, 2024

gabrielelibardi mentioned this issue Oct 24, 2024

Added icon to nwp providers #72

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Load ICON data from HF #66

Load ICON data from HF #66

peterdudfield commented Sep 4, 2024 •

edited

Loading

gabrielelibardi commented Oct 13, 2024

peterdudfield commented Oct 13, 2024

gabrielelibardi commented Oct 14, 2024

peterdudfield commented Oct 14, 2024

gabrielelibardi commented Oct 14, 2024

peterdudfield commented Oct 14, 2024

gabrielelibardi commented Oct 14, 2024

gabrielelibardi commented Oct 14, 2024

peterdudfield commented Oct 14, 2024

jacobbieker commented Oct 14, 2024

gabrielelibardi commented Oct 16, 2024

peterdudfield commented Oct 17, 2024

gabrielelibardi commented Oct 21, 2024

peterdudfield commented Oct 21, 2024 •

edited

Loading

gabrielelibardi commented Oct 23, 2024

peterdudfield commented Oct 23, 2024

gabrielelibardi commented Oct 23, 2024

Load ICON data from HF #66

Load ICON data from HF #66

Comments

peterdudfield commented Sep 4, 2024 • edited Loading

Detailed Description

Context

Possible Implementation

gabrielelibardi commented Oct 13, 2024

peterdudfield commented Oct 13, 2024

gabrielelibardi commented Oct 14, 2024

peterdudfield commented Oct 14, 2024

gabrielelibardi commented Oct 14, 2024

peterdudfield commented Oct 14, 2024

gabrielelibardi commented Oct 14, 2024

gabrielelibardi commented Oct 14, 2024

peterdudfield commented Oct 14, 2024

jacobbieker commented Oct 14, 2024

gabrielelibardi commented Oct 16, 2024

peterdudfield commented Oct 17, 2024

gabrielelibardi commented Oct 21, 2024

peterdudfield commented Oct 21, 2024 • edited Loading

gabrielelibardi commented Oct 23, 2024

peterdudfield commented Oct 23, 2024

gabrielelibardi commented Oct 23, 2024

peterdudfield commented Sep 4, 2024 •

edited

Loading

peterdudfield commented Oct 21, 2024 •

edited

Loading